Issue
Dear Stackoverflow Community,
I am trying to understand the calculation of performance limits for DRAM access but my benchmarks for achieving these limits are not very close to the numbers that can be found in the specs. One would not expect to reach a theoretical limit, of course, but there might be some explanations why this is so far off.
E.g. I do measure around 11 GB/s for DRAM access on my system, but WikiChip or the JEDEC spec list the peak performance for a dual channel DDR4-2400 system at 38.4 GB/s.
Is my measurement flawed or are these just not the right numbers to calculate peak memory performance?
The Measurement
On my system with a core i7 8550u at 1.8GHz from the (Kaby Lake Microarchitecture)
it is the case, that lshw
shows two memory
entries
*-memory
...
*-bank:0
...
slot: ChannelA-DIMM0
width: 64 bits
clock: 2400MHz (0.4ns)
*-bank:1
...
slot: ChannelB-DIMM0
width: 64 bits
clock: 2400MHz (0.4ns)
so these two should run in "dual channel" mode then (is that automatically the case?).
I set up the system to reduce the measurement noise with
- disabling frequency scaling
- disabling Address Space Layout Randomization
- setting
scaling_governor
toperformance
- using
cpuset
to isolate the benchmark on an own core - setting a niceness of -20
- using a headless system with a minimal amount of processes running
Then I started out with the ScanWrite256PtrUnrollLoop
benchmark of the pmbw - Parallel Memory Bandwidth Benchmark / Measurement program:
pmbw -f ScanWrite256PtrUnrollLoop -p 1 -P 1
The inner loop can be examined with
gdb -batch -ex "disassemble/rs ScanWrite256PtrUnrollLoop" `which pmbw` | c++filt
It seems that this benchmark creates a "stream" of vmovdqa
Move Aligned Packed Integer Values AVX256-instructions to saturate the CPU's memory subsystem
<+44>:
vmovdqa %ymm0,(%rax)
vmovdqa %ymm0,0x20(%rax)
vmovdqa %ymm0,0x40(%rax)
vmovdqa %ymm0,0x60(%rax)
vmovdqa %ymm0,0x80(%rax)
vmovdqa %ymm0,0xa0(%rax)
vmovdqa %ymm0,0xc0(%rax)
vmovdqa %ymm0,0xe0(%rax)
vmovdqa %ymm0,0x100(%rax)
vmovdqa %ymm0,0x120(%rax)
vmovdqa %ymm0,0x140(%rax)
vmovdqa %ymm0,0x160(%rax)
vmovdqa %ymm0,0x180(%rax)
vmovdqa %ymm0,0x1a0(%rax)
vmovdqa %ymm0,0x1c0(%rax)
vmovdqa %ymm0,0x1e0(%rax)
add $0x200,%rax
cmp %rsi,%rax
jb 0x37dc <ScanWrite256PtrUnrollLoop(char*, unsigned long, unsigned long)+44>
As a similar benchmark in Julia I came up with the following:
const C = NTuple{K,VecElement{Float64}} where K
@inline function Base.fill!(dst::Vector{C{K}},x::C{K},::Val{NT} = Val(8)) where {NT,K}
NB = div(length(dst),NT)
k = 0
@inbounds for i in Base.OneTo(NB)
@simd for j in Base.OneTo(NT)
dst[k += 1] = x
end
end
end
When investigating the inner loop of this fill!
function
code_native(fill!,(Vector{C{4}},C{4},Val{16}),debuginfo=:none)
we can see that this also creates a similar "stream" of vmovups
Move Unaligned Packed Single-Precision Floating-Point Values instructions:
L32:
vmovups %ymm0, -480(%rcx)
vmovups %ymm0, -448(%rcx)
vmovups %ymm0, -416(%rcx)
vmovups %ymm0, -384(%rcx)
vmovups %ymm0, -352(%rcx)
vmovups %ymm0, -320(%rcx)
vmovups %ymm0, -288(%rcx)
vmovups %ymm0, -256(%rcx)
vmovups %ymm0, -224(%rcx)
vmovups %ymm0, -192(%rcx)
vmovups %ymm0, -160(%rcx)
vmovups %ymm0, -128(%rcx)
vmovups %ymm0, -96(%rcx)
vmovups %ymm0, -64(%rcx)
vmovups %ymm0, -32(%rcx)
vmovups %ymm0, (%rcx)
leaq 1(%rdx), %rsi
addq $512, %rcx
cmpq %rax, %rdx
movq %rsi, %rdx
jne L32
Now, all these benchmarks somehow show the different "performance-plateaus" for the three caches and the main memory:
but interestingly, they are all bound to around 11 GB/s for bigger test sizes:
using multiple threads and (re)activating the frequency scaling (which doubles the CPU's frequency) has an impact on the smaller testsizes but does not really change these findings fort the larger ones.
Solution
After some investigations, I found a blogpost describing the same problem with the recommendation to use so-called non-temporal write operations. There are multiple other resources, including an LWN article by Ulrich Drepper with much more details to study from here on.
In Julia this can be achieved with vstorent
from the SIMD.jl package:
function Base.fill!(p::Ptr{T},len::Int64,y::Y,::Val{K},::Val{NT} = Val(16)) where
{K,NT,T,Y <: Union{NTuple{K,T},T,Vec{K,T}}}
# @assert Int64(p) % K*S == 0
x = Vec{K,T}(y)
nb = max(div(len ,K*NT),0)
S = sizeof(T)
p0 = p + nb*NT*K*S
while p < p0
for j in Base.OneTo(NT)
vstorent(x,p) # non-temporal, `K*S` aligned store
p += K*S
end
end
Threads.atomic_fence()
return nothing
end
which, for a vector width of 4 × Float64 and an unrolling factor of 16, compiles down
code_native(fill!,(Ptr{Float64},Int64,Float64,Val{4},Val{16}),debuginfo=:none)
to vmovntps
Store Packed Single-Precision Floating-Point Values Using Non-Temporal Hint instructions:
... # y is register-passed via %xmm0
vbroadcastsd %xmm0, %ymm0 # x = Vec{4,Float64}(y)
nop
L48:
vmovntps %ymm0, (%rdi)
vmovntps %ymm0, 32(%rdi)
vmovntps %ymm0, 64(%rdi)
vmovntps %ymm0, 96(%rdi)
vmovntps %ymm0, 128(%rdi)
vmovntps %ymm0, 160(%rdi)
vmovntps %ymm0, 192(%rdi)
vmovntps %ymm0, 224(%rdi)
vmovntps %ymm0, 256(%rdi)
vmovntps %ymm0, 288(%rdi)
vmovntps %ymm0, 320(%rdi)
vmovntps %ymm0, 352(%rdi)
vmovntps %ymm0, 384(%rdi)
vmovntps %ymm0, 416(%rdi)
vmovntps %ymm0, 448(%rdi)
vmovntps %ymm0, 480(%rdi)
addq $512, %rdi # imm = 0x200
cmpq %rax, %rdi
jb L48
L175:
mfence # Threads.atomic_fence()
vzeroupper
retq
nopw %cs:(%rax,%rax)
which reaches with up to 34 GB/s almost 90% of the theoretical maximum of 38.4 GB/s.
The achieved bandwidth seems to depend on various things though: like the number of threads and whether frequency scaling is enabled:
The measured peak-performance on my system can be achieved with a single thread, when the frequency scales at maximum (4 GHz "turbo" instead of 1.8 GHz "no_turbo") but it cannot be achieved without frequency scaling (i.e. at 1.8 GHz) - not even with multiple threads.
Answered By - christianl