Saturday, November 27, 2021

[SOLVED] calculating DRAM peak performance

November 27, 2021 julia, linux, memory, microbenchmark

Issue

Dear Stackoverflow Community,

I am trying to understand the calculation of performance limits for DRAM access but my benchmarks for achieving these limits are not very close to the numbers that can be found in the specs. One would not expect to reach a theoretical limit, of course, but there might be some explanations why this is so far off.

E.g. I do measure around 11 GB/s for DRAM access on my system, but WikiChip or the JEDEC spec list the peak performance for a dual channel DDR4-2400 system at 38.4 GB/s.

Is my measurement flawed or are these just not the right numbers to calculate peak memory performance?

The Measurement

On my system with a core i7 8550u at 1.8GHz from the (Kaby Lake Microarchitecture)

it is the case, that lshw shows two memory entries

     *-memory
        ...
        *-bank:0
             ...
             slot: ChannelA-DIMM0
             width: 64 bits
             clock: 2400MHz (0.4ns)
        *-bank:1
             ...
             slot: ChannelB-DIMM0
             width: 64 bits
             clock: 2400MHz (0.4ns)

so these two should run in "dual channel" mode then (is that automatically the case?).

I set up the system to reduce the measurement noise with

disabling frequency scaling
disabling Address Space Layout Randomization
setting scaling_governor to performance
using cpuset to isolate the benchmark on an own core
setting a niceness of -20
using a headless system with a minimal amount of processes running

Then I started out with the ScanWrite256PtrUnrollLoop benchmark of the pmbw - Parallel Memory Bandwidth Benchmark / Measurement program:

pmbw -f ScanWrite256PtrUnrollLoop -p 1 -P 1

The inner loop can be examined with

gdb -batch -ex "disassemble/rs ScanWrite256PtrUnrollLoop" `which pmbw` | c++filt

It seems that this benchmark creates a "stream" of vmovdqa Move Aligned Packed Integer Values AVX256-instructions to saturate the CPU's memory subsystem

<+44>:  
  vmovdqa %ymm0,(%rax)
  vmovdqa %ymm0,0x20(%rax)
  vmovdqa %ymm0,0x40(%rax)
  vmovdqa %ymm0,0x60(%rax)
  vmovdqa %ymm0,0x80(%rax)
  vmovdqa %ymm0,0xa0(%rax)
  vmovdqa %ymm0,0xc0(%rax)
  vmovdqa %ymm0,0xe0(%rax)
  vmovdqa %ymm0,0x100(%rax)
  vmovdqa %ymm0,0x120(%rax)
  vmovdqa %ymm0,0x140(%rax)
  vmovdqa %ymm0,0x160(%rax)
  vmovdqa %ymm0,0x180(%rax)
  vmovdqa %ymm0,0x1a0(%rax)
  vmovdqa %ymm0,0x1c0(%rax)
  vmovdqa %ymm0,0x1e0(%rax)
  add    $0x200,%rax
  cmp    %rsi,%rax
  jb     0x37dc <ScanWrite256PtrUnrollLoop(char*, unsigned long, unsigned long)+44>

As a similar benchmark in Julia I came up with the following:

const C = NTuple{K,VecElement{Float64}} where K
@inline function Base.fill!(dst::Vector{C{K}},x::C{K},::Val{NT} = Val(8)) where {NT,K}
  NB = div(length(dst),NT)
  k = 0
  @inbounds for i in Base.OneTo(NB)
    @simd   for j in Base.OneTo(NT)
      dst[k += 1] = x
    end
  end
end

When investigating the inner loop of this fill! function

code_native(fill!,(Vector{C{4}},C{4},Val{16}),debuginfo=:none)

we can see that this also creates a similar "stream" of vmovups Move Unaligned Packed Single-Precision Floating-Point Values instructions:

L32:
  vmovups %ymm0, -480(%rcx)
  vmovups %ymm0, -448(%rcx)
  vmovups %ymm0, -416(%rcx)
  vmovups %ymm0, -384(%rcx)
  vmovups %ymm0, -352(%rcx)
  vmovups %ymm0, -320(%rcx)
  vmovups %ymm0, -288(%rcx)
  vmovups %ymm0, -256(%rcx)
  vmovups %ymm0, -224(%rcx)
  vmovups %ymm0, -192(%rcx)
  vmovups %ymm0, -160(%rcx)
  vmovups %ymm0, -128(%rcx)
  vmovups %ymm0, -96(%rcx)
  vmovups %ymm0, -64(%rcx)
  vmovups %ymm0, -32(%rcx)
  vmovups %ymm0, (%rcx)
  leaq    1(%rdx), %rsi
  addq    $512, %rcx
  cmpq    %rax, %rdx
  movq    %rsi, %rdx
  jne     L32

Now, all these benchmarks somehow show the different "performance-plateaus" for the three caches and the main memory:

but interestingly, they are all bound to around 11 GB/s for bigger test sizes:

using multiple threads and (re)activating the frequency scaling (which doubles the CPU's frequency) has an impact on the smaller testsizes but does not really change these findings fort the larger ones.

Solution

After some investigations, I found a blogpost describing the same problem with the recommendation to use so-called non-temporal write operations. There are multiple other resources, including an LWN article by Ulrich Drepper with much more details to study from here on.

In Julia this can be achieved with vstorent from the SIMD.jl package:

function Base.fill!(p::Ptr{T},len::Int64,y::Y,::Val{K},::Val{NT} = Val(16)) where
  {K,NT,T,Y <: Union{NTuple{K,T},T,Vec{K,T}}}
  # @assert Int64(p) % K*S == 0
  x  = Vec{K,T}(y)
  nb = max(div(len ,K*NT),0)
  S  = sizeof(T)
  p0 = p + nb*NT*K*S
  while p < p0
    for j in Base.OneTo(NT)
      vstorent(x,p) # non-temporal, `K*S` aligned store
      p += K*S
    end
  end
  Threads.atomic_fence()
  return nothing
end

which, for a vector width of 4 × Float64 and an unrolling factor of 16, compiles down

code_native(fill!,(Ptr{Float64},Int64,Float64,Val{4},Val{16}),debuginfo=:none)

to vmovntps Store Packed Single-Precision Floating-Point Values Using Non-Temporal Hint instructions:

...                                            # y is register-passed via %xmm0
        vbroadcastsd    %xmm0, %ymm0           # x = Vec{4,Float64}(y)
        nop
L48:
        vmovntps        %ymm0, (%rdi)
        vmovntps        %ymm0, 32(%rdi)
        vmovntps        %ymm0, 64(%rdi)
        vmovntps        %ymm0, 96(%rdi)
        vmovntps        %ymm0, 128(%rdi)
        vmovntps        %ymm0, 160(%rdi)
        vmovntps        %ymm0, 192(%rdi)
        vmovntps        %ymm0, 224(%rdi)
        vmovntps        %ymm0, 256(%rdi)
        vmovntps        %ymm0, 288(%rdi)
        vmovntps        %ymm0, 320(%rdi)
        vmovntps        %ymm0, 352(%rdi)
        vmovntps        %ymm0, 384(%rdi)
        vmovntps        %ymm0, 416(%rdi)
        vmovntps        %ymm0, 448(%rdi)
        vmovntps        %ymm0, 480(%rdi)
        addq    $512, %rdi                      # imm = 0x200
        cmpq    %rax, %rdi
        jb      L48
L175:
        mfence                                  # Threads.atomic_fence()
        vzeroupper
        retq
        nopw    %cs:(%rax,%rax)

which reaches with up to 34 GB/s almost 90% of the theoretical maximum of 38.4 GB/s.

The achieved bandwidth seems to depend on various things though: like the number of threads and whether frequency scaling is enabled:

non-temporal load instruction benchmarks

The measured peak-performance on my system can be achieved with a single thread, when the frequency scales at maximum (4 GHz "turbo" instead of 1.8 GHz "no_turbo") but it cannot be achieved without frequency scaling (i.e. at 1.8 GHz) - not even with multiple threads.

Answered By - christianl

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, November 27, 2021

[SOLVED] calculating DRAM peak performance

Issue

The Measurement

Solution

Popular Posts

Labels