Thursday, October 6, 2022

[SOLVED] Raspberry PI benchmark timings oddities with std::chrono::steady_clock

Issue

I'm trying to benchmark a piece of DSP code on a Raspberry Pi 4 using the std::chrono::steady_clock, but the results I'm getting are peculiar. Because GNU profiling tools don't work on Raspberry Pi, I'm stuck with benchmarking to evaluate code optimizations, so this is rather a big deal.

What would cause performance to vary by 10% between executions of the benchmark program, while remaining consistent +/- 1% when the same test is run multiple times in the same execution of the program?

Results for a ~6-second benchmark vary by ~10%. But the peculiar thing is that the variance seems to be sticky for a particular execution of the benchmark. I run the benchmark three times in a row each time the program is run, and get roughly the same results +/- 1%. But when I re-run the program the results of the three benchmarks vary by +/- 10% from the previous run, but with each of the three results in the new run being +/- 1%.

e.g:

Run 1:
9:21:37. Performance: 0.0912333 x realtime
9:21:42. Performance: 0.0910667 x realtime
9:21:47. Performance: 0.0910667 x realtime

Run 2:
9:20:15. Performance: 0.106667 x realtime
9:20:21. Performance: 0.1062 x realtime
9:20:28. Performance: 0.106117 x realtime

Results vary randomly roughly between those two extremes for each run. But the peculiar thing here is that the results are consistent to +/- 1% between the three tests performed each time the program is run.

I'm an experienced programmer, so I get that benchmarks will vary somewhat. But the ~10% variance is unworkable for what I'm trying to do. And I am unable to come up with a reasonable theory as to why the variances changes from invocation to invocation.

The code under test is a machine-learning algorithm (LSTM->Dense), using hand-optimized neon intrinsics used to generate real-time audio. The bulk of the execution (~90%) is matrix and vector arithmetic using hand-optimized neon intrinsics. Data footprint is about 13kb (fits comfortably in L1 d-cache). Code footprint is unknown, but may not fit in L1 i-cache. Most of the code pipelines beautifully, so code may be running close L1-cache bandwidth limits. So far, optimizaton has resulted in an improvement from ~0.18 x realtime, to 0.093 x realtime. I think there's probably another ~15% improvement available, but the timing inaccuracies are getting in the way at this point. The code under test gets executed three times, taking ~0.3 x realtime, so further optimizations are in fact critical.

Things that have been checked:

  • Not a NEON alignment problem. All matrices, matrix rows, and vectors are 16-byte aligned (checked with asserts in debug compiles).

  • Not a CPU Frequency issue. The CPU scaling governors have been set to performance, and all CPUs are running at 1.8Ghz.

  • I don't think it's related to cache competition between processes. HTOP indicates ~6% CPU use at idle when connected by VNC, and about 0.3% (wifi supplicant) when connected via ssh. The pattern doesn't significantly change when connected via SSH.

  • I don't think it varies depending on which CPU core the code is running on -- although I'm only able to determine which core the code is running on in a particular run using HTOP, which isn't completely definitive. Test runs seems to occasionally get shifted to a different CPU core, but for the most part, they seem to run on a single randomly selected core for the duration of the 3 tests per execution run.

  • I don't think it's heat throttling. CPU temperature is a very modest 47C. And I don't think Raspberry PI 4s heat throttle until they get to 80C.

  • vector operations rely on GCC compiler auto-vectorization, which have been properly annotated with restrict declarations, and verified to have produce optimal neon vectorization (with better instruction scheduling than I could produce with Neon intrinsics).

  • Not a timer resolution problem. Consecutive calls to std::chrono::steady_clock::now() produce increments of between 37 and 56ns.

  • Not a problem with choice of clock. steady_clock, system_clock, and high_resolution_clock all exhibit the same behaviour.

Verified cpu frequency:

$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor 
performance
performance
performance
performance

$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq 
1800000
1800000
1800000
1800000

Things that I don't know that you may be able to help with:

  • How the std::chrono::steady_clock is implemented on Raspberry Pi. Is it based on a CPU clock counter? Any details appreciated.

  • Whether heat throttling is reflected in /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq. I think it is, but I'm not certain.

  • I'm clearly missing something important.

Technical Details:

  • Raspberry Pi 4b 8GB
  • Linux raspberrypi 5.15.61-v8+ #1579 SMP PREEMPT Fri Aug 26 11:16:44 BST 2022 aarch64 GNU/Linux
  • gcc version 10.2.1 20210110 (Debian 10.2.1-6)
  • Tests run under the catch2 test framework.

Solution

Finally identified the source of the problem. The issue seems to be very mild competition for L1 cache contents, presumably from some background system process.

Performance counters exhibit the same odd behavior as the benchmarks do: 3 consecutive runs each time the test program is launched show ~1% variance in benchmark results; but results vary by ~10% for launch to launch.

It is peculiar that the differences in performance between runs of the test are consistent and persist over several seconds. But given how minor the disturbance of the L1 cache is, it's hard to guess what of a hundred-odd running system processes are interfering with the benchmark, and why that rather unfortunate pattern manifests, especially since they could be running at any scheduler priority.

Results of performance counters measurements illustrate the problem: an average of ~30 extra L1 data cache misses per iteration of a function with 2,995 instructions accounts for a 10% variance in benchmark results. Surprisingly.

I am unable to guess what sort of system process would pollute the L1 data cache at a rate that remains consistent over an 18 second period, but varies on larger timescales.

The good news: the code under tests is remarkably close to optimal. (An LSTM unit with two substantial multiplies, and a significant number of vectorized ArcTan and Sigmoid function calls), which manages to use over 75% of available cache memory bandwidth, and issue almost two instructions per clock cycle. Woohoo!

Test Data

Performance counter measurements averaged per iteration of the test code. Each launch of the program runs the ~six-second benchmark three times.

A good run of the test program:

CpuClk      :           1,694
L1D Access  :           1,244
L1D Miss    :               6
L1I Miss    :               0
Instructions:           2,995
L2 Access   :              12
L2 Miss     :               0

---
CpuClk      :           1,694
L1D Access  :           1,244
L1D Miss    :               6
L1I Miss    :               0
Instructions:           2,995
L2 Access   :              12
L2 Miss     :               0

---
CpuClk      :           1,693
L1D Access  :           1,244
L1D Miss    :               6
L1I Miss    :               0
Instructions:           2,995
L2 Access   :              12
L2 Miss     :               0

A bad run:

CpuClk      :           1,797
L1D Access  :           1,244
L1D Miss    :              37
L1I Miss    :               0
Instructions:           2,995
L2 Access   :              78
L2 Miss     :               0

---
CpuClk      :           1,794
L1D Access  :           1,244
L1D Miss    :              37
L1I Miss    :               0
Instructions:           2,995
L2 Access   :              78
L2 Miss     :               0

---
CpuClk      :           1,797
L1D Access  :           1,244
L1D Miss    :              37
L1I Miss    :               0
Instructions:           2,995
L2 Access   :              78
L2 Miss     :               0


Answered By - Robin Davies
Answer Checked By - Mildred Charles (WPSolving Admin)