Issue
I have a test application like this:
int main()
{
// calls sched_setaffinity() to set affinity to core 0
while(true)
{
}
return 0;
}
I have 4x logical cores across 2x physical cores.
I would like to see perf event counters but only for the CPU core my app uses.
I run this Perf command (killing it after a few seconds):
sudo perf stat -e cycles:u --cpu=0 --delay=1000 ./app
four times, each time changing the cpu id, it shows cycles:u
> 0.
How can all four cpus be executing userspace cycles for an application which pins to core 0?
Surely three of the cores should not be executing userspace cycles for my application?
Solution
It appears the docs for perf stat --cpu=
are wrong or misleading when they say The -a
option is still necessary to activate system-wide monitoring.
It seems to behave identically with or without -a
, so the counts are from other processes, like GUI animations, cursor blink, etc. Or for cycles:k
, from any interrupt handler regardless of which process is the kernel's current
on that core.
I tested this on Arch GNU/Linux system with kernel and perf 6.5. The smoking gun here is running another infinite-loop process pinned to a different core, and seeing cycles:u
on that core the same as the core running the command from the perf command-line.
Baseline without any funky options, with awk 'BEGIN{for(i=0;i<100000000;i++){}}'
as the command being profiled. (It keeps a CPU core busy for a few seconds, making no system calls except at start and exit, and having a small cache footprint.)
$ taskset -c 1 perf stat -e task-clock,cycles:u --delay=1000 awk 'BEGIN{for(i=0;i<100000000;i++){}}'
Events disabled
Events enabled
Performance counter stats for 'awk BEGIN{for(i=0;i<100000000;i++){}}':
2,625.29 msec task-clock # 1.000 CPUs utilized
10,226,974,370 cycles:u
2.625290561 seconds time elapsed
vs. with --cpu
, two signs of not limiting itself to the awk
process: task-clock is 8x longer than the elapsed time. And the header it prints: Performance counter stats for 'CPU(s) 0-7': rather than for our process.
taskset -c 1 perf stat -e task-clock,cycles:u --cpu=0-7 --delay=1000 awk 'BEGIN{for(i=0;i<100000000;i++){}}'
Events disabled
Events enabled
Performance counter stats for 'CPU(s) 0-7':
23,621.12 msec task-clock # 8.000 CPUs utilized
12,747,549,694 cycles:u
2.952544251 seconds time elapsed
And with -A
aka --no-aggr
to not aggregate across CPUs, each one has task-clock running for the full interval even though our process definitely isn't on it (since it and perf itself started pinned to CPU 1):
$ taskset -c 1 perf stat -e task-clock,cycles:u --cpu=0-7 -A --delay=1000 awk 'BEGIN{for(i=0;i<100000000;i++){}}'
Events disabled
Events enabled
Performance counter stats for 'CPU(s) 0-7':
CPU0 2,605.76 msec task-clock # 1.000 CPUs utilized
CPU1 2,605.75 msec task-clock # 1.000 CPUs utilized
CPU2 2,605.76 msec task-clock # 1.000 CPUs utilized
CPU3 2,605.75 msec task-clock # 1.000 CPUs utilized
CPU4 2,605.78 msec task-clock # 1.000 CPUs utilized
CPU5 2,605.78 msec task-clock # 1.000 CPUs utilized
CPU6 2,605.78 msec task-clock # 1.000 CPUs utilized
CPU7 2,605.78 msec task-clock # 1.000 CPUs utilized
CPU0 162,311,044 cycles:u
CPU1 10,153,813,410 cycles:u
CPU2 12,114,460 cycles:u
CPU3 49,619,566 cycles:u
CPU4 180,412,777 cycles:u
CPU5 51,045,799 cycles:u
CPU6 132,275,644 cycles:u
CPU7 344,942,273 cycles:u
2.605703012 seconds time elapsed
And with a dummy load (like another awk process, or in this case I ran program that just loops on pause
instruction, which isn't special other than using less power), we see it uses as many cycles core CPU 2 as awk
did on CPU 1.
$ taskset -c 2 timeout 10m ~/src/SO/pauseloop &
# actually on another terminal
$ taskset -c 1 perf stat -e task-clock,cycles:u --cpu=0-7 -A --delay=1000 awk 'BEGIN{for(i=0;i<100000000;i++){}}'
Events disabled
Events enabled
Performance counter stats for 'CPU(s) 0-7':
CPU0 2,721.71 msec task-clock # 1.000 CPUs utilized
CPU1 2,721.71 msec task-clock # 1.000 CPUs utilized
CPU2 2,721.71 msec task-clock # 1.000 CPUs utilized
CPU3 2,721.71 msec task-clock # 1.000 CPUs utilized
CPU4 2,721.70 msec task-clock # 1.000 CPUs utilized
CPU5 2,721.70 msec task-clock # 1.000 CPUs utilized
CPU6 2,721.69 msec task-clock # 1.000 CPUs utilized
CPU7 2,721.69 msec task-clock # 1.000 CPUs utilized
CPU0 861,492,255 cycles:u
CPU1 10,485,755,009 cycles:u
CPU2 10,485,655,861 cycles:u
CPU3 559,708,162 cycles:u
CPU4 1,207,134,025 cycles:u
CPU5 556,142,162 cycles:u
CPU6 623,020,558 cycles:u
CPU7 3,721,131,901 cycles:u
2.721620976 seconds time elapsed
Note the 10,485,755,009
vs. 10,485,655,861
counts for cycles:u
on cores 1 and 2. The same to within one part per 10k. (At 3.9GHz on my i7-6700k, and the absolute difference is about 1 million, or about 250 nanoseconds, which could easily be accounted for by differences in starting / stopping counting on different cores.)
-a --cpu
only changes the header message, not counts
Adding -a
to any of these (except the one without --cpu
) changes nothing except the header message, to Performance counter stats for 'system wide': instead of 'CPU(s) 0-7'
Everything I'm seeing is consistent with --cpu
actually counting everything happening on that core wile perf stat
is running, just like -a
mode.
Answered By - Peter Cordes Answer Checked By - David Goodson (WPSolving Volunteer)