Issue
I wanted to benchmark glibc
's strlen
function for some reason and found out it apparently performs much slower with optimizations enabled in GCC and I have no idea why.
Here's my code:
#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
int main() {
char *s = calloc(1 << 20, 1);
memset(s, 65, 1000000);
clock_t start = clock();
for (int i = 0; i < 128; ++i) {
s[strlen(s)] = 'A';
}
clock_t end = clock();
printf("%lld\n", (long long)(end - start));
return 0;
}
On my machine it outputs:
$ gcc test.c && ./a.out
13336
$ gcc -O1 test.c && ./a.out
199004
$ gcc -O2 test.c && ./a.out
83415
$ gcc -O3 test.c && ./a.out
83415
Somehow, enabling optimizations causes it to execute longer.
Solution
Testing your code on Godbolt's Compiler Explorer provides this explanation:
- at
-O0
or without optimisations, the generated code calls the C library functionstrlen
; - at
-O1
the generated code uses a simple inline expansion using arep scasb
instruction; - at
-O2
and above, the generated code uses a more elaborate inline expansion.
Benchmarking your code repeatedly shows substantial variations from one run to another, but increasing the number of iterations shows that:
- the
-O1
code is much slower than the C library implementation:32240
vs3090
- the
-O2
code is faster than the-O1
but still substantially slower than the C ibrary code:8570
vs3090
.
This behavior is specific to gcc
and the GNU libc. The same test on OS/X with clang
and Apple's Libc does not show significant differences, which is not a surprise as Godbolt shows that clang
generates a call to the C library strlen
at all optimisation levels.
This could be considered a bug in gcc/glibc but more extensive benchmarking might show that the overhead of calling strlen
has a more important impact than the lack of performance of the inline code for small strings. The strings in your benchmark are uncommonly large, so focusing the benchmark on ultra-long strings might not give meaningful results.
I improved this benchmark and tested various string lengths. It appears from the benchmarks on linux with gcc (Debian 4.7.2-5) 4.7.2 running on an Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz that the inline code generated by -O1
is always slower, by as much as a factor of 10 for moderately long strings, while -O2
is only slightly faster than the libc strlen
for very short strings and half as fast for longer strings. From this data, the GNU C library version of strlen
is quite efficient for most string lengths, at least on my specific hardware. Also keeping in mind that cacheing has a major impact on benchmark measurements.
Here is the updated code:
#include <stdlib.h>
#include <string.h>
#include <time.h>
void benchmark(int repeat, int minlen, int maxlen) {
char *s = malloc(maxlen + 1);
memset(s, 'A', minlen);
long long bytes = 0, calls = 0;
clock_t clk = clock();
for (int n = 0; n < repeat; n++) {
for (int i = minlen; i < maxlen; ++i) {
bytes += i + 1;
calls += 1;
s[i] = '\0';
s[strlen(s)] = 'A';
}
}
clk = clock() - clk;
free(s);
double avglen = (minlen + maxlen - 1) / 2.0;
double ns = (double)clk * 1e9 / CLOCKS_PER_SEC;
printf("average length %7.0f -> avg time: %7.3f ns/byte, %7.3f ns/call\n",
avglen, ns / bytes, ns / calls);
}
int main() {
benchmark(10000000, 0, 1);
benchmark(1000000, 0, 10);
benchmark(1000000, 5, 15);
benchmark(100000, 0, 100);
benchmark(100000, 50, 150);
benchmark(10000, 0, 1000);
benchmark(10000, 500, 1500);
benchmark(1000, 0, 10000);
benchmark(1000, 5000, 15000);
benchmark(100, 1000000 - 50, 1000000 + 50);
return 0;
}
Here is the output:
chqrlie> gcc -std=c99 -O0 benchstrlen.c && ./a.out average length 0 -> avg time: 14.000 ns/byte, 14.000 ns/call average length 4 -> avg time: 2.364 ns/byte, 13.000 ns/call average length 10 -> avg time: 1.238 ns/byte, 13.000 ns/call average length 50 -> avg time: 0.317 ns/byte, 16.000 ns/call average length 100 -> avg time: 0.169 ns/byte, 17.000 ns/call average length 500 -> avg time: 0.074 ns/byte, 37.000 ns/call average length 1000 -> avg time: 0.068 ns/byte, 68.000 ns/call average length 5000 -> avg time: 0.064 ns/byte, 318.000 ns/call average length 10000 -> avg time: 0.062 ns/byte, 622.000 ns/call average length 1000000 -> avg time: 0.062 ns/byte, 62000.000 ns/call chqrlie> gcc -std=c99 -O1 benchstrlen.c && ./a.out average length 0 -> avg time: 20.000 ns/byte, 20.000 ns/call average length 4 -> avg time: 3.818 ns/byte, 21.000 ns/call average length 10 -> avg time: 2.190 ns/byte, 23.000 ns/call average length 50 -> avg time: 0.990 ns/byte, 50.000 ns/call average length 100 -> avg time: 0.816 ns/byte, 82.000 ns/call average length 500 -> avg time: 0.679 ns/byte, 340.000 ns/call average length 1000 -> avg time: 0.664 ns/byte, 664.000 ns/call average length 5000 -> avg time: 0.651 ns/byte, 3254.000 ns/call average length 10000 -> avg time: 0.649 ns/byte, 6491.000 ns/call average length 1000000 -> avg time: 0.648 ns/byte, 648000.000 ns/call chqrlie> gcc -std=c99 -O2 benchstrlen.c && ./a.out average length 0 -> avg time: 10.000 ns/byte, 10.000 ns/call average length 4 -> avg time: 2.000 ns/byte, 11.000 ns/call average length 10 -> avg time: 1.048 ns/byte, 11.000 ns/call average length 50 -> avg time: 0.337 ns/byte, 17.000 ns/call average length 100 -> avg time: 0.299 ns/byte, 30.000 ns/call average length 500 -> avg time: 0.202 ns/byte, 101.000 ns/call average length 1000 -> avg time: 0.188 ns/byte, 188.000 ns/call average length 5000 -> avg time: 0.174 ns/byte, 868.000 ns/call average length 10000 -> avg time: 0.172 ns/byte, 1716.000 ns/call average length 1000000 -> avg time: 0.172 ns/byte, 172000.000 ns/call
Answered By - chqrlie Answer Checked By - Candace Johnson (WPSolving Volunteer)