Issue
I wrote a fortran program to simulate molecular system. I developed it on a desktop computer whose processor is a Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
. After, to launch large scale simulation, I used compute blades whose processors are AMD Opteron(tm) Processor 6176 @ 2.3 GHz
. I was surprised because the execution time is increased by a factor of about 3.
So, I decided to learn how to use perf
, asm
, ... to optimize program.
After a lot of stuff, I finally wrote this short program and I still have a factor of about 3.
program simple_pgm
integer :: i, res
res = 0
do i=1,1000000000
res = res + i
enddo
write(*,*) res
end program simple_pgm
Compilation command : gfortran -g -Wall -O2 simple.f90 -o simple
When I looked at annotate MAIN__
in the perf report -n
, the assembler code is more or less the same.
On the i7 processor :
│ res = 0 │ do i=1,1000000000 │ mov $0x1,%eax │ xchg %ax,%ax │ res = res + i 1095 │10: add %eax,%edx │ do i=1,1000000000 1 │ add $0x1,%eax │ cmp $0x3b9aca01,%eax │ ↑ jne 10 │ enddo
And on the Opteron one :
│ res = 0 │ do i=1,1000000000 │ mov $0x1,%eax │ program simple_pgm │ sub $0x220,%rsp │ nop │ res = res + i 1972 │10: add %eax,%edx │ do i=1,1000000000 1524 │ add $0x1,%eax │ cmp $0x3b9aca01,%eax │ ↑ jne 10 │ enddo
I wonder why in the sample
column, for the instruction add $0x1,%eax
, the value is very large for the Opteron processor (1524
). And could it explain the factor of about in execution speed ?
Thanks for answer. As I am learning ASM, processor and computer architecture, perf, ... (many things for a beginner), any comments, suggestions would be appreciated. I am aware that I could be on the wrong way.
Solution
AMD Opteron 6176 used the K10 microarchitecture, versus Intel Core i7 6700 using the Skylake microarchitecture. K10 is very old, too old apparently to have its information listed on uops.info or to be available in code analyzers.
Going by Agner Fogs microarchitecture information, it seems that K10 needed at least 2 cycles per iteration of small loops. Skylake does not have that limitation, and in this particular case should be able reach 1 cycle per iteration.
0.25ns per iteration (assuming the i7 6700 runs at 4GHz turbo) is almost (but not quite) 3 times as fast as 0.63ns per iteration (assuming the Operon runs at 3.2GHz turbo, that information isn't very well supported, and perhaps the maximum turbo frequency was not used).
Answered By - harold Answer Checked By - Marie Seifert (WPSolving Admin)