I need to profile an application that runs multiple copies of arrays, so I ended up profiling this very simple function:
typedef unsigned char UChar;
void copy_mem(UChar *src, UChar *dst, unsigned int len) {
UChar *end = src + len;
while (src < end)
*dst++ = *src++;
}
I use Intel VTune to do the actual profiling, and from there I saw that when compiling with gcc -O3 and "plain" gcc (4.4) there are significant differences.
To understand why and how, I have an assembly of both compilations.
Non-optimized version:
.L3:
movl 8(%ebp), %eax
movzbl (%eax), %edx
movl 12(%ebp), %eax
movb %dl, (%eax)
addl $1, 12(%ebp)
addl $1, 8(%ebp)
.L2:
movl 8(%ebp), %eax
cmpl -4(%ebp), %eax
jb .L3
leave
So, I see that it first loads the dword from * src and places the low byte in edx, then it saves it in * dst and updates the pointers: simple enough.
Then I saw an optimized version, and I did not understand anything.
EDIT : There is an optimized build here.
: gcc ?