Gcc optimization when copying an array

I need to profile an application that runs multiple copies of arrays, so I ended up profiling this very simple function:

typedef unsigned char UChar;
void copy_mem(UChar *src, UChar *dst, unsigned int len) {
        UChar *end = src + len;
        while (src < end)
                *dst++ = *src++;
}

I use Intel VTune to do the actual profiling, and from there I saw that when compiling with gcc -O3 and "plain" gcc (4.4) there are significant differences.

To understand why and how, I have an assembly of both compilations.

Non-optimized version:

.L3:
        movl    8(%ebp), %eax
        movzbl  (%eax), %edx
        movl    12(%ebp), %eax
        movb    %dl, (%eax)
        addl    $1, 12(%ebp)
        addl    $1, 8(%ebp)
.L2:
        movl    8(%ebp), %eax
        cmpl    -4(%ebp), %eax
        jb      .L3
        leave

So, I see that it first loads the dword from * src and places the low byte in edx, then it saves it in * dst and updates the pointers: simple enough.

Then I saw an optimized version, and I did not understand anything.

EDIT : There is an optimized build here.

: gcc ?

+3
5

, 3 ( L6, L13 L12). , gcc , @GJ ( ). L6 4 , №2 №1. № 3, № 2.

+2

, !

, 4 , 1..3 . (4 ) , . , . , .

mem, memmove!

+2

, , , , MOV, , REP ( ). REP MOVS ( , , ).

SSE , , , (MOVDQU), (dunno, , ) . /dest , .

, MOVSB, .

+1

The fastest x86 build instructions that gcc can generate will be rep movsdthat will copy 4 bytes at a time. Standard libc function memcpyin <string.h>, as well as a special insert for gcc memcpyand many other features to <string.h>give you the fastest results.

0
source

You can also use restrict here.

0
source

All Articles