I prefer your first option in a second. Or you can use parallel processing, having four local variables that accept separate bytes shifted by the correct amount. Then in the final line you return b0shifted | b1shifted | b2shifted | b3shifted.
In any case, it all depends on your compiler. The second option contains more load / storage operations, so the first option has fewer abstract operations.
, , . , (endianess, alignment), , CHAR_BIT == 8.