So, while I was writing those atomic functions in assembly, I figured I might as well go ahead and write the endian conversion functions. The functions themselves are trivial to write, using rol for 16-bit flipping, and bswap for 32-bit and 64-bit flipping. But after I get done with that, Merlin asks me why I didn't just use the VC++ flipping functions (_byteswap_ushort, _byteswap_ulong, and _byteswap_uint64). Well, the truth was that I didn't know about those. So, I go back and add a special case for the endian functions, when VC++ is used. So, after writing all that, I go to test the code (or #defines, in this case). Of course, I first build the test program (called ThingyTron, in this case). Stepping through the code leads me to the implimentation of the _byteswap_uint64 function, which looks like this:
unsigned __int64 __cdecl _byteswap_uint64(unsigned __int64 i)
{
unsigned __int64 j;
j = (i << 56);
j += (i << 40)&0x00FF000000000000;
j += (i << 24)&0x0000FF0000000000;
j += (i << 8)&0x000000FF00000000;
j += (i >> 8)&0x00000000FF000000;
j += (i >> 24)&0x0000000000FF0000;
j += (i >> 40)&0x000000000000FF00;
j += (i >> 56);
return j;
}
And if you think that's ugly, you should see the assembly the compiler generates from it. Fortunately, Merlin was right - in release build, these functions are translated into intrinsics (basically identical to the assembly I wrote, only inlined in the calling function). Because of this, not only are they as fast as the functions I wrote, but they don't have the overhead of my function calls, making them extremely fast. And so, everything worked out beautifully in the end. I do have to wonder, though, what on earth prompted them to use such a horrible flipping algorithm for the non-intrinsic version.
Search This Blog
Thursday, June 09, 2005
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment