Search This Blog

Monday, July 18, 2005

Depravity

So I had an evil, perverted urge to learn MMX assembly (I would learn SSE2, but I don't have that on my computer), yet nothing to use MMX for. Then I got the idea to write a function to convert a string from lower case to upper case using MMX (if the length is known in advance to be a multiple of 8 bytes, MMX can be used to calculate all 8 bytes at once). The general idea is this (since obviously branches are out in SIMD programming):

1. Read block of char (8 chars - the size of an MMX register)
2. Generate comparison mask of chars that are greater than or equal to 'a' (0x61)
3. Generate comparison mask of chars that are less than or equal to 'z' (0x7a)
4. AND masks to get a mask where each byte is 0xFF if the char is lowercase, otherwise 0
5. AND that mask with a vector of all 0x20 bytes ('a' - 'A') to form a subtraction mask, where bytes corresponding to lower case chars are 0x20, otherwise 0
6. Subtract that mask from the block of chars
7. Write block of chars back
8. Rinse and repeat

This sounds like a major pain, and really it is. But hopefully it'll be faster than traditional char-by-char conversion (and if nothing else, it's an introduction to MMX assembly).

But before we get into the actual function, we need something to compare it to. The following is a simple branchless function I just wrote, and takes 13 cycles per char, on a Pentium 4, plus the memory accesses, which will generally be cached:

push ebx
xchg ecx, edx

test ecx, ecx
je Done

Begin:
mov al, [edx]
cmp al, 'a'
setl bl
cmp al, 'z'
setg bh
or bl, bh
sub bl, 1
and bl, 20h
sub al, bl
mov [edx], al

add edx, 1
sub ecx, 1
jne Begin

Done:
pop ebx
ret

This process could be made faster, under certain circumstances, by using a simple lookup table; use of a lookup table would produce a function taking 6 cycles per char. The problem with this is that it also produces many memory accesses that will take dozens of cycles if uncached. Thus, the particular circumstance where this is beneficial: when the function is called so frequently that the table is kept entirely in the processor's data cache. If this condition is not met, the function will be significantly slower that the previous nonbranching function.

Now, the MMX version. This function is more than a little cryptic because I did some optimizations on it. One bottleneck in this function is that movq instructions (move quadword - 64 bits) are very slow - they take about 6 cycles to complete (although other instructions may execute after 1 cycle, provided they don't use the result of the movq instruction). All other instructions used here take 2 cycles, but another instruction may execute after 1 cycle, provided it does not depend on the results of the previous instruction.

This biggest oddity in how I wrote this function, however, is due to the fact that there are no instructions to load an MMX register with an immediate, and memory access is slow. Yet we need 3 masks: 1 of 0x20 bytes ('a' - 'A'), 1 of 0x60 bytes ('a' - 1), and 1 of 0x7b bytes ('z' + 1), which correspond to the compares we're going to make. What I decided to do (because it was by far the fastest solution I could come up with) was to load a 16-bit value from a normal register into an MMX register, then use the shuffle instruction (shufw) to spam that same value into the other 48 bits of the register. This is much faster than loading three MMX registers from memory, given that we can load 2 16-bit values at once with movd (move doubleword - 32 bits), and even that only takes 2 cycles.

push ebx
test edx, edx
xchg ecx, edx ; ecx is now number of blocks, edx is the string
je Done

mov eax, 7b7b6060h ; low word contains the lower bounds mask, high word contains the upper bounds mask
mov ebx, 20202020h ; the subtraction mask

Begin:
movq mm1, [edx] ; the packed characters
movd mm2, eax
movd mm3, ebx
movq mm0, mm1
pshufw mm4, mm2, 00000000b ; the lower bounds mask
pshufw mm3, mm3, 00000000b ; the subtraction mask
pshufw mm2, mm2, 01010101b ; the upper bounds mask
pcmpgtb mm1, mm4
pcmpgtb mm2, mm0
pand mm3, mm1
pand mm3, mm2
psubb mm0, mm3
movq [edx], mm0

add edx, 8
sub ecx, 1
jnz Begin

Done:
pop ebx
ret

As you can see, once we've computed the two conditional masks, we reach a bottleneck - we can no longer shove new instructions into the processor every cycle. Still, the overall time is pretty nice. The non-MMX instructions before the loop are negligible, and the instructions inside the loop take about 23 cycles + 1 memory access (but on average these accesses will be cached a majority of the time, so there's reduced delay). That's about 3 cycles and 1/16 memory access per char, which is about twice as fast as the lookup table version (in ideal conditions) and four times as fast as the nonbranching non-MMX version.

So, was the speed increase worth the effort? Probably not, but I certainly learned some stuff in the process, and that's worth the effort.

Oh, and one last thing: all the information I used on the speed of instructions is based on the Intel Pentium 4 optimization manual. Other processors may be faster or slower for each instruction. This means that the way I've got the the instructions ordered for maximum speed may not be maximal on non-P4 processors (like mine).

No comments: