Previous | Table of Contents | Next |
However, it generally is. Sure, if the length is odd, John’s approach incurs a penalty approximately equal to the REP startup time for MOVSB. However, if the length is even, John’s approach doesn’t branch, saving cycles and not emptying the prefetch queue. If copy lengths are evenly distributed between even and odd, John’s approach is faster in most x86 systems. (Not on the 486, though.)
John also points out that on the 386, multiple LEAs can be combined to perform multiplications that can’t be handled by a single LEA, much as multiple shifts and adds can be used for multiplication, only faster. LEA can be used to multiply in a single instruction on the 386, but only by the values 2, 3, 4, 5, 8, and 9; several LEAs strung together can handle a much wider range of values. For example, video programmers are undoubtedly familiar with the following code to multiply AX times 80 (the width in bytes of the bitmap in most PC display modes):
SHL AX,1 ;*2 SH LAX,1 ;*4 SH LAX,1 ;*8 SH LAX,1 ;*16 MO VBX,AX SH LAX,1 ;*32 SH LAX,1 ;*64 ADD AX,BX ;*80
Using LEA on the 386, the above could be reduced to
LEA EAX,[EAX*2] ;*2 LEA EAX,[EAX*8] ;*16 LEA EAX,[EAX+EAX*4] ;*80
which still isn’t as fast as using a lookup table like
MOV EAX,MultiplesOf80Table[EAX*4]
but is close and takes a great deal less space.
Of course, on the 386, the shift and add version could also be reduced to this considerably more efficient code:
SH LAX,4 ;*16 MOV BX,AX SHL AX,2 ;*64 ADD AX,BX ;*80
That brings us to multiplication, one of the slowest of x86 operations and one that allows for considerable optimization. One way to speed up multiplication is to use shift and add, LEA, or a lookup table to hard-code a multiplication operation for a fixed multiplier, as shown above. Another is to take advantage of the early-out feature of the 386 (and the 486, but in the interests of brevity I’ll just say “386” from now on) by arranging your operands so that the multiplier (always the rightmost operand following MUL or IMUL) is no larger than the other operand.
Why? Because the 386 processes one multiplier bit per cycle and immediately ends a multiplication when all significant bits of the multiplier have been processed, so fewer cycles are required to multiply a large multiplicand times a small multiplier than a small multiplicand times a large multiplier, by a factor of about 1 cycle for each significant multiplier bit eliminated. |
(There’s a minimum execution time on this trick; below 3 significant multiplier bits, no additional cycles are saved.) For example, multiplication of 32,767 times 1 is 12 cycles faster than multiplication of 1 times 32,727.
Choosing the right operand as the multiplier can work wonders. According to published specs, the 386 takes 38 cycles to multiply by a multiplier with 32 significant bits but only 9 cycles to multiply by a multiplier of 2, a performance improvement of more than four times! (My tests regularly indicate that multiplication takes 3 to 4 cycles longer than the specs indicate, but the cycle-per-bit advantage of smaller multipliers holds true nonetheless.)
This highlights another interesting point: MUL and IMUL on the 386 are so fast that alternative multiplication approaches, while generally still faster, are worthwhile only in truly time-critical code.
On 386SXs and uncached 386s, where code size can significantly affect performance due to instruction prefetching, the compact MUL and IMUL instructions can approach and in some cases even outperform the “optimized” alternatives. |
All in all, MUL and IMUL are reasonable performers on the 386, no longer to be avoided in most cases—and you can help that along by arranging your code to make the smaller operand the multiplier whenever you know which operand is smaller.
That doesn’t mean that your code should test and swap operands to make sure the smaller one is the multiplier; that rarely pays off. I’m speaking more of the case where you’re scaling an array up by a value that’s always in the range of, say, 2 to 10; because the scale value will always be small and the array elements may have any value, the scale value is the logical choice for the multiplier.
Rob Williams writes with a wonderful optimization to the REPNZ SCASB-based optimized searching routine I discussed in Chapter 5. As a quick refresher, I described searching a buffer for a text string as follows: Scan for the first byte of the text string with REPNZ SCASB, then use REPZ CMPS to check for a full match whenever REPNZ SCASB finds a match for the first character, as shown in Figure 9.1. The principle is that most buffer characters won’t match the first character of any given string, so REPNZ SCASB, by far the fastest way to search on the PC, can be used to eliminate most potential matches; each remaining potential match can then be checked in its entirety with REPZ CMPS.
Figure 9.1 Simple searching method for locating a text string.
Rob’s revelation, which he credits without explanation to Edgar Allen Poe (search nevermore?), was that by far the slowest part of the whole deal is handling REPNZ SCASB matches, which require checking the remainder of the string with REPZ CMPS and restarting REPNZ SCASB if no match is found.
Rob points out that the number of REPNZ SCASB matches can easily be reduced simply by scanning for the character in the searched-for string that appears least often in the buffer being searched. |
Imagine, if you will, that you’re searching for the string “EQUAL.” By my approach, you’d use REPNZ SCASB to scan for each occurrence of “E,” which crops up quite often in normal text. Rob points out that it would make more sense to scan for “Q,” then back up one character and check the whole string when a “Q” is found, as shown in Figure 9.2. “Q” is likely to occur much less often, resulting in many fewer whole-string checks and much faster processing.
Previous | Table of Contents | Next |