Optimize your algorithm first before you try these optimizations!
That's usually where you get the biggest speed increases.
Before:
' a very slowly converging formula for pi
' This takes a minute or so on a fast Pentium III.
r# = 0
s% = 1
FOR i& = 1 TO 90000000 STEP 2
r# = r# + s% * 1# / i&
s% = -s%
NEXT
pi! = CSNG(r# * 4)
PRINT pi!
After:
' a fast! constant time formula for pi
' This is faster than you can blink.
PI! = ATN(1) * 4
PRINT pi!
Once you've optimized the algorithm all that you can, you can start
looking at algebraic and loop optimizations.
The classic one is to use DEFINT A-Z. This forces you to use
as many integer variables as possible.
Use integer variables to index FOR loops. This may require
substitution and algebraic simplification.
Before:
FOR i! = 0 to 0.3 STEP 0.01
p! = i! * 3
NEXT
After:
FOR i% = 0 to 30
p! = i% * 0.03
NEXT
If your code has a lot of floating point calculations that need
high accuracy, compile with QB 4.0.
e.g. a raytracer
If your code has a lot of floating point calculations that don't
need more than 8 bits of accuracy, then definitely convert it to fixed
point. Even if it needs up to 16 bits of accuracy, it might be worth
converting to fixed point, if it is being used in the main loop.
e.g. a rotozoomer or voxel terrain.
Don't use IFs (conditional branches). Some comparison results
can be directly be used in a calculation. Note that in QB,
a TRUE boolean expression equals -1, and a FALSE one equals 0.
Before:
IF a > 4 THEN
b = 5
ELSE
b = 0
END IF
After:
b = -5 * (a > 4)
Actually, the above example is too simple for the After: version
to be faster. But for more complicated expressions involving
multiplication and division, it can make a difference.
Buffer your reads from a file. This is especially useful in a
non-disk-cached environment like DOS 4.0.
Before:
DIM c AS STRING * 1
OPEN "file.bin" FOR BINARY AS #1
FOR i = 0 TO 10 * 256
GET #1, , c
NEXT
CLOSE #1
After:
DIM buffer(0) AS STRING * 256
OPEN "file.bin" FOR BINARY AS #1
FOR i = 0 TO 10
GET #1, , buffer(0)
NEXT
CLOSE #1
Use an assembler keyboard handler or INP(&H60) plus keyboard buffer
clearing routines instead of INKEY$.
e.g. For a user controlled floormapper routine, this made a huge
difference in rending fps.
For a straight QB multikey handler, don't bother to clear the
keyboard buffer every vertical retrace. Instead, slow down
the keyboard repeat rate, and check every few frames.
e.g. This made a huge difference in QBMKEY.BAS
Use integer division for integers.
Before:
x% = x% / y%
After:
x% = x% \ y%
Make an integer division lookup table if there is a division
slowing down the inner loop.
Store the results of complicated expressions in look-up tables.
Before:
pi = ATN(1) * 4
DO
FOR i = 0 to 360
x! = 100 + COS(i * pi / 180!)
y! = 100 + SIN(i * pi / 180!)
PSET (x!, y!), c
NEXT i
LOOP UNTIL LEN(INKEY$)
After:
pi = ATN(1) * 4
Editor's note: I think some code is missing here...
If several complicated expressions in a loop has common subexpressions,
move the common subexpressions out of the loop.
Before:
FOR x = 1 to 32767
FOR y = 1 to 10
c = sin(x) * 30 + sqr(x) + y
NEXT y
NEXT x
After:
FOR x = 1 to 32767
xc = sin(x) * 30 + sqr(x)
FOR y = 1 to 10
c = xc + y
NEXT
NEXT
Make constants CONST. Unfortunately, you can't use transcendental
functions like ATN on the right side anymore.
Before:
pi = ATN(1) * 4
piover2 = pi / 2
After:
CONST pi = 3.14159265358979#
CONST piover2 = pi/2
Unroll short loops.
Before:
FOR a = 1 to 8
POKE a, a
NEXT
After:
POKE 1,1
POKE 2,2
POKE 3,3
POKE 4,4
POKE 5,5
POKE 6,6
POKE 7,7
POKE 8,8
Partially unroll long loops.
Before:
FOR x = 0 TO 319
POKE x,a
NEXT
After:
' this is a silly example, you should be using
' MMX filling or REP STOSB at least.
FOR x = 0 TO 319 STEP 4
POKE x, a
POKE x + 1, a
POKE x + 2, a
POKE x + 3, a
NEXT x
Move junk outside of the inner loops (code movement).
Before:
FOR y = 0 TO 199
FOR x = 0 TO 319
a = x * 4 + COS(t)
b = y * 3 + SIN(t)
NEXT
NEXT
After:
FOR y = 0 TO 199
b = y * 3 + SIN(t)
FOR x = 0 TO 319
a = x * 4 + COS(t)
NEXT
NEXT
Use cache sensitive programming. This means, try to
access your arrays in a sequential manner if possible.
If not, access them in small blocks that are adjacent to
each other. For example, QB arrays are usually stored in
a column major order, so dimension your arrays as
vscreen(xmax,ymax) if you are doing scanline-based algorithms,
and only change move in the x (scanline) direction in the
inner loop.
Before:
'$DYNAMIC
xmax = 319: ymax = 199
DIM buf(xmax, ymax)
FOR x = 0 TO xmax
FOR y = 0 TO ymax
buf(x, y) = INT(RND * 256)
NEXT
NEXT
DEF SEG
After:
'$DYNAMIC
xmax = 319: ymax = 199
DIM buf(xmax, ymax)
FOR y = 0 TO ymax
FOR x = 0 TO xmax
buf(x, y) = INT(RND * 256)
NEXT
NEXT
DEF SEG
Use a precalculated (canned) pseudo-random number sequence.
Before:
'main loop
FOR i = 1 TO 1000
x = INT(RND * 256)
y = INT(RND * 256)
c = INT(RND * 256)
PSET (x, y), c
NEXT i
After:
'precalculation
DIM rand(8191)
FOR i = 0 TO 8191
rand(i) = INT(RND * 256)
NEXT i
'main loop
count = 0
FOR i = 1 TO 1000
x = rand(count)
y = rand(count + 1)
c = rand(count + 2)
PSET (x, y), c
count = (count + 3) ' AND 8192 (needed in general)
NEXT i
Prefer array indexing over user defined TYPEs. (1)
Warning: This makes code unreadable unless it is well commented.
Cache often-used array elements in scalar variables. (2)
Cache intermediate values into temporary variables. (3)
Example of both optimizations being used.
Before:
TYPE PtType
x AS INTEGER
y AS INTEGER
z AS INTEGER
END TYPE
TYPE TriType
pt1 AS INTEGER 'index of first point in points array
pt2 AS INTEGER 'index of second point in points array
pt3 AS INTEGER 'index of third point in points array
END TYPE
DIM points(numpoints, 1 TO 3) AS PtType
DIM tri(numtriangles) AS TriType
CONST screendist = 200
CONST lightx = 1, lighty = 0, lightz = 0
CALL loadobject(filename$, points())
FOR i = 1 TO numtriangles
V1x = points(tri(i).pt2).x - points(tri(i).pt1).x
V2x = points(tri(i).pt3).x - points(tri(i).pt1).x
V1y = points(tri(i).pt2).y - points(tri(i).pt1).y
V2y = points(tri(i).pt3).y - points(tri(i).pt1).y
V1z = points(tri(i).pt2).z - points(tri(i).pt1).z
V2z = points(tri(i).pt3).z - points(tri(i).pt1).z
length1 = SQR(V1x * V1x + V1y * V1y + V1z + V1z)
length2 = SQR(V2x * V2x + V2y * V2y + V2z + V2z)
vx = V1y * V2z - V2y * V1z
vy = V2x * V1z - V1x * V2z
vz = V1x * V2y - V2x * V1y
CALL normalize(vx, vy, vz)
brightness = vx * lightx + vy * lighty + vz * lightz
xp1 = screendist * x1 / z1
yp1 = screendist * y1 / z1
xp2 = screendist * y1 / z1
yp2 = screendist * y2 / z2
'... and so on...
NEXT i
After:
' index 1 = x coordinate of point
' index 2 = y coordinate of point
' index 3 = z coordinate of point
DIM points(numpoints, 1 TO 3)
DIM tri(numtriangles, 1 TO 3)
CONST screendist = 200
CONST lightx = 1, lighty = 0, lightz = 0
CALL loadobject(filename$, points())
FOR i = 1 TO numtriangles
x1 = points(tri(i, 1), 1) ' example of optimization 1
y1 = points(tri(i, 1), 2) ' and optimization 2
z1 = points(tri(i, 1), 3)
x2 = points(tri(i, 2), 1)
y2 = points(tri(i, 2), 2)
z2 = points(tri(i, 2), 3)
x3 = points(tri(i, 2), 1)
y3 = points(tri(i, 2), 2)
z3 = points(tri(i, 2), 3)
V1x = (x2 - x1): V2x = (x3 - x1)
V1y = (y2 - y1): V2y = (y3 - y1)
V1z = (z2 - z1): V2z = (z3 - z1)
length1 = SQR(V1x * V1x + V1y * V1y + V1z + V1z)
length2 = SQR(V2x * V2x + V2y * V2y + V2z + V2z)
vx = V1y * V2z - V2y * V1z
vy = V2x * V1z - V1x * V2z
vz = V1x * V2y - V2x * V1y
CALL normalize(vx, vy, vz)
brightness = vx * lightx + vy * lighty + vz * lightz
xp1 = screendist * x1 / z1
yp1 = screendist * y1 / z1
xp2 = screendist * y1 / z1
yp2 = screendist * y2 / z2
'... and so on...
NEXT i
Use REDIM to clear a large array instead of using
a FOR loop to set each element to zero.
Before:
DIM x(32000)
DO
FOR i = 0 TO 32000
x(i) = 0 'clear array slowly
NEXT i
x(RND * 32000) = 50
x(RND * 32000) = 93
LOOP UNTIL LEN(INKEY$)
After:
DIM x(32000)
DO
REDIM x(32000) 'clear array faster
x(RND * 32000) = 50
x(RND * 32000) = 93
LOOP UNTIL LEN(INKEY$)
Avoid multidimensional arrays.
Use array head lookup tables like in the POKE vs. PSET example for
faster access of single dimension arrays as multidimensional ones.
Before:
DIM x(63, 63)
After:
DIM x(4095)
Don't waste an extra element. Unlike C arrays, the declaration of QB arrays
specify the first and last element indicies rather than the size of the array.
This matters when you want to make a 64KB array without using '$DYNAMIC.
Before:
DIM x(256, 256) 'allocate 66049 elements
After:
DIM x(0 TO 255, 0 TO 255) 'allocate 66536 elements
or
'OPTION BASE 0
DIM x(255, 255)
Use incremental calculation instead of evaluating the entire equation
every loop. This usually means multiplies will be replaced by addition.
It's very important that you do this in any linear interpolation
function you use for Gouraud Shading, Texture Mapping, etc. Most
line DDAs (digital difference analyzers) use this method.
Before:
slope! = 0.1
FOR x = 0 TO max
y! = slope! * x
NEXT x
After:
slope! = 0.1
y! = 0
FOR x = 0 TO max
y! = y! + slope!
NEXT x
Use POKE instead of PSET.
This is a simple way to get 2x performance in graphics intensive apps.
Before:
CONST xmax = 319, ymax = 199, scansize& = 320
FOR i = 0 TO 255
PSET (i, 0), i
NEXT i
FOR i = 0 TO 255
PSET (i, 10), i
NEXT i
After:
CONST xmax = 319, ymax = 199, scansize& = 320
DIM ytab&(ymax)
FOR y = 0 to ymax
ytab&(y) = y * scansize&
NEXT y
DEF SEG = &HA000
FOR i = 0 TO 255
POKE i, i
NEXT i
FOR i = 0 TO 255
POKE ytab&(10) + i, i
NEXT i
DEF SEG
Use INTEGER variables instead of LONGs for unsigned integers in
the range 0 to 65535. This will only work when the program is compiled.
PEEKing from video memory is slower than PEEKing from system
memory. Therefore, use double buffering when you need to do feedback effects.
Use DEF SEG sparingly. You don't need to DEF SEG back to the default
segment when you are accessing arrays in the default segment. DEF SEG
only applies to PEEK and POKE and SETMEM.
Don't use '$DYNAMIC.
QB arrays in the default segment are accessed at blazing speed, because
there is no segment switching. However, '$DYNAMIC puts them in different
segments, which need extra instructions to accessed, slowing them down.
This makes a big difference in programs that use large lookup tables
in their inner loop. It seems that huge arrays (allowed using the QB/AH
command) are the slowest to access.
Before:
'$DYNAMIC
DIM hugetable(319, 199)
FOR y = 0 TO 199
FOR x = 0 TO 319
xo = (x - 160) \ 2
yo = (y - 100) \ 2
hugetable(x, y) = xo * xo + yo * yo
NEXT
NEXT
After:
'$STATIC
DIM hugetable1(319, 99)
DIM hugetable2(319, 99)
FOR y = 0 TO 99
FOR x = 0 TO 319
xo = (x - 160) \ 2
yo = (y - 100) \ 2
hugetable(x, y) = xo * xo + yo * yo
NEXT
NEXT
FOR y = 100 TO 199
FOR x = 0 TO 319
xo = (x - 160) \ 2
yo = (y - 100) \ 2
hugetable(x, y - 100) = xo * xo + yo * yo
NEXT
NEXT
Use SELECT CASE instead of a bunch of ELSEIFs.
The only exception is when one case executes much
more often than the others.
Before:
IF i = 1 THEN
CALL DrawSprite
ELSEIF i = 6 THEN
CALL PlaySound
ELSEIF i > 9 AND i < 16 THEN
CALL Calculate(i)
ELSE
PRINT "."
END IF
After:
SELECT CASE i
CASE 1
CALL DrawSprite
CASE 6
CALL PlaySound
CASE 10 TO 15
CALL Calculate(i)
CASE ELSE
PRINT "."
END SELECT
Use AND instead of MOD for MODing by a power of 2.
Before:
a = b MOD 64
After:
a = b AND 63
Simplify compares against zero.
Before:
IF a <> 0 THEN
b% = b% - 1
END IF
After:
IF a% THEN 'note <>0 is gone
b% = b% - 1
END IF
Use -x to find the negative of a number instead of -1*x.
This is an obvious optimization if you know that the CPU
has a NEG instruction, which is faster than IMUL.
Don't put the main loop in the main code-- put it in a SUB.
This makes a difference in the IDE, probably because the
p-code interpreter has less variables to wade through when
you are in a SUB.
- Use static storage for non-recursive SUB parameters.
This makes very little improvement in speed, unless there are
tons of variables passed to a SUB.
Before:
SUB drawcircle(x%,y%,r%)
'routine to draw a circle
END SUB
After:
SUB drawcircle(x%,y%,r%) STATIC
'routine to draw a circle
END SUB
Pass dummy parameters to functions that take an odd number of
arguments in order to improve data alignment.
The dummy parameter is not used by the function, but is there
to encourage burst memory writes. This only makes a minimal
difference in speed.
Before:
CALL drawcircle(x%, y%, r%)
After:
CALL drawcircle(x%, y%, r%, dummy%)
Don't initialize QB array elements to zero. Warning: this is
a dangerous habit to get into, if you plan to use C or C++ later on.
This is because C does not initialize variables by default.
Before:
DIM div320(32767)
FOR i = 0 TO 32767
div320(i) = i \ 320
NEXT
After:
DIM div320(32767)
FOR i = 320 TO 32767
div320(i) = i \ 320
NEXT
For floating point, multiply by the reciprocal of a number instead of
dividing by a number.
Before:
SUB normalize (x!, y!, z!)
norm! = SQR(x! * x! + y! * y! + z! * z!)
x! = x! / norm!
y! = y! / norm!
z! = z! / norm!
END SUB
After:
SUB normalize (x!, y!, z!)
recipnorm! = 1 / SQR(x! * x! + y! * y! + z! * z!)
x! = x! * recipnorm!
y! = y! * recipnorm!
z! = z! * recipnorm!
END SUB
Simplify comparisons using simpler monotonic functions. Monotonic
functions are functions that always grow upwards or always grow downwards.
For example, x^2 is a monotonic function of x, so is 2*x. In the example,
an expensive square root was removed by squaring both sides, since squaring
is a monotonic function.
Before:
dist = SQR(x * x + y * y)
IF dist < radius THEN
'inside circle
END IF
After:
r2 = radius * radius
distsquared = x * x + y * y
IF distsquared < r2 THEN
'inside circle
END IF