on tuning code

preface

TODO: Add code snippets and profiling results to demonstrate the techniques.

This document demonstrates low-level code optimization techniques as well as simple data structures useful for high-performance data processing. The goal is to introduce the reader to some aspects of the magic of tuning code.

Compilers massage code, people tune code. There is a difference. The best compiler cannot undo a bad design. Tuning code may not be rocket science, but it's definitely art and lots of fun.

kiss

Keep it simple, stupid. The philosophy behind tuning code. CPUs are reasonably simple and good at doing simple things. The less you make them do, the faster the code will run.

The fact that algorithmic optimizations, i.e. picking efficient algorithms is the first part of optimization won't be given much attention in this document. Instead, I will concentrate on lower-level optimizations such as data organization.

I'm of the belief that tuning code is mostly about knowing the machine. Therefore, I will discuss CPU internals (on a generic level that should be widely applicable). You will see that there's more to efficient code than first meets the eye.

c language

The C language is one of the biggest contributions to software technology. C is basically as low-level as you can get without resorting to assembly. C has been given the nickname of "pseudo-assembly". The basic CPU environment of registers, memory, and I/O has been abstracted, but thinly. If this abstraction gets too much on your way, you can usually use inline assembly, i.e. mix assembly code with C.

You can, most of the time, think of variables as registers. Most C programs use pointers quite a bit; they can be considered memory addresses. In short, C programmers are dealing with what the CPU is; operations on and between registers and memory.

c integer operations


op      tick/op usecs   count
--      ------- -----   -----
++      6       17      1024
--      6       18      1024
+       6       17      1024
-       6       18      1024
*       8       21      1024
/       12      32      1024
%       36      86      1024

logical ops
op      tick/op usecs   count
--      ------- -----   -----
&       2       8       1024
|       2       8       1024
^       2       9       1024
~       2       8       1024
!       4       12      1024

notes

The speed measurements above were run on a single P3-450 CPU. Note that whereas most arithmetic operations are about equal in speed, integer division is a bit slower and modulus is very slow. Therefore, it's worth knowing that for integral x and power-of-two value p2,

y = x % p2;
is equivalent with
y = x & (p2 - 1);

To find out if x is a power of two, you can use

#define powerof2(x) (!((x) & ((x) - 1)))

In other words, you can replace the very slow modulus with very fast logical operations. Note how logical operations run quite a bit faster than arithmetic ones. ;)

c floating-point operations

float
-----
op      tick/op usecs   count
--      ------- -----   -----
+       9       23      1024
-       9       24      1024
*       109     250     1024
/       124     284     1024

double
------
op      tick/op usecs   count
--      ------- -----   -----
+       9       25      1024
-       9       25      1024
*       51      118     1024
/       125     286     1024

long double
-----------
op      tick/op usecs   count
--      ------- -----   -----
+       11      29      1024
-       12      30      1024
*       14      35      1024
/       40      95      1024


notes

As is the case with integer math, addition and subtraction are reasonably fast. However, division and modulus are very slow.

An interesting thing to notice is that added precision may even speed things up. :)

computer architecture

von neumann machine

A Von Neumann machine consists of three parts - CPU, memory, and I/O facilities.

dies and cores

A core is a CPU execution unit, basically a set of hardware registers and mathematical units that operate on them.

Multicore systems have several cores on the same die, i.e. the same package. So called SMP (symmetric multiprocessor) systems are more traditional. They consist of several dies with one core on each.

Most of the time, the programmer need not be concerned about the hardware details. It may, however, pay back to implement CPU-intensive applications using multiple threads in the hopes the system might be capable of executing several of the threads simultaneously.

alu

ALU, arithmetic-logical unit, is a central part of a CPU. Implementation is, naturally, architecture-dependent, but the unit is used for reasonably simple arithmetic and logical operations.

fpu

FPU, floating point unit, is used for floating-point calculations. There may, for example, be machine instructions for operations such as sine and cosine.

Even though the days of slow floating-point operations may be history, there's still a point in not using floating-point to minimize the number of registers that need to be saved on task switches.

As a special case, I'm trying to keep my kernel all-integer in order to speed up in-kernel context switches.

mmu

MMU, memory management unit, is a part of a CPU that handles [kernel-level] virtual memory. Virtual addresses can be mapped to arbitrary physical addresses to facilitate paging data in and out transparently to the programmer. MMUs implement schemes to protect memory from unwanted access.

As a fast form of IPC (interprocess communications), shared memory deserves to be mentioned. Shared memory can be implemented by mapping pieces of physical memory (RAM) to address spaces of several processes.

prefetch queue

Prefetch queues are used to read chunks of instruction data at a time. It's a good idea not to use many branching constructs, i.e. jump around in code, to keep the CPU from not having to flush its prefetch queue often.

pipeline

CPU pipelines are used to split instruction execution into several phases and try to execute different phases of adjacent instructions in parallel.

A common technique to schedule instructions is to avoid data dependencies in adjacent operations; targets of an operation shouldn't be sources for the next one. This will reduce pipeline stalls and keep the CPU running code faster.

basic techniques

inlining

Inlining is a technique to duplicate pieces of code in order to avoid function calls. Function calls cause overhead because register state needs to be saved and restored; As this overhead shouldn't be long on current computers, inlining should probably only be used for small operations.

To reduce code cache misses, it's probably better to avoid code-duplication from excess inlining.

If you can deal with the so-called debugging possibilities, macros are perhaps the best and definitely the most portable way to inline code.

loop unrolling

Unrolling a loop means executing several loop iterations in one actual loop iteration. This is done to avoid looping overhead, e.g. updating index variables and pipeline stalls.

constants

If you see a constant used repetitively in code, it may be a good idea to assign it to a variable to hint the compiler to store it into a register. This way, instructions may become shorter as there is no need to encode [big] constant values into instruction.

It may be fruitful to make constants as small as possible; the compiler should realize to know this, but I find code more appealing to read when hex constants of 2, 4, 8, or 16 digits are used to represent 8-bit, 16-bit, 32-bit, and 64-bit values respectively.

memory access

In short, memory access should be avoided. When you need to do it, try to avoid cache misses and keep your data aligned.

alignment

CPUs read memory a fixed number of bytes at a time. This means memory operands are fetched a given number of bytes at a time; you should minimize this number. For small-size (up to 64-bit) values, aligning to the boundary of the size of the type should be a good way to optimize memory accesses.

cache

A bit higher, CPUs read a cacheline's worth of data at a time. To avoid cache misses and so slow access to main physical memory, you should keep your data on as few cachelines as possible; in structures, try to keep related values close to each other.

simple data structures

bitmaps

Bitmaps are useful for storing simple state information. For example, if you really only have to differentiate two states from each other, say free and busy, it would be a big overhead to store boolean values into words.

lookup tables

If, for example, property determination gets complex, especially if it leads to lots of branching, it may be feasible to use lookup tables to store operation values that can be accessed in constant time; note, however, that there's a memory access which is often quite an expensive task.

As a useful trick, long switch statements can often be replaced byy a short piece of code to index a function pointer tables with the 'cases'. This way, the execution time will be constant.

hash tables

The idea of hash tables is to map given keys to indices into a [fixed-size] table. Key collisions can be dealt with simple linked lists. Picking a proper hash function for calculating keys, e.g. from character strings, should ensure that the lists (chains) don't get long so lookups will run fast. Note, though, that hash tables are practically impossible to sort without using other data structures to help. Hash tables may use lots of memory; pay attention to the table size.

stacks

Stacks are a very simple data structure, in some ways limited, but also very fast for some of simple operations.

A basic stack provides push and pop operations to store and receive the topmost item, respectively. This is easy to implement as a (possibly single-direction) linked list.

example

/* stk.c */

#include <stdio.h>

struct page *pagestk;

struct page {
    void        *addr;
    struct page *next;
};

#define pushpage(page) ((page)->next = pagestk, pagestk = (page))
#define poppage(tmp)   ((tmp) = pagestk, \
                        pagestk = ((tmp) ? (tmp)->next : NULL), \
                        (tmp))

int
main(int argc,
     char *argv)
{
    struct page *tmp;
    struct page  pg1 = { &pg1, NULL };
    struct page  pg2 = { &pg2, NULL };

    fprintf(stderr, "push\t%p\n", &pg1);
    pushpage(&pg1);
    fprintf(stderr, "push\t%p\n", &pg2);
    pushpage(&pg2);
    fprintf(stderr, "pop\t%p\n", poppage(tmp));
    fprintf(stderr, "pop\t%p\n", poppage(tmp));
    fprintf(stderr, "pop\t%p\n", poppage(tmp));

    exit(0);
}

tricks

/* zero page, 32-bit version. */
void
pzero(void *ptr)
{
    unsigned long  zero = 0;        /* do not embed big constant into loop. */
    unsigned long  cnt = PGSZ >> 4; /* loop count. */
    unsigned long *ulptr = ptr;     /* unsigned long at a time. */

    /*
     * - using small constants as pointer offsets is faster than variables
     *   for me.
     *   -vendu
     */
    while (cnt--) {
        ulptr[0] = ulptr[1] = ulptr[2] = ulptr[3] = zero;
        ulptr += 4;
    }

    return;
}

appendices

tools

valgrindgreat memory debugger (and cache simulator)
papiperformance monitor

suggested reading

hyde, randallwrite great code
warren, henry s jr.hacker's delight
kaspersky, kriscode optimization: effective memory usage
booth, rickinner loops

links

aggregatemagic algorithms
stanfordbit twiddling hacks
mithakmem

allocator

TODO: write a case study on zero kernel allocator.

Copyright (C) 2006 Tuomo Venäläinen