Instruction scheduling for LEON3

The LEON3 is a 32-bit microprocessor designed by Aeroflex Gaisler. The processor is available as a VHDL model under GPL license.

The LEON3 implements the SPARC V8 instruction set. While the SPARC architecture describes the semantics of all machine instructions, it does not specify the amount of time it takes to execute instructions. Indeed, instruction timing varies among different implementations of the SPARC architecture. But detailed knowledge of instruction timing is essential for low-level code optimization.

The first part of this article describes instruction timing of the LEON3. The second part investigates an implementation of LEON3-specific instruction scheduling in GCC.

The LEON3 instruction pipeline

This section looks into instruction timing of the LEON3, i.e. the number of clock cycles used to execute specific instructions. Most of this information can also be found in the LEON3 datasheet.

Several factors contribute to the amount of time needed to execute a code fragment. To get a first estimate, one could simply add the intrinsic execution times of the individual instructions in the code. However, for pipelined processors the full picture is more complicated.

In the LEON3, each instruction goes through several stages during its execution. Pipelining means that the execution of adjacent instructions is overlapped in time. While a new instruction executes its first stage, the previous instruction executes its second stage and so on. If the execution of an instruction requires the result of a previous instruction which is still being processed, a pipeline stall may occur. In that case the processor inserts extra delay cycles until the required result is available. As a consequence, the execution time of a program depends not only on the instructions used, but also on the order in which the instructions occur.

Another performance factor is bus access time: the time needed to transfer data between the processor and main memory. Unfortunately, it is very difficult to take this into account. One reason is that it is hard to predict whether an access will cause a cache hit or a cache miss. Secondly, bus timing depends on the entire system, not just the processor. Bus access time is therefore ignored in this article. We just pretend that every access is a cache hit, so no bus transfers are ever needed. It should be noted that this a simplification. Performance analyses of real systems should definitely take cache effects and bus transfers into account.

Instruction timing model

The following table describes instruction cycle timing of the LEON3 integer unit. For each category of instructions, the table mentions the intrinsic execution time of the instruction, followed by possible pipeline effects. These data are valid for GRLIB 1.1.0; load delay set to 1 (lddel=1); MMU uses fast write (tlb_type=2).

Data movement, integer arithmetic, bitwise logic and shift
1 cycle per instruction.
No pipeline stalls between any instructions in this group. Results of one instruction can be used immediately in the next instruction.
Load instructions: LD, LDD, LDUB, ...
1 cycle (word, half-word, byte); except for LDD (load double-word) which takes 2 cycles.
+1 extra cycle if the next instruction depends on the result of the load.
However if the next instruction is a store, the extra cycle is only taken if the result of the load is used to calculate the address of the store, not when the result of the load is only used as data to be stored.
Store instructions: ST, STD, STH, ...
2 cycles (word, half-word, byte); except for STD (store double-word) which takes 3 cycles.
If the MMU is enabled, +2 extra cycles for some processor configurations (shared TLB, or split TLB without fast write buffer).
Unconditional branch (B, BA)
1 cycle to execute the branch.

Annulling the branch delay slot does not improve performance. An annulled delay slot takes 1 extra cycle, just as if the delay slot was not annulled and contained NOP.

Conditional branch (BE, BNE, BG, ...)
1 cycle to execute the branch.
+2 extra cycles if the immediately preceding instruction modifies condition codes;
otherwise, +1 extra cycle if the second preceding instruction modifies condition codes.

If static branch prediction is enabled, a taken conditional branch (i.e. the condition turns out to be true) does not use extra cycles, even if condition codes were modified just before the branch. Not-taken conditional branches still take up to 2 extra cycles, as described above.

To be clear: when I say preceding instruction, I mean the instruction appearing before the branch; I don't mean the instruction in the delay slot. For example, in the sequence [ CMP %g0, 1 ; BE somewhere ; NOP ], I would say there are zero instructions between compare and branch, therefore this code hits a 2-cycle pipeline stall.

Annulling the branch delay slot does not improve performance. (Similar to unconditional branching.)
JMPL and CALL
3 cycles for JMPL (including its synthetic friends JMP, RET, RETL and CALL to absolute address).
1 cycle for CALL to a PC-relative address (this is the real CALL instruction).
Multiplication: SMUL, UMUL (SPARC V8)
1 cycle if the 2-cycle (32x32 pipelined) multiplier is configured.
4 cycles if the 4-cycle (16x16 standard) multiplier is configured.
5 cycles if the 5-cycle (16x16 pipelined) multiplier is configured.
+1 extra cycle if the next instruction depends on the result of the MUL.
No extra cycle is taken if the next instruction reads the Y register (RDY or MOV %y, ...).
If the next instruction is a store, the extra cycle is only taken if the result of the MUL is used to calculate the address of the store, not when the result of the MUL is only used as data to be stored.

If MULCC appears close before a conditional branch, in certain cases 1 extra cycle is taken in addition to the normal rules for conditional branching.

Funny fact: MUL immediately followed by DIV takes 2 extra cycles.

Division: SDIV, UDIV (SPARC V8)
35 cycles to execute DIV.
+1 extra cycle if the next instruction depends on the result of the DIV.
NOTE: Some instructions are not covered here: atomic load/store, SWAP, SAVE, RESTORE, traps, floating point instructions. See the LEON3 datasheet for details.

Optimized instruction scheduling

There is often some flexibility in the order of instructions in a program. For example, if two adjacent instructions operate on different registers, they can change places without changing the meaning of the program. Rearranging such instructions may change the performance of the program as a result of pipline effects.

Careful ordering of instructions thus provides an opportunity for optimization. This is called instruction scheduling, a well-known optimization technique implemented in many compilers.

The LEON3 has only a few cases where timing is affected by instruction order. These cases can be summarized in the form of the following three guidelines:

  1. Avoid using the result of a load immediately in the next instruction. Try to put at least one unrelated instruction in between. This rule does not apply if the load is immediately followed by a store to an non-dependent address, but it does apply if the result of the load is used to calculate the address of a store.

    In the following example, SUB depends on LD, which can be solved by swapping ADD and SUB.

    original optimized
    LD [%g3], %g1
    (1 cycle stall)
    SUB %g1, 7, %g2
    ADD %g3, 4, %g3
    LD [%g3], %g1
    ADD %g3, 4, %g3
    SUB %g1, 7, %g2
    4 cycles total 3 cycles total
  2. Avoid using the result of multiplication immediately in the next instruction. Try to put at least one unrelated instruction in between. Reading the Y register immediately after multiplication is fine; this does not cause a pipeline stall.

    In the following example, ADDCC depends on MUL. This can be solved by moving MOV %y, but only at the expense of using an extra temporary register.

    original optimized
    MUL %g1, %g1, %g2
    (1 cycle stall)
    ADDCC %g2, %g5, %g5
    MOV %y, %g2
    ADDX %g2, %g4, %g4
    MUL %g1, %g1, %g2
    MOV %y, %g3
    ADDCC %g2, %g5, %g5
    ADDX %g3, %g4, %g4
    9 cycles total 8 cycles total
  3. Avoid testing a condition immediately before the conditional branch. Try to put at least two unrelated instructions between the test and the branch. If static branch prediction is enabled, this rule does not apply to branches which are almost always taken (loops, for example). Static branch prediction is a relatively new feature of LEON3 processors, introduced in GRLIB 1.0.22.

    In the following example, there is a 2-cycle stall between CMP and BE. This can be improved by swapping CMP and ADD, which incidentally also solves a load stall.

    original optimized
    LD [%g1], %g2
    (1 cycle stall)
    ADD %g2, %g3, %g3
    CMP %g1, 20
    (2 cycles stall)
    BE loop
    ADD %g1, 4, %g1
    LD [%g1], %g2
    CMP %g1, 20
    ADD %g2, %g3, %g3
    (1 cycle stall)
    BE loop
    ADD %g1, 4, %g1
    8 cycles total 6 cycles total

These guidelines may be useful during manual optimization of SPARC assembly code. Of course, such manual optimizations are only sensible in cases where performance is extremely important. It is usually impossible to follow the guidelines completely, because dependencies between instructions already enforce a certain order. If there are many inter-instruction dependencies, there are few opportunities to change the order of instructions without changing the meaning of the program.

Measuring instruction timing

The information presented above is based partly on the LEON3 datasheet, and partly on actual measurements. Tests were done with a LEON3 soft-processor in a Xilinx Spartan-3 FPGA on a Pender GR-XC3S1500 board.

A test program (leon3perf.c) executes several sequences of SPARC instructions. A timer measures the number of clock cycles needed for each sequence. Information about the execution time of individual instructions can be derived by comparing the execution times of different code sequences.

The first part of the test determines the intrinsic execution time of various SPARC instructions, ignoring pipeline effects. This is done by measuring the time needed to execute a selected instruction, surrounded by a large number of NOP instructions.

The second part of the test looks for pipeline stalls. It measures the time needed to execute a sequence of two selected instructions, separated by a variable number of NOPs. Several combinations of instructions are tried, with different kinds of data dependencies between the two instructions. A pipeline stall is revealed when a sequence uses more cycles than the sum of the intrinsic execution times of all instructions in the sequence.

Finally, conditional branches are tested. This is done by executing an instruction that modifies conditions codes, followed by a variable number of NOPs, followed by a conditional branch.

test intrinsic time test data dependency test conditional branch
NOP ; NOP ; NOP
MUL %g1, %g1, %g2
NOP ; NOP ; NOP
NOP ; NOP ; NOP
MUL %g1, %g1, %g2
(insert 0 to 3 NOPs)
MOV %g2, %g3
NOP ; NOP ; NOP
NOP ; NOP ; NOP
ADDCC %g1, %g1, %g2
(insert 0 to 3 NOPs)
BE skip
NOP
skip:
NOP ; NOP ; NOP

Tuning GCC for LEON3

Instruction scheduling is a common optimization strategy in many compilers, including GCC. The compiler needs a model of the instruction pipeline in the target processor in order to perform effective instruction scheduling. Because different SPARC implementations have significantly different pipelines, the model must be designed for a specific SPARC model.

Adding LEON3-specific scheduling to GCC

Several SPARC models are already supported by GCC. These models are selected through a command line switch, for example -mtune=supersparc or -mtune=hypersparc. However, the LEON3 is not yet one of the supported SPARC processors.
(Note: GCC 4.6 now includes support for LEON processors, developed by Gaisler, not related to the tests described in this article.)

As an experiment, I added LEON3-specific instruction scheduling to GCC 4.5.2. This new optimization is enabled via -mtune=leon3. The following changes were made in the GCC source code.

  • A machine description file, leon3.md, was created. This file describes the LEON3 instruction timing model, written in a Lisp-like language.

    GCC provides a generic infrastructure for modelling the instruction pipeline of a target processor. This infrastructure is sufficiently powerful to describe complicated super-scalar architectures. The LEON3 is a relatively simple processor, and can be described quite easily within GCC's framework.

  • The top-level files for the SPARC architecture, sparc.c, sparc.h and sparc.md were modified to add references to the LEON3 pipeline description. These files also contain cost tables, used by the compiler to guide code generation. A new cost table was added for the LEON3.

These changes are available here in the form of a patch against GCC-4.5.2: gcc-4.5.2_leon3_tune6.diff

Benchmark results of LEON3-specific scheduling

The effect of the LEON3-specific changes in GCC was measured with a series of benchmarks, consisting of seven programs. Source code of these benchmarks is available here: benchmark_20110131.tar.gz.

The benchmarks were run on a LEON3 soft-processor, using a Pender GR-XC3S1500 board. Each benchmark runs as an application under RTEMS 4.10.

Benchmark setup
Processor: 40 MHz LEON3, GRLIB 1.1.0, 5-cycle multiplier, no branch prediction,
no FPU, 2-way instruction cache, 2-way data cache
Compiler: GCC-4.5.2 for RTEMS 4.10, newlib 1.18.0
Optimization: either -O2 or -O3 or -Os, whatever worked best for the baseline run
Baseline run: -mcpu=v8 -msoft-float
Table shows run time in seconds, or dhrystone score.
Leon3 (app): -mcpu=v8 -msoft-float -mtune=leon3
Application recompiled for LEON3, but libraries not recompiled.
Table shows improvement over baseline run.
Leon3 (app+libs): -mcpu=v8 -msoft-float -mtune=leon3
Libraries (libgcc and newlib) also recompiled with LEON3-specific optimizations.
Table shows improvement over baseline run.

Benchmark results
benchmark opt baseline leon3 (app) leon3 (app+libs) description
bench_sort -O2 145.69 s 3.7 % faster 3.9 % faster Benchmark of several sorting algorithms (integers only). Written by Paul Hsieh and Attractive Chaos. See this blog.
bench_aes -O2 30.17 s 3.2 % faster 3.2 % faster AES encryption with a 256-bit key, using code from OpenSSL 0.9.8.
bench_sat -O3 279.48 s 2.7 % faster 3.4 % faster Minisat 2.2.0 boolean satisfiability solver. The input is frb30-15-1.cnf from the DIMACS SAT benchmark.
bench_odelco -Os 10.65 s 4.1 % faster 4.1 % faster Digital IIR filters (integers only).
bench_eigen -O2 107.38 s no change 3.5 % faster Solving a 10x10 linear system through LU decomposition with Eigen 2.0.15 (software floating-point emulation).
bench_zip -O2 31.26 s 2.6 % faster 2.7 % faster Compression and decompression with ZLib 1.2.5.
dhrystone -O2 59995  1.0 % faster 3.9 % faster The classic Dhrystone benchmark by Reinhold P. Weicker, version 2.1. http://www.netlib.org/benchmark/dhry-c

The LEON3-specific optimizations show improvements of 1% to 4% compared to the baseline. This is nice, but not as good as I had hoped.

I should mention that these results may be a little biased, because I used the same benchmarks to solve problems in earlier versions of the LEON3-specific tuning.

With earlier versions of the LEON3-specific optimizations, some of the benchmarks would run slower than with default settings. In one case, it turned out that the instruction scheduler increased the number of temporary registers, thus increasing spilling. Recent versions of GCC have an option (-fsched-pressure) which avoids spilling due to scheduling. In another case, the scheduler had moved a number of store instructions closer to a following load instruction. If a load instruction causes a cache miss, the LEON3 needs to flush its write buffers before it can continue. By reducing the average distance between store instructions and the nearest following load instruction, the scheduler accidentally increased the cost of cache miss processing. Both of these problems disappeared after upgrading from GCC 4.4.5 to GCC 4.5.2.

The tests performed so far have used only integer instructions. It would be interesting to test the effect of instruction scheduling on floating point instructions. The pipeline latency of GRFPU operations may present additional opportunities for performance improvements.

References