Scalar Intermediate Representation

The IR code is included in src/ir/ of the compiler code base The IR as designed in this compiler is the fruit of a long reflection I mostly have with Thomas Raoux. Note I usually call it "Gen IR".

Scalar vs vector IR

This is actually the major question: do we need a vector IR or a scalar IR? On the LLVM side, we have both. LLVM IR can manipulate vectors and scalars (and even generalized values but we can ignore it for now).

For that reason, the Clang front-end generates both scalar and vector code. Typically, a uint4 variable will output a vector of 4 integers. Arithmetic computations will be directly done on vector variables.

One the HW side, the situation is completely different:

One may therefore argue that we need vector instructions to handle the sends. Send will indeed require both vector destinations and sources. This may be a strong argument for vectors in the IR. However, the situation is not that good.

Indeed, if we look carefully at the send instructions we see that they will require vectors that are not vectors in LLVM IR. This code for example:

__global uint4 *src;
uint4 x = src[get_global_id(0)];

will be translated into an untyped write in the Gen ISA. Unfortunately, the address and the values to write are in the same vector. However, LLVM IR will output a store like:

store(%addr, %value)

which basically uses one scalar (the address) and one value (the vector to write). Therefore even if we handle vectors in the IR, that will not directly solve the problem we have at the end for the send instructions.

We therefore decided to go the other direction:

This leads to a very simple intermediate representation which is mostly a pure scalar RISC machine.

Very limited IR

The other major question, in particular when you look similar stacks like NVidia PTX, is:

do we need to encode in the IR register modifiers (abs, negate...) and immediate registers (like in add.f x y 1.0)?

Contrary to other IRs (PTX and even LLVM that both supports immediates), we also chose to have a very simply IR, much simpler than the final ISA, and to merge back what we need at the instruction selection pass. Since we need instruction selection, let us keep the IR simple.

Also, there are a lot of major issues that can not be covered in the IR and require to be specifically handled at the very end of the code:

For these reasons, most of the problems directly related to Gen naturally find their solutions in either the instruction selection or the register allocator.

This leads to the following strategy:

This classical approach leads to limit the complexity in the IR while forcing us to write the proper tools in the final stages.

Why not using LLVM IR directly?

We hesitated a long time between writing a dedicated IR (as we did) and just using LLVM IR. Indeed, LLVM comes with a large set of tools that are parts of "LLVM backends". LLVM provides a lot of tools to perform the instruction selection (SelectionDAG) and the register allocation. Two things however prevent us from choosing this path:

Since we will need to do some significant work anyway, this leads us to choose a more hard-coded path with a in-house IR. Note that will not prevent us from implementing later a LLVM backend "by the book" as Nvidia does today with PTX (using a LLVM backend to do the LLVM IR -> PTX conversion)

SSA or no SSA

Since we have a purely scalar IR, implementing a SSA transformation on the IR may be convenient. However, most the literature about compiler back-ends use non-SSA representation of the code. Since the primary goal is to write a compiler back-end (instruction selection, register allocation and instruction scheduling), we keep the code in non-SSA letting the higher level optimizations to LLVM.

Types, registers, instructions, functions and units

The IR is organized as follows:

Function arguments and pushed constants

Gen can push values into the register file i.e. some registers are preset when the kernel starts to run. As detailed previously, the PTX ABI is convenient since every argument is either one register or one pointer to load from or to store to.

However, when a pointer is used for an argument, loads are issued which may be avoided by using constant pushes.

Once again OCL makes the task a bit harder than expected. Indeed, the C semantic once again applies to function arguments as well.

Look at these three examples:

Case 1. Direct loads -> constant push can be used

struct foo { int x; int y; };
__kernel void case1(__global int *dst, struct foo bar)
{
  dst[get_global_id(0)] = bar.x + bar.y;
}

We use a direct load for bar with bar.x and bar.y. Values can be pushed into registers and we can replace the loads by register reads.

Case 2. Indirect loads -> we need to load the values from memory

struct foo { int x[16]; };
__kernel void case1(__global int *dst, struct foo bar)
{
  dst[get_global_id(0)] = bar.x[get_local_id(0)];
}

We use an indirect load with bar.x[get\_local\_id(0)]. Here we need to issue a load from memory (well, actually, we could do a gather from registers, but it is not supported yet).

Case 3. Writes to arguments -> we need to spill the values to memory first

struct foo { int x[16]; };
__kernel void case1(__global int *dst, struct foo bar)
{
bar.x[0] = get_global_id(1);
  dst[get_global_id(0)] = bar.x[get_local_id(0)];
}

Here the values are written before being read. This causes some troubles since we are running in SIMD mode. Indeed, we only have in memory one instance of the function arguments. Here, many SIMD lanes and actually many hardware threads are running at the same time. This means that we can not write the data to memory. We need to allocate a private area for each SIMD lane.

In that case, we need to spill back the function arguments into memory. We spill once per SIMD lane. Then, we read from this private area rather than the function arguments directly.

This analysis is partially done today in src/ir/lowering.*pp. We identify all the cases but only the case with constant pushing is fully implemented. Actually, the two last cases are easy to implement but this requires one or two days of work.

Value and liveness analysis tools

You may also notice that we provide a complete framework for value analysis (i.e. to figure when a value or instruction destination is used and where the instruction sources come from). The code is in src/ir/value.*pp. Well, today, this code will burn a crazy amount of memory (use of std::set all over the place) but it at least provides the analysis required by many other passes. Compacting the data structures and using O(n) algorithms instead of the O(ln(n)) are in the TODO list for sure :-)

Finally, we also provide a liveness analysis tool which simply figures out which registers are alive at the end of each block (classically "live out" sets).

Up