Overview of instruction-level parallelism

Published: 2019/10/16
Last edited: 2019/11/06

This talk by Josh Fisher is an overview of instruction-level parallelism (ILP). It is a great talk that covers a lot of material in an understandable way. Here are my notes from the talk.

ILP is the idea that, as a computer executes a stream of assembly instructions, it should run instructions that are independent of each other in parallel. It is a very important idea, that has been around since the early days of computer architecture.

People wanted to find out how much ILP there is in typical programs, and conducted the following experiment:  suppose you have a computer with infinitely many functional units. How much faster will ILP make your program? Several research groups looked into this theoretical speedup and found that it is around 1.5x to 2.5x. The conclusion was that ILP is not a worthy path to take. But, these experiments assumed that the programs stay as is. However, compiler transformations can change a program to expose more ILP.

Another experiment looking at theoretical speedups enumerated all possible paths through a program (assuming no loops in the program?), and got a set of programs from that. With all branches gone in each of those, and assuming infinite computing power again, how fast can we run each program in the set? This time the speedup was 1 to 2 orders of magnitude. So, if you can guess which way the branches go in a program, you can get more opportunities for ILP and huge speedups. Turns out the vast majority of branches are predictable in practice. In current hardware, the technique that exploits this predictability is called speculative execution.

Say you want to build a chip with ILP. There are two possibilities on opposite ends of the design spectrum.

  1. You put smarts in the hardware to figure out which instructions are independent of each other, and execute them in parallel. This is called a superscalar architecture.
  2. You put smarts in the compiler to extract the parallelism, and the compiler tells the hardware which instructions to execute in parallel. This way, the hardware can be simpler. This is called a VLIW (very long instruction word) architecture. Instructions that execute together form a bundle. A bundle has a certain number of slots, and the number of slots is the upper bound on the instructions that can be executed in parallel. The instructions in different slots occupy different functional units when they are executed (and that's why their execution can happen in parallel).

Superscalar and VLIW are not the only two options. There is a spectrum of techniques in-between. E.g., he mentions a hybrid system where, for each instruction, the compiler figures out how many instructions immediately before it are independent of it. In the emitted code, each instruction is tagged with the number of prior independent instructions. Then, the hardware can try to execute some or all of those in parallel.

A difficulty with VLIW is that there is no object-code compatibility between different generations of the hardware. The talk was given in 1992. I suppose this is not a problem anymore, because users often download their software from the internet; they don't have a CD or whatever other medium that needs to run on many generations of hardware. Also, VLIW is mostly used in specialized domains, not in typical CPUs, and in these domains recompiling the source code for new hardware is not an issue.

A difficulty with superscalar is that, the more functional units you have on the chip, the harder it is to find that many parallel instructions to execute in real time. The scheduling becomes really complicated.

Towards the end of the talk, he briefly discusses code generation techniques for ILP .

  • Function inlining and loop unrolling are important, because they produce long instruction sequences, with more opportunities for ILP.
  • Software pipelining: the compiler rewrites loop iterations in way that overlaps memory transfers and compute.

Comments

  1. Landed on this page. Great write up!

    BTW, software pipelining != parallelizing loop iterations.

    ReplyDelete
    Replies
    1. Thanks Amit! I rewrote the description to make it more accurate.

      Delete

Post a Comment