Summary of the paper "A Domain-Specific Architecture for Deep Neural Networks"

A team at Google developed a specialized processor for machine-learning inference, and deployed it in 2015. The processor is called a TPU (tensor processing unit). This post is a short summary of a recent paper in the Communications of the ACM describing the TPU.

Motivation for creating the TPU:

Moore's law, which states that the number of transistors on a chip doubles every two years, has ended. Dennard scaling states that power density is constant as transistors get smaller, meaning that the current and voltage required drop as transistors get smaller. Dennard scaling has also ended. Because of these, and because the multicore path has already been explored, we should not expect great gains in efficiency of general-purpose processors. Thus, architects now believe that the only major performance improvements can come from domain-specific architectures: build a chip that only does one thing, and does it well.

Main features of the TPU:

Single processor with a large matrix multiply unit, which uses a systolic array. Quantized arithmetic (using 8-bit integers instead of 32-bit floats). CISC instead of RISC.

Benefits of the TPU:

Compared to inference with a CPU or a GPU, they found that the TPU has much better performance per watt: 30x better than a GPU and 80x better than a CPU. If I understand their "roofline" chart correctly, they found that even if the CPU and the GPU worked at peak throughput for their benchmarks, the TPU would still be much faster. Because the TPU is specialized for ML, it avoids many features that would complicate its design, and can be power efficient (no caches, branch prediction, out-of-order execution, multithreading, context switching). Inference has strict latency targets, and it is easier to meet these targets with a single thread. 8-bit integers improve both computation speed and memory bandwidth compared to 32-bit floats.

Overall, a very interesting paper. Worth reading.

Comments

Jonathan SchusterNovember 18, 2018 at 2:22 PM
Thanks for the summary. Speaking of domain-specific architectures, I thought this post by Herb Sutter (author of The Free Lunch is Over) was a good summary of what led to this point and why domain-specific processors are the way to go: https://herbsutter.com/welcome-to-the-jungle/

Curiosity

Search This Blog