Running Large Language Models at Scale on Groq's LPU Machine Learning Chips
Satnam Singh
Groq (Mountain View, California, USA)
Talk
Satnam Singh is a Fellow at Groq where he applies the power of functional programming languages to the design of machine learning chips and their programming models. Satnam Singh previously worked at Google (machine learning chips, cluster management), Facebook (Android optimization), Microsoft (parallel and concurrent programming) and Xilinx (Lava DSL for hardware design). He started his career as an academic at the University of Glasgow (FPGA-based application acceleration and functional programming).
AG 1, AG 2, AG 3, INET, AG 4, AG 5, D6, SWS, RG1, MMCI
The core technology produced at Groq are silicon chips designed to accelerate machine learning inference tasks, a compiler for programming these chips, and rack-scale deployments of foundation large language models (LLMs) with public API access. This presentation gives an overview of the Groq hardware architecture, with a focus on its deterministic characteristics that prove advantageous for achieving very low latency implementations of open weight foundation large language models (e.g. Llama3-70B, Gemma2, Mixtral 8x7B), as well as deploying large rack-scale systems at scale with predictable performance. An overview will also be given of the compiler we have developed for the Groq architecture, based on MLIR for the front end (consuming ONNX from PyTorch, as well as our own linear algebra representation, and some support for Tensorflow/JAX). The front-end of the compiler is based on MLIR, and the back-end uses a custom intermediate representation and is written in Haskell. I'll say a few words about the specific things I have worked on personally, which include the design of power management hardware features, an experimental domain specific language (DSL) for programming our chips written in Haskell, and the formal verification of our hardware using temporal logic and model checking (SystemVerilog Assertions).