High performance LLM on FPGA without matrix multiplication #efficiency

A new technical paper titled “Scalable MatMul-free Language Modeling” was published by UC Santa Cruz, Soochow University, UC Davis, and LuxiTech. The paper discusses how matrix multiplication (MatMul) is a significant computational cost in large language models (LLMs) and proposes a method to eliminate MatMul operations while maintaining strong performance at billion-parameter scales. The experiments show that the MatMul-free models perform on par with state-of-the-art Transformers but require far less memory during inference. The paper also presents a GPU-efficient implementation that reduces memory usage by up to 61% during training and more than 10x during inference compared to unoptimized models. Additionally, a custom hardware solution on an FPGA is developed to process billion-parameter scale models at high efficiency. The research not only demonstrates the effectiveness of stripped-down LLMs but also suggests optimization strategies for future accelerators to process lightweight LLMs. The paper is available for further reading as a preprint and was published in June 2024.

Source link

Source link: https://semiengineering.com/lower-energy-high-performance-llm-on-fpga-without-matrix-multiplication/