Alibaba reveals network and datacenter design for language model. #AItraining

Alibaba has unveiled its datacenter design for Large Language Model (LLM) training, featuring an Ethernet-based network with each host containing eight GPUs and nine NICs. The design aims to maximize GPU PCIe capabilities and increase network send/receive capacity. Using NVlink for the intra-host network and connecting NIC ports to different top-of-rack switches, the design avoids single points of failure. Each pod contains 15,000 GPUs, housed in a single datacenter building with a power constraint of 18MW. Alibaba’s cooling mechanism includes a vapor chamber heat sink and wicked pillars to efficiently manage chip temperatures. The company anticipates model parameters to increase from one trillion to 10 trillion, with the new architecture planned to support up to 100,000 GPUs. This unique network architecture addresses the specific traffic patterns and fault tolerance requirements of LLM training, providing scalability and high performance. The design has been in production for eight months and is tailored to meet the demands of training large language models efficiently.

Source link

Source link: https://www.techradar.com/pro/website-hosting/alibaba-unveils-its-network-and-datacenter-design-for-large-language-model-training