Menu
in

Alibaba Cloud unveils datacenter design, homebrew network for LLM training #cloudcomputing

Alibaba Cloud has developed a specialized Ethernet-based network for training large language models, which has been in production for eight months. The decision to use Ethernet was made to avoid vendor lock-in and benefit from the evolution of the Ethernet Alliance. The network design, called High Performance Network (HPN), addresses issues like hash polarization and single points of failure in AI infrastructure.

Each host used for training contains eight GPUs and nine network interface cards, with a dedicated network for intra-host communication. The design aims to maximize GPU capabilities and network throughput. Alibaba Cloud prefers single-chip switches for their stability and lower failure rates compared to multi-chip switches.

The network design includes a DIY heatsink to prevent switches from overheating, as well as pods housing 15,000 GPUs in a single datacenter building. The company is already planning for the next generation of network architecture with higher capacity switches.

Alibaba Cloud’s training of large language models relies on a distributed training cluster with millions of GPUs. The company’s Qwen model, trained on 110 billion parameters, indicates the scale of its operations and the need for continued expansion. The network design and infrastructure improvements aim to support the growing demands of AI workloads in the future.

Source link

Source link: https://www.theregister.com/AMP/2024/06/27/alibaba_network_datacenter_designs_revealed/

Leave a Reply

Exit mobile version