#FedLLM-Bench: Benchmarking Federated Learning for Large Language Models.

Large language models (LLMs) have seen success in various domains, but training them centrally is costly due to data collection and annotation efforts. Federated learning (FL) offers a solution by enabling collaborative training on decentralized data while preserving privacy. However, the lack of realistic benchmarks remains a challenge in FedLLM. Existing frameworks and methods address data heterogeneity in FL, but artificial datasets fail to capture real-world data properties.

Researchers from Shanghai Jiao Tong University, Tsinghua University, and Shanghai AI Laboratory introduce FedLLM-Bench, the first realistic benchmark for FedLLM. It includes four datasets split by real-world user IDs, capturing properties like language diversity, data quality, and user preferences. The benchmark integrates eight baseline methods and six evaluation metrics to facilitate comparisons and explore new research directions.

Extensive experiments on FedLLM-Bench show that federated methods outperform local training on average, highlighting opportunities for language personalization and instruction-following ability. Federated methods consistently surpass local training in single and multi-turn conversations, with clear benefits across datasets. The benchmark aims to reduce effort, enable fair comparisons, and drive progress in FedLLM research.

The study introduces a comprehensive testbed for collaborative, privacy-preserving training of large language models, providing a practical tool for the research community. The benchmark evaluates diverse datasets with various training methods and metrics, showcasing the advantages of federated learning in capturing real-world complexities and improving model performance.

