Mini versions of LLMs pre-training: GPT and Llama3 #NLP

The content discusses three different models – nano_gpt, torch_gpt, and mini_llama3, which are smaller versions of LLMs with different parameters and setups. Each model is fully contained in one file and has specific characteristics in terms of tokenization, embedding, and data flow. The training code and parameter initialization are similar for all three models, and they can be trained together due to their small size. The mini_llama3 model uses bf16 and has a smaller checkpoint size compared to the other two models. The content also includes a comparison of validation loss versus iteration number for the three models, showing that mini_llama3 converges quickly but may overfit. Additionally, example generations from the three models are provided. The models are pre-trained on a dataset called shakespeare_char and can be easily pre-trained on a consumer-grade GPU. The models use different tokenizers and embedding layers, with mini_llama3 incorporating ColumnParallelLinear, RowParallelLinear, and VocabParallelEmbedding layers from Meta’s fairscale package. Further parameter search and tuning are expected to improve the validation loss for mini_llama3.

Source link

Source link: https://whatdhack.medium.com/pre-training-mini-versions-of-llms-gpt-and-llama3-7cf69ac00280?source=rss——llm-5