Gemma 2: A New Addition to Gemma Series #technology

Gemma 2 is a new addition to the Gemma family with technical modifications like interleaving local-global attentions and group-query attention. The model is trained with knowledge distillation instead of next token prediction, resulting in better performance for its size. The models are available on HuggingFace. Gemma 2 models have a context length of 8192 tokens, use Rotary Position Embeddings, and approximated GeGLU non-linearity. Notable differences include alternating between local sliding window and global attention, using RMSNorm for stabilization, and capping attention logits. The pre-training data primarily consists of English from various sources. Gemma 2 models are fine-tuned and formatted differently from Gemma 1 models, using the same control tokens. Pre-training evaluations show that Gemma 2 models outperform larger models in their category. Post-training evaluations indicate that Gemma 27B sets a new state of the art for open-weights models, while Gemma 9B outperforms other models. The post-training process involves supervised fine-tuning, reinforcement learning from human feedback, and model merging. The Gemma 2 report provides more detailed information on the models. Recommended reading includes Gemini/Gemma Models and Small LLMs.

Source link

Source link: https://ritvik19.medium.com/papers-explained-157-gemma-2-f1b75b56b9f2?source=rss——artificial_intelligence-5