Microsoft introduces Phi-2, a revolutionary model with 2.7 billion parameters
Microsoft introduces Phi-2, a groundbreaking language model featuring 2.7 billion parameters. This innovative model sets a new standard for linguistic understanding and reasoning, particularly among models with fewer than 13 billion parameters.
Continuing in the tradition of its predecessors, Phi-1 and Phi-1.5, Phi-2 matches or surpasses models up to 25 times larger. This feat is the result of innovations in scaling techniques and meticulous curation of training data.
The relatively compact size of Phi-2 makes it an ideal research tool, paving the way for in-depth exploration of mechanical interpretability, improved safety, and fine-tuning across various tasks.
The success of Phi-2 relies on two key elements:
- High-Quality Training Data :
Microsoft emphasizes the critical importance of high-quality training data for the model’s performance. Phi-2 leverages « manually curated » data, focusing on synthetic datasets designed to develop common-sense reasoning and general knowledge. The training corpus is enriched with web data, selected for its educational value and high-quality content.
- Innovative Scaling Techniques :
Building on its predecessor, Phi-1.5, Microsoft implemented novel scaling techniques. By transferring knowledge from the 1.3-billion-parameter model, Phi-2 achieves faster convergence during training, resulting in significant improvements in benchmark scores.
Performance Evaluation
Phi-2 has undergone rigorous evaluations on various benchmarks, including Big Bench Hard, commonsense reasoning, language understanding, mathematics, and programming.
With only 2.7 billion parameters, Phi-2 surpasses larger models, such as Mistral and Llama-2, and matches or surpasses Google’s recent Gemini Nano 2 :
In real-world scenarios, Phi-2 demonstrates its capabilities. Tests with prompts commonly used by the research community reveal Phi-2’s expertise in solving physics problems and correcting students’ errors, highlighting its versatility beyond standard assessments.
Phi-2 is a model based on the Transformer architecture, trained on 1.4 trillion tokens from synthetic and web datasets. The training process, carried out on 96 A100 GPUs for 14 days, aims to maintain a high level of safety and claims to surpass open-source models in terms of reducing toxicity and bias.
With the announcement of Phi-2, Microsoft continues to push the boundaries of what more compact language models can achieve.