Microsoft's Phi-4-multimodal is closely superior to Google's Gemini 2.0 Flash, benchmark says
Two new small models have arrived
2 min. read
Published on
Read our disclosure page to find out how can you help MSPoweruser sustain the editorial team Read more
Key notes
- Microsoft launched Phi-4-multimodal (5.6B parameters) and Phi-4-mini (3.8B parameters).
- Phi-4-multimodal excels in speech, vision, and text tasks, achieving a 6.14% word error rate in speech recognition.
- Phi-4-mini focuses on text-based tasks, with 128,000 token processing and strong performance in reasoning, math, and coding.
Microsoft has just welcomed two new additions to its Phi-4 family model: the Phi-4-multimodal and the Phi-4-mini. In comparison, as the Redmond tech giant says, the Phi-4-multimodal is slightly superior to Google’s new Gemini 2.0 Flash that currently powers the Gemini chatbot.
Being a multimodal model means that it can handle speech, vision, and text processing at once. So, Phi-4-multimodal, which has 5.6 billion parameters, outperforms Gemini 2.0 Flash in key benchmarks, including speech recognition, where it achieves a lower word error rate of 6.14% and speech translation.
The Phi-4-multimodal also excels in visual reasoning, document understanding, and OCR, outperforming Gemini 2.0 Flash despite having fewer parameters. The model shows strong performance in mathematical and scientific reasoning.
Despite its compact size of just 5.6 billion parameters, the model stands out for its impressive performance in speech-related tasks by (also) outperforming larger models like WhisperV3 and SeamlessM4T-v2-Large.
Or, at least in the cherry-picked benchmark scoring that Microsoft chose to demonstrate the model.
In the same announcement, Microsoft also shipped another small model, the Phi-4-mini, with 3.8 billion parameters for speed and efficiency. It excels in text-based tasks like reasoning, math, coding, and instruction-following and can process sequences up to 128,000 tokens.
Both models are available for developers to explore through platforms like Azure AI Foundry, HuggingFace, and NVIDIA’s API Catalog. They are designed for low-latency, on-device processing and offer a range of applications across industries, from smart devices to automotive systems.
User forum
0 messages