Florence-2, Microsoft Azure AI's new model, is well-made for vision tasks

It comes with 232M and 771M parameters variants.

Home » News

2 min. read

Published on June 20, 2024

by Rafly Gilang

published on June 20, 2024

Share this article

Improve this guide

Readers help support MSpoweruser. We may get a commission if you buy through our links.

Key notes

Microsoft has released Florence-2, a versatile vision model on Hugging Face.
It handles diverse vision tasks using unified prompts, outperforming larger models.
Florence-2 integrates spatial and semantic understanding for efficient performance.

Microsoft has dropped yet another AI model, just fresh off after announcing the Phi-3 family not too long ago. The Azure AI team has just launched Florence-2, a versatile, unified vision model that is now available on HuggingFace and Github.

In its announcement, Microsoft says that Florence-2 excels in various vision tasks, performing as well as or better than many larger models. It also handles tasks like captioning and object detection using a prompt-based approach and uses the FLD-5B dataset with 5.4 billion annotations across 126 million images.

“The model’s sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings, proving to be a competitive vision foundation model,” Microsoft says.

Sure, traditional LLMs are great at handling text-based tasks flexibly. But, they struggle with complex visual tasks that require understanding spatial and detailed visual information. That makes a model like this crucial for tackling diverse visual challenges effectively.

Florence-2 comes with an open MIT license, meaning it’s free to use and modify. The model comes in two sizes, with 232 million and 771 million parameters. During its test, it outperformed larger models like DeepMind’s 80B Flamingo and even some of Microsoft’s own specialized models.

“Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data,” Microsoft’s engineer said in the model’s original paper.

Rafly Gilang

Tech Reporter

Rafly is a reporter with years of journalistic experience, ranging from technology, business, social, and culture. Currently reporting news on Microsoft-related products, tech, and AI on MSPowerUser. Got a tip? Send it to [email protected]

User forum

0 messages

Sort by:

Leave a Reply Cancel reply