Florence-2, Microsoft Azure AI's new model, is well-made for vision tasks

It comes with 232M and 771M parameters variants.

Reading time icon 2 min. read


Readers help support MSpoweruser. We may get a commission if you buy through our links. Tooltip Icon

Read our disclosure page to find out how can you help MSPoweruser sustain the editorial team Read more

Key notes

  • Microsoft has released Florence-2, a versatile vision model on Hugging Face.
  • It handles diverse vision tasks using unified prompts, outperforming larger models.
  • Florence-2 integrates spatial and semantic understanding for efficient performance.
Florence-2 model

Microsoft has dropped yet another AI model, just fresh off after announcing the Phi-3 family not too long ago. The Azure AI team has just launched Florence-2, a versatile, unified vision model that is now available on HuggingFace and Github.

In its announcement, Microsoft says that Florence-2 excels in various vision tasks, performing as well as or better than many larger models. It also handles tasks like captioning and object detection using a prompt-based approach and uses the FLD-5B dataset with 5.4 billion annotations across 126 million images.

“The model’s sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings, proving to be a competitive vision foundation model,” Microsoft says.

Sure, traditional LLMs are great at handling text-based tasks flexibly. But, they struggle with complex visual tasks that require understanding spatial and detailed visual information. That makes a model like this crucial for tackling diverse visual challenges effectively.

Florence-2 comes with an open MIT license, meaning it’s free to use and modify. The model comes in two sizes, with 232 million and 771 million parameters. During its test, it outperformed larger models like DeepMind’s 80B Flamingo and even some of Microsoft’s own specialized models.

“Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data,” Microsoft’s engineer said in the model’s original paper.