Elon Musk's xAI announces Grok-1.5 Vision, with multimodal capability

Reading time icon 2 min. read


Readers help support MSpoweruser. We may get a commission if you buy through our links. Tooltip Icon

Read our disclosure page to find out how can you help MSPoweruser sustain the editorial team Read more

Key notes

  • Elon Musk’s xAI has announced Grok-1.5 Vision or Grok-1.5V.
  • Grok-1.5V is the company’s first multimodal model and will be available to early testers and existing Grok users soon.
  • Grok-1.5V can process text and visual information.

Last month, Elon Musk launched Grok-1.5 LLM days after Google launched Gemini 1.5. While Musk’s xAI claimed that its model is close to the GPT-4 performance, it doesn’t have multimodal capability. However, the company’s recently announced Grok-1.5 Vision doesn’t have that limitation, as it can process both text and visual information.

What’s Grok-1.5 Vision (Grok-1.5V) and when will it be available?

Grok-1.5V is xAI’s first-generation multimodal model that aims to connect the digital and physical worlds. “Grok outperforms its peers in our new RealWorldQA benchmark that measures real-world spatial understanding,” the company said in a blog post. Additionally, Grok-1.5V can “process a wide variety of visual information, including documents, diagrams, charts, screenshots, and photographs.”

For example, some of the exciting things it can do include writing code from a diagram, calculating calories, making bedtime stories based on drawings, helping you understand a meme, and more. xAI claims that Grok-1.5V performs better than its rival LLMs, including GPT-4V, Claude 3Sonnet, Claude 3 Opus, and Gemini Pro, in the RealWorldQA benchmark.

“Grok outperforms its peers in our new RealWorldQA benchmark that measures real-world spatial understanding,” xAI highlighted.

Grok-1.5V isn’t currently available, but it’s coming soon to early testers and existing Grok users as a preview. While xAI hasn’t specified the launch date, it’s promised to further advance “multimodal understanding” and “generation capabilities” and bring improvements to various modalities such as images, audio, and video.

Leave a Reply

Your email address will not be published. Required fields are marked *