Apple's ReALM AI model can 'see' and understand screen context; 'substantially outperformed' GPT-4

Reading time icon 2 min. read

Readers help support MSpoweruser. We may get a commission if you buy through our links. Tooltip Icon

Read our disclosure page to find out how can you help MSPoweruser sustain the editorial team Read more

Key notes

  • Apple’s ReALM understands what’s on your screen and responds to your requests accordingly.
  • ReALM outperformed GPT-4 on various tasks despite having fewer parameters.
  • ReALM excels at understanding user intent for domain-specific queries.

Apple researchers unveiled a new AI system called ReALM that can understand what’s on your screen and respond to your requests accordingly.  This breakthrough comes after Apple acquired DarwinAI last month.

ReALM achieves this by converting information on your screen to text, allowing it to function on devices without requiring bulky image recognition. It can consider what’s on the screen and tasks running in the background.

According to a research paper, Apple’s larger ReALM models significantly surpassed GPT-4 despite having fewer parameters.

Imagine browsing a webpage and finding a business you’d like to call. With ReALM, you could tell Siri to “call this business,” and Siri would be able to “see” the phone number and initiate the call directly.

This is just one example of how ReALM’s understanding of on-screen information can improve user experience. By integrating ReALM into future Siri updates, Apple could create a more seamless and hands-free user experience. Apple also happens to be working on MM1, which can reduce the need for multiple prompts to get the desired result, and an AI image manipulator,

The research paper also details benchmarks where ReALM outperformed previous models on various datasets, including conversational, synthetic, and unseen conversational datasets. Notably, ReALM performed competitively with GPT-4 on tasks involving on-screen information, even though GPT-4 was given access to screenshots while ReALM relied solely on textual encoding. Seen on X.

It also explores the benefits of ReALM’s different model sizes. While all models perform better with more parameters, the improvement is most meaningful for processing on-screen information, suggesting this task’s complexity.

When evaluating performance on completely new, unseen domains, both ReALM and GPT-4 showed similar results. However, ReALM outperformed GPT-4 when it came to domain-specific queries due to being fine-tuned on user requests. This allows ReALM to grasp the nuances of user intent and respond accordingly.

Overall, the research demonstrates how ReALM uses LLMs for reference resolution. ReALM can understand the user’s screen and their requests by converting on-screen entities into natural language text, even while remaining efficient for on-device applications.

While ReALM effectively encodes the position of on-screen entities, the researchers say that it might not capture every detail for intricate user queries requiring a complex understanding of spatial relationships.