Apple Ferret-UI Lite: The smart way to on-device AI Apfelpatient

Apple has been intensively researching models for years that aim to bring artificial intelligence directly to the device. With Ferret-UI Lite, the company has now introduced a model that, despite its compact size of only 3 billion parameters, sometimes performs better than competing models that have up to 24 times as many parameters. The model is designed as an on-device solution, meaning it runs entirely locally without sending data to the cloud, and can interact independently with app interfaces. At first glance, this sounds unrealistic for such a small model. However, a closer look at the architecture and training strategy quickly reveals why it works.

The story of Ferret-UI Lite doesn't begin with this model, but with a study from December 2023. At that time, a nine-member research team at Apple published a paper titled "FERRET: Refer and Ground Anything Anywhere at Any Granularity." In it, the researchers presented a multimodal large language model, or MLLM for short, capable of relating natural language descriptions to specific areas of an image and precisely identifying them. The basic idea was simple, yet effective: A model should understand when someone says "the red button in the top left" and be able to find and work with that exact element in the image.

Apple's Ferret model family

Apple built upon this foundation in the following months, consistently developing Ferret v2, Ferret-UI, and Ferret-UI 2, with each new version bringing new capabilities. Ferret-UI was the first variant specifically designed for mobile user interfaces. The original Ferret-UI model was based on 13 billion parameters and trained to understand screenshots from mobile devices—precisely the kind of interfaces seen daily on smartphones. The researchers justified this focus by explaining that while general MLLMs perform well at analyzing natural images, they regularly fail when it comes to user interfaces. Icons, buttons, menus, and text elements in an app interface follow different rules than a photo of a dog or a landscape.

Ferret-UI 2 subsequently expanded the system to include support for multiple platforms and a higher perceived resolution. And now, with Ferret-UI Lite, Apple has added a completely new direction to the series: Instead of further enlarging and increasing the model's power, the researchers have radically miniaturized it and optimized it for use directly on the device.

Why a small model makes sense at all

Before delving into the technical details of Ferret-UI Lite, it's worth taking a brief look at the fundamental question: Why should Apple even develop a small, on-device running model when large, server-side models demonstrably perform better?

The answer lies in two factors: latency and privacy. A model running on a server requires a network connection, has to send data back and forth, and has a measurable delay. This is impractical for an agent that interacts with app interfaces on the device and responds to user input. Then there's the question of what data is actually being transmitted: screenshots, interaction logs, app content. These are all things that many people would prefer not to have on third-party servers. An on-device model completely avoids this problem because, quite simply, no data has to leave the device.

Apple has pursued this approach in recent years with Apple Intelligence, and Ferret-UI Lite is a clear expression of this strategy: powerful AI that runs locally.

What makes Ferret-UI Lite technically special

A model with 3 billion parameters

Ferret-UI Lite has 3 billion parameters. For comparison, the original Ferret-UI had 13 billion, and many of the server-side competitor models with which Ferret-UI Lite is compared have 7 billion, 13 billion, or even 72 billion parameters. The central finding of the study is that a 3-billion-parameter model can keep pace with or even surpass these sizes in benchmarks, and this is supported by concrete benchmark results.

The researchers describe Ferret-UI Lite as a model "built on the basis of insights gained from training small language models with several key components." While this may sound abstract, it can be broken down into three concrete building blocks: diverse training data, an intelligent runtime image processing technique, and a combination of supervised and reinforcement learning.

Training data from real and synthetic sources

Ferret-UI Lite was trained using a mix of real and synthetic training data from multiple GUI domains. This means the model saw not only real screenshots and interactions, but also machine-generated examples specifically created for training. This combination is important because, while real data is realistic, it is often sparse in certain areas, whereas synthetic data can specifically fill in those gaps.

What's particularly interesting is how Apple generated the synthetic training data. To do this, the researchers developed a multi-stage, multi-agent system that interacts directly with live GUI platforms. This system consists of four components that work together: a curriculum task generator, a planning agent, a grounding agent, and a critique model.

The curriculum task generator suggests increasingly challenging task goals, ensuring that training doesn't stagnate on simple tasks. The planning agent breaks these goals down into concrete individual steps. The grounding agent executes these steps on the screen. Finally, the critique model evaluates whether the result was correct and only adds high-quality examples to the training dataset.

What makes this approach particularly valuable is that the system captures the ambiguity of real-world interactions. It documents not only successful processes, but also errors, unexpected system states, and the strategies the model uses to resolve these states. This would hardly be possible to this extent with manually annotated data, because people tend to document clean, error-free processes, while real-world usage is more chaotic.

On-the-fly cropping and zoom at runtime

One of the most technically striking solutions in Ferret-UI Lite is its real-time cropping and zooming technology. Small devices have a fundamental problem when processing screen captures: they can only process a limited number of image tokens simultaneously. A full screenshot of an app, however, often contains many relevant details that all need to be in view at once.

Ferret-UI Lite solves this problem with a two-stage process. In the first step, the model makes a rough prediction about where the relevant information is located on the screen. Then, the area around this initial prediction is cropped and enlarged. The model then makes a new, more precise prediction within this cropped area. The result is an iterative process in which the model gradually narrows its focus, strategically utilizing its limited processing capacity instead of distributing it evenly across the entire screen.

This technique is not entirely new, but its integration into an on-device model of this size and its consistent application to GUI grounding tasks is a clear step forward.

Supervised learning and reinforcement learning

Ferret-UI Lite combines two different training approaches. Supervised fine-tuning ensures that the model learns correct answers for clearly defined tasks. Reinforcement learning goes further, rewarding the model for behaviors that lead to good results, even if the exact path to those results wasn't predetermined. This combination is particularly useful for agents that have to deal with changing and unpredictable environments, which is the norm in GUI interactions.

Where Ferret-UI Lite was tested

An interesting aspect of the study is the choice of test environments. Ferret-UI and Ferret-UI 2 were primarily evaluated using iPhone screenshots and other Apple-specific interfaces. Ferret-UI Lite, on the other hand, was trained and evaluated on Android, web, and desktop GUI environments. AndroidWorld and OSWorld, two of the best-known and most reproducible test environments for GUI agent research, were used as benchmarks.

The researchers do not give an explicit reason for this change. However, it is likely that the availability of standardized and reproducible test environments played a key role. AndroidWorld and OSWorld are widely used in the research community and allow for direct comparison with other models, which is important for a scientific study.

Strengths and limitations of the model

What Ferret-UI Lite does well

Ferret-UI Lite performs strongly in short-term, clearly defined tasks. The model can accurately identify UI elements, respond to user requests, and independently perform simple interactions. In these areas, it sometimes outperforms models with many more parameters. This is a direct result of its specialized training strategy and crop-and-zoom technique.

Where the boundaries lie

The model's weaknesses become apparent in more complex, multi-stage interactions. Tasks requiring many consecutive steps, long-term planning, or the ability to react flexibly to unexpected intermediate results understandably overwhelm a 3-billion-parameter model more than a server-side model with 70 billion parameters. The researchers consider this compromise to be expected and do not portray it as a flaw, but rather as a conscious trade-off between model size, hardware requirements, and performance.

This means that Ferret-UI Lite is not a jack-of-all-trades, nor is it intended to be. It is a specialized, streamlined model for defined tasks that functions without a network connection or cloud infrastructure.

Apple's course towards private on-device AI

Ferret-UI Lite is a significant step in Apple's AI strategy. The model demonstrates that it's possible to develop a high-performance GUI agent in a format that runs on a device, protects privacy, and remains competitive. The combination of self-generated training data, a clever runtime image processing technique, and a robust training approach based on supervised and reinforcement learning makes Ferret-UI Lite technically interesting, even beyond the Apple context.

Apple has not yet announced whether or how it will integrate this technology into future products. However, the research direction is clear: powerful AI that runs locally on the device, transmits no data, and can interact autonomously with app interfaces. With Ferret UI Lite, Apple has demonstrated that this approach works not only in theory but also in practice. (Image: Shutterstock / MMD Creative)