Apple has been researching the interface between humans and machines for years. The user-friendliness of its products is at the core of its brand. A new research report shows how the company is now using artificial intelligence to address a long-standing problem in human-computer interaction: understanding app interfaces. In collaboration with Aalto University in Finland, a new AI model was presented that can not only recognize app user interfaces but also understand their content. The model is called ILuvUI.
ILuvUI is a vision language model (VLM) that simultaneously processes visual information from screenshots and natural language. Its development is based on the open-source LLaVA model, but has been further developed specifically for graphical user interfaces. The goal is to create an artificial intelligence that understands complex app interfaces the same way humans do. This means that the AI should not only recognize what's on the screen, but also what certain elements mean and how to interact with them.
A model for app interfaces, not for street dogs
Many existing vision language models are trained to interpret natural images such as animals, buildings, or traffic signs. However, app user interfaces pose entirely different challenges. A screen can simultaneously contain multiple layers of information—lists, buttons, checkboxes, text input fields—the meaning of which often depends on the context. Conventional models deliver only limited, usable results in such cases because they are not optimized for this type of content. ILuvUI addresses this issue. It has been specifically trained to analyze structured user interfaces. It combines the visual structure of the interface with additional text input in natural language. The result is a significantly more precise understanding of app interfaces.
How ILuvUI was trained
The research team adapted LLaVA for this purpose. First, they created synthetic text-image pairs—i.e., screenshots of apps combined with appropriate description text. They also used so-called "golden examples," which contain particularly precisely formulated interactions. The final dataset contained various types of information:
- Question-and-answer dialogues about app usage
- Full descriptions of the screen content
- Predictions of what action a user would perform
- Step-by-step instructions for more complex operations (e.g., starting a podcast or changing display settings)
After training with this data, ILuvUI was able to outperform the original LLaVA model in machine benchmarks and in tests with human subjects. The AI demonstrated a better understanding of app logic, higher accuracy in predicting user goals, and clearer explanations in dialog format.
No “area of interest” required
A notable difference from previous models: ILuvUI eliminates the need for manual selection of screen areas. The model automatically analyzes the entire context of a screenshot and simultaneously processes text input—for example, a question about app usage. This makes the system versatile. ILuvUI can, for example, explain how to use certain app functions or which steps are necessary to solve a problem.
Practical applications
According to Apple, ILuvUI is particularly interesting for two areas of application: accessibility and automated UI testing. People with visual or mobility impairments could use this type of AI to navigate complex app layouts more easily. The AI automatically recognizes the correct interaction steps and can provide guidance via voice or assistance systems. For developers, ILuvUI is a tool for automated testing. The AI can simulate operating sequences, identify errors, and evaluate the logical structure of user interfaces. The model could also be used for training or support systems – anywhere a technical system needs to explain how an app works.
Outlook on future developments
ILuvUI isn't yet at the end of its development. The current version is based on open components. In the future, the researchers plan to integrate larger image encoders, increase screenshot resolution, and improve output formats. The goal is for ILuvUI to be able to work directly with standard formats like JSON, which are used in modern UI frameworks. Combined with other Apple research projects—for example, predicting in-app actions—a clear direction emerges: systems that can not only see and describe, but also think and act.
ILuvUI brings structure to complex app interfaces
With ILuvUI, Apple has developed an AI model that can visually and linguistically analyze app user interfaces – in detail, precisely, and with a focus on application. The system not only recognizes the individual elements of an app, but also understands their meaning and possible user paths. This significantly improves the interaction between humans and technology. ILuvUI opens up new possibilities, especially for accessible user concepts and automated testing procedures. The project demonstrates how AI could become a central component of app usage in the future. (Image: Shutterstock / Rabbi Creative)