Apple AI creates 3D models from just one image Apfelpatient

Apple has unveiled a new AI model that enables a remarkable advancement in 3D reconstruction. Instead of requiring multiple images from different perspectives, as was previously the case, a single image is sufficient. The model then generates a complete 3D object, taking into account realistic lighting conditions such as reflections and highlights.

This development shows how far Apple has come in the field of artificial intelligence and what practical applications are now possible as a result.

To understand how this model works, a brief look at the concept of so-called latent space is helpful. This is not new in machine learning, but has gained considerable importance through modern AI models, especially those based on the Transformer architecture, as well as through so-called world models.

In simple terms, latent space describes a method in which information is converted into mathematical values and organized in a multidimensional space. This allows for the efficient calculation of relationships between data. A classic example:

The mathematical representation of "king" minus "man" plus "woman" leads to the representation of "queen" in latent space.

Although this example comes from word processing, the same principle can be applied to other data types, such as images or 3D information. This is precisely what Apple is using in its new study.

LiTo: Surface Light Field Tokenization

In the study titled "LiTo: Surface Light Field Tokenization," Apple presents a new method. The goal is a 3D latent representation that depicts two things simultaneously:

the geometry of an object
the appearance depends on the viewing angle

Previous approaches had clear limitations. Many models focused either on the pure form of an object or on a simplified appearance that was independent of the viewing angle. As a result, realistic effects were often lost.

Apple's approach combines both in a unified model. The system utilizes the insight that RGB depth images can be understood as samples of a so-called surface light field.

By encoding random sub-areas of this light field into compact latent vectors, the model learns to represent both shape and light behavior together. This allows even complex effects to be reproduced correctly, including:

Reflections
Highlights
Fresnel reflections

These effects remain consistent across different perspectives.

How the model works

The basic structure of the system is an encoder-decoder approach.

The encoder compresses an object's information. Instead of storing each detail individually, it creates a condensed mathematical representation in latent space. This representation contains both the object's shape and information about how light interacts with its surface.

The decoder then takes over the reconstruction. From the compact representation, it creates a complete 3D object. In doing so, it also calculates how light behaves depending on the viewing angle.

The result is a model that not only reproduces the structure of an object, but also realistically simulates its visual properties.

Training of the LiTo model

Apple used a large database for the training:

Thousands of different objects
Each rendered from 150 different viewpoints
under three different lighting conditions

Instead of using all the data directly, the system randomly selected small subsets of this information. These were then transferred into latent space.

The decoder was then trained to reconstruct the entire object, including all lighting and perspective effects, from this incomplete data.

During this training, the model learned a representation that reliably depicts both the geometry and the changes in appearance depending on the viewing angle.

Additionally, another model was trained. This model takes a single image as input and predicts the appropriate latent representation. Based on this, the decoder can then generate the complete 3D object.

This makes reconstruction possible from just one image.

Comparison with existing methods

Apple compared its model to an existing system called TRELLIS, among others. The results showed that LiTo delivers significantly better performance, especially in complex lighting conditions.

While other models struggle to accurately represent reflective surfaces or angle-dependent effects, the representation remains stable and realistic with LiTo.

Apple's project page provides corresponding comparison charts, including interactive side-by-sides. The differences can be directly observed there.

Meaning and possible applications

The technology has several practical applications:

Augmented Reality: more realistic representation of digital objects
E-commerce: Products can be visualized as a 3D model from a single photo.
Game development: more efficient asset creation
Film and design: faster and more precise visualization

The major advantage is that significantly less input data is required without compromising quality.

Apple's progress in 3D AI

With the LiTo model, Apple demonstrates how powerful modern AI has become in the field of visual processing. The combination of 3D reconstruction and realistic lighting effects from just one image represents a clear advancement over previous methods.

The use of latent space as a key technology enables a compact yet precise representation of complex relationships. With this, Apple takes another step towards efficient, practical AI applications that can be used in many areas. (Image: Shutterstock / Chaosamran_Studio)