Apple AI generates speech and sound from silent videos Apfelpatient

Apple is supporting a new AI model that addresses a long-standing problem: the realistic generation of sound and speech from completely silent videos. The model, called VSSFlow, was developed by three Apple researchers in collaboration with six researchers from Renmin University in China. The goal was a unified system that generates sound effects and speech not separately, but together – with measurably good results.

Previous approaches in this area were mostly highly specialized. Video-to-audio models could generate ambient sounds but struggled with speech. Text-to-speech models produced clear voices but were not designed to generate non-speech sounds like footsteps, wind, or machine noise. Attempts to combine both tasks often relied on separate training steps, based on the assumption that joint training would degrade performance. This led to complex pipelines and limited results. VSSFlow deliberately takes a different approach and challenges this assumption.

The initial problem

Separating sound and speech generation had clear disadvantages. Models were either good at sound effects or good at speech, but rarely at both. Systems intended to handle both tasks became unnecessarily complex and often lagged behind specialized solutions. This was insufficient for realistic videos with dialogue and background noise.

The idea behind VSSFlow

VSSFlow is designed as a unified AI model that learns and generates sound effects and speech together. Instead of combining two separate systems, a single model processes visual information from the video and text-based information from transcripts directly in the audio generation process.

Several concepts from generative AI are used in this process. Spoken texts are first converted into phoneme sequences, i.e., basic sound units. For the actual audio generation, the model uses flow matching. It learns to gradually reconstruct a structured audio signal from random noise until the desired result is achieved.

These mechanisms are embedded in a ten-layer architecture that simultaneously considers video frames and transcript information. This allows the model to process speech and sound effects in a single system.

Joint training instead of competition

A key finding of the research is that speech and sound training do not hinder each other. On the contrary, learning together led to better results in both tasks. Speech benefited from sound training, and the sound effects became more precise through speech training. This mutually reinforcing effect contradicts the previous assumption that multitasking in this area inevitably leads to a decrease in performance.

Training data and procedure

To train VSSFlow, the researchers used a combination of different data types:

silent videos with ambient sounds (video-to-sound),
silent videos with transcripts for spoken content (Visual Text-to-Speech),
classic text-to-speech datasets.

All data were used in a continuous end-to-end training process. This allowed the model to capture both sounds and speech in a unified learning process.

Fine-tuning for simultaneous output

In its original version, VSSFlow couldn't automatically generate background noise and spoken dialogue simultaneously in a single output. To overcome this issue, the model was subsequently fine-tuned. The researchers used large quantities of synthetic examples in which speech and ambient noise were mixed. In this way, the model learned how the two should sound together.

Deployment and results

When using VSSFlow, the application starts with random noise. Visual cues are extracted from the video at approximately ten frames per second to create suitable ambient sounds. Simultaneously, a transcript provides precise information for the generated voice.

Compared to specialized models designed solely for sound effects or solely for speech, VSSFlow achieved competitive results. In several key metrics, the model even outperformed the competition, despite combining both tasks in a single system.

The researchers published numerous demos, including examples of pure sound generation, pure speech generation, and combined output from videos. They also provided direct comparisons with alternative models.

Open Source and Outlook

The VSSFlow code has been released as open source on GitHub. The researchers are also working on making the model weights accessible and providing an inference demo.

They see several open challenges for the future. A key limitation is the limited availability of high-quality video-speech-audio data. Furthermore, developing better representations for sound and speech remains an important issue, especially if speech details are to be preserved without making the models unnecessarily large.

Apple is pushing ahead with integrated audio AI

With VSSFlow, Apple demonstrates that a unified model for video-based audio and speech generation is feasible and even offers advantages over separate approaches. The combined learning of sound and speech proves to be a strength rather than a weakness. This work thus provides a clear impetus for future research and underscores Apple's role in the advancement of modern AI systems. (Image: Shutterstock / gnepphoto)

Have you already checked out our Amazon Storefront? You'll find a hand-picked selection of various products for your iPhone and other devices there – enjoy browsing.

This post contains affiliate links.

Add Apfelpatient to your Google News Feed.

Was this article helpful?

YesNo

Tags: TechPatient

Apple AI generates speech and sound from silent videos

AirPods Pro with cameras: This is what Apple is planning for 2026

EU threatens Meta with measures over AI blocking on WhatsApp

EU threatens Meta with measures over AI blocking on WhatsApp

The US partially releases Claude Mythos 5

Data leak at Tata: Apple and suppliers react

iPhone 18 Pro: Up to $200 price increase expected

About APFELPATIENT

Legal

Service