apple patient
  • Home
  • News
  • Rumors
  • Tips & Tricks
  • Tests & Experience Reports
  • Generally
No Result
View All Result
  • Home
  • News
  • Rumors
  • Tips & Tricks
  • Tests & Experience Reports
  • Generally
No Result
View All Result
apple patient
No Result
View All Result

Apple AI generates speech and sound from silent videos

by Milan
February 9, 2026
Apple AI

Image: Shutterstock / gnepphoto

Apple is supporting a new AI model that addresses a long-standing problem: the realistic generation of sound and speech from completely silent videos. The model, called VSSFlow, was developed by three Apple researchers in collaboration with six researchers from Renmin University in China. The goal was a unified system that generates sound effects and speech not separately, but together – with measurably good results.

Previous approaches in this area were mostly highly specialized. Video-to-audio models could generate ambient sounds but struggled with speech. Text-to-speech models produced clear voices but were not designed to generate non-speech sounds like footsteps, wind, or machine noise. Attempts to combine both tasks often relied on separate training steps, based on the assumption that joint training would degrade performance. This led to complex pipelines and limited results. VSSFlow deliberately takes a different approach and challenges this assumption.

The initial problem

Separating sound and speech generation had clear disadvantages. Models were either good at sound effects or good at speech, but rarely at both. Systems intended to handle both tasks became unnecessarily complex and often lagged behind specialized solutions. This was insufficient for realistic videos with dialogue and background noise.

The idea behind VSSFlow

VSSFlow is designed as a unified AI model that learns and generates sound effects and speech together. Instead of combining two separate systems, a single model processes visual information from the video and text-based information from transcripts directly in the audio generation process.

Several concepts from generative AI are used in this process. Spoken texts are first converted into phoneme sequences, i.e., basic sound units. For the actual audio generation, the model uses flow matching. It learns to gradually reconstruct a structured audio signal from random noise until the desired result is achieved.

These mechanisms are embedded in a ten-layer architecture that simultaneously considers video frames and transcript information. This allows the model to process speech and sound effects in a single system.

Joint training instead of competition

A key finding of the research is that speech and sound training do not hinder each other. On the contrary, learning together led to better results in both tasks. Speech benefited from sound training, and the sound effects became more precise through speech training. This mutually reinforcing effect contradicts the previous assumption that multitasking in this area inevitably leads to a decrease in performance.

Training data and procedure

To train VSSFlow, the researchers used a combination of different data types:

  • silent videos with ambient sounds (video-to-sound),
  • silent videos with transcripts for spoken content (Visual Text-to-Speech),
  • classic text-to-speech datasets.

All data were used in a continuous end-to-end training process. This allowed the model to capture both sounds and speech in a unified learning process.

Fine-tuning for simultaneous output

In its original version, VSSFlow couldn't automatically generate background noise and spoken dialogue simultaneously in a single output. To overcome this issue, the model was subsequently fine-tuned. The researchers used large quantities of synthetic examples in which speech and ambient noise were mixed. In this way, the model learned how the two should sound together.

Deployment and results

When using VSSFlow, the application starts with random noise. Visual cues are extracted from the video at approximately ten frames per second to create suitable ambient sounds. Simultaneously, a transcript provides precise information for the generated voice.

Compared to specialized models designed solely for sound effects or solely for speech, VSSFlow achieved competitive results. In several key metrics, the model even outperformed the competition, despite combining both tasks in a single system.

The researchers published numerous demos, including examples of pure sound generation, pure speech generation, and combined output from videos. They also provided direct comparisons with alternative models.

Open Source and Outlook

The VSSFlow code has been released as open source on GitHub. The researchers are also working on making the model weights accessible and providing an inference demo.

They see several open challenges for the future. A key limitation is the limited availability of high-quality video-speech-audio data. Furthermore, developing better representations for sound and speech remains an important issue, especially if speech details are to be preserved without making the models unnecessarily large.

Apple is pushing ahead with integrated audio AI

With VSSFlow, Apple demonstrates that a unified model for video-based audio and speech generation is feasible and even offers advantages over separate approaches. The combined learning of sound and speech proves to be a strength rather than a weakness. This work thus provides a clear impetus for future research and underscores Apple's role in the advancement of modern AI systems. (Image: Shutterstock / gnepphoto)

  • Apple Music: Bad Bunny's Halftime Show breaks records
  • Apple TV celebrates success at the 78th Directors Guild Awards
  • Apple tightens App Store guidelines for chat apps
  • Apple CEO Tim Cook comments on retirement
  • watchOS 11.6.2 for Apple Watch: What's included in the update
  • Apple celebrates 50 years and looks to the future with AI
  • Apple takes a stand: Tim Cook fights for migration
  • AirTag 2 Teardown: iFixit Shows All the New Features
  • iPhone 17 Pro Max wins battery test against Android competition
  • Apple Maps and Ads are not considered gatekeepers in the EU
  • Apple celebrates record quarter: How the China comeback succeeded
  • iPad growth is expected to be strong in 2025 – 2026 will be challenging.
  • Google as a cloud platform for Siri? New statements raise questions
  • Formula 1 boss hints at F1 movie sequel on Apple TV
  • iOS 26.3 provides clues about upcoming M5 chips
  • Lockdown Mode: FBI fails to crack a reporter's iPhone
  • Visual Intelligence: Apple's AI feature becomes key
  • iOS 26.3 RC released: All new features at a glance
  • iPhone and NFC: These innovations will shape the next few years
  • Apple expands education and robotics in India's supply chain
  • Apple TV Press Day Event: An overview of all content for 2026
  • iPhone spyware bypasses iOS protection for camera and audio
  • iOS 26.2.1 is causing problems for some users
Have you already visited our Amazon Storefront? There you'll find a hand-picked selection of various products for your iPhone and other devices – enjoy browsing !
This post contains affiliate links.
Add Apfelpatient to your Google News Feed. 
Was this article helpful?
YesNo
Tags: TechPatient
Previous Post

AirPods Pro with cameras: This is what Apple is planning for 2026

Apple AI

Apple AI generates speech and sound from silent videos

February 9, 2026
Apple AirPods Pro

AirPods Pro with cameras: This is what Apple is planning for 2026

February 9, 2026
iOS 26.3 Apple M5 Chips

Apple M5 Leak: Pro and Max could be the same chip

February 9, 2026

About APFELPATIENT

Welcome to your ultimate source for everything Apple - from the latest hardware like iPhone, iPad, Apple Watch, Mac, AirTags, HomePods, AirPods to the groundbreaking Apple Vision Pro and high-quality accessories. Dive deep into the world of Apple software with the latest updates and features for iOS, iPadOS, tvOS, watchOS, macOS and visionOS. In addition to comprehensive tips and tricks, we offer you the hottest rumors, the latest news and much more to keep you up to date. Selected gaming topics also find their place with us, always with a focus on how they enrich the Apple experience. Your interest in Apple and related technology is served here with plenty of expert knowledge and passion.

Legal

  • Imprint – About APFELPATIENT
  • Cookie Settings
  • Privacy Policy
  • Terms of Use

Service

  • Partner Program
  • Netiquette – About APFELPATIENT

RSS Feed

Follow Apfelpatient:
Facebook Instagram YouTube threads threads
Apfelpatient Logo

© 2026 Apfelpatient. All rights reserved. | Sitemap

No Result
View All Result
  • Home
  • News
  • Rumors
  • Tips & Tricks
  • Tests & Experience Reports
  • Generally

© 2026 Apfelpatient. All rights reserved. Page Directory

Change language to Deutsch