MMAudio — generating synchronized audio from video/text
Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation [[Paper (being prepared)]](https://hkchengrex.github.io/MMAudio) [[Project Page]](https://hkchengrex.github.io/MMAudio) **Note: T...
| Entity Passport | |
| Registry ID | hf-space--hkchengrex--mmaudio |
| Provider | huggingface |
Cite this space
Academic & Research Attribution
@misc{hf_space__hkchengrex__mmaudio,
author = {hkchengrex},
title = {MMAudio — generating synchronized audio from video/text Space},
year = {2026},
howpublished = {\url{https://huggingface.co/spaces/hkchengrex/MMAudio}},
note = {Accessed via Free2AITools Knowledge Fortress}
} 🔬Technical Deep Dive
Full Specifications [+]▾
💬 Why this score?
The Nexus Index for MMAudio — generating synchronized audio from video/text aggregates Popularity (P:0), Velocity (V:0), and Credibility (C:0). The Utility score (U:0) represents deployment readiness, context efficiency, and structural reliability within the Nexus ecosystem.
🔗 Source Links (Click to verify)
Environment Profile
title: MMAudio — generating synchronized audio from video/text
emoji: 🔊
colorFrom: blue
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: false
sdk_version: 6.3.0
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji
University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation
[Paper (being prepared)] [Project Page]
Note: This repository is still under construction. Single-example inference should work as expected. The training code will be added. Code is subject to non-backward-compatible changes.
Highlight
MMAudio generates synchronized audio given video and/or text inputs.
Our key innovation is multimodal joint training which allows training on a wide range of audio-visual and audio-text datasets.
Moreover, a synchronization module aligns the generated audio with the video frames.
Results
(All audio from our algorithm MMAudio)
Videos from Sora:
https://github.com/user-attachments/assets/82afd192-0cee-48a1-86ca-bd39b8c8f330
Videos from MovieGen/Hunyuan Video/VGGSound:
https://github.com/user-attachments/assets/29230d4e-21c1-4cf8-a221-c28f2af6d0ca
For more results, visit https://hkchengrex.com/MMAudio/video_main.html.
Installation
We have only tested this on Ubuntu.
Prerequisites
We recommend using a miniforge environment.
- Python 3.8+
- PyTorch 2.5.1+ and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/)
- ffmpeg<7 (this is required by torchaudio, you can install it in a miniforge environment with
conda install -c conda-forge 'ffmpeg<7')
Clone our repository:
git clone https://github.com/hkchengrex/MMAudio.git
Install with pip:
cd MMAudio
pip install -e .
(If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip)
Pretrained models:
The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in mmaudio/utils/download_utils.py
| Model | Download link | File size |
|---|---|---|
| Flow prediction network, small 16kHz | mmaudio_small_16k.pth | 601M |
| Flow prediction network, small 44.1kHz | mmaudio_small_44k.pth | 601M |
| Flow prediction network, medium 44.1kHz | mmaudio_medium_44k.pth | 2.4G |
| Flow prediction network, large 44.1kHz (recommended) | mmaudio_large_44k.pth | 3.9G |
| 16kHz VAE | v1-16.pth | 655M |
| 16kHz BigVGAN vocoder | best_netG.pt | 429M |
| 44.1kHz VAE | v1-44.pth | 1.2G |
| Synchformer visual encoder | synchformer_state_dict.pth | 907M |
The 44.1kHz vocoder will be downloaded automatically.
The expected directory structure (full):
MMAudio
├── ext_weights
│ ├── best_netG.pt
│ ├── synchformer_state_dict.pth
│ ├── v1-16.pth
│ └── v1-44.pth
├── weights
│ ├── mmaudio_small_16k.pth
│ ├── mmaudio_small_44k.pth
│ ├── mmaudio_medium_44k.pth
│ └── mmaudio_large_44k.pth
└── ...
The expected directory structure (minimal, for the recommended model only):
MMAudio
├── ext_weights
│ ├── synchformer_state_dict.pth
│ └── v1-44.pth
├── weights
│ └── mmaudio_large_44k.pth
└── ...
Demo
By default, these scripts use the large_44k model.
In our experiments, inference only takes around 6GB of GPU memory (in 16-bit mode) which should fit in most modern GPUs.
Command-line interface
With demo.py
python demo.py --duration=8 --video=<path to video> --prompt "your prompt"
The output (audio in .flac format, and video in .mp4 format) will be saved in ./output.
See the file for more options.
Simply omit the --video option for text-to-audio synthesis.
The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality.
Gradio interface
Supports video-to-audio and text-to-audio synthesis.
python gradio_demo.py
Known limitations
- The model sometimes generates undesired unintelligible human speech-like sounds
- The model sometimes generates undesired background music
- The model struggles with unfamiliar concepts, e.g., it can generate "gunfires" but not "RPG firing".
We believe all of these three limitations can be addressed with more high-quality training data.
Training
Work in progress.
Evaluation
Work in progress.
Acknowledgement
Many thanks to:
- Make-An-Audio 2 for the 16kHz BigVGAN pretrained model
- BigVGAN
- Synchformer
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
🛡️ Space Transparency Report
Verified data manifest for traceability and transparency.
🆔 Identity & Source
- id
- hf-space--hkchengrex--mmaudio
- source
- huggingface
- author
- hkchengrex
- tags
- gradioregion:us
⚙️ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
- gradio
📊 Engagement & Metrics
- likes
- 925
- downloads
- 0
Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)