🚀

Space

MMAudio — generating synchronized audio from video/text

by hkchengrex ID: hf-space--hkchengrex--mmaudio

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation [[Paper (being prepared)]](https://hkchengrex.github.io/MMAudio) [[Project Page]](https://hkchengrex.github.io/MMAudio) **Note: T...

View Source Code →

gradio SDK

CPU Hardware

Running Status

Active 925 Activity

Space Information Summary
Entity Passport
Registry ID	hf-space--hkchengrex--mmaudio
Provider	huggingface

📜

Cite this space

Academic & Research Attribution

BibTeX

@misc{hf_space__hkchengrex__mmaudio,
  author = {hkchengrex},
  title = {MMAudio — generating synchronized audio from video/text Space},
  year = {2026},
  howpublished = {\url{https://huggingface.co/spaces/hkchengrex/MMAudio}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

hkchengrex. (2026). MMAudio — generating synchronized audio from video/text [Space]. Free2AITools. https://huggingface.co/spaces/hkchengrex/MMAudio

🔬Technical Deep Dive

Full Specifications [+]

⚖️ Free2AI Nexus Index

Methodology → 📘 What is FNI?

26.0

Top 1% Overall Impact

🔥 Popularity (P) 0

🚀 Velocity (V) 0

🛡️ Credibility (C) 0

🔧 Utility (U) 0

Nexus Verified Data

💬 Why this score?

The Nexus Index for MMAudio — generating synchronized audio from video/text aggregates Popularity (P:0), Velocity (V:0), and Credibility (C:0). The Utility score (U:0) represents deployment readiness, context efficiency, and structural reliability within the Nexus ecosystem.

🔗 Source Links (Click to verify)

📊 P: HuggingFace Stats 📈 V: 7-Day Delta 📄 C: Papers With Code 🔧 U: Deploy Score

Data Verified 🕐 Last Updated: Not calculated

Free2AI Nexus Index | Fair · Transparent · Explainable | Full Methodology

Environment Profile

title: MMAudio — generating synchronized audio from video/text
emoji: 🔊
colorFrom: blue
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: false
sdk_version: 6.3.0

Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji

University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation

[Paper (being prepared)] [Project Page]

Note: This repository is still under construction. Single-example inference should work as expected. The training code will be added. Code is subject to non-backward-compatible changes.

Highlight

MMAudio generates synchronized audio given video and/or text inputs.
Our key innovation is multimodal joint training which allows training on a wide range of audio-visual and audio-text datasets.
Moreover, a synchronization module aligns the generated audio with the video frames.

Results

(All audio from our algorithm MMAudio)

Videos from Sora:

https://github.com/user-attachments/assets/82afd192-0cee-48a1-86ca-bd39b8c8f330

Videos from MovieGen/Hunyuan Video/VGGSound:

https://github.com/user-attachments/assets/29230d4e-21c1-4cf8-a221-c28f2af6d0ca

For more results, visit https://hkchengrex.com/MMAudio/video_main.html.

Installation

We have only tested this on Ubuntu.

Prerequisites

We recommend using a miniforge environment.

Python 3.8+
PyTorch 2.5.1+ and corresponding torchvision/torchaudio (pick your CUDA version https://pytorch.org/)
ffmpeg<7 (this is required by torchaudio, you can install it in a miniforge environment with conda install -c conda-forge 'ffmpeg<7')

Clone our repository:

git clone https://github.com/hkchengrex/MMAudio.git

Install with pip:

cd MMAudio
pip install -e .

(If you encounter the File "setup.py" not found error, upgrade your pip with pip install --upgrade pip)

Pretrained models:

The models will be downloaded automatically when you run the demo script. MD5 checksums are provided in mmaudio/utils/download_utils.py

Model	Download link	File size
Flow prediction network, small 16kHz	mmaudio_small_16k.pth	601M
Flow prediction network, small 44.1kHz	mmaudio_small_44k.pth	601M
Flow prediction network, medium 44.1kHz	mmaudio_medium_44k.pth	2.4G
Flow prediction network, large 44.1kHz (recommended)	mmaudio_large_44k.pth	3.9G
16kHz VAE	v1-16.pth	655M
16kHz BigVGAN vocoder	best_netG.pt	429M
44.1kHz VAE	v1-44.pth	1.2G
Synchformer visual encoder	synchformer_state_dict.pth	907M

The 44.1kHz vocoder will be downloaded automatically.

The expected directory structure (full):

MMAudio
├── ext_weights
│   ├── best_netG.pt
│   ├── synchformer_state_dict.pth
│   ├── v1-16.pth
│   └── v1-44.pth
├── weights
│   ├── mmaudio_small_16k.pth
│   ├── mmaudio_small_44k.pth
│   ├── mmaudio_medium_44k.pth
│   └── mmaudio_large_44k.pth
└── ...

The expected directory structure (minimal, for the recommended model only):

MMAudio
├── ext_weights
│   ├── synchformer_state_dict.pth
│   └── v1-44.pth
├── weights
│   └── mmaudio_large_44k.pth
└── ...

Demo

By default, these scripts use the large_44k model.
In our experiments, inference only takes around 6GB of GPU memory (in 16-bit mode) which should fit in most modern GPUs.

Command-line interface

With demo.py

python demo.py --duration=8 --video=<path to video> --prompt "your prompt"

The output (audio in .flac format, and video in .mp4 format) will be saved in ./output.
See the file for more options.
Simply omit the --video option for text-to-audio synthesis.
The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality.

Gradio interface

Supports video-to-audio and text-to-audio synthesis.

python gradio_demo.py

Known limitations

The model sometimes generates undesired unintelligible human speech-like sounds
The model sometimes generates undesired background music
The model struggles with unfamiliar concepts, e.g., it can generate "gunfires" but not "RPG firing".

We believe all of these three limitations can be addressed with more high-quality training data.

Training

Work in progress.

Evaluation

Work in progress.

Acknowledgement

Many thanks to:

Make-An-Audio 2 for the 16kHz BigVGAN pretrained model
BigVGAN
Synchformer

Top Tier

Social Proof

HuggingFace Hub

925Likes

Hub Discussions

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Space Transparency Report

Verified data manifest for traceability and transparency.

100% Data Disclosure Active

🆔 Identity & Source

id: hf-space--hkchengrex--mmaudio
source: huggingface
author: hkchengrex
tags: gradioregion:us

⚙️ Technical Specs

architecture: null
params billions: null
context length: null
pipeline tag: gradio

📊 Engagement & Metrics

likes: 925
downloads: 0

Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)

Welcome to Free2AI Tools!

Smart Search

FNI Score

You're All Set!