🧠 Model

VibeVoice-1.5B

by microsoft

--- language: - en - zh license: mit pipeline_tag: text-to-speech tags: - Podcast library_name: transformers --- VibeVoice is a novel framework designed for gen

🕐 Updated 12/19/2025

🧠 Architecture Explorer

Neural network architecture

1 Input Layer

2 Hidden Layers

3 Attention

4 Output Layer

Learn about Transformers →

About

VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking. A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic a...

📝 Limitations & Considerations

• Benchmark scores may vary based on evaluation methodology and hardware configuration.
• VRAM requirements are estimates; actual usage depends on quantization and batch size.
• FNI scores are relative rankings and may change as new models are added.
• Data source: [{"source_platform":"huggingface","source_url":"https://huggingface.co/microsoft/VibeVoice-1.5B","fetched_at":"2025-12-19T07:41:01.175Z","adapter_version":"3.2.0"}]

📚 Related Resources

📄 Related Papers

No related papers linked yet. Check the model's official documentation for research papers.

📊 Training Datasets

Training data information not available. Refer to the original model card for details.

🔗 Related Models V6.2

phi-2📦 Sibling phi-4📦 Sibling Florence-2-large📦 Sibling OmniParser📦 Sibling Phi-3-mini-128k-instruct📦 Sibling

Model Information Summary
Model Name	VibeVoice-1.5B
Author	microsoft
Type	text-to-speech
Downloads	600,844
Likes	2,099
Source	Hugging Face
Last Updated	December 19, 2025

Graph Overview

200 Models

460 Connections

Explore Full Graph →

🚀 What's Next?

📊