WhisperSpeech
| Entity Passport | |
| Registry ID | gh-model--whisperspeech--whisperspeech |
| License | MIT |
| Provider | github |
Cite this tool
Academic & Research Attribution
@misc{gh_model__whisperspeech__whisperspeech,
author = {WhisperSpeech},
title = {WhisperSpeech Tool},
year = {2026},
howpublished = {\url{https://free2aitools.com/tool/gh-model--whisperspeech--whisperspeech}},
note = {Accessed via Free2AITools Knowledge Fortress}
} 🔬Technical Deep Dive
Full Specifications [+]▾
Quick Commands
pip install whisperspeech ⚖️ Nexus Index V2.0
💬 Index Insight
FNI V2.0 for WhisperSpeech: Semantic (S:50), Authority (A:0), Popularity (P:63), Recency (R:76), Quality (Q:70).
Verification Authority
📋 Specs
- Language
- Python
- License
- MIT
- Version
- 1.0.0
Usage documentation not yet indexed for this tool.
Technical Documentation
WhisperSpeech
Join us in the #audio-generation channel on the LAION Discord to chat, ask questions, or contribute!
WhisperSpeech is an open-source, text-to-speech (TTS) system created by “inverting” OpenAI Whisper.
Our goal is to be for speech what Stable Diffusion is for images—powerful, hackable, and commercially safe.
- All code is Apache-2.0 / MIT.
- Models are trained only on properly licensed data.
- Current release: English (LibreLight). Multilingual release coming next.
Sample output →
https://github.com/collabora/WhisperSpeech/assets/107984/aa5a1e7e-dc94-481f-8863-b022c7fd7434
🚀 Progress Updates
[2024-01-29] – Tiny S2A multilingual voice-cloning
We trained a tiny S2A model on an en + pl + fr dataset; it successfully clones French voices using semantic tokens frozen on English + Polish—evidence that one tokeniser could cover all languages.
https://github.com/collabora/WhisperSpeech/assets/107984/267f2602-7eec-4646-a43b-059ff91b574e
https://github.com/collabora/WhisperSpeech/assets/107984/fbf08e8e-0f9a-4b0d-ab5e-747ffba2ccb9
[2024-01-18] – 12× real-time on a 4090 + voice-cloning demo
- Added
torch.compile, KV-caching, and layer tweaks → 12× faster-than-real-time on a consumer RTX 4090. - Seamlessly code-switch within one sentence:
To jest pierwszy test wielojęzycznego
Whisper Speechmodelu …
https://github.com/collabora/WhisperSpeech/assets/107984/d7092ef1-9df7-40e3-a07e-fdc7a090ae9e
- One-click voice-cloning—example based on Winston Churchill’s “Be Ye Men of Valour” (radio static preserved by design):
https://github.com/collabora/WhisperSpeech/assets/107984/bd28110b-31fb-4d61-83f6-c997f560bc26
Test it on Colab (≤ 30 s install). Hugging Face Space coming soon.
[2024-01-10] – Faster SD S2A + first cloning example
A new SD‑size S2A model brings major speed‑ups without sacrificing quality; cloning example added.
Try it on Colab.
[2023-12-10] – Multilingual trio (EN/PL)
- English (female voice transferred from a Polish dataset):
https://github.com/collabora/WhisperSpeech/assets/107984/aa5a1e7e-dc94-481f-8863-b022c7fd7434 - Polish (male voice):
https://github.com/collabora/WhisperSpeech/assets/107984/4da14b03-33f9-4e2d-be42-f0fcf1d4a6ec
📊 Community Benchmarks
Unofficial speed & memory‑usage results from the community can be found here.
📦 Downloads
- Quick start: open the Colab above or run the notebook locally.
- Manual:
- Pre‑trained models – https://huggingface.co/collabora/whisperspeech
- Converted datasets – https://huggingface.co/datasets/collabora/whisperspeech
🗺️ Roadmap
- Gather large emotive‑speech dataset
- Condition generation on emotion & prosody
- Community drive for freely licensed multilingual speech
- Train final multilingual models
⚙️ Architecture
WhisperSpeech follows the two‑stage, token‑based pipeline popularised by
AudioLM, Google’s SPEAR TTS, and Meta’s MusicGen:
| Stage | Model | Purpose |
|---|---|---|
| Semantic | Whisper | Transcription ➜ semantic tokens |
| Acoustic | EnCodec | Tokenise waveform (1.5 kbps) |
| Vocoder | Vocos | High‑fidelity audio |
EnCodec architecture diagram

Conference talks (deep dives)

Tricks Learned from Scaling WhisperSpeech Models to 80k+ Hours of Speech – Jakub Cłapa, Collabora

Open‑Source TTS Projects: WhisperSpeech – In‑Depth Discussion
🙏 Appreciation
Made possible by:
- Collabora – code & training
- LAION – community & datasets
- Jülich Supercomputing Centre – JUWELS Booster
Additional compute funded by the Gauss Centre for Supercomputing via the John von Neumann Institute for Computing (NIC).
Special thanks to individual contributors:
- @inevitable-2031 (
qwerty_qweron Discord) for dataset curation
💼 Consulting
Need help with open‑source or proprietary AI projects?
Contact us via Collabora or DM on Discord:
📚 Citations
@article{SpearTTS,
title = {Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision},
url = {https://arxiv.org/abs/2302.03540},
author = {Kharitonov, Eugene and Vincent, Damien and Borsos, Zalán and Marinier, Raphaël and Girgin, Sertan and Pietquin, Olivier and Sharifi, Matt and Tagliasacchi, Marco and Zeghidour, Neil},
publisher = {arXiv},
year = {2023},
}
@article{MusicGen,
title = {Simple and Controllable Music Generation},
url = {https://arxiv.org/abs/2306.05284},
author = {Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
publisher = {arXiv},
year = {2023},
}
@article{Whisper,
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
url = {https://arxiv.org/abs/2212.04356},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
publisher = {arXiv},
year = {2022},
}
@article{EnCodec,
title = {High Fidelity Neural Audio Compression},
url = {https://arxiv.org/abs/2210.13438},
author = {Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
publisher = {arXiv},
year = {2022},
}
@article{Vocos,
title = {Vocos: Closing the Gap Between Time‑Domain and Fourier‑Based Neural Vocoders for High‑Quality Audio Synthesis},
url = {https://arxiv.org/abs/2306.00814},
author = {Hubert Siuzdak},
publisher = {arXiv},
year = {2023},
}
AI Summary: Based on GitHub metadata. Not a recommendation.
🛡️ Tool Transparency Report
Technical metadata sourced from upstream repositories.
🆔 Identity & Source
- id
- gh-model--whisperspeech--whisperspeech
- slug
- whisperspeech--whisperspeech
- source
- github
- author
- WhisperSpeech
- license
- MIT
- tags
- pytorch, speech-synthesis, tts, jupyter notebook
⚙️ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
- automatic-speech-recognition
📊 Engagement & Metrics
- downloads
- 0
- stars
- 0
- forks
- 0
Data indexed from public sources. Updated daily.

