Moss Tts Realtime Onnx
| Entity Passport | |
| Registry ID | hf-model--pltobing--moss-tts-realtime-onnx |
| License | Apache-2.0 |
| Provider | huggingface |
Cite this model
Academic & Research Attribution
@misc{hf_model__pltobing__moss_tts_realtime_onnx,
author = {pltobing},
title = {Moss Tts Realtime Onnx Model},
year = {2026},
howpublished = {\url{https://huggingface.co/pltobing/moss-tts-realtime-onnx}},
note = {Accessed via Free2AITools Knowledge Fortress}
} đŦTechnical Deep Dive
Full Specifications [+]âž
Quick Commands
huggingface-cli download pltobing/moss-tts-realtime-onnx âī¸ Nexus Index V2.0
đŦ Index Insight
FNI V2.0 for Moss Tts Realtime Onnx: Semantic (S:50), Authority (A:0), Popularity (P:2), Recency (R:97), Quality (Q:50).
Verification Authority
đ What's Next?
Technical Deep Dive
MOSS-TTS-Realtime ONNX Inference
Pure ONNX Runtime inference pipeline for MOSS-TTS-Realtime, enabling streaming text-to-speech without any PyTorch or Hugging Face Transformers dependency at runtime.
Overview
This repository provides:
inferencer_onnx.pyâ Core streaming TTS engine that orchestrates four ONNX models (backbone LLM, local transformer, codec encoder, codec decoder) using only NumPy and ONNX Runtime.moss_text_tokenizer.pyâ Lightweight Qwen3-compatible tokenizer wrapping thetokenizerslibrary, with notransformersdependency.test_basic_streaming-onnx.pyâ End-to-end test script that simulates LLM streaming text and produces a WAV file.
Architecture
Reference Audio âââē Codec Encoder âââē RVQ Audio Codes (voice clone context)
â
âŧ
Text Deltas âââē Backbone LLM (Qwen3-1.7B) âââē Hidden States
â
âŧ
Local Transformer âââē 16-codebook Audio Tokens
â
âŧ
Codec Decoder âââē 24 kHz Waveform
| Component | ONNX Model | Description |
|---|---|---|
| Backbone LLM | backbone_llm.onnx |
Qwen3-based causal LM mapping interleaved text+audio tokens to hidden states. Maintains a growing KV-cache across the entire generation. |
| Local Transformer | backbone_local.onnx |
Depth-wise decoder generating 16 RVQ codebook entries per frame from backbone hidden states. Creates and discards a fresh KV-cache per frame. |
| Codec Encoder | codec_encoder.onnx |
Encodes reference speaker waveform into RVQ codes for voice cloning. Run once per session. |
| Codec Decoder | codec_decoder.onnx |
Decodes RVQ audio codes back to 24 kHz waveform. Maintains four hierarchical KV-caches for streaming decode. |
Requirements
numpy
onnxruntime
soundfile
librosa
tokenizers
Install with:
pip install numpy onnxruntime soundfile librosa tokenizers
Directory Structure
.
âââ inferencer_onnx.py # Core ONNX inference engine
âââ moss_text_tokenizer.py # Lightweight Qwen3 tokenizer
âââ test_basic_streaming-onnx.py # End-to-end test script
âââ README.md
âââ onnx_models/ # FP32
â âââ backbone_f32/
â â âââ backbone_f32.onnx
â âââ local_transformer/
â â âââ local_transformer_f32.onnx
â âââ codec_decoder/
â â âââ codec_decoder.onnx
â âââ codec_encoder/
â âââ codec_encoder.onnx
âââ onnx_models/
â âââ codec_decoder_int8/
â â âââ codec_decoder_int8.onnx
âââ configs/
â âââ config_backbone.json
â âââ config_codec.json
âââ tokenizers/
â âââ tokenizer.json
â âââ tokenizer_config.json
âââ audio_ref/
â âââ .[wav|mp3|flac]
âââ audio_synth/
âââ .wav
Usage
Basic Streaming TTS
Notes 1
- With float32, all models loaded will consume about 13GB. It will OOM after about 120 steps on 16GB RAM.
- With <= 16GB RAM, you can use quantized (INT8) codec decoder to avoid OOM. Quantized codec encoder can also be further used but degrades the performance.
- With quantized (INT8) backbone_llm and backbone_local_transformer, the performance will be unacceptable and most of the times gibberish/hallucinates.
- BF16, as the original MOSS-TTS model is saved, is not yet fully supported on most CPUs. If you want to use GPU, you can convert the fp32 model.
- We also noted that the performance when using FP32 (torch/ONNX) on backbone_llm and backbone_local is a bit unstable compared to bf16 (torch). Probably due to the training with bfloat16 and excessed in numerical range with fp32 inference.
- So, perhaps the better option is to use ONNX converted to fp16 with GPU/supported CPU. We tried with m8a.xlarge and m8a.2xlarge instances, they do not support CPU with fp16.
Notes 2
- The KV caching mechanism is modified to use input past_kv tensor/array and initialized with empty on the time dimension so no need to export two ONNXs for prefill and step. In this case, one ONNX can handle both initializing and continuing. This mechanism is all for the backbone_llm (Qwen3Model), backbone_local_transformer, and codec_decoder. The codec_encoder always receives full sequence.
Notes 3
- Text by default in Russian, you can modify in the args. The prompt for the speaker is also modified in Russian, you can change in the inferencer_onnx.py.
- This prompting and default decoding hyperparameters (temp, top_p, top_k, repetition) has been optimized for Russian, and you can probably change for your language.
- The default prompt from MOSS-TTS is given in English, and we investigated you can slightly modify and even change to your targeted language to produce consistent accent/nativeness as we are using for the Russian within the
MOSSTTSRealtimeProcessor.
Example
- With quantized (INT8) codec decoder (requires at least 13GB RAM)
python test_basic_streaming-onnx.py --tokenizer_vocab_path tokenizers/tokenizer.json --tokenizer_config_path tokenizers/tokenizer_config.json --backbone_llm_path onnx_models/backbone_f32/backbone_f32.onnx --backbone_local_path onnx_models/local_transformer_f32/local_transformer_f32.onnx --codec_decoder_path onnx_models_quantized/codec_decoder_int8/codec_decoder_int8.onnx --codec_encoder_path onnx_models/codec_encoder/codec_encoder.onnx --backbone_config_path configs/config_backbone.json --codec_config_path configs/config_codec.json --prompt_wav audio_ref/male_stewie.mp3 --out_wav output.wav
- With all FP32 (requires > 16GB RAM)
python test_basic_streaming-onnx.py --tokenizer_vocab_path tokenizers/tokenizer.json --tokenizer_config_path tokenizers/tokenizer_config.json --backbone_llm_path onnx_models/backbone_f32/backbone_f32.onnx --backbone_local_path onnx_models/local_transformer_f32/local_transformer_f32.onnx --codec_decoder_path onnx_models/codec_decoder/codec_decoder.onnx --codec_encoder_path onnx_models/codec_encoder/codec_encoder.onnx --backbone_config_path configs/config_backbone.json --codec_config_path configs/config_codec.json --prompt_wav audio_ref/male_stewie.mp3 --out_wav output.wav
Programmatic Usage
import json
import onnxruntime as ort
from inferencer_onnx import MossTTSRealtimeInferenceONNX
from moss_text_tokenizer import MOSSTextTokenizer
# Load tokenizer and ONNX sessions
tokenizer = MOSSTextTokenizer("tokenizers/tokenizer.json",
"tokenizers/tokenizer_config.json")
backbone_llm = ort.InferenceSession("onnx_models/backbone_llm.onnx",
providers=["CPUExecutionProvider"])
backbone_local = ort.InferenceSession("onnx_models/backbone_local.onnx",
providers=["CPUExecutionProvider"])
codec_decoder = ort.InferenceSession("onnx_models/codec_decoder.onnx",
providers=["CPUExecutionProvider"])
codec_encoder = ort.InferenceSession("onnx_models/codec_encoder.onnx",
providers=["CPUExecutionProvider"])
with open("configs/config_backbone.json") as f:
backbone_config = json.load(f)
with open("configs/config_codec.json") as f:
codec_config = json.load(f)
# Create inferencer
inferencer = MossTTSRealtimeInferenceONNX(
tokenizer, backbone_llm, backbone_local,
codec_decoder, codec_encoder,
backbone_config, codec_config,
)
# Encode reference speaker for voice cloning
prompt_tokens = inferencer._encode_reference_audio("audio/speaker.wav")
input_ids = inferencer.processor.make_ensemble(prompt_tokens.squeeze(1))
inferencer.reset_turn(input_ids=input_ids, include_system_prompt=False,
reset_cache=True)
# Stream text and collect audio
for delta in your_llm_stream():
audio_frames = inferencer.push_text(delta)
for frame in audio_frames:
# push_tokens + audio_chunks for waveform decoding
...
Command-Line Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
--tokenizer_vocab_path |
str | required | Path to tokenizer.json |
--tokenizer_config_path |
str | required | Path to tokenizer_config.json |
--backbone_llm_path |
str | required | Path to backbone LLM ONNX model |
--backbone_local_path |
str | required | Path to local transformer ONNX model |
--codec_decoder_path |
str | required | Path to codec decoder ONNX model |
--codec_encoder_path |
str | required | Path to codec encoder ONNX model |
--backbone_config_path |
str | required | Path to config_backbone.json |
--codec_config_path |
str | required | Path to config_codec.json |
--prompt_wav |
str | required | Reference speaker audio for voice cloning |
--out_wav |
str | out_streaming.wav |
Output WAV file path |
--sample_rate |
int | 24000 |
Output sample rate (Hz) |
--temperature |
float | 0.725 |
Sampling temperature |
--top_p |
float | 0.6 |
Nucleus sampling threshold |
--top_k |
int | 34 |
Top-k sampling cutoff |
--repetition_penalty |
float | 1.9 |
Repetition penalty coefficient |
--repetition_window |
int | 50 |
Window for repetition penalty |
--max_length |
int | 5000 |
Maximum generation steps |
--delta_chunk_chars |
int | 1 |
Characters per simulated LLM delta |
--delta_delay_s |
float | 0.0 |
Delay between simulated deltas (seconds) |
--assistant_text |
str | (Russian text) | Text to synthesize |
Acknowledgments
This work builds upon the MOSS-TTS-Realtime model by OpenMOSS Team and the MOSS-Audio-Tokenizer codec.
License
Copyright 2026 Patrick Lumbantobing, Vertox-AI
Licensed under the Apache License, Version 2.0. See LICENSE for details.
â ī¸ Incomplete Data
Some information about this model is not available. Use with Caution - Verify details from the original source before relying on this data.
View Original Source âđ Limitations & Considerations
- âĸ Benchmark scores may vary based on evaluation methodology and hardware configuration.
- âĸ VRAM requirements are estimates; actual usage depends on quantization and batch size.
- âĸ FNI scores are relative rankings and may change as new models are added.
- â License Unknown: Verify licensing terms before commercial use.
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
đĄī¸ Model Transparency Report
Technical metadata sourced from upstream repositories.
đ Identity & Source
- id
- hf-model--pltobing--moss-tts-realtime-onnx
- slug
- pltobing--moss-tts-realtime-onnx
- source
- huggingface
- author
- pltobing
- license
- Apache-2.0
- tags
- onnx, text-to-speech, tts, moss-tts, voice clone, streaming, qwen3, rvq, multilingual, causal audio tokenizer, ru, zh, en, de, es, fr, ja, it, he, ko, fa, ar, pl, pt, cs, da, sv, hu, el, tr, base_model:openmoss-team/moss-tts, base_model:quantized:openmoss-team/moss-tts, license:apache-2.0, region:us
âī¸ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
- text-to-speech
đ Engagement & Metrics
- downloads
- 45
- stars
- 0
- forks
- 0
Data indexed from public sources. Updated daily.