🧠

Model

Moss Tts Realtime Onnx

Name: Moss Tts Realtime Onnx
Author: pltobing

by pltobing hf-model--pltobing--moss-tts-realtime-onnx

Nexus Index

37.5 Top 10%

S: Semantic 50

A: Authority 0

P: Popularity 2

R: Recency 97

Q: Quality 50

Tech Context

Vital Performance

45 DL / 30D

0.0%

Source →

Audited 37.5 FNI Score

Tiny - Params

- Context

45 Downloads

Commercial APACHE License

Model Information Summary
Entity Passport
Registry ID	hf-model--pltobing--moss-tts-realtime-onnx
License	Apache-2.0
Provider	huggingface

📜

Cite this model

Academic & Research Attribution

BibTeX

@misc{hf_model__pltobing__moss_tts_realtime_onnx,
  author = {pltobing},
  title = {Moss Tts Realtime Onnx Model},
  year = {2026},
  howpublished = {\url{https://huggingface.co/pltobing/MOSS-TTS-Realtime-ONNX}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

pltobing. (2026). Moss Tts Realtime Onnx [Model]. Free2AITools. https://huggingface.co/pltobing/MOSS-TTS-Realtime-ONNX

🔬Technical Deep Dive

Full Specifications [+]

Quick Commands

🤗 HF Download

huggingface-cli download pltobing/moss-tts-realtime-onnx

⚖️ Nexus Index V2.0

Methodology Index Protocol

37.5

TOP 10% SYSTEM IMPACT

Semantic (S) 50

Authority (A) 0

Popularity (P) 2

Recency (R) 97

Quality (Q) 50

💬 Index Insight

FNI V2.0 for Moss Tts Realtime Onnx: Semantic (S:50), Authority (A:0), Popularity (P:2), Recency (R:97), Quality (Q:50).

Free2AITools Nexus Index

Verification Authority

HuggingFace API GitHub Metadata Arxiv Citation DB System Audit

Unbiased Data Node Refresh: VFS Live

---

🚀 What's Next?

📊

Find Training Datasets

Discover datasets compatible with this model

📈

Compare Benchmarks

See how this model ranks on standard tests

⚡

Deployment Guide

Understand deployment options

Technical Deep Dive

MOSS-TTS-Realtime ONNX Inference

Pure ONNX Runtime inference pipeline for MOSS-TTS-Realtime, enabling streaming text-to-speech without any PyTorch or Hugging Face Transformers dependency at runtime.

Overview

This repository provides:

inferencer_onnx.py — Core streaming TTS engine that orchestrates four ONNX models (backbone LLM, local transformer, codec encoder, codec decoder) using only NumPy and ONNX Runtime.
moss_text_tokenizer.py — Lightweight Qwen3-compatible tokenizer wrapping the tokenizers library, with no transformers dependency.
test_basic_streaming-onnx.py — End-to-end test script that simulates LLM streaming text and produces a WAV file.

Architecture

text

Reference Audio ──► Codec Encoder ──► RVQ Audio Codes (voice clone context)
                                           │
                                           ▼
Text Deltas ──► Backbone LLM (Qwen3-1.7B) ──► Hidden States
                                                    │
                                                    ▼
                                            Local Transformer ──► 16-codebook Audio Tokens
                                                                        │
                                                                        ▼
                                                                Codec Decoder ──► 24 kHz Waveform

Component	ONNX Model	Description
Backbone LLM	`backbone_llm.onnx`	Qwen3-based causal LM mapping interleaved text+audio tokens to hidden states. Maintains a growing KV-cache across the entire generation.
Local Transformer	`backbone_local.onnx`	Depth-wise decoder generating 16 RVQ codebook entries per frame from backbone hidden states. Creates and discards a fresh KV-cache per frame.
Codec Encoder	`codec_encoder.onnx`	Encodes reference speaker waveform into RVQ codes for voice cloning. Run once per session.
Codec Decoder	`codec_decoder.onnx`	Decodes RVQ audio codes back to 24 kHz waveform. Maintains four hierarchical KV-caches for streaming decode.

Requirements

text

numpy
onnxruntime
soundfile
librosa
tokenizers

Install with:

bash

pip install numpy onnxruntime soundfile librosa tokenizers

Directory Structure

text

.
├── inferencer_onnx.py              # Core ONNX inference engine
├── moss_text_tokenizer.py          # Lightweight Qwen3 tokenizer
├── test_basic_streaming-onnx.py    # End-to-end test script
├── README.md
├── onnx_models/  # FP32
│   ├── backbone_f32/
│   │   └── backbone_f32.onnx
│   ├── local_transformer/
│   │   └── local_transformer_f32.onnx
│   ├── codec_decoder/
│   │   └── codec_decoder.onnx
│   └── codec_encoder/
│       └── codec_encoder.onnx
├── onnx_models/
│   ├── codec_decoder_int8/
│   │   └── codec_decoder_int8.onnx
├── configs/
│   ├── config_backbone.json
│   └── config_codec.json
├── tokenizers/
│   ├── tokenizer.json
│   └── tokenizer_config.json
├── audio_ref/
│   └── .[wav|mp3|flac]
└── audio_synth/
    └── .wav

Usage

Basic Streaming TTS

Notes 1

With float32, all models loaded will consume about 13GB. It will OOM after about 120 steps on 16GB RAM.
With <= 16GB RAM, you can use quantized (INT8) codec decoder to avoid OOM. Quantized codec encoder can also be further used but degrades the performance.
With quantized (INT8) backbone_llm and backbone_local_transformer, the performance will be unacceptable and most of the times gibberish/hallucinates.
BF16, as the original MOSS-TTS model is saved, is not yet fully supported on most CPUs. If you want to use GPU, you can convert the fp32 model.
We also noted that the performance when using FP32 (torch/ONNX) on backbone_llm and backbone_local is a bit unstable compared to bf16 (torch). Probably due to the training with bfloat16 and excessed in numerical range with fp32 inference.
So, perhaps the better option is to use ONNX converted to fp16 with GPU/supported CPU. We tried with m8a.xlarge and m8a.2xlarge instances, they do not support CPU with fp16.

Notes 2

The KV caching mechanism is modified to use input past_kv tensor/array and initialized with empty on the time dimension so no need to export two ONNXs for prefill and step. In this case, one ONNX can handle both initializing and continuing. This mechanism is all for the backbone_llm (Qwen3Model), backbone_local_transformer, and codec_decoder. The codec_encoder always receives full sequence.

Notes 3

Text by default in Russian, you can modify in the args. The prompt for the speaker is also modified in Russian, you can change in the inferencer_onnx.py.
This prompting and default decoding hyperparameters (temp, top_p, top_k, repetition) has been optimized for Russian, and you can probably change for your language.
The default prompt from MOSS-TTS is given in English, and we investigated you can slightly modify and even change to your targeted language to produce consistent accent/nativeness as we are using for the Russian within the MOSSTTSRealtimeProcessor.

Example

With quantized (INT8) codec decoder (requires at least 13GB RAM)

bash

python test_basic_streaming-onnx.py --tokenizer_vocab_path tokenizers/tokenizer.json --tokenizer_config_path tokenizers/tokenizer_config.json --backbone_llm_path onnx_models/backbone_f32/backbone_f32.onnx --backbone_local_path onnx_models/local_transformer_f32/local_transformer_f32.onnx --codec_decoder_path onnx_models_quantized/codec_decoder_int8/codec_decoder_int8.onnx --codec_encoder_path onnx_models/codec_encoder/codec_encoder.onnx --backbone_config_path configs/config_backbone.json --codec_config_path configs/config_codec.json --prompt_wav audio_ref/male_stewie.mp3 --out_wav output.wav

With all FP32 (requires > 16GB RAM)

bash

python test_basic_streaming-onnx.py --tokenizer_vocab_path tokenizers/tokenizer.json --tokenizer_config_path tokenizers/tokenizer_config.json --backbone_llm_path onnx_models/backbone_f32/backbone_f32.onnx --backbone_local_path onnx_models/local_transformer_f32/local_transformer_f32.onnx --codec_decoder_path onnx_models/codec_decoder/codec_decoder.onnx --codec_encoder_path onnx_models/codec_encoder/codec_encoder.onnx --backbone_config_path configs/config_backbone.json --codec_config_path configs/config_codec.json --prompt_wav audio_ref/male_stewie.mp3 --out_wav output.wav

Programmatic Usage

python

import json
import onnxruntime as ort
from inferencer_onnx import MossTTSRealtimeInferenceONNX
from moss_text_tokenizer import MOSSTextTokenizer

# Load tokenizer and ONNX sessions
tokenizer = MOSSTextTokenizer("tokenizers/tokenizer.json",
                               "tokenizers/tokenizer_config.json")
backbone_llm = ort.InferenceSession("onnx_models/backbone_llm.onnx",
                                     providers=["CPUExecutionProvider"])
backbone_local = ort.InferenceSession("onnx_models/backbone_local.onnx",
                                       providers=["CPUExecutionProvider"])
codec_decoder = ort.InferenceSession("onnx_models/codec_decoder.onnx",
                                      providers=["CPUExecutionProvider"])
codec_encoder = ort.InferenceSession("onnx_models/codec_encoder.onnx",
                                      providers=["CPUExecutionProvider"])

with open("configs/config_backbone.json") as f:
    backbone_config = json.load(f)
with open("configs/config_codec.json") as f:
    codec_config = json.load(f)

# Create inferencer
inferencer = MossTTSRealtimeInferenceONNX(
    tokenizer, backbone_llm, backbone_local,
    codec_decoder, codec_encoder,
    backbone_config, codec_config,
)

# Encode reference speaker for voice cloning
prompt_tokens = inferencer._encode_reference_audio("audio/speaker.wav")
input_ids = inferencer.processor.make_ensemble(prompt_tokens.squeeze(1))
inferencer.reset_turn(input_ids=input_ids, include_system_prompt=False,
                      reset_cache=True)

# Stream text and collect audio
for delta in your_llm_stream():
    audio_frames = inferencer.push_text(delta)
    for frame in audio_frames:
        # push_tokens + audio_chunks for waveform decoding
        ...

Command-Line Arguments

Argument	Type	Default	Description
`--tokenizer_vocab_path`	str	required	Path to `tokenizer.json`
`--tokenizer_config_path`	str	required	Path to `tokenizer_config.json`
`--backbone_llm_path`	str	required	Path to backbone LLM ONNX model
`--backbone_local_path`	str	required	Path to local transformer ONNX model
`--codec_decoder_path`	str	required	Path to codec decoder ONNX model
`--codec_encoder_path`	str	required	Path to codec encoder ONNX model
`--backbone_config_path`	str	required	Path to `config_backbone.json`
`--codec_config_path`	str	required	Path to `config_codec.json`
`--prompt_wav`	str	required	Reference speaker audio for voice cloning
`--out_wav`	str	`out_streaming.wav`	Output WAV file path
`--sample_rate`	int	`24000`	Output sample rate (Hz)
`--temperature`	float	`0.725`	Sampling temperature
`--top_p`	float	`0.6`	Nucleus sampling threshold
`--top_k`	int	`34`	Top-k sampling cutoff
`--repetition_penalty`	float	`1.9`	Repetition penalty coefficient
`--repetition_window`	int	`50`	Window for repetition penalty
`--max_length`	int	`5000`	Maximum generation steps
`--delta_chunk_chars`	int	`1`	Characters per simulated LLM delta
`--delta_delay_s`	float	`0.0`	Delay between simulated deltas (seconds)
`--assistant_text`	str	(Russian text)	Text to synthesize

Acknowledgments

This work builds upon the MOSS-TTS-Realtime model by OpenMOSS Team and the MOSS-Audio-Tokenizer codec.

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

⚠️ Incomplete Data

Some information about this model is not available. Use with Caution - Verify details from the original source before relying on this data.

View Original Source →

📝 Limitations & Considerations

• Benchmark scores may vary based on evaluation methodology and hardware configuration.
• VRAM requirements are estimates; actual usage depends on quantization and batch size.
• FNI scores are relative rankings and may change as new models are added.
⚠ License Unknown: Verify licensing terms before commercial use.

Top Tier

Social Proof

HuggingFace Hub

45Downloads

Hub Discussions

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Model Transparency Report

Technical metadata sourced from upstream repositories.

Open Metadata

🆔 Identity & Source

id: hf-model--pltobing--moss-tts-realtime-onnx
slug: pltobing--moss-tts-realtime-onnx
source: huggingface
author: pltobing
license: Apache-2.0
tags: onnx, text-to-speech, tts, moss-tts, voice clone, streaming, qwen3, rvq, multilingual, causal audio tokenizer, ru, zh, en, de, es, fr, ja, it, he, ko, fa, ar, pl, pt, cs, da, sv, hu, el, tr, base_model:openmoss-team/moss-tts, base_model:quantized:openmoss-team/moss-tts, license:apache-2.0, region:us

⚙️ Technical Specs

architecture: null
params billions: null
context length: null
pipeline tag: text-to-speech

📊 Engagement & Metrics

downloads: 45
stars: 0
forks: 0

Data indexed from public sources. Updated daily.

Welcome to Free2AI Tools!

Smart Search

FNI Score

You're All Set!

Cite this model

🔬Technical Deep Dive

Quick Commands

⚖️ Nexus Index V2.0

💬 Index Insight

Verification Authority

🚀 What's Next?

Find Training Datasets

Compare Benchmarks

Deployment Guide

Technical Deep Dive

MOSS-TTS-Realtime ONNX Inference

Overview

Architecture

Requirements

Directory Structure

Usage

Basic Streaming TTS

Notes 1

Notes 2

Notes 3

Example

Programmatic Usage

Command-Line Arguments

Acknowledgments

License

⚠️ Incomplete Data

📝 Limitations & Considerations

Social Proof

🛡️ Model Transparency Report

🆔 Identity & Source

⚙️ Technical Specs

📊 Engagement & Metrics