Shrutam 2
| Entity Passport | |
| Registry ID | hf-model--bharatgenai--shrutam-2 |
| Provider | huggingface |
Cite this model
Academic & Research Attribution
@misc{hf_model__bharatgenai__shrutam_2,
author = {bharatgenai},
title = {Shrutam 2 Model},
year = {2026},
howpublished = {\url{https://huggingface.co/bharatgenai/Shrutam-2}},
note = {Accessed via Free2AITools Knowledge Fortress}
} 🔬Technical Deep Dive
Full Specifications [+]▾
Quick Commands
huggingface-cli download bharatgenai/shrutam-2 ⚖️ Free2AITools Nexus Index V2.0
💬 Index Insight
FNI V2.0 for Shrutam 2: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:97), Quality (Q:50).
Verification Authority
🚀 What's Next?
Technical Deep Dive
Shrutam-2: LLM-Powered Multilingual Indic Speech Recognition
Shrutam-2 is a LLM based automatic speech recognition system for 12 major Indian languages. It bridges a Conformer speech encoder with a pretrained LLM decoder through a Mixture-of-Experts (MoE) projection layer, enabling high-quality, prompt-controllable transcription across diverse Indic languages.
Architecture Overview
Unlike conventional CTC/Attention ASR systems that map audio directly to text tokens, Shrutam-2 reframes speech recognition as a conditional language generation task. A speech encoder produces frame-level audio representations, which are then projected into the LLM's embedding space and fed to a frozen LLM decoder alongside a text prompt.
The key architectural contribution is the MoE Projector that bridges the encoder and the LLM:
| Component | Details |
|---|---|
| Downsampler | Two-stage Conv1D that reduces the encoder frame rate for efficient LLM consumption |
| MoE Projector | 8 linear experts with SMEAR (Soft Merging of Experts with Adaptive Routing) — utterance-level soft gating computes a weighted merge of all expert parameters into a single projector per input, avoiding discrete top-k routing and its associated load-balancing issues |
Each expert is a two-layer MLP (encoder_dim → 2048 → llm_dim). Rather than routing each frame to a single expert, SMEAR computes frame-wise router probabilities, averages them at the utterance level, and produces a single merged weight matrix per utterance. This yields a smooth, fully differentiable routing mechanism with a simple MSE-based load-balancing loss.
Why LLM-Based ASR?
Traditional ASR pipelines rely on acoustic models trained exclusively on speech-text pairs. By grounding transcription in a pretrained LLM, this approach gains several advantages:
- Rich linguistic priors — The LLM's language knowledge reduces hallucinations and improves fluency, especially for low-resource languages.
- Prompt controllability — Transcription behavior can be steered through natural-language prompts without retraining.
- Unified multilingual capacity — A single model serves all 12 languages, with the MoE layer learning language-adaptive projections.
Languages Supported
| # | Language | Script | ISO 639-1 |
|---|---|---|---|
| 1 | Hindi | Devanagari | hi |
| 2 | Marathi | Devanagari | mr |
| 3 | Tamil | Tamil | ta |
| 4 | Telugu | Telugu | te |
| 5 | Malayalam | Malayalam | ml |
| 6 | Kannada | Kannada | kn |
| 7 | Odia | Odia | or |
| 8 | Bengali | Bengali | bn |
| 9 | Urdu | Nastaliq | ur |
| 10 | Assamese | Bengali | as |
| 11 | Gujarati | Gujarati | gu |
| 12 | Punjabi | Gurmukhi | pa |
Extended Capabilities
Note: The capabilities below are not fully tested and are presented as potential directions. They can be unlocked or significantly enhanced with task-specific fine-tuning.
Prompt Customisation
Because the LLM decoder conditions on both audio embeddings and a text prompt, you can control transcription behavior at inference time by changing the prompt.
Basic transcription:
"Transcribe speech to text."
Language-specific prompting:
"Transcribe the following Hindi speech to text."
"Transcribe the following Tamil speech to Devanagari text."
Domain-specific prompting:
"Transcribe the following medical conversation in Hindi."
"Transcribe the following legal proceeding in Bengali."
Few-Shot Prompting
The LLM backbone enables few-shot prompting where you provide example transcriptions in the prompt to bias the model toward a specific vocabulary, style, or domain:
"The following are examples of transcriptions from a banking domain:
- 'मुझे अपने खाते का बैलेंस जानना है'
- 'कृपया मेरा पिन रीसेट कर दीजिए'
Now transcribe the following speech to text."
This is particularly useful for:
- Domain adaptation — Bias the decoder toward domain-specific terminology (medical, legal, financial) without retraining.
- Named entity handling — Provide example transcriptions containing proper nouns, brand names, or technical terms so the model calibrates its output vocabulary.
- Script/transliteration control — Guide the model toward a particular script or romanization convention.
Code-Switching Support
The multilingual nature of both the speech encoder and the LLM enables handling of code-switched speech (e.g., Hindi-English) when prompted appropriately:
"Transcribe the following Hindi-English code-mixed speech to text."
Usage
Requirements
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118
pip install transformers=4.56.3 huggingface_hub==0.36.0 pyyaml
Quick Start
Update inference_config.yaml with your model paths (see Configuration below), then run:
python inference_script.py
The script loads the full pipeline (encoder, MoE projector, LLM), transcribes the audio file, and prints the output text.
License
This model is released under the BharatGen non-commercial license. Please refer to the LICENSE file for detailed terms and conditions.
For more details about the model - https://arxiv.org/abs/2601.19451
⚠️ Incomplete Data
Some information about this model is not available. Use with Caution - Verify details from the original source before relying on this data.
View Original Source →📝 Limitations & Considerations
- • Benchmark scores may vary based on evaluation methodology and hardware configuration.
- • VRAM requirements are estimates; actual usage depends on quantization and batch size.
- • FNI scores are relative rankings and may change as new models are added.
- ⚠ License Unknown: Verify licensing terms before commercial use.
AI Summary: Based on Hugging Face metadata. Not a recommendation.
🛡️ Model Transparency Report
Technical metadata sourced from upstream repositories.
🆔 Identity & Source
- id
- hf-model--bharatgenai--shrutam-2
- slug
- bharatgenai--shrutam-2
- source
- huggingface
- author
- bharatgenai
- license
- tags
- automatic-speech-recognition, multilingual, conformer, mixture-of-experts, llm, speech-to-text, hi, mr, ta, te, ml, kn, or, bn, ur, as, gu, pa, arxiv:2601.19451, region:us, safetensors
⚙️ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
- automatic-speech-recognition
📊 Engagement & Metrics
- downloads
- 0
- stars
- 0
- forks
- 0
Data indexed from public sources. Updated daily.