🧠

Model

Shrutam 2

Name: Shrutam 2
Author: bharatgenai

by bharatgenai hf-model--bharatgenai--shrutam-2

Free2AITools Nexus Index

37.3 Top 100%

S: Semantic 50

A: Authority 0

P: Popularity 0

R: Recency 97

Q: Quality 50

Tech Context

Vital Performance

0 DL / 30D

0.0%

Source →

Audited 37.3 FNI Score

Tiny - Params

- Context

0 Downloads

Model Information Summary
Entity Passport
Registry ID	hf-model--bharatgenai--shrutam-2
Provider	huggingface

📜

Cite this model

Academic & Research Attribution

BibTeX

@misc{hf_model__bharatgenai__shrutam_2,
  author = {bharatgenai},
  title = {Shrutam 2 Model},
  year = {2026},
  howpublished = {\url{https://huggingface.co/bharatgenai/Shrutam-2}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

bharatgenai. (2026). Shrutam 2 [Model]. Free2AITools. https://huggingface.co/bharatgenai/Shrutam-2

🔬Technical Deep Dive

Full Specifications [+]

Quick Commands

🤗 HF Download

huggingface-cli download bharatgenai/shrutam-2

⚖️ Free2AITools Nexus Index V2.0

Methodology Index Protocol

Semantic (S) 50

Authority (A) 0

Popularity (P) 0

Recency (R) 97

Quality (Q) 50

💬 Index Insight

FNI V2.0 for Shrutam 2: Semantic (S:50), Authority (A:0), Popularity (P:0), Recency (R:97), Quality (Q:50).

Free2AITools Nexus Index

Verification Authority

HuggingFace API GitHub Metadata Arxiv Citation DB System Audit

Unbiased Data Node Refresh: VFS Live

---

🚀 What's Next?

📊

Find Training Datasets

Discover datasets compatible with this model

📈

Compare Benchmarks

See how this model ranks on standard tests

⚡

Technical Deep Dive

Shrutam-2: LLM-Powered Multilingual Indic Speech Recognition

Shrutam-2 is a LLM based automatic speech recognition system for 12 major Indian languages. It bridges a Conformer speech encoder with a pretrained LLM decoder through a Mixture-of-Experts (MoE) projection layer, enabling high-quality, prompt-controllable transcription across diverse Indic languages.

Architecture Overview

Unlike conventional CTC/Attention ASR systems that map audio directly to text tokens, Shrutam-2 reframes speech recognition as a conditional language generation task. A speech encoder produces frame-level audio representations, which are then projected into the LLM's embedding space and fed to a frozen LLM decoder alongside a text prompt.

The key architectural contribution is the MoE Projector that bridges the encoder and the LLM:

Component	Details
Downsampler	Two-stage Conv1D that reduces the encoder frame rate for efficient LLM consumption
MoE Projector	8 linear experts with SMEAR (Soft Merging of Experts with Adaptive Routing) — utterance-level soft gating computes a weighted merge of all expert parameters into a single projector per input, avoiding discrete top-k routing and its associated load-balancing issues

Each expert is a two-layer MLP (encoder_dim → 2048 → llm_dim). Rather than routing each frame to a single expert, SMEAR computes frame-wise router probabilities, averages them at the utterance level, and produces a single merged weight matrix per utterance. This yields a smooth, fully differentiable routing mechanism with a simple MSE-based load-balancing loss.

Why LLM-Based ASR?

Traditional ASR pipelines rely on acoustic models trained exclusively on speech-text pairs. By grounding transcription in a pretrained LLM, this approach gains several advantages:

Rich linguistic priors — The LLM's language knowledge reduces hallucinations and improves fluency, especially for low-resource languages.
Prompt controllability — Transcription behavior can be steered through natural-language prompts without retraining.
Unified multilingual capacity — A single model serves all 12 languages, with the MoE layer learning language-adaptive projections.

Languages Supported

#	Language	Script	ISO 639-1
1	Hindi	Devanagari	`hi`
2	Marathi	Devanagari	`mr`
3	Tamil	Tamil	`ta`
4	Telugu	Telugu	`te`
5	Malayalam	Malayalam	`ml`
6	Kannada	Kannada	`kn`
7	Odia	Odia	`or`
8	Bengali	Bengali	`bn`
9	Urdu	Nastaliq	`ur`
10	Assamese	Bengali	`as`
11	Gujarati	Gujarati	`gu`
12	Punjabi	Gurmukhi	`pa`

Extended Capabilities

Note: The capabilities below are not fully tested and are presented as potential directions. They can be unlocked or significantly enhanced with task-specific fine-tuning.

Prompt Customisation

Because the LLM decoder conditions on both audio embeddings and a text prompt, you can control transcription behavior at inference time by changing the prompt.

Basic transcription:

text

"Transcribe speech to text."

Language-specific prompting:

text

"Transcribe the following Hindi speech to text."
"Transcribe the following Tamil speech to Devanagari text."

Domain-specific prompting:

text

"Transcribe the following medical conversation in Hindi."
"Transcribe the following legal proceeding in Bengali."

Few-Shot Prompting

The LLM backbone enables few-shot prompting where you provide example transcriptions in the prompt to bias the model toward a specific vocabulary, style, or domain:

text

"The following are examples of transcriptions from a banking domain:
 - 'मुझे अपने खाते का बैलेंस जानना है'
 - 'कृपया मेरा पिन रीसेट कर दीजिए'
Now transcribe the following speech to text."

This is particularly useful for:

Domain adaptation — Bias the decoder toward domain-specific terminology (medical, legal, financial) without retraining.
Named entity handling — Provide example transcriptions containing proper nouns, brand names, or technical terms so the model calibrates its output vocabulary.
Script/transliteration control — Guide the model toward a particular script or romanization convention.

Code-Switching Support

The multilingual nature of both the speech encoder and the LLM enables handling of code-switched speech (e.g., Hindi-English) when prompted appropriately:

text

"Transcribe the following Hindi-English code-mixed speech to text."

Usage

Requirements

bash

pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118
pip install transformers=4.56.3 huggingface_hub==0.36.0 pyyaml

Quick Start

Update inference_config.yaml with your model paths (see Configuration below), then run:

bash

python inference_script.py

The script loads the full pipeline (encoder, MoE projector, LLM), transcribes the audio file, and prints the output text.

License

This model is released under the BharatGen non-commercial license. Please refer to the LICENSE file for detailed terms and conditions.

For more details about the model - https://arxiv.org/abs/2601.19451

⚠️ Incomplete Data

Some information about this model is not available. Use with Caution - Verify details from the original source before relying on this data.

View Original Source →

📝 Limitations & Considerations

• Benchmark scores may vary based on evaluation methodology and hardware configuration.
• VRAM requirements are estimates; actual usage depends on quantization and batch size.
• FNI scores are relative rankings and may change as new models are added.
⚠ License Unknown: Verify licensing terms before commercial use.

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Model Transparency Report

Technical metadata sourced from upstream repositories.

Open Metadata

🆔 Identity & Source

id: hf-model--bharatgenai--shrutam-2
slug: bharatgenai--shrutam-2
source: huggingface
author: bharatgenai
license
tags: automatic-speech-recognition, multilingual, conformer, mixture-of-experts, llm, speech-to-text, hi, mr, ta, te, ml, kn, or, bn, ur, as, gu, pa, arxiv:2601.19451, region:us, safetensors

⚙️ Technical Specs

architecture: null
params billions: null
context length: null
pipeline tag: automatic-speech-recognition

📊 Engagement & Metrics

downloads: 0
stars: 0
forks: 0

Data indexed from public sources. Updated daily.

Cite this model

🔬Technical Deep Dive

Quick Commands

⚖️ Free2AITools Nexus Index V2.0

💬 Index Insight

Verification Authority

🚀 What's Next?

Find Training Datasets

Compare Benchmarks

Deployment Guide

Technical Deep Dive

Shrutam-2: LLM-Powered Multilingual Indic Speech Recognition

Architecture Overview

Why LLM-Based ASR?

Languages Supported

Extended Capabilities

Prompt Customisation

Few-Shot Prompting

Code-Switching Support

Usage

Requirements

Quick Start

License

⚠️ Incomplete Data

📝 Limitations & Considerations

🛡️ Model Transparency Report

🆔 Identity & Source

⚙️ Technical Specs

📊 Engagement & Metrics