📊

Dataset

Tts Dataset

by srinathnr hf-dataset--srinathnr--tts_dataset

Nexus Index

38.0 Top 0%

P / V / C / U Breakdown Calibration Pending

Pillar scores are computed during the next indexing cycle.

Tech Context

Vital Performance

0 DL / 30D

0.0%

**Phonemized-VCTK** is a light-repack of the VCTK corpus that bundles—per utterance— * the raw audio () * the plain transcript () * the IPA phoneme string () * frame-level pitch-aligned segments () * sentence-level context embeddings () * speaker-level embeddings () The goal is to provide a *tu...

Source →

Data Integrity 38 FNI Score

- Size

- Rows

Parquet Format

- Tokens

Dataset Information Summary
Entity Passport
Registry ID	hf-dataset--srinathnr--tts_dataset
Provider	huggingface

📜

Cite this dataset

Academic & Research Attribution

BibTeX

@misc{hf_dataset__srinathnr__tts_dataset,
  author = {srinathnr},
  title = {Tts Dataset Dataset},
  year = {2026},
  howpublished = {\url{https://huggingface.co/datasets/srinathnr/TTS_DATASET}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}

APA Style

srinathnr. (2026). Tts Dataset [Dataset]. Free2AITools. https://huggingface.co/datasets/srinathnr/TTS_DATASET

🔬Technical Deep Dive

Full Specifications [+]

⚖️ Nexus Index V16.5

Methodology Index Protocol

38.0

ESTIMATED IMPACT TIER

Popularity (P) 0

Freshness (F) 0

Completeness (C) 0

Utility (U) 0

💬 Index Insight

The Free2AITools Nexus Index for Tts Dataset aggregates Popularity (P:0), Freshness (F:0), and Completeness (C:0). The Utility score (U:0) represents deployment readiness and ecosystem adoption.

Free2AITools Nexus Index

Verification Authority

HuggingFace API GitHub Metadata Arxiv Citation DB System Audit

Unbiased Data Node Refresh: VFS Live

⬇️

Downloads

53,190

👁️ Data Preview

📊

Row-level preview not available for this dataset.

Schema structure is shown in the Field Logic panel when available.

🔗 Explore Full Dataset ↗

🧬 Field Logic

🧬

Schema not yet indexed for this dataset.

Dataset Specification

license: cc
task_categories:

automatic-speech-recognition
audio-to-audio
audio-classification
language:
en
pretty_name: Phonemized-VCTK (speech + features)
size_categories:
10K<n<100K

Phonemized-VCTK (speech + features)

Phonemized-VCTK is a light-repack of the VCTK corpus that bundles—per utterance—

the raw audio (wav/)
the plain transcript (txt/)
the IPA phoneme string (phonemized/)
frame-level pitch-aligned segments (segments/)
sentence-level context embeddings (context_embeddings/)
speaker-level embeddings (speaker_embeddings/)

The goal is to provide a turn-key dataset for
forced alignment, prosody modelling, TTS, and speaker adaptation experiments without having to regenerate these side-products every time.

Folder layout

Folder	Contents	Shape / format
`wav/<spk>/`	48 kHz 16‑bit mono `.wav` files	`p225_001.wav`, …
`txt/<spk>/`	original plain‑text transcript	`p225_001.txt`, …
`phonemized/<spk>/`	whitespace‑separated IPA symbols, `#h` = word boundary	`p225_001.txt`, …
`segments/<spk>/`	JSON with per‑phoneme timing & mean pitch	`p225_001.json`, …
`context_embeddings/<spk>/`	NumPy float32 `.npy`, sentence embedding of the utterance	`p225_001.npy`, …
`speaker_embeddings/`	NumPy float32 `.npy`, one vector per speaker, generated from NVIDIA `TitaNet-Large` model	`p225.npy`, …

Example segments entry

{
  "0": ["h#", {"start_sec":0.0,"end_sec":0.10,"duration_sec":0.10,"mean_pitch":0.0}],
  "1": ["p",  {"start_sec":0.10,"end_sec":0.18,"duration_sec":0.08,"mean_pitch":0.0}],
  "2": ["l",  {"start_sec":0.18,"end_sec":1.32,"duration_sec":1.14,"mean_pitch":1377.16}]
}

Quick start

from datasets import load_dataset

ds_train = load_dataset("srinathnr/TTS_DATASET", split="train", trust_remote_code=True, streaming=True)
ds_val = load_dataset("srinathnr/TTS_DATASET", split="validation", trust_remote_code=True, streaming=True)
ds_test = load_dataset("srinathnr/TTS_DATASET", split="test", trust_remote_code=True, streaming=True)

Custom Data Load

from pathlib import Path
from datasets import Audio
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def init(self, dataset_folder):
        self.dataset_folder = dataset_folder
        self.audio_files = sorted(
            [path for path in (Path(dataset_folder) / 'wav').rglob('.wav') if not path.name.startswith('._')]
        )
        self.phoneme_files = sorted(
            [path for path in (Path(dataset_folder) / 'phonemized').rglob('.txt') if not path.name.startswith('._')]
        )

    
      
        text
        
      
          # Get the base file names (without extensions) for matching
    audio_basenames = {path.stem for path in self.audio_files}
    phoneme_basenames = {path.stem for path in self.phoneme_files}

    # Intersection of all file sets (excluding speaker embeddings)
    common_basenames = audio_basenames & phoneme_basenames

    # Filter files to only include common base names
    self.audio_files = [path for path in self.audio_files if path.stem in common_basenames]
    self.phoneme_files = [path for path in self.phoneme_files if path.stem in common_basenames]

    self.audio_feature = Audio(sampling_rate=16000)

def __len__(self):
    return len(self.audio_files)

def __getitem__(self, idx):
    audio_path = str(self.audio_files[idx])
    phoneme_path = str(self.phoneme_files[idx])

    align_audio = self.audio_feature.decode_example({"path": str(audio_path), "bytes": None})

    with open(phoneme_path, 'r') as f:
        phoneme = f.read()
    
    if phoneme is not None:
        phoneme = phoneme.split()
    else:
        phoneme = []

    return {
        'phoneme': phoneme,
        'align_audio': align_audio
    }

Explore

from pathlib import Path
import json, soundfile as sf
import numpy as np

root = Path("Phonemized-VCTK")
wav, sr = sf.read(root/"wav/p225/p225_001.wav")
text = (root/"txt/p225/p225_001.txt").read_text().strip()
ipa  = (root/"phonemized/p225/p225_001.txt").read_text().strip()
segs = json.loads((root/"segments/p225/p225_001.json").read_text())
ctx  = np.load(root/"context_embeddings/p225/p225_001.npy")
print(text)
print(ipa.split())       # IPA tokens
print(ctx.shape)         # (384,)

Known limitations

The phone set is plain IPA—no stress or intonation markers.
English only (≈109 speakers, various accents).
Pitch = 0 on unvoiced phones; interpolate if needed.
Embedding models were chosen for convenience—swap as you like.

Citation

Please cite both VCTK and this derivative if you use the corpus:

@misc{yours2025phonvctk,
  title        = {Phonemized-VCTK: An enriched version of VCTK with IPA, alignments and embeddings},
  author       = {Your Name},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/datasets/your-handle/phonemized-vctk}}
}

@inproceedings{yamagishi2019cstr,
  title={The CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit},
  author={Yamagishi, Junichi et al.},
  booktitle={Proc. LREC},
  year={2019}
}

Top Tier

Social Proof

HuggingFace Hub

53.2KDownloads

Hub Discussions

🤗 Data Source: Hugging Face ↗

🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Dataset Transparency Report

Verified data manifest for traceability and transparency.

100% Data Disclosure Active

🆔 Identity & Source

id: hf-dataset--srinathnr--tts_dataset
source: huggingface
author: srinathnr
tags: task_categories:automatic-speech-recognitiontask_categories:audio-to-audiotask_categories:audio-classificationlanguage:enlicense:ccsize_categories:10kmodality:textregion:us

⚙️ Technical Specs

architecture: null
params billions: null
context length: null

📊 Engagement & Metrics

likes: 0
downloads: 53,190

Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)

Welcome to Free2AI Tools!

Smart Search

FNI Score

You're All Set!