Tts Dataset
Pillar scores are computed during the next indexing cycle.
**Phonemized-VCTK** is a light-repack of the VCTK corpus that bundles—per utterance— * the raw audio () * the plain transcript () * the IPA phoneme string () * frame-level pitch-aligned segments () * sentence-level context embeddings () * speaker-level embeddings () The goal is to provide a *tu...
| Entity Passport | |
| Registry ID | hf-dataset--srinathnr--tts_dataset |
| Provider | huggingface |
Cite this dataset
Academic & Research Attribution
@misc{hf_dataset__srinathnr__tts_dataset,
author = {srinathnr},
title = {Tts Dataset Dataset},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/srinathnr/TTS_DATASET}},
note = {Accessed via Free2AITools Knowledge Fortress}
} 🔬Technical Deep Dive
Full Specifications [+]▾
⚖️ Nexus Index V16.5
💬 Index Insight
The Free2AITools Nexus Index for Tts Dataset aggregates Popularity (P:0), Freshness (F:0), and Completeness (C:0). The Utility score (U:0) represents deployment readiness and ecosystem adoption.
Verification Authority
👁️ Data Preview
Row-level preview not available for this dataset.
Schema structure is shown in the Field Logic panel when available.
🔗 Explore Full Dataset ↗🧬 Field Logic
Schema not yet indexed for this dataset.
Dataset Specification
license: cc
task_categories:
- automatic-speech-recognition
- audio-to-audio
- audio-classification
language: - en
pretty_name: Phonemized-VCTK (speech + features)
size_categories: - 10K<n<100K
Phonemized-VCTK (speech + features)
Phonemized-VCTK is a light-repack of the VCTK corpus that bundles—per utterance—
- the raw audio (
wav/) - the plain transcript (
txt/) - the IPA phoneme string (
phonemized/) - frame-level pitch-aligned segments (
segments/) - sentence-level context embeddings (
context_embeddings/) - speaker-level embeddings (
speaker_embeddings/)
The goal is to provide a turn-key dataset for
forced alignment, prosody modelling, TTS, and speaker adaptation experiments without having to regenerate these side-products every time.
Folder layout
| Folder | Contents | Shape / format |
|---|---|---|
wav/<spk>/ |
48 kHz 16‑bit mono .wav files |
p225_001.wav, … |
txt/<spk>/ |
original plain‑text transcript | p225_001.txt, … |
phonemized/<spk>/ |
whitespace‑separated IPA symbols, #h = word boundary |
p225_001.txt, … |
segments/<spk>/ |
JSON with per‑phoneme timing & mean pitch | p225_001.json, … |
context_embeddings/<spk>/ |
NumPy float32 .npy, sentence embedding of the utterance |
p225_001.npy, … |
speaker_embeddings/ |
NumPy float32 .npy, one vector per speaker, generated from NVIDIA TitaNet-Large model |
p225.npy, … |
Example segments entry
{
"0": ["h#", {"start_sec":0.0,"end_sec":0.10,"duration_sec":0.10,"mean_pitch":0.0}],
"1": ["p", {"start_sec":0.10,"end_sec":0.18,"duration_sec":0.08,"mean_pitch":0.0}],
"2": ["l", {"start_sec":0.18,"end_sec":1.32,"duration_sec":1.14,"mean_pitch":1377.16}]
}
Quick start
from datasets import load_dataset
ds_train = load_dataset("srinathnr/TTS_DATASET", split="train", trust_remote_code=True, streaming=True)
ds_val = load_dataset("srinathnr/TTS_DATASET", split="validation", trust_remote_code=True, streaming=True)
ds_test = load_dataset("srinathnr/TTS_DATASET", split="test", trust_remote_code=True, streaming=True)
Custom Data Load
from pathlib import Path
from datasets import Audio
from torch.utils.data import Dataset
class CustomDataset(Dataset):
def init(self, dataset_folder):
self.dataset_folder = dataset_folder
self.audio_files = sorted(
[path for path in (Path(dataset_folder) / 'wav').rglob('.wav') if not path.name.startswith('._')]
)
self.phoneme_files = sorted(
[path for path in (Path(dataset_folder) / 'phonemized').rglob('.txt') if not path.name.startswith('._')]
)
text
# Get the base file names (without extensions) for matching
audio_basenames = {path.stem for path in self.audio_files}
phoneme_basenames = {path.stem for path in self.phoneme_files}
# Intersection of all file sets (excluding speaker embeddings)
common_basenames = audio_basenames & phoneme_basenames
# Filter files to only include common base names
self.audio_files = [path for path in self.audio_files if path.stem in common_basenames]
self.phoneme_files = [path for path in self.phoneme_files if path.stem in common_basenames]
self.audio_feature = Audio(sampling_rate=16000)
def __len__(self):
return len(self.audio_files)
def __getitem__(self, idx):
audio_path = str(self.audio_files[idx])
phoneme_path = str(self.phoneme_files[idx])
align_audio = self.audio_feature.decode_example({"path": str(audio_path), "bytes": None})
with open(phoneme_path, 'r') as f:
phoneme = f.read()
if phoneme is not None:
phoneme = phoneme.split()
else:
phoneme = []
return {
'phoneme': phoneme,
'align_audio': align_audio
}
Explore
from pathlib import Path
import json, soundfile as sf
import numpy as np
root = Path("Phonemized-VCTK")
wav, sr = sf.read(root/"wav/p225/p225_001.wav")
text = (root/"txt/p225/p225_001.txt").read_text().strip()
ipa = (root/"phonemized/p225/p225_001.txt").read_text().strip()
segs = json.loads((root/"segments/p225/p225_001.json").read_text())
ctx = np.load(root/"context_embeddings/p225/p225_001.npy")
print(text)
print(ipa.split()) # IPA tokens
print(ctx.shape) # (384,)
Known limitations
- The phone set is plain IPA—no stress or intonation markers.
- English only (≈109 speakers, various accents).
- Pitch = 0 on unvoiced phones; interpolate if needed.
- Embedding models were chosen for convenience—swap as you like.
Citation
Please cite both VCTK and this derivative if you use the corpus:
@misc{yours2025phonvctk,
title = {Phonemized-VCTK: An enriched version of VCTK with IPA, alignments and embeddings},
author = {Your Name},
year = {2025},
howpublished = {\url{https://huggingface.co/datasets/your-handle/phonemized-vctk}}
}
@inproceedings{yamagishi2019cstr,
title={The CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit},
author={Yamagishi, Junichi et al.},
booktitle={Proc. LREC},
year={2019}
}
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
🛡️ Dataset Transparency Report
Verified data manifest for traceability and transparency.
🆔 Identity & Source
- id
- hf-dataset--srinathnr--tts_dataset
- source
- huggingface
- author
- srinathnr
- tags
- task_categories:automatic-speech-recognitiontask_categories:audio-to-audiotask_categories:audio-classificationlanguage:enlicense:ccsize_categories:10k
modality:textregion:us
⚙️ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
📊 Engagement & Metrics
- likes
- 0
- downloads
- 53,190
Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)