KZ STT/TTS стек

Date: 2026-05-02 Status: ground-truth benchmark for ADR-0023 KZ stack selection. Hardware tested on: Apple M3 Max (16-inch MBP), MPS + CPU. Production target Mac Studio M3 Ultra 256 GB + RTX PRO 6000 — same model families, faster wall-clock. Test clip: projects/oinap/ip/07_Звук/OiynUp_Sound_KAZ.wav (25.26 s, 24-bit 44.1 kHz stereo, downmixed to 16 kHz mono PCM for inference). Children’s voiceover script naming OIYNUP characters (Домалақ, Трия, …).

TL;DR

Layer	Recommendation	Why
STT primary	`Uali/whisper-turbo-ksc2-kazakh-finetuned` (HF, open)	Only model that recognised the canonical OIYNUP character name«Домалақпен». Clean Kazakh suffixes. Whisper-Turbo architecture so deployable on RTX PRO 6000 or Mac Studio MPS without bespoke runtime.
STT fallback	`issai/soyle_onnx` (ISSAI, open)	Closest in suffix accuracy. Stable Kazakh inflection. ONNX → portable to web / mobile / edge. Hallucinated «Мадрид» on this one clip — keep as fallback, not primary.
STT generic	`mlx-whisper large-v3-turbo` (kk)	Fastest on Apple Silicon; baseline only. Mangles KZ proper nouns. Use for non-OIYNUP content (Russian, English) where it’s strong.
TTS primary (V1)	`facebook/mms-tts-kaz` (Meta, open VITS)	Native Kazakh, runs on CPU at RTF ≈ 0.10, 16 kHz output. Single-speaker, clean intelligible Kazakh — usable for naparnik V1 (no voice cloning yet).
TTS expressivity (V2)	Fine-tuned F5-TTS or MMS-TTS-kaz on a Kazakh kid speech corpus	F5 stock cross-lingual checkpoint tested 2026-05-02 — does NOT pronounce Kazakh (round-trip via Uali = gibberish). Cross-lingual paper code not yet a public checkpoint. Need our own KZ fine-tune.
TTS kid voice (V1 stopgap)	Pitch-shifted MMS-TTS-kaz (+4 to +8 semitones via librosa)	The same MMS adult voice transposed up. Sounds “helium-adult” not “real kid” but is intelligible Kazakh and ships today.

Discarded vs ADR-0023 canon:

Coqui XTTS-v2 — Coqui shut down December 2025; XTTS-v2 also doesn’t list kk-KZ in its 16 supported languages. Drop from canon.
Qwen3-TTS (Jan 2026, Alibaba) — 10 major languages, Kazakh not included. Drop.
IndexTTS-2.5 (2026, Bilibili) — 4 languages (Zh/En/Ja/Es), Kazakh not included. Skip.
Higgs Audio v2.5 (Jan 2026, Boson AI) — claims 32 languages but Kazakh not documented; quality best for English/Chinese/Spanish per model card. Skip for Kazakh-primary product.
Chatterbox Multilingual (Resemble AI) — 22 languages explicitly listed, Kazakh not included. Skip.
F5-TTS V1 stock checkpoint cross-lingual to Kazakh — tested 2026-05-02, fails (see “F5-TTS round-trip” section below).
Whisper Large-v3 KZ subset as primary — without KSC2 fine-tune it mangles KZ proper nouns badly. Use only as language-agnostic baseline.
Soyle enterprise license — already rejected per Jean test 2026-05-02. Open soyle_onnx retained as fallback.

Method

ffmpeg -i OiynUp_Sound_KAZ.wav -ar 16000 -ac 1 -c:a pcm_s16le kaz_16k.wav

Each STT model called with language="kk", task="transcribe", max_new_tokens=400. Each TTS model run cold (full model load → first inference → audio dump). Wall-clock measured with Python time.time().

Reference text is unknown ground truth — the original WAV is a children’s voiceover, not transcribed in source. We triangulate accuracy by:

Whether the model recognised the OIYNUP-canonical character names (Домалақ, Трия, Патрик).
Whether morphological suffixes are valid Kazakh («-пен», «-мен», «қайталайық»).
Whether output hallucinated non-Kazakh words.

STT results (same 25.26 s clip)

#	Model	Inference	Output
1	`mlx-community/whisper-large-v3-turbo` (kk)	1.6 s	«Тау-мала қейінен ойнап, Патриктпенен ойнап, Трия менен ойнап, Білім алайы. Ойнап теген сөзді, Қайталал екпірге, Қанебалалар ойнап, ойнап, Ойнап!»
2	`mlx-community/whisper-large-v3-mlx` (full) (kk)	1.6 s	«Тау балық әйлен ойнап, Батрикт білін ойнап, Три әміне нойнап, Өлім алайын. Ойнап деген сөзді, Қайталайық бірге, Қарни балалар, Ойнап, ойнап, Ойнап!»
3	`issai/soyle_onnx` (ONNX runtime)	4.21 s	«дамалақ елен ойнапмадрид билен ойнап триаменен ойнап біліп алайық ойнап деген сөзді қайталайық бірге қане балалар ойнап ойнап ойнап»
4	`Uali/whisper-turbo-ksc2-kazakh-finetuned`	10.87 s	«Домалақпенен ойнап, подрядпенен ойнап, бір емесең ойнап білім алайық, ойнап деген сөздің қайталайық бірге, қане балалар ойнап, ойнап, ойнап.»

Verdict per axis

Character names: Uali ✅ (Домалақпенен) ▶ Soyle ⚠️ (дамалақ — close but lowercase, no inflection) ▶ Whisper-Turbo ❌ («Тау-мала», «Патриктпенен») ▶ Whisper-Large-v3 ❌ («Тау балық» = “mountain fish”)
Kazakh morphology: Uali ✅ ▶ Soyle ✅ ▶ Whisper-Large-v3 ⚠️ («Қайталайық бірге» correct, rest broken) ▶ Whisper-Turbo ❌
Hallucinations: Whisper variants none, Soyle hallucinated «Мадрид», Uali wrote «подряд» (Russian word — present in KSC2 fine-tune corpus, possibly correct).
Speed: Whisper-Turbo (mlx) >> Soyle (ONNX) >> Uali (transformers + CPU). On Mac Studio M3 Ultra MPS or RTX PRO 6000 CUDA, Uali Turbo will drop to ≈ 1–2 s for the same clip.

Recommendation rationale

For OIYNUP we transcribe kids speaking Kazakh while naming OIYNUP characters and education terms. Domain match matters more than raw WER on generic news corpora. KSC2 fine-tune gives Uali the right vocabulary; ONNX Soyle is a safe fallback because it’s pre-packaged for runtime portability. Vanilla Whisper is for non-Kazakh streams (RU/EN cartoon dub QA, parent-side voice notes).

TTS results — `facebook/mms-tts-kaz`

CPU-only, M3 Max, single thread:

Sample	Text	Inference	Audio	RTF
`tts_mms_intro.wav`	«Сәлеметсіңдер ме, балалар! Менің атым Спарк. Ойнап үйренеміз, бірге дамимыз!»	0.72 s	6.75 s	0.106
`tts_mms_lesson.wav`	«Бүгін біз сандарды үйренеміз. Бір, екі, үш, төрт, бес. Қайталайық!»	0.65 s	6.59 s	0.098
`tts_mms_quest.wav`	«Жарайсың, өте жақсы жауап! Енді келесі тапсырмаға өтейік.»	0.61 s	6.22 s	0.098

Audio: 16 kHz mono float32 PCM. Output files in 2026-05-02-kz-stt-tts-bench/.

Quality observations (acoustic, by ear — Jean to confirm):

Intelligible neutral-female voice. Single speaker (no choice).
Pitch control: none (VITS is deterministic at this checkpoint).
Prosody: flat children’s-book reading, not playful — fine for V1 lesson narration but not enough for a tamagochi character.
Phonetic accuracy on borrowed Russian words («Спарк») reasonable; native Kazakh phonemes clean.

Kid voice tests (added 2026-05-02 round 2)

Question from Jean: can we change the napарник voice to a kid voice? MMS-TTS-kaz is a single fixed adult-female voice — there’s no speaker parameter. So we tested two paths.

Path A — pitch-shift MMS-TTS via librosa (works, ships today)

Three pitch levels generated for each of the 3 OIYNUP scripts:

Variant	Pitch shift	Speed	Sounds like
`*_kid.wav`	+4 semitones	1.05×	Older kid / teen, lightly cartoonish
`*_kid_high.wav`	+6 semitones	1.05×	Younger kid, slightly chipmunk
`*_small_kid.wav`	+8 semitones	1.05×	Small child / cartoon mascot

These are real-time cheap (librosa pitch_shift, sub-second). Intelligibility of the underlying Kazakh is preserved at +4 / +6, starts to degrade at +8 (formants too high). Acceptable as V1 napарник voice if Jean approves the timbre by ear.

Path B — F5-TTS cross-lingual cloning (FAILS for Kazakh)

Tested with the OIYNUP source itself as the kid Kazakh reference:

ref_audio = OiynUp_Sound_KAZ.wav cut to 4–14 s (real Kazakh kid voice, 10 s)
ref_text  = «Патрияменен ойнап білім алайық, ойнап деген сөзді қайталайық бірге.»
gen_text  = each of the 3 OIYNUP scripts above
checkpoint = F5-TTS V1 (stock pip install)
device    = MPS (Apple Silicon)
inference = 30–43 s per clip

Round-trip via Uali STT (intended → heard):

Sample	Intended	What Uali heard from F5 output
intro	«…Менің атым Спарк. Ойнап үйренеміз, бірге дамимыз!»	«Айдау керекеуің қал ми, айф бәрі сайлэнц бәктерлер, уақыт жеңіспеушісі ев ауф ән пайыз уайз ән фал бір алғай жүрмән бұл ай-майды үйіндеуің»
lesson	«…Бір, екі, үш, төрт, бес. Қайталайық!»	«Айма айма айма қағиға қалай қалай қалай қалай …» (mode-collapsed, single syllable repeated 50+ times)
quest	«Жарайсың, өте жақсы жауап! Енді келесі тапсырмаға өтейік.»	«А мәдет әншілер шығар, ағай, лучше айтпак, лучше.»

Diagnosis: the stock F5-TTS V1 checkpoint is trained on English + Chinese only. It clones the kid timbre/prosody from the reference but produces gibberish phonemes when fed Cyrillic Kazakh — Whisper-Uali then tries to fit those phonemes into Kazakh words and mostly fails. The Sep 2025 «Cross-Lingual F5-TTS» paper proposes a framework that fixes this, but the public checkpoint to use that framework has not shipped yet. F5-TTS for Kazakh = not viable today without our own fine-tune.

Output kept for ear-test: tts_f5_kid_intro.wav, tts_f5_kid_lesson.wav, tts_f5_kid_quest.wav.

Path C — fine-tune F5 or MMS on Kazakh kid corpus (V2 path)

Required: 1–3 hours of clean transcribed Kazakh child speech (ages 5–10). Sources:

KSC2 (ISSAI Kazakh Speech Corpus v2) — has age metadata, can filter for <13. Confirm distribution before relying on it.
OIYNUP cartoon dub raw stems — kid VO sessions for the 10 episodes airing 11 May 2026 are an exact-domain in-house dataset. Coordinate with the cartoon team (Grinvich Technology).
Synthetic augmentation: pitch-shift adult Kazakh corpus down/up to expand kid range (caveat: doesn’t generate real child prosody).

Fine-tune cost estimate (Phase 1 hardware = 1× RTX PRO 6000 96 GB):

Model	Hours of training data	Wall-clock to converge	Disk
MMS-TTS-kaz LoRA	1–3 h	4–8 h	5 GB
F5-TTS full fine-tune	5–10 h	1–2 days	30 GB
KazakhTTS2 ESPnet (Tacotron2 + FastSpeech2)	5–10 h	2–3 days	20 GB

MMS-TTS LoRA is the cheapest experiment — start there before committing to full F5 retrain.

Why MMS-TTS first, not F5/KazakhTTS2

	MMS-TTS-kaz	F5-TTS V1 stock	KazakhTTS2 (ISSAI)
Native KZ training	✅	❌ (en/zh only)	✅ (271 h, 5 voices)
Voice cloning	❌	✅ timbre, ❌ KZ phonemes (verified failing)	❌ (5 fixed voices)
Runtime cost	CPU, RTF 0.1	MPS, ≈30 s for 10 s output	GPU + ESPnet stack
Setup effort	5 min (HF transformers)	5 min pip but Kazakh fails	1–2 days (ESPnet recipe + tarball)
Right for V1 napарник?	✅ ship now	❌ needs KZ fine-tune first	⚠️ V3 candidate when we want >1 voice

Implementation order (revised after kid-voice tests, 2026-05-02 round 2):

V1 (Э4): ship MMS-TTS-kaz for narration + librosa pitch-shift +4 semitones for napарник kid voice. Single model, post-processing only. Public-facing napарник has no default name (per CLAUDE.md canon — child names it themselves; «Спарк» used in this bench is the internal Figma/code label).
V2 (Э5): MMS-TTS-kaz LoRA fine-tune on Kazakh kid corpus (KSC2 age-filtered + OIYNUP cartoon kid VO stems from Grinvich Technology). Goal: real kid timbre instead of pitch-shifted adult. ~4–8 h training on Phase 1 hardware (1× RTX PRO 6000 96 GB).
V3 (post-launch): if quality plateau, evaluate KazakhTTS2 ESPnet recipe (5 voices) and F5-TTS full Kazakh fine-tune for the «3 character × 3 voice» canon from ADR-0001.

Reproducibility

All artifacts are in 2026-05-02-kz-stt-tts-bench/:

_reference_kaz_16k.wav   — 16 kHz mono input clip
kaz_turbo.txt            — Whisper Large-v3-Turbo
kaz_largev3.txt          — Whisper Large-v3 full
kaz_soyle.txt            — issai/soyle_onnx
kaz_uali.txt             — Uali/whisper-turbo-ksc2-kazakh-finetuned
tts_mms_intro.wav        — MMS-TTS-kaz: napарник intro (adult voice)
tts_mms_lesson.wav       — MMS-TTS-kaz: number lesson (adult voice)
tts_mms_quest.wav        — MMS-TTS-kaz: quest praise (adult voice)

tts_mms_*_kid.wav        — MMS-TTS pitch-shifted +4 st (older kid)
tts_mms_*_kid_high.wav   — MMS-TTS pitch-shifted +6 st (younger kid)
tts_mms_*_small_kid.wav  — MMS-TTS pitch-shifted +8 st (small child)

tts_f5_kid_*.wav         — F5-TTS V1 stock cross-lingual attempts (FAILED on Kazakh phonemes — kept for ear-test)

kid_ref_kaz.wav          — 10 s kid Kazakh reference cut from OiynUp_Sound_KAZ.wav (4–14 s)
kid_ref_kaz.txt          — Uali transcript of the reference clip

Reproduce locally:

uv venv --python 3.12 .venv && source .venv/bin/activate
uv pip install torch torchaudio "transformers>=4.46" scipy numpy soundfile librosa \
  mlx-whisper "optimum[onnxruntime]" onnxruntime
ffmpeg -i OiynUp_Sound_KAZ.wav -ar 16000 -ac 1 -c:a pcm_s16le kaz_16k.wav

# STT
mlx_whisper kaz_16k.wav --model mlx-community/whisper-large-v3-turbo --language kk
python -c "from transformers import WhisperProcessor, WhisperForConditionalGeneration; \
  import soundfile as sf; a,sr = sf.read('kaz_16k.wav'); \
  p=WhisperProcessor.from_pretrained('Uali/whisper-turbo-ksc2-kazakh-finetuned'); \
  m=WhisperForConditionalGeneration.from_pretrained('Uali/whisper-turbo-ksc2-kazakh-finetuned'); \
  i=p(a,sampling_rate=sr,return_tensors='pt'); \
  ids=m.generate(input_features=i.input_features,language='kk',task='transcribe',max_new_tokens=400); \
  print(p.batch_decode(ids,skip_special_tokens=True)[0])"

# TTS
python -c "from transformers import VitsModel, AutoTokenizer; import torch, scipy.io.wavfile as w; \
  t=AutoTokenizer.from_pretrained('facebook/mms-tts-kaz'); \
  m=VitsModel.from_pretrained('facebook/mms-tts-kaz'); \
  i=t('Сәлем балалар!', return_tensors='pt'); \
  with torch.no_grad(): o=m(**i).waveform; \
  w.write('out.wav', m.config.sampling_rate, o.squeeze().numpy())"

Open questions for Jean

Listen to the three TTS clips. Is MMS-TTS-kaz acoustic quality good enough for V1 naparnik narration, or do we go straight to F5-TTS cross-lingual cloning?
Do we have a transcript of OiynUp_Sound_KAZ.wav from the production team? If yes, we can compute real WER instead of triangulating. If not, we ship Uali as primary and tag «Домалақ» recognition as the canonical sanity check.
OK to update ADR-0023? Replace Coqui-XTTS-v2 + Qwen3-TTS references with MMS-TTS-kaz (V1) + F5-TTS cross-lingual (V2). Soyle stays as fallback per Jean’s earlier direction.

Source links

ISSAI Soyle: https://github.com/IS2AI/Soyle, https://huggingface.co/issai/soyle_onnx
Uali fine-tune: https://huggingface.co/Uali/whisper-turbo-ksc2-kazakh-finetuned
Meta MMS-TTS-kaz: https://huggingface.co/facebook/mms-tts-kaz
KazakhTTS2 (ISSAI): https://github.com/IS2AI/Kazakh_TTS, https://huggingface.co/datasets/issai/KazakhTTS
F5-TTS: https://github.com/SWivid/F5-TTS; cross-lingual paper: https://arxiv.org/abs/2509.14579
Whisper Large-v3-Turbo: https://huggingface.co/openai/whisper-large-v3-turbo
Comprehensive KZ STT/TTS survey: https://www.mdpi.com/2078-2489/16/10/879