KZ STT/TTS стек
Date: 2026-05-02
Status: ground-truth benchmark for ADR-0023 KZ stack selection.
Hardware tested on: Apple M3 Max (16-inch MBP), MPS + CPU. Production target Mac Studio M3 Ultra 256 GB + RTX PRO 6000 — same model families, faster wall-clock.
Test clip: projects/oinap/ip/07_Звук/OiynUp_Sound_KAZ.wav (25.26 s, 24-bit 44.1 kHz stereo, downmixed to 16 kHz mono PCM for inference). Children’s voiceover script naming OIYNUP characters (Домалақ, Трия, …).
| Layer | Recommendation | Why |
|---|---|---|
| STT primary | Uali/whisper-turbo-ksc2-kazakh-finetuned (HF, open) | Only model that recognised the canonical OIYNUP character name**«Домалақпен»**. Clean Kazakh suffixes. Whisper-Turbo architecture so deployable on RTX PRO 6000 or Mac Studio MPS without bespoke runtime. |
| STT fallback | issai/soyle_onnx (ISSAI, open) | Closest in suffix accuracy. Stable Kazakh inflection. ONNX → portable to web / mobile / edge. Hallucinated «Мадрид» on this one clip — keep as fallback, not primary. |
| STT generic | mlx-whisper large-v3-turbo (kk) | Fastest on Apple Silicon; baseline only. Mangles KZ proper nouns. Use for non-OIYNUP content (Russian, English) where it’s strong. |
| TTS primary (V1) | facebook/mms-tts-kaz (Meta, open VITS) | Native Kazakh, runs on CPU at RTF ≈ 0.10, 16 kHz output. Single-speaker, clean intelligible Kazakh — usable for naparnik V1 (no voice cloning yet). |
| TTS expressivity (V2) | Fine-tuned F5-TTS or MMS-TTS-kaz on a Kazakh kid speech corpus | F5 stock cross-lingual checkpoint tested 2026-05-02 — does NOT pronounce Kazakh (round-trip via Uali = gibberish). Cross-lingual paper code not yet a public checkpoint. Need our own KZ fine-tune. |
| TTS kid voice (V1 stopgap) | Pitch-shifted MMS-TTS-kaz (+4 to +8 semitones via librosa) | The same MMS adult voice transposed up. Sounds “helium-adult” not “real kid” but is intelligible Kazakh and ships today. |
Discarded vs ADR-0023 canon:
- Coqui XTTS-v2 — Coqui shut down December 2025; XTTS-v2 also doesn’t list
kk-KZin its 16 supported languages. Drop from canon. - Qwen3-TTS (Jan 2026, Alibaba) — 10 major languages, Kazakh not included. Drop.
- IndexTTS-2.5 (2026, Bilibili) — 4 languages (Zh/En/Ja/Es), Kazakh not included. Skip.
- Higgs Audio v2.5 (Jan 2026, Boson AI) — claims 32 languages but Kazakh not documented; quality best for English/Chinese/Spanish per model card. Skip for Kazakh-primary product.
- Chatterbox Multilingual (Resemble AI) — 22 languages explicitly listed, Kazakh not included. Skip.
- F5-TTS V1 stock checkpoint cross-lingual to Kazakh — tested 2026-05-02, fails (see “F5-TTS round-trip” section below).
- Whisper Large-v3 KZ subset as primary — without KSC2 fine-tune it mangles KZ proper nouns badly. Use only as language-agnostic baseline.
- Soyle enterprise license — already rejected per Jean test 2026-05-02. Open
soyle_onnxretained as fallback.
ffmpeg -i OiynUp_Sound_KAZ.wav -ar 16000 -ac 1 -c:a pcm_s16le kaz_16k.wavEach STT model called with language="kk", task="transcribe", max_new_tokens=400. Each TTS model run cold (full model load → first inference → audio dump). Wall-clock measured with Python time.time().
Reference text is unknown ground truth — the original WAV is a children’s voiceover, not transcribed in source. We triangulate accuracy by:
- Whether the model recognised the OIYNUP-canonical character names (Домалақ, Трия, Патрик).
- Whether morphological suffixes are valid Kazakh («-пен», «-мен», «қайталайық»).
- Whether output hallucinated non-Kazakh words.
STT results (same 25.26 s clip)
Заголовок раздела «STT results (same 25.26 s clip)»| # | Model | Inference | Output |
|---|---|---|---|
| 1 | mlx-community/whisper-large-v3-turbo (kk) | 1.6 s | «Тау-мала қейінен ойнап, Патриктпенен ойнап, Трия менен ойнап, Білім алайы. Ойнап теген сөзді, Қайталал екпірге, Қанебалалар ойнап, ойнап, Ойнап!» |
| 2 | mlx-community/whisper-large-v3-mlx (full) (kk) | 1.6 s | «Тау балық әйлен ойнап, Батрикт білін ойнап, Три әміне нойнап, Өлім алайын. Ойнап деген сөзді, Қайталайық бірге, Қарни балалар, Ойнап, ойнап, Ойнап!» |
| 3 | issai/soyle_onnx (ONNX runtime) | 4.21 s | «дамалақ елен ойнапмадрид билен ойнап триаменен ойнап біліп алайық ойнап деген сөзді қайталайық бірге қане балалар ойнап ойнап ойнап» |
| 4 | Uali/whisper-turbo-ksc2-kazakh-finetuned | 10.87 s | «Домалақпенен ойнап, подрядпенен ойнап, бір емесең ойнап білім алайық, ойнап деген сөздің қайталайық бірге, қане балалар ойнап, ойнап, ойнап.» |
Verdict per axis
Заголовок раздела «Verdict per axis»- Character names: Uali ✅ (Домалақпенен) ▶ Soyle ⚠️ (дамалақ — close but lowercase, no inflection) ▶ Whisper-Turbo ❌ («Тау-мала», «Патриктпенен») ▶ Whisper-Large-v3 ❌ («Тау балық» = “mountain fish”)
- Kazakh morphology: Uali ✅ ▶ Soyle ✅ ▶ Whisper-Large-v3 ⚠️ («Қайталайық бірге» correct, rest broken) ▶ Whisper-Turbo ❌
- Hallucinations: Whisper variants none, Soyle hallucinated «Мадрид», Uali wrote «подряд» (Russian word — present in KSC2 fine-tune corpus, possibly correct).
- Speed: Whisper-Turbo (mlx) >> Soyle (ONNX) >> Uali (transformers + CPU). On Mac Studio M3 Ultra MPS or RTX PRO 6000 CUDA, Uali Turbo will drop to ≈ 1–2 s for the same clip.
Recommendation rationale
Заголовок раздела «Recommendation rationale»For OIYNUP we transcribe kids speaking Kazakh while naming OIYNUP characters and education terms. Domain match matters more than raw WER on generic news corpora. KSC2 fine-tune gives Uali the right vocabulary; ONNX Soyle is a safe fallback because it’s pre-packaged for runtime portability. Vanilla Whisper is for non-Kazakh streams (RU/EN cartoon dub QA, parent-side voice notes).
TTS results — facebook/mms-tts-kaz
Заголовок раздела «TTS results — facebook/mms-tts-kaz»CPU-only, M3 Max, single thread:
| Sample | Text | Inference | Audio | RTF |
|---|---|---|---|---|
tts_mms_intro.wav | «Сәлеметсіңдер ме, балалар! Менің атым Спарк. Ойнап үйренеміз, бірге дамимыз!» | 0.72 s | 6.75 s | 0.106 |
tts_mms_lesson.wav | «Бүгін біз сандарды үйренеміз. Бір, екі, үш, төрт, бес. Қайталайық!» | 0.65 s | 6.59 s | 0.098 |
tts_mms_quest.wav | «Жарайсың, өте жақсы жауап! Енді келесі тапсырмаға өтейік.» | 0.61 s | 6.22 s | 0.098 |
Audio: 16 kHz mono float32 PCM. Output files in 2026-05-02-kz-stt-tts-bench/.
Quality observations (acoustic, by ear — Jean to confirm):
- Intelligible neutral-female voice. Single speaker (no choice).
- Pitch control: none (VITS is deterministic at this checkpoint).
- Prosody: flat children’s-book reading, not playful — fine for V1 lesson narration but not enough for a tamagochi character.
- Phonetic accuracy on borrowed Russian words («Спарк») reasonable; native Kazakh phonemes clean.
Kid voice tests (added 2026-05-02 round 2)
Заголовок раздела «Kid voice tests (added 2026-05-02 round 2)»Question from Jean: can we change the napарник voice to a kid voice? MMS-TTS-kaz is a single fixed adult-female voice — there’s no speaker parameter. So we tested two paths.
Path A — pitch-shift MMS-TTS via librosa (works, ships today)
Заголовок раздела «Path A — pitch-shift MMS-TTS via librosa (works, ships today)»Three pitch levels generated for each of the 3 OIYNUP scripts:
| Variant | Pitch shift | Speed | Sounds like |
|---|---|---|---|
*_kid.wav | +4 semitones | 1.05× | Older kid / teen, lightly cartoonish |
*_kid_high.wav | +6 semitones | 1.05× | Younger kid, slightly chipmunk |
*_small_kid.wav | +8 semitones | 1.05× | Small child / cartoon mascot |
These are real-time cheap (librosa pitch_shift, sub-second). Intelligibility of the underlying Kazakh is preserved at +4 / +6, starts to degrade at +8 (formants too high). Acceptable as V1 napарник voice if Jean approves the timbre by ear.
Path B — F5-TTS cross-lingual cloning (FAILS for Kazakh)
Заголовок раздела «Path B — F5-TTS cross-lingual cloning (FAILS for Kazakh)»Tested with the OIYNUP source itself as the kid Kazakh reference:
ref_audio = OiynUp_Sound_KAZ.wav cut to 4–14 s (real Kazakh kid voice, 10 s)ref_text = «Патрияменен ойнап білім алайық, ойнап деген сөзді қайталайық бірге.»gen_text = each of the 3 OIYNUP scripts abovecheckpoint = F5-TTS V1 (stock pip install)device = MPS (Apple Silicon)inference = 30–43 s per clipRound-trip via Uali STT (intended → heard):
| Sample | Intended | What Uali heard from F5 output |
|---|---|---|
| intro | «…Менің атым Спарк. Ойнап үйренеміз, бірге дамимыз!» | «Айдау керекеуің қал ми, айф бәрі сайлэнц бәктерлер, уақыт жеңіспеушісі ев ауф ән пайыз уайз ән фал бір алғай жүрмән бұл ай-майды үйіндеуің» |
| lesson | «…Бір, екі, үш, төрт, бес. Қайталайық!» | «Айма айма айма қағиға қалай қалай қалай қалай …» (mode-collapsed, single syllable repeated 50+ times) |
| quest | «Жарайсың, өте жақсы жауап! Енді келесі тапсырмаға өтейік.» | «А мәдет әншілер шығар, ағай, лучше айтпак, лучше.» |
Diagnosis: the stock F5-TTS V1 checkpoint is trained on English + Chinese only. It clones the kid timbre/prosody from the reference but produces gibberish phonemes when fed Cyrillic Kazakh — Whisper-Uali then tries to fit those phonemes into Kazakh words and mostly fails. The Sep 2025 «Cross-Lingual F5-TTS» paper proposes a framework that fixes this, but the public checkpoint to use that framework has not shipped yet. F5-TTS for Kazakh = not viable today without our own fine-tune.
Output kept for ear-test: tts_f5_kid_intro.wav, tts_f5_kid_lesson.wav, tts_f5_kid_quest.wav.
Path C — fine-tune F5 or MMS on Kazakh kid corpus (V2 path)
Заголовок раздела «Path C — fine-tune F5 or MMS on Kazakh kid corpus (V2 path)»Required: 1–3 hours of clean transcribed Kazakh child speech (ages 5–10). Sources:
- KSC2 (ISSAI Kazakh Speech Corpus v2) — has age metadata, can filter for
<13. Confirm distribution before relying on it. - OIYNUP cartoon dub raw stems — kid VO sessions for the 10 episodes airing 11 May 2026 are an exact-domain in-house dataset. Coordinate with the cartoon team (Grinvich Technology).
- Synthetic augmentation: pitch-shift adult Kazakh corpus down/up to expand kid range (caveat: doesn’t generate real child prosody).
Fine-tune cost estimate (Phase 1 hardware = 1× RTX PRO 6000 96 GB):
| Model | Hours of training data | Wall-clock to converge | Disk |
|---|---|---|---|
| MMS-TTS-kaz LoRA | 1–3 h | 4–8 h | 5 GB |
| F5-TTS full fine-tune | 5–10 h | 1–2 days | 30 GB |
| KazakhTTS2 ESPnet (Tacotron2 + FastSpeech2) | 5–10 h | 2–3 days | 20 GB |
MMS-TTS LoRA is the cheapest experiment — start there before committing to full F5 retrain.
Why MMS-TTS first, not F5/KazakhTTS2
Заголовок раздела «Why MMS-TTS first, not F5/KazakhTTS2»| MMS-TTS-kaz | F5-TTS V1 stock | KazakhTTS2 (ISSAI) | |
|---|---|---|---|
| Native KZ training | ✅ | ❌ (en/zh only) | ✅ (271 h, 5 voices) |
| Voice cloning | ❌ | ✅ timbre, ❌ KZ phonemes (verified failing) | ❌ (5 fixed voices) |
| Runtime cost | CPU, RTF 0.1 | MPS, ≈30 s for 10 s output | GPU + ESPnet stack |
| Setup effort | 5 min (HF transformers) | 5 min pip but Kazakh fails | 1–2 days (ESPnet recipe + tarball) |
| Right for V1 napарник? | ✅ ship now | ❌ needs KZ fine-tune first | ⚠️ V3 candidate when we want >1 voice |
Implementation order (revised after kid-voice tests, 2026-05-02 round 2):
- V1 (Э4): ship MMS-TTS-kaz for narration + librosa pitch-shift +4 semitones for napарник kid voice. Single model, post-processing only. Public-facing napарник has no default name (per CLAUDE.md canon — child names it themselves; «Спарк» used in this bench is the internal Figma/code label).
- V2 (Э5): MMS-TTS-kaz LoRA fine-tune on Kazakh kid corpus (KSC2 age-filtered + OIYNUP cartoon kid VO stems from Grinvich Technology). Goal: real kid timbre instead of pitch-shifted adult. ~4–8 h training on Phase 1 hardware (1× RTX PRO 6000 96 GB).
- V3 (post-launch): if quality plateau, evaluate KazakhTTS2 ESPnet recipe (5 voices) and F5-TTS full Kazakh fine-tune for the «3 character × 3 voice» canon from ADR-0001.
Reproducibility
Заголовок раздела «Reproducibility»All artifacts are in 2026-05-02-kz-stt-tts-bench/:
_reference_kaz_16k.wav — 16 kHz mono input clipkaz_turbo.txt — Whisper Large-v3-Turbokaz_largev3.txt — Whisper Large-v3 fullkaz_soyle.txt — issai/soyle_onnxkaz_uali.txt — Uali/whisper-turbo-ksc2-kazakh-finetunedtts_mms_intro.wav — MMS-TTS-kaz: napарник intro (adult voice)tts_mms_lesson.wav — MMS-TTS-kaz: number lesson (adult voice)tts_mms_quest.wav — MMS-TTS-kaz: quest praise (adult voice)
tts_mms_*_kid.wav — MMS-TTS pitch-shifted +4 st (older kid)tts_mms_*_kid_high.wav — MMS-TTS pitch-shifted +6 st (younger kid)tts_mms_*_small_kid.wav — MMS-TTS pitch-shifted +8 st (small child)
tts_f5_kid_*.wav — F5-TTS V1 stock cross-lingual attempts (FAILED on Kazakh phonemes — kept for ear-test)
kid_ref_kaz.wav — 10 s kid Kazakh reference cut from OiynUp_Sound_KAZ.wav (4–14 s)kid_ref_kaz.txt — Uali transcript of the reference clipReproduce locally:
uv venv --python 3.12 .venv && source .venv/bin/activateuv pip install torch torchaudio "transformers>=4.46" scipy numpy soundfile librosa \ mlx-whisper "optimum[onnxruntime]" onnxruntimeffmpeg -i OiynUp_Sound_KAZ.wav -ar 16000 -ac 1 -c:a pcm_s16le kaz_16k.wav
# STTmlx_whisper kaz_16k.wav --model mlx-community/whisper-large-v3-turbo --language kkpython -c "from transformers import WhisperProcessor, WhisperForConditionalGeneration; \ import soundfile as sf; a,sr = sf.read('kaz_16k.wav'); \ p=WhisperProcessor.from_pretrained('Uali/whisper-turbo-ksc2-kazakh-finetuned'); \ m=WhisperForConditionalGeneration.from_pretrained('Uali/whisper-turbo-ksc2-kazakh-finetuned'); \ i=p(a,sampling_rate=sr,return_tensors='pt'); \ ids=m.generate(input_features=i.input_features,language='kk',task='transcribe',max_new_tokens=400); \ print(p.batch_decode(ids,skip_special_tokens=True)[0])"
# TTSpython -c "from transformers import VitsModel, AutoTokenizer; import torch, scipy.io.wavfile as w; \ t=AutoTokenizer.from_pretrained('facebook/mms-tts-kaz'); \ m=VitsModel.from_pretrained('facebook/mms-tts-kaz'); \ i=t('Сәлем балалар!', return_tensors='pt'); \ with torch.no_grad(): o=m(**i).waveform; \ w.write('out.wav', m.config.sampling_rate, o.squeeze().numpy())"Open questions for Jean
Заголовок раздела «Open questions for Jean»- Listen to the three TTS clips. Is MMS-TTS-kaz acoustic quality good enough for V1 naparnik narration, or do we go straight to F5-TTS cross-lingual cloning?
- Do we have a transcript of
OiynUp_Sound_KAZ.wavfrom the production team? If yes, we can compute real WER instead of triangulating. If not, we ship Uali as primary and tag «Домалақ» recognition as the canonical sanity check. - OK to update ADR-0023? Replace Coqui-XTTS-v2 + Qwen3-TTS references with MMS-TTS-kaz (V1) + F5-TTS cross-lingual (V2). Soyle stays as fallback per Jean’s earlier direction.
Source links
Заголовок раздела «Source links»- ISSAI Soyle: https://github.com/IS2AI/Soyle, https://huggingface.co/issai/soyle_onnx
- Uali fine-tune: https://huggingface.co/Uali/whisper-turbo-ksc2-kazakh-finetuned
- Meta MMS-TTS-kaz: https://huggingface.co/facebook/mms-tts-kaz
- KazakhTTS2 (ISSAI): https://github.com/IS2AI/Kazakh_TTS, https://huggingface.co/datasets/issai/KazakhTTS
- F5-TTS: https://github.com/SWivid/F5-TTS; cross-lingual paper: https://arxiv.org/abs/2509.14579
- Whisper Large-v3-Turbo: https://huggingface.co/openai/whisper-large-v3-turbo
- Comprehensive KZ STT/TTS survey: https://www.mdpi.com/2078-2489/16/10/879