Перейти к содержимому

KZ STT/TTS стек

Date: 2026-05-02 Status: ground-truth benchmark for ADR-0023 KZ stack selection. Hardware tested on: Apple M3 Max (16-inch MBP), MPS + CPU. Production target Mac Studio M3 Ultra 256 GB + RTX PRO 6000 — same model families, faster wall-clock. Test clip: projects/oinap/ip/07_Звук/OiynUp_Sound_KAZ.wav (25.26 s, 24-bit 44.1 kHz stereo, downmixed to 16 kHz mono PCM for inference). Children’s voiceover script naming OIYNUP characters (Домалақ, Трия, …).


LayerRecommendationWhy
STT primaryUali/whisper-turbo-ksc2-kazakh-finetuned (HF, open)Only model that recognised the canonical OIYNUP character name**«Домалақпен»**. Clean Kazakh suffixes. Whisper-Turbo architecture so deployable on RTX PRO 6000 or Mac Studio MPS without bespoke runtime.
STT fallbackissai/soyle_onnx (ISSAI, open)Closest in suffix accuracy. Stable Kazakh inflection. ONNX → portable to web / mobile / edge. Hallucinated «Мадрид» on this one clip — keep as fallback, not primary.
STT genericmlx-whisper large-v3-turbo (kk)Fastest on Apple Silicon; baseline only. Mangles KZ proper nouns. Use for non-OIYNUP content (Russian, English) where it’s strong.
TTS primary (V1)facebook/mms-tts-kaz (Meta, open VITS)Native Kazakh, runs on CPU at RTF ≈ 0.10, 16 kHz output. Single-speaker, clean intelligible Kazakh — usable for napar­nik V1 (no voice cloning yet).
TTS expressivity (V2)Fine-tuned F5-TTS or MMS-TTS-kaz on a Kazakh kid speech corpusF5 stock cross-lingual checkpoint tested 2026-05-02 — does NOT pronounce Kazakh (round-trip via Uali = gibberish). Cross-lingual paper code not yet a public checkpoint. Need our own KZ fine-tune.
TTS kid voice (V1 stopgap)Pitch-shifted MMS-TTS-kaz (+4 to +8 semitones via librosa)The same MMS adult voice transposed up. Sounds “helium-adult” not “real kid” but is intelligible Kazakh and ships today.

Discarded vs ADR-0023 canon:

  • Coqui XTTS-v2 — Coqui shut down December 2025; XTTS-v2 also doesn’t list kk-KZ in its 16 supported languages. Drop from canon.
  • Qwen3-TTS (Jan 2026, Alibaba) — 10 major languages, Kazakh not included. Drop.
  • IndexTTS-2.5 (2026, Bilibili) — 4 languages (Zh/En/Ja/Es), Kazakh not included. Skip.
  • Higgs Audio v2.5 (Jan 2026, Boson AI) — claims 32 languages but Kazakh not documented; quality best for English/Chinese/Spanish per model card. Skip for Kazakh-primary product.
  • Chatterbox Multilingual (Resemble AI) — 22 languages explicitly listed, Kazakh not included. Skip.
  • F5-TTS V1 stock checkpoint cross-lingual to Kazakh — tested 2026-05-02, fails (see “F5-TTS round-trip” section below).
  • Whisper Large-v3 KZ subset as primary — without KSC2 fine-tune it mangles KZ proper nouns badly. Use only as language-agnostic baseline.
  • Soyle enterprise license — already rejected per Jean test 2026-05-02. Open soyle_onnx retained as fallback.

ffmpeg -i OiynUp_Sound_KAZ.wav -ar 16000 -ac 1 -c:a pcm_s16le kaz_16k.wav

Each STT model called with language="kk", task="transcribe", max_new_tokens=400. Each TTS model run cold (full model load → first inference → audio dump). Wall-clock measured with Python time.time().

Reference text is unknown ground truth — the original WAV is a children’s voiceover, not transcribed in source. We triangulate accuracy by:

  1. Whether the model recognised the OIYNUP-canonical character names (Домалақ, Трия, Патрик).
  2. Whether morphological suffixes are valid Kazakh («-пен», «-мен», «қайталайық»).
  3. Whether output hallucinated non-Kazakh words.

#ModelInferenceOutput
1mlx-community/whisper-large-v3-turbo (kk)1.6 s«Тау-мала қейінен ойнап, Патрикт­пенен ойнап, Трия менен ойнап, Білім алайы. Ойнап теген сөзді, Қайталал екпірге, Қанебалалар ойнап, ойнап, Ойнап!»
2mlx-community/whisper-large-v3-mlx (full) (kk)1.6 s«Тау балық әйлен ойнап, Батрикт білін ойнап, Три әміне нойнап, Өлім алайын. Ойнап деген сөзді, Қайталайық бірге, Қарни балалар, Ойнап, ойнап, Ойнап!»
3issai/soyle_onnx (ONNX runtime)4.21 s«дамалақ елен ойнапмадрид билен ойнап триаменен ойнап біліп алайық ойнап деген сөзді қайталайық бірге қане балалар ойнап ойнап ойнап»
4Uali/whisper-turbo-ksc2-kazakh-finetuned10.87 s«Домалақпенен ойнап, подрядпенен ойнап, бір емесең ойнап білім алайық, ойнап деген сөздің қайталайық бірге, қане балалар ойнап, ойнап, ойнап.»
  • Character names: Uali ✅ (Домалақпенен) ▶ Soyle ⚠️ (дамалақ — close but lowercase, no inflection) ▶ Whisper-Turbo ❌ («Тау-мала», «Патрикт­пенен») ▶ Whisper-Large-v3 ❌ («Тау балық» = “mountain fish”)
  • Kazakh morphology: Uali ✅ ▶ Soyle ✅ ▶ Whisper-Large-v3 ⚠️ («Қайталайық бірге» correct, rest broken) ▶ Whisper-Turbo ❌
  • Hallucinations: Whisper variants none, Soyle hallucinated «Мадрид», Uali wrote «подряд» (Russian word — present in KSC2 fine-tune corpus, possibly correct).
  • Speed: Whisper-Turbo (mlx) >> Soyle (ONNX) >> Uali (transformers + CPU). On Mac Studio M3 Ultra MPS or RTX PRO 6000 CUDA, Uali Turbo will drop to ≈ 1–2 s for the same clip.

For OIYNUP we transcribe kids speaking Kazakh while naming OIYNUP characters and education terms. Domain match matters more than raw WER on generic news corpora. KSC2 fine-tune gives Uali the right vocabulary; ONNX Soyle is a safe fallback because it’s pre-packaged for runtime portability. Vanilla Whisper is for non-Kazakh streams (RU/EN cartoon dub QA, parent-side voice notes).


CPU-only, M3 Max, single thread:

SampleTextInferenceAudioRTF
tts_mms_intro.wav«Сәлеметсіңдер ме, балалар! Менің атым Спарк. Ойнап үйренеміз, бірге дамимыз!»0.72 s6.75 s0.106
tts_mms_lesson.wav«Бүгін біз сандарды үйренеміз. Бір, екі, үш, төрт, бес. Қайталайық!»0.65 s6.59 s0.098
tts_mms_quest.wav«Жарайсың, өте жақсы жауап! Енді келесі тапсырмаға өтейік.»0.61 s6.22 s0.098

Audio: 16 kHz mono float32 PCM. Output files in 2026-05-02-kz-stt-tts-bench/.

Quality observations (acoustic, by ear — Jean to confirm):

  • Intelligible neutral-female voice. Single speaker (no choice).
  • Pitch control: none (VITS is deterministic at this checkpoint).
  • Prosody: flat children’s-book reading, not playful — fine for V1 lesson narration but not enough for a tamagochi character.
  • Phonetic accuracy on borrowed Russian words («Спарк») reasonable; native Kazakh phonemes clean.

Question from Jean: can we change the napарник voice to a kid voice? MMS-TTS-kaz is a single fixed adult-female voice — there’s no speaker parameter. So we tested two paths.

Three pitch levels generated for each of the 3 OIYNUP scripts:

VariantPitch shiftSpeedSounds like
*_kid.wav+4 semitones1.05×Older kid / teen, lightly cartoonish
*_kid_high.wav+6 semitones1.05×Younger kid, slightly chipmunk
*_small_kid.wav+8 semitones1.05×Small child / cartoon mascot

These are real-time cheap (librosa pitch_shift, sub-second). Intelligibility of the underlying Kazakh is preserved at +4 / +6, starts to degrade at +8 (formants too high). Acceptable as V1 napарник voice if Jean approves the timbre by ear.

Tested with the OIYNUP source itself as the kid Kazakh reference:

ref_audio = OiynUp_Sound_KAZ.wav cut to 4–14 s (real Kazakh kid voice, 10 s)
ref_text = «Патрияменен ойнап білім алайық, ойнап деген сөзді қайталайық бірге.»
gen_text = each of the 3 OIYNUP scripts above
checkpoint = F5-TTS V1 (stock pip install)
device = MPS (Apple Silicon)
inference = 30–43 s per clip

Round-trip via Uali STT (intended → heard):

SampleIntendedWhat Uali heard from F5 output
intro«…Менің атым Спарк. Ойнап үйренеміз, бірге дамимыз!»«Айдау керекеуің қал ми, айф бәрі сайлэнц бәктерлер, уақыт жеңіспеушісі ев ауф ән пайыз уайз ән фал бір алғай жүрмән бұл ай-майды үйіндеуің»
lesson«…Бір, екі, үш, төрт, бес. Қайталайық!»«Айма айма айма қағиға қалай қалай қалай қалай …» (mode-collapsed, single syllable repeated 50+ times)
quest«Жарайсың, өте жақсы жауап! Енді келесі тапсырмаға өтейік.»«А мәдет әншілер шығар, ағай, лучше айтпак, лучше.»

Diagnosis: the stock F5-TTS V1 checkpoint is trained on English + Chinese only. It clones the kid timbre/prosody from the reference but produces gibberish phonemes when fed Cyrillic Kazakh — Whisper-Uali then tries to fit those phonemes into Kazakh words and mostly fails. The Sep 2025 «Cross-Lingual F5-TTS» paper proposes a framework that fixes this, but the public checkpoint to use that framework has not shipped yet. F5-TTS for Kazakh = not viable today without our own fine-tune.

Output kept for ear-test: tts_f5_kid_intro.wav, tts_f5_kid_lesson.wav, tts_f5_kid_quest.wav.

Required: 1–3 hours of clean transcribed Kazakh child speech (ages 5–10). Sources:

  • KSC2 (ISSAI Kazakh Speech Corpus v2) — has age metadata, can filter for <13. Confirm distribution before relying on it.
  • OIYNUP cartoon dub raw stems — kid VO sessions for the 10 episodes airing 11 May 2026 are an exact-domain in-house dataset. Coordinate with the cartoon team (Grinvich Technology).
  • Synthetic augmentation: pitch-shift adult Kazakh corpus down/up to expand kid range (caveat: doesn’t generate real child prosody).

Fine-tune cost estimate (Phase 1 hardware = 1× RTX PRO 6000 96 GB):

ModelHours of training dataWall-clock to convergeDisk
MMS-TTS-kaz LoRA1–3 h4–8 h5 GB
F5-TTS full fine-tune5–10 h1–2 days30 GB
KazakhTTS2 ESPnet (Tacotron2 + FastSpeech2)5–10 h2–3 days20 GB

MMS-TTS LoRA is the cheapest experiment — start there before committing to full F5 retrain.


MMS-TTS-kazF5-TTS V1 stockKazakhTTS2 (ISSAI)
Native KZ training❌ (en/zh only)✅ (271 h, 5 voices)
Voice cloning✅ timbre, ❌ KZ phonemes (verified failing)❌ (5 fixed voices)
Runtime costCPU, RTF 0.1MPS, ≈30 s for 10 s outputGPU + ESPnet stack
Setup effort5 min (HF transformers)5 min pip but Kazakh fails1–2 days (ESPnet recipe + tarball)
Right for V1 napар­ник?✅ ship now❌ needs KZ fine-tune first⚠️ V3 candidate when we want >1 voice

Implementation order (revised after kid-voice tests, 2026-05-02 round 2):

  1. V1 (Э4): ship MMS-TTS-kaz for narration + librosa pitch-shift +4 semitones for napар­ник kid voice. Single model, post-processing only. Public-facing napар­ник has no default name (per CLAUDE.md canon — child names it themselves; «Спарк» used in this bench is the internal Figma/code label).
  2. V2 (Э5): MMS-TTS-kaz LoRA fine-tune on Kazakh kid corpus (KSC2 age-filtered + OIYNUP cartoon kid VO stems from Grinvich Technology). Goal: real kid timbre instead of pitch-shifted adult. ~4–8 h training on Phase 1 hardware (1× RTX PRO 6000 96 GB).
  3. V3 (post-launch): if quality plateau, evaluate KazakhTTS2 ESPnet recipe (5 voices) and F5-TTS full Kazakh fine-tune for the «3 character × 3 voice» canon from ADR-0001.

All artifacts are in 2026-05-02-kz-stt-tts-bench/:

_reference_kaz_16k.wav — 16 kHz mono input clip
kaz_turbo.txt — Whisper Large-v3-Turbo
kaz_largev3.txt — Whisper Large-v3 full
kaz_soyle.txt — issai/soyle_onnx
kaz_uali.txt — Uali/whisper-turbo-ksc2-kazakh-finetuned
tts_mms_intro.wav — MMS-TTS-kaz: napарник intro (adult voice)
tts_mms_lesson.wav — MMS-TTS-kaz: number lesson (adult voice)
tts_mms_quest.wav — MMS-TTS-kaz: quest praise (adult voice)
tts_mms_*_kid.wav — MMS-TTS pitch-shifted +4 st (older kid)
tts_mms_*_kid_high.wav — MMS-TTS pitch-shifted +6 st (younger kid)
tts_mms_*_small_kid.wav — MMS-TTS pitch-shifted +8 st (small child)
tts_f5_kid_*.wav — F5-TTS V1 stock cross-lingual attempts (FAILED on Kazakh phonemes — kept for ear-test)
kid_ref_kaz.wav — 10 s kid Kazakh reference cut from OiynUp_Sound_KAZ.wav (4–14 s)
kid_ref_kaz.txt — Uali transcript of the reference clip

Reproduce locally:

Окно терминала
uv venv --python 3.12 .venv && source .venv/bin/activate
uv pip install torch torchaudio "transformers>=4.46" scipy numpy soundfile librosa \
mlx-whisper "optimum[onnxruntime]" onnxruntime
ffmpeg -i OiynUp_Sound_KAZ.wav -ar 16000 -ac 1 -c:a pcm_s16le kaz_16k.wav
# STT
mlx_whisper kaz_16k.wav --model mlx-community/whisper-large-v3-turbo --language kk
python -c "from transformers import WhisperProcessor, WhisperForConditionalGeneration; \
import soundfile as sf; a,sr = sf.read('kaz_16k.wav'); \
p=WhisperProcessor.from_pretrained('Uali/whisper-turbo-ksc2-kazakh-finetuned'); \
m=WhisperForConditionalGeneration.from_pretrained('Uali/whisper-turbo-ksc2-kazakh-finetuned'); \
i=p(a,sampling_rate=sr,return_tensors='pt'); \
ids=m.generate(input_features=i.input_features,language='kk',task='transcribe',max_new_tokens=400); \
print(p.batch_decode(ids,skip_special_tokens=True)[0])"
# TTS
python -c "from transformers import VitsModel, AutoTokenizer; import torch, scipy.io.wavfile as w; \
t=AutoTokenizer.from_pretrained('facebook/mms-tts-kaz'); \
m=VitsModel.from_pretrained('facebook/mms-tts-kaz'); \
i=t('Сәлем балалар!', return_tensors='pt'); \
with torch.no_grad(): o=m(**i).waveform; \
w.write('out.wav', m.config.sampling_rate, o.squeeze().numpy())"

  1. Listen to the three TTS clips. Is MMS-TTS-kaz acoustic quality good enough for V1 napar­nik narration, or do we go straight to F5-TTS cross-lingual cloning?
  2. Do we have a transcript of OiynUp_Sound_KAZ.wav from the production team? If yes, we can compute real WER instead of triangulating. If not, we ship Uali as primary and tag «Домалақ» recognition as the canonical sanity check.
  3. OK to update ADR-0023? Replace Coqui-XTTS-v2 + Qwen3-TTS references with MMS-TTS-kaz (V1) + F5-TTS cross-lingual (V2). Soyle stays as fallback per Jean’s earlier direction.