ADR-0023: Local AI infrastructure — DeepSeek V4-Flash + Kimi K2.6 on owned hardware, OIYNUP AI fine-tune
- Status: Accepted
- Deciders: Jean
- Date: 2026-05-02
- Supersedes: Cloud-API LLM dependency assumed in
2026-05-01-realistic-financial-recalibration.md§5 (Anthropic API ₸12M Y1) - Related SOT slice: Layer 2 → Tech / Infrastructure (AI tutor stack — supplemented from existing PENDING entry)
Context
Заголовок раздела «Context»Recalibration 2026-05-01 budgeted ₸40M Y1 software bucket, including ~₸12M Anthropic API spend (Sonnet 4.6 + Opus 4.6 with prompt caching) and ₸8M Soyle KZ STT enterprise license. Jean directed during continuation session 2026-05-02:
- All AI inference local-first on owned hardware. Cloud API used only for unexpected demand spikes — and the mandate is to size hardware sufficient that no cloud fallback budget is needed.
- No Anthropic API spend baseline in Y1 budget. If hardware insufficient, that’s a hardware sizing problem, not a cloud-spend problem.
- No Soyle enterprise license — Jean tested Soyle, found it worse than current open-source KZ STT/TTS options. Switch to open-source stack + custom voice training on owned hardware.
- No Firebase — self-host backend on EPYC server (Postgres + MinIO + Authelia + Caddy).
Concurrent with this direction, Jean specified: heavy use of agents across studio operations + AI companion (in-game ИИ-помощник) gets fine-tuned on owned hardware. Latest open-weight frontier models verified May 2026:
- Kimi K2.6 (released 2026-04-20, Moonshot AI) — 1T parameter MoE / 32B activated, 262K context, native vision + video, agentic 300-swarm proven, SWE-Bench Pro 58.6 / Verified 80.2, HLE w/ tools 54.0 (leads GPT-5.4 + Opus 4.6), open-weight
- DeepSeek V4-Flash (released 2026-04-24, DeepSeek) — 284B parameter MoE / 13B activated, 1M token context, hybrid attention CSA+HCA, open-weight
- DeepSeek V4-Pro (same release) — 1.6T MoE / 49B activated, 1M context, top-tier open-source reasoning, available via API
These models change the unit economics of AI infrastructure significantly: local hosting on Phase 1+2 hardware (Mac Studio M3 Ultra 256GB + dual RTX PRO 6000 96GB ECC = 192GB VRAM) eliminates ~₸105M/yr cloud LLM spend at Y3 scale.
Decision
Заголовок раздела «Decision»Local-first AI infrastructure across two parallel use cases:
Use case A — Studio agents (productivity)
Заголовок раздела «Use case A — Studio agents (productivity)»- Model: Kimi K2.6 (1T MoE, 32B activated, 262K context, open-weight)
- Host: Mac Studio M3 Ultra 256GB unified RAM (Phase 1, ₸5.5M)
- Runtime: MLX (Apple Silicon native), FP8 / AWQ quantization with offload
- Tasks: code review, asset pipeline orchestration (3D fal.ai → Meshy → optimization), partnership research synthesis, B2B copy drafts, financial model maintenance, customer support routing, content QA, design system audits
Use case B — OIYNUP AI (in-game ИИ-помощник)
Заголовок раздела «Use case B — OIYNUP AI (in-game ИИ-помощник)»- Model base: DeepSeek V4-Flash (284B MoE, 13B activated, 1M context, open-weight)
- Host: Workstation Threadripper PRO 7980X + 1× RTX PRO 6000 96GB ECC + 128GB DDR5 ECC (Phase 1, ₸7.5M) for fine-tuning, + 2nd RTX PRO 6000 96GB (Phase 2, ₸5M) for production inference dual-GPU 192GB VRAM total
- Fine-tune corpus:
- Curated KZ-language corpus (Soyle parallel data + government education datasets + cleaned Wikipedia KK + custom KZ children’s speech samples)
- Kid-safe RU corpus filtered through age-appropriate review
- Curriculum Q&A pairs from existing 2 000 validated questions in
docs/question_bank/ - Dialog pairs from cartoon scripts (10 ep × ~4 000 lines × N seasons = 40K+ dialog turns labeled by character)
- Output: “OIYNUP AI” — vertical-specialized model serving in-game tutor backend
- Working name canon: OIYNUP AI (per Jean naming choice 2026-05-02)
KZ STT/TTS open-source stack (Soyle dropped per Jean test result)
Заголовок раздела «KZ STT/TTS open-source stack (Soyle dropped per Jean test result)»Y1H1 task for AI engineer: benchmark + select stack from candidates:
- Whisper Large-v3 KZ subset (Meta multilingual, 99 languages incl KZ)
- Coqui XTTS-v2 with KZ fine-tune (voice cloning + custom voices)
- F5-TTS open-source (zero-shot voice cloning)
- ISSAI Soyle base open-source release (kept as fallback if quality acceptable)
- Meta MMS multilingual including Türkic
- All run on owned hardware, ₸0 license cost
Custom voice training: 3 character × 3 voice combinations per ADR-0001 fine-tuned on owned hardware using Coqui or F5-TTS pipelines.
Self-hosted backend (Firebase replaced)
Заголовок раздела «Self-hosted backend (Firebase replaced)»- Postgres (primary database)
- MinIO (S3-compatible object storage)
- Authelia (auth + SSO)
- Caddy (reverse proxy + auto-TLS)
- All running on Phase 2 EPYC 9554 game server (₸6.5M, 256GB ECC + 2× 4TB NVMe + 25GbE NIC)
What we explicitly are NOT doing
Заголовок раздела «What we explicitly are NOT doing»- No Anthropic API baseline budget (mandate: hardware sufficient)
- No Soyle KZ STT enterprise license
- No Firebase
- No Adobe Creative Cloud
- No JetBrains licenses
- No GitHub Copilot Business (replaced by local Kimi K2.6 in IDE via Claude Code Max)
- No cartoon-gen hardware tier in Y1 base (deferred Y2-Y3 when local video models catch fal.ai quality)
Cost effects
Заголовок раздела «Cost effects»| Software bucket Y1 | Recalibration ₸M | This ADR ₸M | Δ |
|---|---|---|---|
| Anthropic API | 12 | 0 | -12 |
| Soyle KZ STT | 8 | 0 | -8 |
| Firebase | 2.5 | 0 | -2.5 |
| Adobe + JetBrains + Copilot | 2.5 | 0 | -2.5 |
| Claude Code Max ($200/mo × 7 seats × 12 mo) | 0 | 8.4 | +8.4 |
| Unity Pro 3 seats | 3.5 | 3.5 | 0 |
| Figma | 1.0 | 1.0 | 0 |
| fal.ai (cartoon + game + toys + marketing) | 0 | 8.0 | +8.0 |
| Notion + Linear + Slack + monitoring | 0.8 | 0.8 | 0 |
| TOTAL | 40 | 22 | -18 |
Net Y1 software savings: ₸18M.
Y3 projected savings (at MAU 300-500K scale): cloud-only baseline would cost ~₸120M/yr; local handles 70-80% workload, residual cloud ₸15M/yr → savings ₸105M/yr Y3 onwards.
Hardware payback period: Phase 1+2 ₸40M / Y3 saves ₸105M = ~5 months at Y3 scale.
Consequences
Заголовок раздела «Consequences»- Plus: OIYNUP AI fine-tuned on KZ corpus is a balance-sheet asset — defensible vertical moat. Not extractable by Anthropic/OpenAI price hikes or terms changes.
- Plus: Hardware-amortized inference cost approaches $0 marginal at scale. Y3 saves ~₸105M vs cloud-only.
- Plus: Open-weight base means models run forever even if upstream provider pivots to closed-only future releases. V4-Flash + K2.6 weights already published.
- Plus: Self-hosted backend (Postgres + MinIO + Authelia + Caddy) eliminates Firebase data-residency risk for KZ kids data + COPPA compliance — KZ-residency by default since EPYC server is in Almaty office.
- Plus: Studio agent productivity at Y1 saves ~$60-110K/yr vs full-API operation — frees founder bandwidth.
- Minus: Hardware reliability risk — single GPU failure halts inference. Mitigation: Phase 2 dual RTX 6000 setup gives redundancy. Mac Studio Y2 redundant LLM node planned per ADR-0017.
- Minus: Fine-tune yields lower than API quality possible Y1, especially before AI engineer fully ramped. Mitigation: hybrid mode with Sonnet 4.6 fallback for high-stakes tutor calls until fine-tune passes quality bar — but this fallback is BUDGETED ZERO Y1 per Jean direction (must work via local). Alternative mitigation: ₸2M emergency cloud-API allocation inside buffer if quality gap unacceptable.
- Minus: Skill in-house — fine-tuning DeepSeek V4-Flash requires AI engineer with LoRA / QLoRA / full-fine-tune experience. Hire criterion specified: AI engineer Q1 2027 must have LLM fine-tuning track record.
- Minus: Self-hosted backend requires DevOps/SRE skills; covered by hardware engineer + Irlen Y1, dedicated DevOps hire Y2-Y3.
- Следующие шаги: Hire AI engineer Q1-Q2 2026 with explicit fine-tuning experience; benchmark + select KZ STT/TTS stack Y1H1; build OIYNUP AI corpus pipeline (cartoon scripts + curriculum + speech samples); fine-tune V4-Flash on Phase 1 hardware; write
2026-05-02-kz-stt-tts-stack.mdafter benchmarking.
Alternatives considered
Заголовок раздела «Alternatives considered»- Anthropic Sonnet 4.6 cloud-only (recalibration baseline): rejected — ₸12M Y1 + scaling cost + dependency on US provider + no fine-tune control + COPPA / KZ-data-residency risk on cross-border API calls.
- OpenAI GPT-5.4 / GPT-5 Plus: same rejection rationale — closed-weight, scaling cost, API dependency.
- Soyle KZ STT enterprise: rejected per Jean’s test result (worse than current open-source options); also closed-weight + license cost.
- Hybrid Anthropic + local (60/40 split): rejected — Jean direction is no Anthropic baseline, mandate hardware-sufficient.
- Smaller local models (DeepSeek V3 / Qwen 3-72B / Llama 4-70B): considered for Y1 launch readiness — V4-Flash significantly better at 13B activated parameters with 1M context. V3 stays as fallback option in Y1H1 benchmarking.
- Larger local model V4-Pro 1.6T directly: rejected — too large for Phase 1+2 hardware in production. V4-Pro available via API for occasional spike loads only.
- Closed-weight frontier model + self-host: not available (no major US lab releases weights at frontier scale).
Related
Заголовок раздела «Related»- SOT:
SOURCE_OF_TRUTH.md§Layer 2 Tech / Infrastructure → AI tutor stack (existing PENDING entry promoted to LOCKED with this ADR) - ADRs: ADR-0001 (world design — AI tutor 3-axis architecture), ADR-0017 (equipment best-in-class — RTX PRO 6000 + Mac M3 Ultra), ADR-0018 (release dates), ADR-0019 (realistic financial baseline)
- Specs:
docs/superpowers/specs/2026-05-02-canon-corrections-jean-direction.md§C7 + §C10 - Memory:
project_oinap_ai_tutor_stack.md(update with V4-Flash + K2.6 candidates added to A/B/C cascade options);feedback_no_hardcoded_tech.md(this ADR is product-strategic, not skill-level tech prescription) - External: DeepSeek V4 API Docs (released 2026-04-24); Kimi K2.6 SiliconANGLE (released 2026-04-20); Kimi K2.6 MarkTechPost