ADR-0023: Local AI infrastructure — DeepSeek V4-Flash + Kimi K2.6 on owned hardware, OIYNUP AI fine-tune

Status: Accepted
Deciders: Jean
Date: 2026-05-02
Supersedes: Cloud-API LLM dependency assumed in 2026-05-01-realistic-financial-recalibration.md §5 (Anthropic API ₸12M Y1)
Related SOT slice: Layer 2 → Tech / Infrastructure (AI tutor stack — supplemented from existing PENDING entry)

Context

Recalibration 2026-05-01 budgeted ₸40M Y1 software bucket, including ~₸12M Anthropic API spend (Sonnet 4.6 + Opus 4.6 with prompt caching) and ₸8M Soyle KZ STT enterprise license. Jean directed during continuation session 2026-05-02:

All AI inference local-first on owned hardware. Cloud API used only for unexpected demand spikes — and the mandate is to size hardware sufficient that no cloud fallback budget is needed.
No Anthropic API spend baseline in Y1 budget. If hardware insufficient, that’s a hardware sizing problem, not a cloud-spend problem.
No Soyle enterprise license — Jean tested Soyle, found it worse than current open-source KZ STT/TTS options. Switch to open-source stack + custom voice training on owned hardware.
No Firebase — self-host backend on EPYC server (Postgres + MinIO + Authelia + Caddy).

Concurrent with this direction, Jean specified: heavy use of agents across studio operations + AI companion (in-game ИИ-помощник) gets fine-tuned on owned hardware. Latest open-weight frontier models verified May 2026:

Kimi K2.6 (released 2026-04-20, Moonshot AI) — 1T parameter MoE / 32B activated, 262K context, native vision + video, agentic 300-swarm proven, SWE-Bench Pro 58.6 / Verified 80.2, HLE w/ tools 54.0 (leads GPT-5.4 + Opus 4.6), open-weight
DeepSeek V4-Flash (released 2026-04-24, DeepSeek) — 284B parameter MoE / 13B activated, 1M token context, hybrid attention CSA+HCA, open-weight
DeepSeek V4-Pro (same release) — 1.6T MoE / 49B activated, 1M context, top-tier open-source reasoning, available via API

These models change the unit economics of AI infrastructure significantly: local hosting on Phase 1+2 hardware (Mac Studio M3 Ultra 256GB + dual RTX PRO 6000 96GB ECC = 192GB VRAM) eliminates ~₸105M/yr cloud LLM spend at Y3 scale.

Decision

Local-first AI infrastructure across two parallel use cases:

Use case A — Studio agents (productivity)

Model: Kimi K2.6 (1T MoE, 32B activated, 262K context, open-weight)
Host: Mac Studio M3 Ultra 256GB unified RAM (Phase 1, ₸5.5M)
Runtime: MLX (Apple Silicon native), FP8 / AWQ quantization with offload
Tasks: code review, asset pipeline orchestration (3D fal.ai → Meshy → optimization), partnership research synthesis, B2B copy drafts, financial model maintenance, customer support routing, content QA, design system audits

Use case B — OIYNUP AI (in-game ИИ-помощник)

Model base: DeepSeek V4-Flash (284B MoE, 13B activated, 1M context, open-weight)
Host: Workstation Threadripper PRO 7980X + 1× RTX PRO 6000 96GB ECC + 128GB DDR5 ECC (Phase 1, ₸7.5M) for fine-tuning, + 2nd RTX PRO 6000 96GB (Phase 2, ₸5M) for production inference dual-GPU 192GB VRAM total
Fine-tune corpus:
- Curated KZ-language corpus (Soyle parallel data + government education datasets + cleaned Wikipedia KK + custom KZ children’s speech samples)
- Kid-safe RU corpus filtered through age-appropriate review
- Curriculum Q&A pairs from existing 2 000 validated questions in docs/question_bank/
- Dialog pairs from cartoon scripts (10 ep × ~4 000 lines × N seasons = 40K+ dialog turns labeled by character)
Output: “OIYNUP AI” — vertical-specialized model serving in-game tutor backend
Working name canon: OIYNUP AI (per Jean naming choice 2026-05-02)

KZ STT/TTS open-source stack (Soyle dropped per Jean test result)

Y1H1 task for AI engineer: benchmark + select stack from candidates:

Whisper Large-v3 KZ subset (Meta multilingual, 99 languages incl KZ)
Coqui XTTS-v2 with KZ fine-tune (voice cloning + custom voices)
F5-TTS open-source (zero-shot voice cloning)
ISSAI Soyle base open-source release (kept as fallback if quality acceptable)
Meta MMS multilingual including Türkic
All run on owned hardware, ₸0 license cost

Custom voice training: 3 character × 3 voice combinations per ADR-0001 fine-tuned on owned hardware using Coqui or F5-TTS pipelines.

Self-hosted backend (Firebase replaced)

Postgres (primary database)
MinIO (S3-compatible object storage)
Authelia (auth + SSO)
Caddy (reverse proxy + auto-TLS)
All running on Phase 2 EPYC 9554 game server (₸6.5M, 256GB ECC + 2× 4TB NVMe + 25GbE NIC)

What we explicitly are NOT doing

No Anthropic API baseline budget (mandate: hardware sufficient)
No Soyle KZ STT enterprise license
No Firebase
No Adobe Creative Cloud
No JetBrains licenses
No GitHub Copilot Business (replaced by local Kimi K2.6 in IDE via Claude Code Max)
No cartoon-gen hardware tier in Y1 base (deferred Y2-Y3 when local video models catch fal.ai quality)

Cost effects

Software bucket Y1	Recalibration ₸M	This ADR ₸M	Δ
Anthropic API	12	0	-12
Soyle KZ STT	8	0	-8
Firebase	2.5	0	-2.5
Adobe + JetBrains + Copilot	2.5	0	-2.5
Claude Code Max ($200/mo × 7 seats × 12 mo)	0	8.4	+8.4
Unity Pro 3 seats	3.5	3.5	0
Figma	1.0	1.0	0
fal.ai (cartoon + game + toys + marketing)	0	8.0	+8.0
Notion + Linear + Slack + monitoring	0.8	0.8	0
TOTAL	40	22	-18

Net Y1 software savings: ₸18M.

Y3 projected savings (at MAU 300-500K scale): cloud-only baseline would cost ~₸120M/yr; local handles 70-80% workload, residual cloud ₸15M/yr → savings ₸105M/yr Y3 onwards.

Hardware payback period: Phase 1+2 ₸40M / Y3 saves ₸105M = ~5 months at Y3 scale.

Consequences

Plus: OIYNUP AI fine-tuned on KZ corpus is a balance-sheet asset — defensible vertical moat. Not extractable by Anthropic/OpenAI price hikes or terms changes.
Plus: Hardware-amortized inference cost approaches $0 marginal at scale. Y3 saves ~₸105M vs cloud-only.
Plus: Open-weight base means models run forever even if upstream provider pivots to closed-only future releases. V4-Flash + K2.6 weights already published.
Plus: Self-hosted backend (Postgres + MinIO + Authelia + Caddy) eliminates Firebase data-residency risk for KZ kids data + COPPA compliance — KZ-residency by default since EPYC server is in Almaty office.
Plus: Studio agent productivity at Y1 saves ~$60-110K/yr vs full-API operation — frees founder bandwidth.
Minus: Hardware reliability risk — single GPU failure halts inference. Mitigation: Phase 2 dual RTX 6000 setup gives redundancy. Mac Studio Y2 redundant LLM node planned per ADR-0017.
Minus: Fine-tune yields lower than API quality possible Y1, especially before AI engineer fully ramped. Mitigation: hybrid mode with Sonnet 4.6 fallback for high-stakes tutor calls until fine-tune passes quality bar — but this fallback is BUDGETED ZERO Y1 per Jean direction (must work via local). Alternative mitigation: ₸2M emergency cloud-API allocation inside buffer if quality gap unacceptable.
Minus: Skill in-house — fine-tuning DeepSeek V4-Flash requires AI engineer with LoRA / QLoRA / full-fine-tune experience. Hire criterion specified: AI engineer Q1 2027 must have LLM fine-tuning track record.
Minus: Self-hosted backend requires DevOps/SRE skills; covered by hardware engineer + Irlen Y1, dedicated DevOps hire Y2-Y3.
Следующие шаги: Hire AI engineer Q1-Q2 2026 with explicit fine-tuning experience; benchmark + select KZ STT/TTS stack Y1H1; build OIYNUP AI corpus pipeline (cartoon scripts + curriculum + speech samples); fine-tune V4-Flash on Phase 1 hardware; write 2026-05-02-kz-stt-tts-stack.md after benchmarking.

Alternatives considered

Anthropic Sonnet 4.6 cloud-only (recalibration baseline): rejected — ₸12M Y1 + scaling cost + dependency on US provider + no fine-tune control + COPPA / KZ-data-residency risk on cross-border API calls.
OpenAI GPT-5.4 / GPT-5 Plus: same rejection rationale — closed-weight, scaling cost, API dependency.
Soyle KZ STT enterprise: rejected per Jean’s test result (worse than current open-source options); also closed-weight + license cost.
Hybrid Anthropic + local (60/40 split): rejected — Jean direction is no Anthropic baseline, mandate hardware-sufficient.
Smaller local models (DeepSeek V3 / Qwen 3-72B / Llama 4-70B): considered for Y1 launch readiness — V4-Flash significantly better at 13B activated parameters with 1M context. V3 stays as fallback option in Y1H1 benchmarking.
Larger local model V4-Pro 1.6T directly: rejected — too large for Phase 1+2 hardware in production. V4-Pro available via API for occasional spike loads only.
Closed-weight frontier model + self-host: not available (no major US lab releases weights at frontier scale).

SOT: SOURCE_OF_TRUTH.md §Layer 2 Tech / Infrastructure → AI tutor stack (existing PENDING entry promoted to LOCKED with this ADR)
ADRs: ADR-0001 (world design — AI tutor 3-axis architecture), ADR-0017 (equipment best-in-class — RTX PRO 6000 + Mac M3 Ultra), ADR-0018 (release dates), ADR-0019 (realistic financial baseline)
Specs: docs/superpowers/specs/2026-05-02-canon-corrections-jean-direction.md §C7 + §C10
Memory: project_oinap_ai_tutor_stack.md (update with V4-Flash + K2.6 candidates added to A/B/C cascade options); feedback_no_hardcoded_tech.md (this ADR is product-strategic, not skill-level tech prescription)
External: DeepSeek V4 API Docs (released 2026-04-24); Kimi K2.6 SiliconANGLE (released 2026-04-20); Kimi K2.6 MarkTechPost