Перейти к содержимому

ADR-0023: Local AI infrastructure — DeepSeek V4-Flash + Kimi K2.6 on owned hardware, OIYNUP AI fine-tune

  • Status: Accepted
  • Deciders: Jean
  • Date: 2026-05-02
  • Supersedes: Cloud-API LLM dependency assumed in 2026-05-01-realistic-financial-recalibration.md §5 (Anthropic API ₸12M Y1)
  • Related SOT slice: Layer 2 → Tech / Infrastructure (AI tutor stack — supplemented from existing PENDING entry)

Recalibration 2026-05-01 budgeted ₸40M Y1 software bucket, including ~₸12M Anthropic API spend (Sonnet 4.6 + Opus 4.6 with prompt caching) and ₸8M Soyle KZ STT enterprise license. Jean directed during continuation session 2026-05-02:

  1. All AI inference local-first on owned hardware. Cloud API used only for unexpected demand spikes — and the mandate is to size hardware sufficient that no cloud fallback budget is needed.
  2. No Anthropic API spend baseline in Y1 budget. If hardware insufficient, that’s a hardware sizing problem, not a cloud-spend problem.
  3. No Soyle enterprise license — Jean tested Soyle, found it worse than current open-source KZ STT/TTS options. Switch to open-source stack + custom voice training on owned hardware.
  4. No Firebase — self-host backend on EPYC server (Postgres + MinIO + Authelia + Caddy).

Concurrent with this direction, Jean specified: heavy use of agents across studio operations + AI companion (in-game ИИ-помощник) gets fine-tuned on owned hardware. Latest open-weight frontier models verified May 2026:

  • Kimi K2.6 (released 2026-04-20, Moonshot AI) — 1T parameter MoE / 32B activated, 262K context, native vision + video, agentic 300-swarm proven, SWE-Bench Pro 58.6 / Verified 80.2, HLE w/ tools 54.0 (leads GPT-5.4 + Opus 4.6), open-weight
  • DeepSeek V4-Flash (released 2026-04-24, DeepSeek) — 284B parameter MoE / 13B activated, 1M token context, hybrid attention CSA+HCA, open-weight
  • DeepSeek V4-Pro (same release) — 1.6T MoE / 49B activated, 1M context, top-tier open-source reasoning, available via API

These models change the unit economics of AI infrastructure significantly: local hosting on Phase 1+2 hardware (Mac Studio M3 Ultra 256GB + dual RTX PRO 6000 96GB ECC = 192GB VRAM) eliminates ~₸105M/yr cloud LLM spend at Y3 scale.

Local-first AI infrastructure across two parallel use cases:

  • Model: Kimi K2.6 (1T MoE, 32B activated, 262K context, open-weight)
  • Host: Mac Studio M3 Ultra 256GB unified RAM (Phase 1, ₸5.5M)
  • Runtime: MLX (Apple Silicon native), FP8 / AWQ quantization with offload
  • Tasks: code review, asset pipeline orchestration (3D fal.ai → Meshy → optimization), partnership research synthesis, B2B copy drafts, financial model maintenance, customer support routing, content QA, design system audits
  • Model base: DeepSeek V4-Flash (284B MoE, 13B activated, 1M context, open-weight)
  • Host: Workstation Threadripper PRO 7980X + 1× RTX PRO 6000 96GB ECC + 128GB DDR5 ECC (Phase 1, ₸7.5M) for fine-tuning, + 2nd RTX PRO 6000 96GB (Phase 2, ₸5M) for production inference dual-GPU 192GB VRAM total
  • Fine-tune corpus:
    • Curated KZ-language corpus (Soyle parallel data + government education datasets + cleaned Wikipedia KK + custom KZ children’s speech samples)
    • Kid-safe RU corpus filtered through age-appropriate review
    • Curriculum Q&A pairs from existing 2 000 validated questions in docs/question_bank/
    • Dialog pairs from cartoon scripts (10 ep × ~4 000 lines × N seasons = 40K+ dialog turns labeled by character)
  • Output: “OIYNUP AI” — vertical-specialized model serving in-game tutor backend
  • Working name canon: OIYNUP AI (per Jean naming choice 2026-05-02)

Y1H1 task for AI engineer: benchmark + select stack from candidates:

  • Whisper Large-v3 KZ subset (Meta multilingual, 99 languages incl KZ)
  • Coqui XTTS-v2 with KZ fine-tune (voice cloning + custom voices)
  • F5-TTS open-source (zero-shot voice cloning)
  • ISSAI Soyle base open-source release (kept as fallback if quality acceptable)
  • Meta MMS multilingual including Türkic
  • All run on owned hardware, ₸0 license cost

Custom voice training: 3 character × 3 voice combinations per ADR-0001 fine-tuned on owned hardware using Coqui or F5-TTS pipelines.

  • Postgres (primary database)
  • MinIO (S3-compatible object storage)
  • Authelia (auth + SSO)
  • Caddy (reverse proxy + auto-TLS)
  • All running on Phase 2 EPYC 9554 game server (₸6.5M, 256GB ECC + 2× 4TB NVMe + 25GbE NIC)
  • No Anthropic API baseline budget (mandate: hardware sufficient)
  • No Soyle KZ STT enterprise license
  • No Firebase
  • No Adobe Creative Cloud
  • No JetBrains licenses
  • No GitHub Copilot Business (replaced by local Kimi K2.6 in IDE via Claude Code Max)
  • No cartoon-gen hardware tier in Y1 base (deferred Y2-Y3 when local video models catch fal.ai quality)
Software bucket Y1Recalibration ₸MThis ADR ₸MΔ
Anthropic API120-12
Soyle KZ STT80-8
Firebase2.50-2.5
Adobe + JetBrains + Copilot2.50-2.5
Claude Code Max ($200/mo × 7 seats × 12 mo)08.4+8.4
Unity Pro 3 seats3.53.50
Figma1.01.00
fal.ai (cartoon + game + toys + marketing)08.0+8.0
Notion + Linear + Slack + monitoring0.80.80
TOTAL4022-18

Net Y1 software savings: ₸18M.

Y3 projected savings (at MAU 300-500K scale): cloud-only baseline would cost ~₸120M/yr; local handles 70-80% workload, residual cloud ₸15M/yr → savings ₸105M/yr Y3 onwards.

Hardware payback period: Phase 1+2 ₸40M / Y3 saves ₸105M = ~5 months at Y3 scale.

  • Plus: OIYNUP AI fine-tuned on KZ corpus is a balance-sheet asset — defensible vertical moat. Not extractable by Anthropic/OpenAI price hikes or terms changes.
  • Plus: Hardware-amortized inference cost approaches $0 marginal at scale. Y3 saves ~₸105M vs cloud-only.
  • Plus: Open-weight base means models run forever even if upstream provider pivots to closed-only future releases. V4-Flash + K2.6 weights already published.
  • Plus: Self-hosted backend (Postgres + MinIO + Authelia + Caddy) eliminates Firebase data-residency risk for KZ kids data + COPPA compliance — KZ-residency by default since EPYC server is in Almaty office.
  • Plus: Studio agent productivity at Y1 saves ~$60-110K/yr vs full-API operation — frees founder bandwidth.
  • Minus: Hardware reliability risk — single GPU failure halts inference. Mitigation: Phase 2 dual RTX 6000 setup gives redundancy. Mac Studio Y2 redundant LLM node planned per ADR-0017.
  • Minus: Fine-tune yields lower than API quality possible Y1, especially before AI engineer fully ramped. Mitigation: hybrid mode with Sonnet 4.6 fallback for high-stakes tutor calls until fine-tune passes quality bar — but this fallback is BUDGETED ZERO Y1 per Jean direction (must work via local). Alternative mitigation: ₸2M emergency cloud-API allocation inside buffer if quality gap unacceptable.
  • Minus: Skill in-house — fine-tuning DeepSeek V4-Flash requires AI engineer with LoRA / QLoRA / full-fine-tune experience. Hire criterion specified: AI engineer Q1 2027 must have LLM fine-tuning track record.
  • Minus: Self-hosted backend requires DevOps/SRE skills; covered by hardware engineer + Irlen Y1, dedicated DevOps hire Y2-Y3.
  • Следующие шаги: Hire AI engineer Q1-Q2 2026 with explicit fine-tuning experience; benchmark + select KZ STT/TTS stack Y1H1; build OIYNUP AI corpus pipeline (cartoon scripts + curriculum + speech samples); fine-tune V4-Flash on Phase 1 hardware; write 2026-05-02-kz-stt-tts-stack.md after benchmarking.
  • Anthropic Sonnet 4.6 cloud-only (recalibration baseline): rejected — ₸12M Y1 + scaling cost + dependency on US provider + no fine-tune control + COPPA / KZ-data-residency risk on cross-border API calls.
  • OpenAI GPT-5.4 / GPT-5 Plus: same rejection rationale — closed-weight, scaling cost, API dependency.
  • Soyle KZ STT enterprise: rejected per Jean’s test result (worse than current open-source options); also closed-weight + license cost.
  • Hybrid Anthropic + local (60/40 split): rejected — Jean direction is no Anthropic baseline, mandate hardware-sufficient.
  • Smaller local models (DeepSeek V3 / Qwen 3-72B / Llama 4-70B): considered for Y1 launch readiness — V4-Flash significantly better at 13B activated parameters with 1M context. V3 stays as fallback option in Y1H1 benchmarking.
  • Larger local model V4-Pro 1.6T directly: rejected — too large for Phase 1+2 hardware in production. V4-Pro available via API for occasional spike loads only.
  • Closed-weight frontier model + self-host: not available (no major US lab releases weights at frontier scale).
  • SOT: SOURCE_OF_TRUTH.md §Layer 2 Tech / Infrastructure → AI tutor stack (existing PENDING entry promoted to LOCKED with this ADR)
  • ADRs: ADR-0001 (world design — AI tutor 3-axis architecture), ADR-0017 (equipment best-in-class — RTX PRO 6000 + Mac M3 Ultra), ADR-0018 (release dates), ADR-0019 (realistic financial baseline)
  • Specs: docs/superpowers/specs/2026-05-02-canon-corrections-jean-direction.md §C7 + §C10
  • Memory: project_oinap_ai_tutor_stack.md (update with V4-Flash + K2.6 candidates added to A/B/C cascade options); feedback_no_hardcoded_tech.md (this ADR is product-strategic, not skill-level tech prescription)
  • External: DeepSeek V4 API Docs (released 2026-04-24); Kimi K2.6 SiliconANGLE (released 2026-04-20); Kimi K2.6 MarkTechPost