Initial release — Darwin-60B-DUO (Hybrid-A: Route 70% / Split-Refine 20% / Ensemble V_1 10%)

Browse files

Files changed (11) hide show

LICENSE +54 -0
README.md +319 -0
benchmarks/README.md +25 -0
config.json +68 -0
docker/docker-compose.yml +115 -0
gateway/ensemble.py +141 -0
gateway/refine.py +90 -0
gateway/requirements.txt +4 -0
gateway/router.py +186 -0
gateway/server.py +286 -0
tokenizer_info.json +17 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,54 @@

+Darwin-60B-DUO — Combined License Notice
+========================================
+This repository aggregates two constituent base models, each governed by its
+own license. The combined repository inherits the more restrictive of the
+two — the Gemma license — as the effective deployment license.
+────────────────────────────────────────────────────────────────────────────
+1.  Constituent base model licenses
+────────────────────────────────────────────────────────────────────────────
+  - Darwin-28B-REASON  (FINAL-Bench/Darwin-28B-REASON)
+      License: Apache License 2.0
+      Source : https://www.apache.org/licenses/LICENSE-2.0
+  - AWAXIS-Think-31B   (Anserwise/AWAXIS-Think-31B)
+      License: Gemma Terms of Use  (inherited from Google Gemma-4 base)
+      Source : https://ai.google.dev/gemma/terms
+────────────────────────────────────────────────────────────────────────────
+2.  Effective combined license for Darwin-60B-DUO
+────────────────────────────────────────────────────────────────────────────
+  Because the Gemma Terms of Use impose more specific restrictions than
+  Apache-2.0 (notably the Gemma Prohibited Use Policy), the combined
+  Darwin-60B-DUO release is distributed under the **Gemma Terms of Use**.
+  Users intending commercial deployment must:
+    - Comply with the Gemma Terms of Use in full
+      https://ai.google.dev/gemma/terms
+    - Comply with the Gemma Prohibited Use Policy
+      https://ai.google.dev/gemma/prohibited_use_policy
+    - Retain all attribution and notices for both constituent models
+────────────────────────────────────────────────────────────────────────────
+3.  Gateway code (this repository's `gateway/`, `docker/`, etc.)
+────────────────────────────────────────────────────────────────────────────
+  The orchestration code authored for Darwin-60B-DUO (FastAPI gateway,
+  router, refine, ensemble, Docker compose) is released under
+  Apache License 2.0 to maximize developer flexibility. The combined
+  license inheritance applies only to the served model behaviour, not the
+  code that orchestrates it.
+────────────────────────────────────────────────────────────────────────────
+4.  Disclaimer
+────────────────────────────────────────────────────────────────────────────
+  This document is a license summary for end-user convenience. In case of
+  any conflict, the original license texts of the constituent models
+  (Apache-2.0 and Gemma Terms of Use) govern. Users should consult those
+  authoritative sources for binding obligations.
+  Copyright (c) 2026 FINAL-Bench, VIDRAFT, Anserwise.

README.md ADDED Viewed

	@@ -0,0 +1,319 @@

+---
+license: gemma
+language:
+- ko
+- en
+- multilingual
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+- darwin
+- darwin-family
+- darwin-duo
+- duo
+- ensemble
+- mixture-of-models
+- router
+- korean
+- reasoning
+- finalbench
+- vidraft
+base_model:
+- FINAL-Bench/Darwin-28B-REASON
+- Anserwise/AWAXIS-Think-31B
+---
+<div align="center">
+# 🌳 Darwin-60B-DUO
+### Darwin family 최초의 듀오 모델 — 두 SOTA가 하나로
+### *The first DUO of the Darwin family — two SOTAs unified into one model*
+</div>
+---
+## ✨ 한 줄 요약 · TL;DR
+> **HF 공인 GPQA Diamond 3위** Darwin-28B-REASON과
+> **한국 과기부 K-AI 리더보드 1위** AWAXIS-Think-31B를
+> **단일 OpenAI-호환 endpoint** 로 묶은 **Darwin family 최초 듀오 모델**.
+>
+> *Combines the **#3 HF-verified GPQA Diamond** Darwin-28B-REASON with the
+> **#1 Korean K-AI Leaderboard** AWAXIS-Think-31B behind a **single OpenAI-compatible endpoint** — the Darwin family's first DUO release.*
+---
+## 🏆 두 SOTA 모델 구성 · Two SOTA Constituents
+| 구성 모델 | 공인 성과 (Verified Rank) | 강점 (Strengths) | 파라미터 |
+|-----------|-----------------------|-----------------|--------|
+| **Darwin-28B-REASON** | 🥉 **Hugging Face 공인 GPQA Diamond 벤치마크 3위** | English graduate-level reasoning · STEM · 수학 · code | 26.9 B |
+| **AWAXIS-Think-31B** | 🥇 **대한민국 과학기술정보통신부 운영 국가 공인 K-AI 리더보드 1위** | 한국어 이해/생성 · 한국 문화 · 자연스러운 어조 | 31.27 B |
+| **Darwin-60B-DUO** (this) | *Aggregate Brand* | 위 두 영역 SOTA 결합 + 자동 hybrid 라우팅 | 58.17 B (≈ 60 B) |
+> 💡 **AWAXIS-Think-31B 역시 Darwin family 입니다.**
+> Darwin 팀이 Google Gemma-4 base 위에 한국어 specialist 분기로 distill 한 모델로,
+> 기존 Darwin (Qwen3.5 계열) lineage 와 함께 Darwin family 양대 축을 형성합니다.
+>
+> *AWAXIS-Think-31B is also part of the **Darwin family** — a Korean specialist branch distilled by the Darwin team on top of Google's Gemma-4 base, complementing the original Qwen3.5-line Darwin lineage as the family's second axis.*
+---
+## 🎯 무엇이 특별한가 · What Makes It Unique
+### 1️⃣ 영역별 SOTA를 한 모델에 (Two SOTA Domains in One Model)
+영어 reasoning과 한국어 자연성을 동시에 SOTA 수준으로 달성하는 단일 LLM은 극히 드뭅니다.
+Darwin-60B-DUO는 각 영역 공인 SOTA 두 모델을 **하나의 API endpoint** 로 묶어,
+사용자가 orchestration 을 인식하지 못한 채 두 강점을 동시에 누립니다.
+*Few single LLMs achieve SOTA in both English reasoning and Korean naturalness simultaneously. Darwin-60B-DUO unifies two domain-verified SOTAs behind one endpoint — users benefit from both without orchestration overhead.*
+### 2️⃣ 자동 Hybrid 라우팅 (Auto Hybrid Routing — "Hybrid-A")
+입력을 분석하여 **시나리오별로 최적 전략을 자동 선택** 합니다.
+| 시나리오 (Scenario) | 라우팅 전략 (Strategy) | 호출 모델 | 비용 (Cost) | 비중 (Share) |
+|---------------------|----------------------|----------|------------|------------|
+| 순수 한국어 (Pure Korean) — 이메일, 한국 정보, 채팅 | **Route → AWAXIS** | 1 model | 1× | ~50 % |
+| 순수 영어 (Pure English) — 코드, 수학, 영어 reasoning | **Route → Darwin** | 1 model | 1× | ~20 % |
+| 한국어 답 + 영어/STEM reasoning 필요 (Korean output needing English/STEM reasoning) | **Split → Darwin reasons → AWAXIS polishes** | 2 models, sequential | 2× | ~15 % |
+| 영어 답 + 한국 정보 필요 (English output needing Korean context) | **Split → AWAXIS retrieves → Darwin polishes** | 2 models, sequential | 2× | ~5 % |
+| 객관식·짧은 답 (MCQ / short answer) | **Ensemble V₁ tournament** | 2 models + cross-verify | 2× | ~10 % |
+**평균 비용 ≈ 1.3 × of a single 30B model**: 70% 케이스는 1×, 30% 케이스만 2×.
+*Average effective cost is roughly 1.3× a single 30B model.*
+### 3️⃣ 단일 모델 façade (Single-Model Façade)
+**OpenAI API 호환 단일 endpoint.** 기존 도구 (LangChain · LlamaIndex · OpenAI SDK · Continue · Cursor 등)를 코드 변경 없이 그대로 사용합니다.
+```python
+from openai import OpenAI
+client = OpenAI(base_url="http://your-server:8000/v1", api_key="anything")
+resp = client.chat.completions.create(
+    model="darwin-60b-duo",   # 한 모델로 호출 / single model name
+    messages=[{"role": "user",
+               "content": "GPT-5와 Claude의 reasoning 차이를 한국어로 정리해줘"}],
+)
+print(resp.choices[0].message.content)
+# 내부: Darwin이 영어 reasoning → AWAXIS가 한국어로 다듬어 반환
+# Internally: Darwin reasons in English → AWAXIS polishes in Korean
+```
+### 4️⃣ 효율적 GPU 운영 (Efficient GPU Footprint)
+- **FP8 양자화** 시 합산 약 **30 GB** → **단일 B200/H100 (80 GB) GPU 1대** 로 충��
+- BF16 운영 시 B200 2대 (각 ~ 60 GB)
+- vLLM 기반 high-throughput inference (텐서 병렬·prefix caching 지원)
+*With FP8 quantization, the combined footprint (~30 GB) fits on a single B200/H100. BF16 deployment uses two B200 GPUs.*
+---
+## 🌳 Darwin Family 가족 트리 · Family Tree
+```
+🌳 Darwin Family
+│
+├─ 👴 GRANDPARENTS (Foundation lineage)
+│  ├─ Cohere Command A+               ── English reasoning lineage (218 B)
+│  └─ Google Gemma-4-31B-it           ── Korean/multilingual base
+│
+├─ 👨 PARENTS (Family bases)
+│  ├─ Darwin-9B                       ── omni-modal, ko-en compact
+│  ├─ Darwin-28B-Opus                 ── English reasoning base
+│  ├─ Darwin-31B-Opus                 ── Korean multimodal base
+│  └─ Darwin-218B-Delphi              ── cascade flagship (GPQA Diamond 90.91 %)
+│
+├─ 🧒 SPECIALISTS (Children — domain SOTAs)
+│  ├─ Darwin-28B-REASON 🥉            ── HF GPQA Diamond #3 (English reasoning specialist)
+│  └─ AWAXIS-Think-31B  🥇            ── K-AI Leaderboard #1 (Korean specialist, Gemma-4 branch)
+│
+└─ ⭐ Darwin-60B-DUO ⭐ (you are here)
+   └─ Two specialists unified — 두 specialist 를 단일 모델로
+```
+---
+## 🚀 사용법 · Usage
+### Option A — Docker Compose (권장 / Recommended)
+```bash
+git clone https://huggingface.co/FINAL-Bench/Darwin-60B-DUO
+cd Darwin-60B-DUO
+docker compose -f docker/docker-compose.yml up -d
+# 검증 / Verify
+curl http://localhost:8000/v1/models
+# → {"data":[{"id":"darwin-60b-duo","object":"model"}]}
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"darwin-60b-duo",
+       "messages":[{"role":"user","content":"안녕하세요. 자기 소개 부탁드립니다."}]}'
+```
+### Option B — Manual launch (B200 / H100 × 2)
+```bash
+# 1) Darwin-28B-REASON  (port 8021, GPU 0)
+CUDA_VISIBLE_DEVICES=0 VLLM_DP_MASTER_PORT=45011 \
+  vllm serve FINAL-Bench/Darwin-28B-REASON \
+    --port 8021 --served-model-name darwin-28r \
+    --quantization fp8 --enforce-eager \
+    --limit-mm-per-prompt '{"image":0,"video":0}' &
+# 2) AWAXIS-Think-31B  (port 8022, GPU 1)
+CUDA_VISIBLE_DEVICES=1 VLLM_DP_MASTER_PORT=45012 \
+  vllm serve Anserwise/AWAXIS-Think-31B \
+    --port 8022 --served-model-name awaxis-31b \
+    --quantization fp8 --enforce-eager \
+    --limit-mm-per-prompt '{"image":0,"video":0}' &
+# 3) Gateway  (port 8000) — from this repo
+pip install -r gateway/requirements.txt
+python gateway/server.py --port 8000 \
+    --darwin-url http://127.0.0.1:8021/v1 \
+    --awaxis-url http://127.0.0.1:8022/v1
+```
+> 💡 **단일 GPU 운영 (Single GPU)**: FP8 양자화 시 두 모델 합산 ~30 GB이므로 80 GB GPU 1대에 collocate 가능. `CUDA_VISIBLE_DEVICES=0` 으로 통일 + `--gpu-memory-utilization 0.45` 씩 분배.
+---
+## ⚙️ 운영 모드 상세 · Operation Modes
+### 🟢 Mode 1 · Route (단일 라우팅, 70 % 케이스)
+입력 분석 → 한 모델만 호출. **가장 빠르고 저렴**.
+*Language + domain detection → single backend. Fastest and cheapest.*
+판정 신호 / Detection signals:
+- `korean_ratio(prompt) > 0.3` → AWAXIS
+- 코드 키워드 (`def`, `function`, `import`, `class`) → Darwin
+- 수학 마커 (`\boxed`, `equation`, `prove`) → Darwin
+- 기타 / Else → 다수 언어 / domain 기준 가중치
+### 🟡 Mode 2 · Split / Refine (분업 협력, 20 % 케이스)
+한 모델이 초안 → 다른 모델이 다듬기. **두 모델의 장점 결합**.
+```
+예: "엔트로피를 한국어로 쉽게 풀어줘"
+Step 1  Darwin (정확한 영어 reasoning) →
+        "Entropy quantifies the number of microstates compatible with a
+         given macrostate, representing disorder ..."
+Step 2  AWAXIS (자연스러운 한국어 다듬기) →
+        "엔트로피는 쉽게 말하면 '무질서함의 정도' 입니다.
+         같은 모습으로 보이지만 사실 그 안에 ..."
+```
+### 🔴 Mode 3 · Ensemble V₁ Tournament (앙상블, 10 % 케이스 — 객관식·짧은 답)
+두 모델이 각자 **N=8 self-consistency** → majority vote.
+- 답 일치 시 → 그대로 반환 (강한 신호)
+- 답 불일치 시 → 두 모델이 **서로의 답을 cross-verify** → tournament winner
+```
+질문: "A/B/C/D 중 정답은?"
+  Darwin (8 sample MAJ) → "C"
+  AWAXIS (8 sample MAJ) → "B"
+  → 불일치 → Darwin 에게 "C vs B 중 정답?" + AWAXIS 에게 같은 질문
+  → verdict 합의 → final answer
+```
+---
+## 📦 Repository Layout
+```
+Darwin-60B-DUO/
+├── README.md                  ← 본 모델카드 / this model card
+├── config.json                ← DUO config (base_models reference)
+├── tokenizer_info.json        ← base tokenizer reference 정보
+├── gateway/
+│   ├── server.py              ← FastAPI orchestrator
+│   ├── router.py              ← 한/영, 도메인, 복잡도 판단
+│   ├── refine.py              ← Sequential refine logic
+│   ├── ensemble.py            ← V₁ cross-verification + MAJ@N
+│   └── requirements.txt
+├── docker/
+│   └── docker-compose.yml     ← vLLM ×2 + gateway 통합 launcher
+├── benchmarks/
+│   └── README.md              ← 평가 자산 (TBA — coming soon)
+└── LICENSE                    ← Gemma + Apache-2.0 dual notice
+```
+---
+## 📊 평가 · Evaluation
+### 구성 모델 공인 점수 (Verified Constituent Scores)
+- **Darwin-28B-REASON** — Hugging Face 공인 **GPQA Diamond 벤치마크 3위**
+- **AWAXIS-Think-31B** — 대한민국 과학기술정보통신부 운영 **국가 공인 K-AI 리더보드 1위**
+### Darwin-60B-DUO Aggregate Bench
+- **GPQA Diamond (full 198Q)** — TBA (정식 평가 진행 예정 / scheduled)
+- **KMMLU** — TBA
+- **CLIcK (Korean cultural)** — TBA
+- **Helmet · Ruler (long context)** — TBA
+> 정식 198Q GPQA 및 K-AI 리더보드 DUO 점수는 평가 완료 후 `benchmarks/` 디렉토리에 게재됩니다.
+>
+> *Full 198-question GPQA and K-AI leaderboard DUO scores will be published in `benchmarks/` after formal evaluation.*
+---
+## 📜 라이센스 · License
+**Combined license — Gemma** (the more restrictive of the constituent base models).
+| 구성 모델 | License |
+|-----------|---------|
+| Darwin-28B-REASON | Apache-2.0 |
+| AWAXIS-Think-31B | Gemma (inherited from Gemma-4) |
+| **Darwin-60B-DUO** | **Gemma** (combined-license inheritance rule) |
+상업적 이용에 앞서 [Gemma Terms of Use](https://ai.google.dev/gemma/terms) 와 [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy) 를 반드시 검토하세요.
+*Please review the [Gemma Terms of Use](https://ai.google.dev/gemma/terms) before commercial deployment.*
+---
+## ⚠️ Limitations · 한계
+- **합산 모델 weight 자체는 본 repo에 없음** — gateway 가 두 base 모델 (Darwin-28B-REASON · AWAXIS-Think-31B) 의 vLLM endpoint 를 호출. 각 base 모델 weight 는 해당 repo 에서 별도 fetch 됩니다.
+- **2× GPU baseline** — BF16 운영 시 GPU 2대 필요. FP8 시 1대 가능 (B200/H100 80GB 기준).
+- **추가 latency** — Split / Ensemble mode 사용 시 단일 모델 대비 ~ 2× 지연 발생.
+- **두 모델 동시 학습 데이터 cut-off** — Darwin-28B-REASON: ~2026-Q1, AWAXIS-Think-31B: ~2026-Q1.
+- **할루시네이션 (Hallucination)** — 일반 LLM 의 한계 그대로 적용됩니다.
+*Combined weights are not bundled in this repo — the gateway calls vLLM endpoints of the two base models, each fetched from their respective repos.*
+---
+## 🙏 Acknowledgments
+- **민식 (FINAL-Bench team lead)** — Darwin family architecture & DUO concept
+- **Anserwise Korean specialist team** — AWAXIS-Think-31B development
+- **VIDRAFT** — orchestration framework & Hybrid-A routing strategy
+- **Google DeepMind** — Gemma-4 foundation
+- **Cohere & Qwen team** — Command A+ / Qwen3.5 foundation lineage
+---
+## 📞 Contact
+- HF org: [FINAL-Bench](https://huggingface.co/FINAL-Bench)
+- Sister orgs: [Anserwise](https://huggingface.co/Anserwise) · [VIDraft](https://huggingface.co/VIDraft)
+- Issues / discussions: 본 repo 의 **Community** 탭
+---
+<div align="center">
+> ⭐ **Darwin-60B-DUO is the Darwin family's first DUO model. One model — two SOTAs.** ⭐
+>
+> ⭐ **Darwin-60B-DUO는 Darwin 패밀리 최초의 듀오 모델입니다. 하나의 모델, 두 개의 SOTA.** ⭐
+</div>

benchmarks/README.md ADDED Viewed

	@@ -0,0 +1,25 @@

+# Darwin-60B-DUO Benchmarks
+> 📌 정식 benchmark 결과는 평가 완료 후 본 디렉토리에 게재됩니다.
+> *Formal benchmark results will be posted here after evaluation.*
+## 평가 예정 항목 · Scheduled Evaluations
+| Benchmark | Scope | Constituent score (verified) | DUO aggregate |
+|-----------|-------|-----------------------------|---------------|
+| **GPQA Diamond (full 198Q)** | English graduate reasoning | Darwin-28B-REASON: HF #3 | TBA |
+| **K-AI Leaderboard** | Korean | AWAXIS-Think-31B: MSIT #1 | TBA |
+| **KMMLU** | Korean MMLU | TBA | TBA |
+| **CLIcK** | Korean cultural | TBA | TBA |
+| **Helmet · Ruler** | Long context retrieval | TBA | TBA |
+| **NIAH 32K · 128K** | Needle-in-haystack | NIAH 32K: 5/5 each (sanity) | TBA |
+## Hybrid-A 라우팅 분포 검증
+프로덕션 트래픽 샘플로 라우터 분포 (50/20/15/5/10 %) 가 실제 호출에서도 유지되는지 정기 모니터링.
+*Production traffic sampling regularly validates that the router distribution (50/20/15/5/10 %) holds in real workloads.*
+## 평가 방식
+- Per-backend isolation: 각 base 모델 단독 점수
+- DUO aggregate: gateway 거친 최종 출력 점수
+- Latency overhead: gateway 추가 지연 (route mode ≈ 0, split mode ≈ 1x, ensemble mode ≈ 1.5–2x)

config.json ADDED Viewed

	@@ -0,0 +1,68 @@

+{
+  "_model_type_friendly": "duo",
+  "_aggregate_brand": "Darwin-60B-DUO",
+  "architectures": [
+    "DarwinDuoOrchestrator"
+  ],
+  "description": "Darwin family DUO — two SOTA constituents (English reasoning + Korean) served via a single OpenAI-compatible gateway. This repo contains the orchestrator gateway code; backend weights are fetched from the constituent repos at runtime.",
+  "constituents": [
+    {
+      "role": "english_reasoning_specialist",
+      "model_id": "FINAL-Bench/Darwin-28B-REASON",
+      "served_name": "darwin-28r",
+      "architecture": "qwen3_5",
+      "params_total": 26895998464,
+      "params_billion": 26.9,
+      "verified_rank": "Hugging Face GPQA Diamond #3",
+      "default_port": 8021,
+      "default_dp_master_port": 45011,
+      "quantization_recommended": "fp8",
+      "vllm_extra_args": [
+        "--enforce-eager",
+        "--limit-mm-per-prompt", "{\"image\":0,\"video\":0}"
+      ]
+    },
+    {
+      "role": "korean_specialist",
+      "model_id": "Anserwise/AWAXIS-Think-31B",
+      "served_name": "awaxis-31b",
+      "architecture": "gemma4",
+      "params_total": 31273086512,
+      "params_billion": 31.27,
+      "verified_rank": "National K-AI Leaderboard (MSIT, Korea) #1",
+      "darwin_family_branch": "korean_specialist (Gemma-4 base)",
+      "default_port": 8022,
+      "default_dp_master_port": 45012,
+      "quantization_recommended": "fp8",
+      "vllm_extra_args": [
+        "--enforce-eager",
+        "--limit-mm-per-prompt", "{\"image\":0,\"video\":0}"
+      ]
+    }
+  ],
+  "aggregate_params_total": 58169085976,
+  "aggregate_params_billion": 58.17,
+  "active_params_router_mode_billion": 30,
+  "active_params_ensemble_mode_billion": 60,
+  "orchestration": {
+    "strategy_name": "Hybrid-A",
+    "version": "1.0",
+    "distribution": {
+      "route_korean": 0.50,
+      "route_english": 0.20,
+      "split_korean_with_reasoning": 0.15,
+      "split_english_with_korean_context": 0.05,
+      "ensemble_v1_mcq": 0.10
+    },
+    "average_cost_multiplier": 1.3,
+    "modes": ["route", "split_refine", "ensemble_v1"]
+  },
+  "gateway": {
+    "port": 8000,
+    "served_model_name": "darwin-60b-duo",
+    "openai_compatible": true,
+    "endpoints": ["/v1/models", "/v1/chat/completions", "/v1/completions"]
+  },
+  "transformers_compatible": false,
+  "_note": "This is NOT a direct transformers AutoModel.from_pretrained() target. Use the gateway (gateway/server.py) or Docker Compose (docker/docker-compose.yml). See README for full usage."
+}

docker/docker-compose.yml ADDED Viewed

	@@ -0,0 +1,115 @@

+version: "3.9"
+# Darwin-60B-DUO — full-stack launcher
+# Spins up:
+#   - vllm-darwin   (Darwin-28B-REASON,   GPU 0, port 8021 internal)
+#   - vllm-awaxis   (AWAXIS-Think-31B,    GPU 1, port 8022 internal)
+#   - gateway       (FastAPI orchestrator, port 8000 exposed)
+#
+# Single-GPU collocation:
+#   Set CUDA_VISIBLE_DEVICES=0 for both vllm-* and lower
+#   --gpu-memory-utilization to 0.45 each (FP8 totals ~30GB on 80GB GPU).
+services:
+  vllm-darwin:
+    image: vllm/vllm-openai:latest
+    container_name: darwin-60b-duo-vllm-darwin
+    runtime: nvidia
+    environment:
+      - CUDA_VISIBLE_DEVICES=0
+      - VLLM_DP_MASTER_PORT=45011
+      - HF_HOME=/root/.cache/huggingface
+      - HF_TOKEN=${HF_TOKEN:-}
+    command: >
+      --model FINAL-Bench/Darwin-28B-REASON
+      --served-model-name darwin-28r
+      --host 0.0.0.0
+      --port 8021
+      --tensor-parallel-size 1
+      --max-model-len 16384
+      --dtype bfloat16
+      --quantization fp8
+      --trust-remote-code
+      --enforce-eager
+      --limit-mm-per-prompt {"image":0,"video":0}
+      --gpu-memory-utilization 0.85
+    volumes:
+      - hf_cache:/root/.cache/huggingface
+    ports:
+      - "8021:8021"
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]
+    healthcheck:
+      test: ["CMD", "curl", "-fsS", "http://127.0.0.1:8021/v1/models"]
+      interval: 20s
+      timeout: 5s
+      retries: 60
+  vllm-awaxis:
+    image: vllm/vllm-openai:latest
+    container_name: darwin-60b-duo-vllm-awaxis
+    runtime: nvidia
+    environment:
+      - CUDA_VISIBLE_DEVICES=1
+      - VLLM_DP_MASTER_PORT=45012
+      - HF_HOME=/root/.cache/huggingface
+      - HF_TOKEN=${HF_TOKEN:-}
+    command: >
+      --model Anserwise/AWAXIS-Think-31B
+      --served-model-name awaxis-31b
+      --host 0.0.0.0
+      --port 8022
+      --tensor-parallel-size 1
+      --max-model-len 16384
+      --dtype bfloat16
+      --quantization fp8
+      --trust-remote-code
+      --enforce-eager
+      --limit-mm-per-prompt {"image":0,"video":0}
+      --gpu-memory-utilization 0.85
+    volumes:
+      - hf_cache:/root/.cache/huggingface
+    ports:
+      - "8022:8022"
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]
+    healthcheck:
+      test: ["CMD", "curl", "-fsS", "http://127.0.0.1:8022/v1/models"]
+      interval: 20s
+      timeout: 5s
+      retries: 60
+  gateway:
+    image: python:3.11-slim
+    container_name: darwin-60b-duo-gateway
+    working_dir: /app
+    command: >
+      bash -c "pip install -q -r requirements.txt &&
+               python server.py --host 0.0.0.0 --port 8000
+                 --darwin-url http://vllm-darwin:8021/v1
+                 --awaxis-url http://vllm-awaxis:8022/v1"
+    volumes:
+      - ../gateway:/app
+    ports:
+      - "8000:8000"
+    depends_on:
+      vllm-darwin:
+        condition: service_healthy
+      vllm-awaxis:
+        condition: service_healthy
+    restart: unless-stopped
+volumes:
+  hf_cache:
+    driver: local

gateway/ensemble.py ADDED Viewed

	@@ -0,0 +1,141 @@

+# -*- coding: utf-8 -*-
+"""
+Darwin-60B-DUO Ensemble V_1 — MAJ@N self-consistency + cross-verification.
+For MCQ / short-answer queries:
+  1) Each backend produces N samples at temperature τ (default 0.7)
+  2) Each backend's answer = its own majority vote (RSA / self-consistency)
+  3) If both majorities agree → return that answer
+  4) If they disagree → each backend verifies the pair (cross-verification)
+     and the gateway picks the tournament winner
+  5) Tiebreaker on split verdicts: majority-vote-count confidence
+"""
+import asyncio
+import re
+from collections import Counter
+from typing import Any, Dict, List, Optional, Tuple
+_LETTERS = "ABCD"
+def _extract_letter(text: str) -> str:
+    """Extract A/B/C/D letter answer from a free-form response."""
+    if not text:
+        return ""
+    # Strip CoT / thinking tags
+    cleaned = re.sub(r"<\|START_THINKING\|>.*?<\|END_THINKING\|>", "", text, flags=re.S)
+    cleaned = re.sub(r"<think>.*?</think>", "", cleaned, flags=re.S)
+    for tag in ["<|END_THINKING|>", "</think>", "<|START_RESPONSE|>", "<|END_RESPONSE|>"]:
+        if tag in cleaned:
+            cleaned = cleaned.split(tag)[-1]
+    # Common answer patterns
+    patterns = [
+        r"ANSWER:\s*\(?([A-D])\)?",
+        r"\\boxed\{\s*\(?([A-D])\)?\s*\}",
+        r"final answer\s*(?:is|:)?\s*\(?([A-D])\)?",
+        r"answer\s+is\s*\(?([A-D])\)?",
+        r"\(([A-D])\)\s*$",
+    ]
+    for p in patterns:
+        m = re.search(p, cleaned, re.I | re.M)
+        if m:
+            return m.group(1).upper()
+    # Fallback: last A-D token
+    candidates = re.findall(r"\b([A-D])\b", cleaned)
+    return candidates[-1].upper() if candidates else ""
+def _majority(letters: List[str]) -> Tuple[Optional[str], Dict[str, int]]:
+    valid = [l for l in letters if l in _LETTERS]
+    if not valid:
+        return None, {}
+    counter = Counter(valid)
+    top, _ = counter.most_common(1)[0]
+    return top, dict(counter)
+_VERIFY_TEMPLATE = (
+    "You are a graduate-level expert verifier. Given the following multiple-"
+    "choice question and two candidate letter answers, decide which is more "
+    "likely correct.\n\n"
+    "QUESTION:\n{question}\n\n"
+    "CANDIDATE 1 says answer = {a1}\n"
+    "CANDIDATE 2 says answer = {a2}\n\n"
+    "Think briefly, then respond with exactly one line:\n"
+    "VERDICT: 1   (if candidate 1's letter is correct)\n"
+    "VERDICT: 2   (if candidate 2's letter is correct)"
+)
+def _parse_verdict(text: str) -> Optional[int]:
+    m = re.search(r"VERDICT:\s*([12])", text)
+    return int(m.group(1)) if m else None
+def _last_user_text(messages: List[Dict[str, str]]) -> str:
+    for m in reversed(messages):
+        if m.get("role") == "user":
+            return m.get("content", "")
+    return ""
+async def ensemble_v1(
+    darwin,
+    awaxis,
+    messages: List[Dict[str, str]],
+    temperature: float = 0.7,
+    max_tokens: int = 4096,
+    n_rsa: int = 8,
+) -> str:
+    """
+    Run V_1 ensemble. Returns the final answer string formatted as
+    "ANSWER: X" so downstream tooling can parse uniformly.
+    """
+    # --- Phase 1: parallel RSA (each backend N samples) ---
+    d_task = darwin.chat(messages, temperature=temperature, max_tokens=max_tokens, n=n_rsa)
+    a_task = awaxis.chat(messages, temperature=temperature, max_tokens=max_tokens, n=n_rsa)
+    d_outs, a_outs = await asyncio.gather(d_task, a_task)
+    d_letters = [_extract_letter(o) for o in d_outs]
+    a_letters = [_extract_letter(o) for o in a_outs]
+    d_maj, d_votes = _majority(d_letters)
+    a_maj, a_votes = _majority(a_letters)
+    # --- Phase 2: agreement check ---
+    if d_maj is None and a_maj is None:
+        return "ANSWER: (no valid answer extracted)"
+    if d_maj is None:
+        return f"ANSWER: {a_maj}"
+    if a_maj is None:
+        return f"ANSWER: {d_maj}"
+    if d_maj == a_maj:
+        return f"ANSWER: {d_maj}"
+    # --- Phase 3: cross-verification on mismatch ---
+    question = _last_user_text(messages)
+    verify_prompt = _VERIFY_TEMPLATE.format(question=question, a1=d_maj, a2=a_maj)
+    verify_msgs = [{"role": "user", "content": verify_prompt}]
+    d_verify_task = darwin.chat(verify_msgs, temperature=0.0, max_tokens=2048, n=1)
+    a_verify_task = awaxis.chat(verify_msgs, temperature=0.0, max_tokens=2048, n=1)
+    d_verify_outs, a_verify_outs = await asyncio.gather(d_verify_task, a_verify_task)
+    d_verdict = _parse_verdict(d_verify_outs[0])
+    a_verdict = _parse_verdict(a_verify_outs[0])
+    # --- Phase 4: combine verdicts ---
+    if d_verdict == a_verdict and d_verdict is not None:
+        return f"ANSWER: {d_maj if d_verdict == 1 else a_maj}"
+    if d_verdict is None and a_verdict is None:
+        # Fall back to confidence (higher own-vote count wins)
+        d_conf = d_votes.get(d_maj, 0)
+        a_conf = a_votes.get(a_maj, 0)
+        return f"ANSWER: {d_maj if d_conf >= a_conf else a_maj}"
+    if d_verdict is None:
+        return f"ANSWER: {d_maj if a_verdict == 1 else a_maj}"
+    if a_verdict is None:
+        return f"ANSWER: {d_maj if d_verdict == 1 else a_maj}"
+    # Split — confidence tiebreaker
+    d_conf = d_votes.get(d_maj, 0)
+    a_conf = a_votes.get(a_maj, 0)
+    return f"ANSWER: {d_maj if d_conf >= a_conf else a_maj}"

gateway/refine.py ADDED Viewed

	@@ -0,0 +1,90 @@

+# -*- coding: utf-8 -*-
+"""
+Darwin-60B-DUO Sequential Refine — two-model collaboration.
+drafter_backend produces the initial draft, then refiner_backend polishes it.
+The polish prompt is built dynamically based on the language combination so
+that:
+  - Darwin (English reasoning) → AWAXIS (Korean polish) for Korean output
+    requiring rigorous English/STEM reasoning
+  - AWAXIS (Korean cultural context) → Darwin (English polish) for English
+    output requiring Korean cultural / linguistic context
+"""
+import re
+from typing import Any, Dict, List
+def _last_user_text(messages: List[Dict[str, str]]) -> str:
+    for m in reversed(messages):
+        if m.get("role") == "user":
+            return m.get("content", "")
+    return ""
+def _korean_ratio(text: str) -> float:
+    if not text:
+        return 0.0
+    return len(re.findall(r"[가-힣]", text)) / len(text)
+async def sequential_refine(
+    drafter,
+    refiner,
+    messages: List[Dict[str, str]],
+    temperature: float = 0.5,
+    max_tokens: int = 4096,
+) -> str:
+    """
+    Step 1: drafter produces the initial answer using the user's messages.
+    Step 2: refiner is given the original messages + the drafter's response +
+            a polish instruction, then produces the final output.
+    The polish instruction is language-adaptive:
+      - If user asked in Korean (kr_ratio > 0.3) → polish to natural Korean
+      - If user asked in English → polish to clearer English
+      - Otherwise → general clarity polish
+    """
+    user_text = _last_user_text(messages)
+    kr = _korean_ratio(user_text)
+    # ---- Step 1: drafter ----
+    draft_outputs = await drafter.chat(
+        messages,
+        temperature=temperature,
+        max_tokens=max_tokens,
+    )
+    draft = draft_outputs[0]
+    # ---- Step 2: refiner polish ----
+    if kr > 0.3:
+        polish_instruction = (
+            "위 초안을 사용자의 원래 질문 의도에 맞게 한국어로 자연스럽고 "
+            "정확하게 다듬어 최종 답변을 작성하세요. 사실관계는 보존하되, "
+            "어색한 표현·번역체·중복은 제거하고, 한국어 독자에게 매끄러운 "
+            "흐름이 되도록 재작성하세요. 새로운 정보 추가 금지 — 표현만 정련하세요."
+        )
+    elif kr < 0.05 and len(user_text) > 0:
+        polish_instruction = (
+            "Polish the draft above into a clearer, more concise, and "
+            "natural-sounding English response that fully addresses the "
+            "user's original question. Preserve all factual content; remove "
+            "redundancy, awkward phrasing, and translation artifacts. Do "
+            "not add new information — refine wording only."
+        )
+    else:
+        polish_instruction = (
+            "Refine the draft above for clarity, naturalness, and "
+            "consistency. Preserve all facts; remove redundancy. Do not "
+            "introduce new information."
+        )
+    refine_messages = list(messages) + [
+        {"role": "assistant", "content": draft},
+        {"role": "user", "content": polish_instruction},
+    ]
+    refined_outputs = await refiner.chat(
+        refine_messages,
+        temperature=max(0.0, temperature - 0.2),  # cooler for polish
+        max_tokens=max_tokens,
+    )
+    return refined_outputs[0]

gateway/requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+fastapi>=0.110
+uvicorn[standard]>=0.27
+httpx>=0.27
+pydantic>=2.6

gateway/router.py ADDED Viewed

	@@ -0,0 +1,186 @@

+# -*- coding: utf-8 -*-
+"""
+Darwin-60B-DUO Router — language + domain + complexity classification.
+Returns a RouteDecision indicating which Hybrid-A strategy to invoke:
+  - "route_darwin"           : English-only single backend
+  - "route_awaxis"           : Korean-only single backend
+  - "split_refine"           : Darwin reasons → AWAXIS polishes (Korean output, English reasoning)
+  - "split_refine_reverse"   : AWAXIS retrieves → Darwin polishes (English output, Korean context)
+  - "ensemble_v1"            : MCQ / short answer requiring cross-verification
+"""
+import re
+from dataclasses import dataclass
+from typing import Optional
+# ---------------------------------------------------------------------------
+# Heuristic keyword lists
+# ---------------------------------------------------------------------------
+ENGLISH_REASONING_KEYWORDS = {
+    # Math
+    "prove", "theorem", "derivative", "integral", "equation", "matrix",
+    "vector", "topology", "manifold",
+    # Code
+    "def ", "function ", "import ", "class ", "return ", "lambda ",
+    "javascript", "python", "rust", "golang", "typescript", "regex",
+    # Sci-tech
+    "gradient", "tensor", "embedding", "transformer", "attention",
+    "rlhf", "rlvr", "quantization", "kernel",
+    # Markers
+    r"\\boxed", r"\\frac", r"\\sum", r"\\int", "<eqn>", "$$",
+}
+KOREAN_CULTURAL_KEYWORDS = {
+    "추석", "설날", "한국", "조선", "고려", "신라", "백제",
+    "k-pop", "케이팝", "한복", "김치", "한국어",
+    "공무원", "정부", "과기부", "교육부", "외교부",
+    "국회", "정책", "법안", "조례",
+}
+MCQ_PATTERNS = [
+    r"\(A\).*\(B\).*\(C\).*\(D\)",
+    r"^\s*A\..*\n\s*B\..*\n\s*C\.",
+    r"answer.*[A-D]",
+    r"정답.*[ABCD가나다라]",
+    r"\bANSWER:",
+]
+@dataclass
+class RouteDecision:
+    strategy: str
+    reason: str
+    korean_ratio: float = 0.0
+    english_ratio: float = 0.0
+    has_reasoning_marker: bool = False
+    has_korean_cultural_marker: bool = False
+    is_mcq: bool = False
+# ---------------------------------------------------------------------------
+# Detection primitives
+# ---------------------------------------------------------------------------
+def korean_ratio(text: str) -> float:
+    """Fraction of Hangul characters."""
+    if not text:
+        return 0.0
+    total = len(text)
+    hangul = len(re.findall(r"[가-힣]", text))
+    return hangul / total if total > 0 else 0.0
+def english_ratio(text: str) -> float:
+    """Fraction of ASCII alphabetic characters."""
+    if not text:
+        return 0.0
+    total = len(text)
+    alpha = len(re.findall(r"[a-zA-Z]", text))
+    return alpha / total if total > 0 else 0.0
+def has_reasoning_marker(text: str) -> bool:
+    """English STEM / coding keywords or math markers."""
+    lower = text.lower()
+    for kw in ENGLISH_REASONING_KEYWORDS:
+        # Some keywords are regex patterns (start with backslash)
+        if kw.startswith("\\"):
+            if re.search(re.escape(kw), text):
+                return True
+        elif kw in lower:
+            return True
+    return False
+def has_korean_cultural_marker(text: str) -> bool:
+    lower = text.lower()
+    return any(kw in lower for kw in KOREAN_CULTURAL_KEYWORDS)
+def is_mcq(text: str) -> bool:
+    for pat in MCQ_PATTERNS:
+        if re.search(pat, text, re.IGNORECASE | re.MULTILINE):
+            return True
+    return False
+# ---------------------------------------------------------------------------
+# Strategy selector — Hybrid-A
+# ---------------------------------------------------------------------------
+def select_strategy(text: str) -> RouteDecision:
+    """
+    Hybrid-A strategy decision:
+      1) MCQ-style short answer → ensemble_v1
+      2) Korean output + English/STEM reasoning needed → split_refine
+      3) English output + Korean cultural context needed → split_refine_reverse
+      4) Korean-dominant → route_awaxis
+      5) English-dominant → route_darwin
+      6) Mixed default → route_awaxis (Korean-first preference)
+    """
+    kr = korean_ratio(text)
+    en = english_ratio(text)
+    reasoning = has_reasoning_marker(text)
+    cultural = has_korean_cultural_marker(text)
+    mcq = is_mcq(text)
+    decision = RouteDecision(
+        strategy="route_awaxis",  # default
+        reason="default",
+        korean_ratio=round(kr, 3),
+        english_ratio=round(en, 3),
+        has_reasoning_marker=reasoning,
+        has_korean_cultural_marker=cultural,
+        is_mcq=mcq,
+    )
+    # 1. MCQ — always ensemble (10% case)
+    if mcq and len(text) < 4000:
+        decision.strategy = "ensemble_v1"
+        decision.reason = "mcq_short_answer"
+        return decision
+    # 2. Korean output + reasoning required (15% case)
+    if kr > 0.3 and reasoning:
+        decision.strategy = "split_refine"
+        decision.reason = "korean_output_with_english_reasoning"
+        return decision
+    # 3. English output + Korean cultural context (5% case)
+    if en > 0.5 and kr < 0.05 and cultural:
+        decision.strategy = "split_refine_reverse"
+        decision.reason = "english_output_with_korean_context"
+        return decision
+    # 4. Korean-dominant (50% case)
+    if kr >= 0.3:
+        decision.strategy = "route_awaxis"
+        decision.reason = "korean_dominant"
+        return decision
+    # 5. English-dominant (20% case)
+    if en >= 0.5 and kr < 0.05:
+        decision.strategy = "route_darwin"
+        decision.reason = "english_dominant"
+        return decision
+    # 6. Mixed / ambiguous → AWAXIS (Korean-first default)
+    decision.strategy = "route_awaxis"
+    decision.reason = "mixed_fallback_korean"
+    return decision
+# ---------------------------------------------------------------------------
+# Smoke test
+# ---------------------------------------------------------------------------
+if __name__ == "__main__":
+    samples = [
+        ("순수 한국어 채팅", "안녕하세요. 오늘 날씨가 어떤가요?"),
+        ("순수 영어 코드", "def fib(n):\n    return n if n < 2 else fib(n-1) + fib(n-2)"),
+        ("한국어 + 영어 reasoning", "Transformer attention의 작동 원리를 한국어로 설명해줘"),
+        ("영어 + 한국 문화", "Explain the Korean Chuseok holiday in simple English."),
+        ("MCQ", "Which is correct?\n(A) foo\n(B) bar\n(C) baz\n(D) qux"),
+        ("한국어 MCQ", "정답은 무엇인가요? A. 1 B. 2 C. 3 D. 4"),
+    ]
+    for name, txt in samples:
+        d = select_strategy(txt)
+        print(f"[{name}] -> {d.strategy} ({d.reason}) kr={d.korean_ratio} en={d.english_ratio}")

gateway/server.py ADDED Viewed

	@@ -0,0 +1,286 @@

+# -*- coding: utf-8 -*-
+"""
+Darwin-60B-DUO Gateway — FastAPI OpenAI-compatible orchestrator.
+Exposes a single OpenAI-compatible endpoint ("darwin-60b-duo") that
+internally routes to two backends:
+  - Darwin-28B-REASON (English reasoning specialist, HF GPQA Diamond #3)
+  - AWAXIS-Think-31B  (Korean specialist, K-AI Leaderboard #1)
+Hybrid-A strategy (config.json):
+  - 70% Route (single backend)
+  - 20% Split / Refine (sequential two-model collaboration)
+  - 10% Ensemble V_1 (cross-verification tournament for MCQ / short answers)
+Run:
+    pip install -r requirements.txt
+    python server.py --port 8000 \\
+        --darwin-url http://127.0.0.1:8021/v1 \\
+        --awaxis-url http://127.0.0.1:8022/v1
+License: Gemma (combined-license inheritance — see README).
+"""
+import argparse
+import asyncio
+import json
+import time
+import uuid
+from typing import Any, Dict, List, Optional
+import httpx
+from fastapi import FastAPI, HTTPException
+from fastapi.responses import JSONResponse, StreamingResponse
+from pydantic import BaseModel, Field
+from router import select_strategy, RouteDecision
+from refine import sequential_refine
+from ensemble import ensemble_v1
+# ---------------------------------------------------------------------------
+# Pydantic models — OpenAI Chat Completions API subset
+# ---------------------------------------------------------------------------
+class ChatMessage(BaseModel):
+    role: str
+    content: str
+class ChatCompletionRequest(BaseModel):
+    model: str = "darwin-60b-duo"
+    messages: List[ChatMessage]
+    temperature: float = 0.7
+    top_p: float = 0.95
+    max_tokens: int = 4096
+    n: int = 1
+    stream: bool = False
+    # Optional: force a specific strategy ("route_darwin", "route_awaxis",
+    # "split_refine", "ensemble_v1", "auto"). Default "auto" = Hybrid-A router.
+    duo_strategy: Optional[str] = "auto"
+# ---------------------------------------------------------------------------
+# Backend HTTP client
+# ---------------------------------------------------------------------------
+class Backend:
+    def __init__(self, name: str, base_url: str, served_name: str):
+        self.name = name
+        self.base_url = base_url.rstrip("/")
+        self.served_name = served_name
+        self.client = httpx.AsyncClient(timeout=httpx.Timeout(900.0))
+    async def chat(
+        self,
+        messages: List[Dict[str, str]],
+        temperature: float = 0.7,
+        max_tokens: int = 4096,
+        n: int = 1,
+        top_p: float = 0.95,
+    ) -> List[str]:
+        payload = {
+            "model": self.served_name,
+            "messages": messages,
+            "temperature": temperature,
+            "top_p": top_p,
+            "max_tokens": max_tokens,
+            "n": n,
+        }
+        r = await self.client.post(
+            f"{self.base_url}/chat/completions", json=payload
+        )
+        r.raise_for_status()
+        data = r.json()
+        return [c["message"]["content"] for c in data["choices"]]
+    async def health(self) -> bool:
+        try:
+            r = await self.client.get(f"{self.base_url}/models", timeout=5)
+            return r.status_code == 200
+        except Exception:
+            return False
+# ---------------------------------------------------------------------------
+# FastAPI app
+# ---------------------------------------------------------------------------
+app = FastAPI(
+    title="Darwin-60B-DUO Gateway",
+    version="1.0.0",
+    description=(
+        "Single OpenAI-compatible endpoint for the Darwin-60B-DUO "
+        "(Darwin-28B-REASON + AWAXIS-Think-31B). Hybrid-A routing."
+    ),
+)
+# Initialized via CLI args at startup
+DARWIN: Optional[Backend] = None
+AWAXIS: Optional[Backend] = None
+@app.get("/v1/models")
+async def list_models():
+    """Expose only the aggregate model to external callers."""
+    return {
+        "object": "list",
+        "data": [
+            {
+                "id": "darwin-60b-duo",
+                "object": "model",
+                "owned_by": "FINAL-Bench",
+                "created": int(time.time()),
+            }
+        ],
+    }
+@app.get("/health")
+async def health():
+    d_ok = await DARWIN.health() if DARWIN else False
+    a_ok = await AWAXIS.health() if AWAXIS else False
+    status = "ok" if (d_ok and a_ok) else "degraded"
+    return {
+        "status": status,
+        "backends": {
+            "darwin-28r": d_ok,
+            "awaxis-31b": a_ok,
+        },
+        "gateway_version": "1.0.0",
+    }
+def _build_response(content: str, route_meta: Dict[str, Any]) -> Dict[str, Any]:
+    """Build an OpenAI-compatible Chat Completion response with route metadata."""
+    return {
+        "id": f"chatcmpl-{uuid.uuid4().hex[:24]}",
+        "object": "chat.completion",
+        "created": int(time.time()),
+        "model": "darwin-60b-duo",
+        "choices": [
+            {
+                "index": 0,
+                "message": {
+                    "role": "assistant",
+                    "content": content,
+                },
+                "finish_reason": "stop",
+            }
+        ],
+        "usage": {
+            "prompt_tokens": -1,  # Aggregate gateway does not track tokens
+            "completion_tokens": -1,
+            "total_tokens": -1,
+        },
+        # Non-standard metadata for transparency / debugging
+        "_duo_route": route_meta,
+    }
+@app.post("/v1/chat/completions")
+async def chat_completions(req: ChatCompletionRequest):
+    if not req.messages:
+        raise HTTPException(400, "messages must not be empty")
+    user_text = req.messages[-1].content
+    messages_dict = [m.dict() for m in req.messages]
+    # ----- Strategy selection -----
+    if req.duo_strategy and req.duo_strategy != "auto":
+        decision = RouteDecision(strategy=req.duo_strategy, reason="user_forced")
+    else:
+        decision = select_strategy(user_text)
+    t0 = time.time()
+    # ----- Execute -----
+    try:
+        if decision.strategy == "route_darwin":
+            outputs = await DARWIN.chat(
+                messages_dict,
+                temperature=req.temperature,
+                max_tokens=req.max_tokens,
+                top_p=req.top_p,
+            )
+            content = outputs[0]
+        elif decision.strategy == "route_awaxis":
+            outputs = await AWAXIS.chat(
+                messages_dict,
+                temperature=req.temperature,
+                max_tokens=req.max_tokens,
+                top_p=req.top_p,
+            )
+            content = outputs[0]
+        elif decision.strategy == "split_refine":
+            # Darwin reasons in English → AWAXIS polishes in Korean
+            content = await sequential_refine(
+                DARWIN, AWAXIS, messages_dict,
+                temperature=req.temperature, max_tokens=req.max_tokens
+            )
+        elif decision.strategy == "split_refine_reverse":
+            # AWAXIS retrieves Korean context → Darwin polishes in English
+            content = await sequential_refine(
+                AWAXIS, DARWIN, messages_dict,
+                temperature=req.temperature, max_tokens=req.max_tokens
+            )
+        elif decision.strategy == "ensemble_v1":
+            # MCQ / short answer: MAJ@N per model + cross-verify if mismatched
+            content = await ensemble_v1(
+                DARWIN, AWAXIS, messages_dict,
+                temperature=req.temperature, max_tokens=req.max_tokens,
+                n_rsa=8,
+            )
+        else:
+            # Fallback: AWAXIS (default for ambiguous / mixed)
+            outputs = await AWAXIS.chat(
+                messages_dict, temperature=req.temperature,
+                max_tokens=req.max_tokens, top_p=req.top_p,
+            )
+            content = outputs[0]
+            decision.strategy = "fallback_awaxis"
+    except httpx.HTTPError as e:
+        raise HTTPException(503, f"backend error: {type(e).__name__}: {e}")
+    elapsed = time.time() - t0
+    route_meta = {
+        "strategy": decision.strategy,
+        "reason": decision.reason,
+        "elapsed_s": round(elapsed, 2),
+        "language_ratio": decision.korean_ratio,
+    }
+    return JSONResponse(_build_response(content, route_meta))
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+def main():
+    p = argparse.ArgumentParser()
+    p.add_argument("--host", default="0.0.0.0")
+    p.add_argument("--port", type=int, default=8000)
+    p.add_argument(
+        "--darwin-url", default="http://127.0.0.1:8021/v1",
+        help="Darwin-28B-REASON vLLM endpoint",
+    )
+    p.add_argument(
+        "--awaxis-url", default="http://127.0.0.1:8022/v1",
+        help="AWAXIS-Think-31B vLLM endpoint",
+    )
+    p.add_argument("--darwin-served-name", default="darwin-28r")
+    p.add_argument("--awaxis-served-name", default="awaxis-31b")
+    args = p.parse_args()
+    global DARWIN, AWAXIS
+    DARWIN = Backend("darwin-28r", args.darwin_url, args.darwin_served_name)
+    AWAXIS = Backend("awaxis-31b", args.awaxis_url, args.awaxis_served_name)
+    import uvicorn
+    uvicorn.run(app, host=args.host, port=args.port, log_level="info")
+if __name__ == "__main__":
+    main()

tokenizer_info.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+  "_note": "Darwin-60B-DUO uses constituent tokenizers via gateway, not a unified one.",
+  "constituent_tokenizers": {
+    "darwin-28r": {
+      "source_model": "FINAL-Bench/Darwin-28B-REASON",
+      "tokenizer_family": "qwen3_5",
+      "vocab_size_estimate": 151936
+    },
+    "awaxis-31b": {
+      "source_model": "Anserwise/AWAXIS-Think-31B",
+      "tokenizer_family": "gemma4",
+      "vocab_size_estimate": 262144
+    }
+  },
+  "routing_decision_layer": "language detection + domain classification (gateway/router.py) performs tokenization-free routing on the raw text before backend selection",
+  "downstream_token_handling": "Each backend (vLLM serving the respective base model) handles its own tokenization. The gateway operates on text strings, not token IDs."
+}