repomind

Running

File size: 5,651 Bytes

c02083b
2ac3c5b
 
 
 
c02083b
 
 
 
 
 
65f9403
2ac3c5b
53f2b06
2ac3c5b
 
 
 
 
 
 
 
c02083b
 
2ac3c5b
 
337508d
2ac3c5b
 
 
 
 
 
 
 
53f2b06
2ac3c5b
 
 
 
 
 
 
 
 
 
fee907b
2ac3c5b
fee907b
5e74789
fee907b
5e74789
fee907b
 
 
5e74789
 
fee907b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0a86f4b
fee907b
0a86f4b
2ac3c5b
0a86f4b
 
 
 
 
 
 
 
 
 
 
2ac3c5b
 
 
6f8abcc
2ac3c5b
6f8abcc

---
title: REPOMIND
emoji: 🧠
colorFrom: indigo
colorTo: red
sdk: gradio
sdk_version: 6.14.0
python_version: '3.13'
app_file: app.py
pinned: false
license: mit
short_description: Repo-scale coding agent — 256K context on a single MI300X
tags:
  - amd-hackathon-2026
  - amd-developer-hackathon
  - agents
  - coding-agent
  - long-context
  - rocm
  - mi300x
  - qwen3-coder
  - vllm
---

# REPOMIND

> Open-source repo-scale coding agent for self-hosted use. Designed to ingest an entire git repo (256K tokens, FP8) and reason across it on a single AMD MI300X — what NVIDIA H100 80GB cannot accommodate by VRAM accounting (~143GB total > 80GB).

**Built for the [AMD Developer Hackathon 2026](https://lablab.ai/ai-hackathons/amd-developer)** · MIT License · [GitHub source](https://github.com/SRKRZ23/repomind)

## Why MI300X?

- Qwen3-Coder-Next-FP8 weights ≈ 80 GB
- 256K KV cache @ FP8 ≈ 38 GB
- activations ≈ 25 GB → **~143 GB total on a single GPU**
- NVIDIA H100 80GB cannot accommodate this configuration on a single card by VRAM accounting (~143 GB > 80 GB). AMD MI300X 192 GB has the headroom.

This is a memory-architecture story, not a CUDA-vs-ROCm one.

## Stack

- **Model**: `Qwen/Qwen3-Coder-Next-FP8` — 80B params, 3B active (MoE)
- **Inference**: vLLM ROCm 7 with `qwen3_coder` tool-call parser
- **Agent loop**: SC-TIR style (PLAN → CALL TOOL → OBSERVE → THINK → ANSWER)
- **Tools**: `read_file` · `grep_codebase` · `execute_code` (sandboxed) · `run_tests` · `git_log`

## Status — verified on real MI300X (2026-05-05 / 2026-05-06)

Full stress test on a single AMD MI300X x1 (AMD Developer Cloud, $1.99/hr, vLLM 0.17.1 + ROCm 7.2 Quick Start image). **2 sessions, 124 min total, ~$4.12.**

**Memory budget — Qwen3-Coder-Next-FP8 + 256K context, FP8 KV cache:**
- ✅ Model weights in VRAM: **77.29 GiB**
- ✅ Available KV cache: **94.58 GiB** (2,065,744 tokens)
- ✅ VRAM peak: **176 GiB / 191.7 GiB** (92% utilization)
- ✅ `--max-model-len 262144` started, `Application startup complete`
- ✅ `/v1/models` returns `max_model_len: 262144`

**Concurrency stress (24 cells, default Triton attention, all 144 outputs clean):**
- ✅ **31/31 success at 8K, 16K, 32K, AND 64K** — every realistic-developer context
- ✅ **25/31 at 128K**, **6-8 at 256K** within a 15-minute window (compute-bound, honest ceiling)
- ✅ Aggregate throughput at N=31: 78.5 tok/s @ 8K · 31.4 @ 16K · 12.1 @ 32K · 3.6 @ 64K

**Long-context coherence — needle-in-haystack at 200K:**
- ✅ **3/3 positions passed** (early, middle, late) — model recovers embedded sentinel function and constant
- ✅ This proves 256K window is *usable*, not just *allocated*

**End-to-end repo ingestion — 9/9 questions answered correctly:**
- ✅ REPOMIND self (68K tokens, 68 files) — 3/3
- ✅ pallets/flask (408K total → fitted 180K) — 3/3
- ✅ **pytorch/vision (1.3M tokens, 581 files, 6,799 chunks → fitted 180K) — 3/3** with correct file path citations

**Tuning attempt — measured regression worth reporting:**
- ⚠️ Tried `--attention-backend ROCM_AITER_FA` (AMD's hand-tuned MI300X kernels)
- Throughput **2-4× higher** under AITER, TTFT 2.8× faster at 64K
- BUT output **degenerates to repeating-punctuation gibberish** in 137/144 cells under FP8 KV cache
- Default Triton stays the production-safe choice; filed for AMD upstream investigation

**Cost — at AMD Cloud $1.99/hr:**
- ✅ ~$45.75 / 1M completion tokens (aggregate at 32K, N=31)
- ✅ 14.5 active continuous queriers per MI300X, or 70–140 dev seats for typical bursty engineering teams
- ✅ Owned MI300X ($18K) breaks even vs Cursor in 3–6 months at team-of-100 usage

## Demo backend

HF Spaces ship CPU / consumer GPUs by default — not MI300X. So this Space serves a **CPU mock for UI demonstration only**. The verified performance numbers above come from a real MI300X stress test on AMD Developer Cloud (124 min, $4.12).

To wire a real MI300X endpoint, set Space secrets `VLLM_BASE_URL` + `MODEL_NAME=Qwen/Qwen3-Coder-Next-FP8` against a vLLM 0.17.1 server. For a live walkthrough on a hosted MI300X, contact razikovsardor1@gmail.com.

## Evidence

- **1-minute demo video**: <https://youtu.be/BvSBR1QazLU>
- **Lablab project page**: <https://lablab.ai/ai-hackathons/amd-developer/repomind/repomind>
- **AMD Developer Forum thread #505** (AITER FP8 regression filed): <https://devcommunity.amd.com/t/repomind-open-source-repo-scale-coding-agent-on-a-single-mi300x-256k-context-fp8-31-31x-concurrency-verified/505>
- **Full evidence pack** (7 JSON results + 5 PNG plots + e2e prompts/answers + 2× rocm-smi snapshots + run logs): [github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test](https://github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test)
- **Extended PHASE 1+2 narrative** (24-cell matrix + AITER A/B): [extended/SUMMARY.md](https://github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test/extended)

Built for the AMD Developer Hackathon 2026 — eligible for the **Hugging Face Special Prize**. If the verified MI300X numbers are useful, a Space like is appreciated. 🤗

## Author

**Sardor Razikov** — Independent ML Engineer · Tashkent 🇺🇿
- Kaggle SPR 2026 #7/371 (Top 1.9%) · S6E3 #23/4,142 · AIMO3 39/50 (XTX $2.2M)
- [Epistemic Curie Benchmark on Zenodo](https://doi.org/10.5281/zenodo.19791329)
- [GitHub](https://github.com/SRKRZ23/repomind) · [LinkedIn](https://www.linkedin.com/in/sardor-razikov-569a5327b) · [X / Twitter](https://x.com/SardorRazi99093)
- Email: razikovsardor1@gmail.com · razikovs777@gmail.com