File size: 5,651 Bytes
c02083b 2ac3c5b c02083b 65f9403 2ac3c5b 53f2b06 2ac3c5b c02083b 2ac3c5b 337508d 2ac3c5b 53f2b06 2ac3c5b fee907b 2ac3c5b fee907b 5e74789 fee907b 5e74789 fee907b 5e74789 fee907b 0a86f4b fee907b 0a86f4b 2ac3c5b 0a86f4b 2ac3c5b 6f8abcc 2ac3c5b 6f8abcc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | ---
title: REPOMIND
emoji: π§
colorFrom: indigo
colorTo: red
sdk: gradio
sdk_version: 6.14.0
python_version: '3.13'
app_file: app.py
pinned: false
license: mit
short_description: Repo-scale coding agent β 256K context on a single MI300X
tags:
- amd-hackathon-2026
- amd-developer-hackathon
- agents
- coding-agent
- long-context
- rocm
- mi300x
- qwen3-coder
- vllm
---
# REPOMIND
> Open-source repo-scale coding agent for self-hosted use. Designed to ingest an entire git repo (256K tokens, FP8) and reason across it on a single AMD MI300X β what NVIDIA H100 80GB cannot accommodate by VRAM accounting (~143GB total > 80GB).
**Built for the [AMD Developer Hackathon 2026](https://lablab.ai/ai-hackathons/amd-developer)** Β· MIT License Β· [GitHub source](https://github.com/SRKRZ23/repomind)
## Why MI300X?
- Qwen3-Coder-Next-FP8 weights β 80 GB
- 256K KV cache @ FP8 β 38 GB
- activations β 25 GB β **~143 GB total on a single GPU**
- NVIDIA H100 80GB cannot accommodate this configuration on a single card by VRAM accounting (~143 GB > 80 GB). AMD MI300X 192 GB has the headroom.
This is a memory-architecture story, not a CUDA-vs-ROCm one.
## Stack
- **Model**: `Qwen/Qwen3-Coder-Next-FP8` β 80B params, 3B active (MoE)
- **Inference**: vLLM ROCm 7 with `qwen3_coder` tool-call parser
- **Agent loop**: SC-TIR style (PLAN β CALL TOOL β OBSERVE β THINK β ANSWER)
- **Tools**: `read_file` Β· `grep_codebase` Β· `execute_code` (sandboxed) Β· `run_tests` Β· `git_log`
## Status β verified on real MI300X (2026-05-05 / 2026-05-06)
Full stress test on a single AMD MI300X x1 (AMD Developer Cloud, $1.99/hr, vLLM 0.17.1 + ROCm 7.2 Quick Start image). **2 sessions, 124 min total, ~$4.12.**
**Memory budget β Qwen3-Coder-Next-FP8 + 256K context, FP8 KV cache:**
- β
Model weights in VRAM: **77.29 GiB**
- β
Available KV cache: **94.58 GiB** (2,065,744 tokens)
- β
VRAM peak: **176 GiB / 191.7 GiB** (92% utilization)
- β
`--max-model-len 262144` started, `Application startup complete`
- β
`/v1/models` returns `max_model_len: 262144`
**Concurrency stress (24 cells, default Triton attention, all 144 outputs clean):**
- β
**31/31 success at 8K, 16K, 32K, AND 64K** β every realistic-developer context
- β
**25/31 at 128K**, **6-8 at 256K** within a 15-minute window (compute-bound, honest ceiling)
- β
Aggregate throughput at N=31: 78.5 tok/s @ 8K Β· 31.4 @ 16K Β· 12.1 @ 32K Β· 3.6 @ 64K
**Long-context coherence β needle-in-haystack at 200K:**
- β
**3/3 positions passed** (early, middle, late) β model recovers embedded sentinel function and constant
- β
This proves 256K window is *usable*, not just *allocated*
**End-to-end repo ingestion β 9/9 questions answered correctly:**
- β
REPOMIND self (68K tokens, 68 files) β 3/3
- β
pallets/flask (408K total β fitted 180K) β 3/3
- β
**pytorch/vision (1.3M tokens, 581 files, 6,799 chunks β fitted 180K) β 3/3** with correct file path citations
**Tuning attempt β measured regression worth reporting:**
- β οΈ Tried `--attention-backend ROCM_AITER_FA` (AMD's hand-tuned MI300X kernels)
- Throughput **2-4Γ higher** under AITER, TTFT 2.8Γ faster at 64K
- BUT output **degenerates to repeating-punctuation gibberish** in 137/144 cells under FP8 KV cache
- Default Triton stays the production-safe choice; filed for AMD upstream investigation
**Cost β at AMD Cloud $1.99/hr:**
- β
~$45.75 / 1M completion tokens (aggregate at 32K, N=31)
- β
14.5 active continuous queriers per MI300X, or 70β140 dev seats for typical bursty engineering teams
- β
Owned MI300X ($18K) breaks even vs Cursor in 3β6 months at team-of-100 usage
## Demo backend
HF Spaces ship CPU / consumer GPUs by default β not MI300X. So this Space serves a **CPU mock for UI demonstration only**. The verified performance numbers above come from a real MI300X stress test on AMD Developer Cloud (124 min, $4.12).
To wire a real MI300X endpoint, set Space secrets `VLLM_BASE_URL` + `MODEL_NAME=Qwen/Qwen3-Coder-Next-FP8` against a vLLM 0.17.1 server. For a live walkthrough on a hosted MI300X, contact razikovsardor1@gmail.com.
## Evidence
- **1-minute demo video**: <https://youtu.be/BvSBR1QazLU>
- **Lablab project page**: <https://lablab.ai/ai-hackathons/amd-developer/repomind/repomind>
- **AMD Developer Forum thread #505** (AITER FP8 regression filed): <https://devcommunity.amd.com/t/repomind-open-source-repo-scale-coding-agent-on-a-single-mi300x-256k-context-fp8-31-31x-concurrency-verified/505>
- **Full evidence pack** (7 JSON results + 5 PNG plots + e2e prompts/answers + 2Γ rocm-smi snapshots + run logs): [github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test](https://github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test)
- **Extended PHASE 1+2 narrative** (24-cell matrix + AITER A/B): [extended/SUMMARY.md](https://github.com/SRKRZ23/repomind/tree/main/benchmarks/2026-05-05-mi300x-stress-test/extended)
Built for the AMD Developer Hackathon 2026 β eligible for the **Hugging Face Special Prize**. If the verified MI300X numbers are useful, a Space like is appreciated. π€
## Author
**Sardor Razikov** β Independent ML Engineer Β· Tashkent πΊπΏ
- Kaggle SPR 2026 #7/371 (Top 1.9%) Β· S6E3 #23/4,142 Β· AIMO3 39/50 (XTX $2.2M)
- [Epistemic Curie Benchmark on Zenodo](https://doi.org/10.5281/zenodo.19791329)
- [GitHub](https://github.com/SRKRZ23/repomind) Β· [LinkedIn](https://www.linkedin.com/in/sardor-razikov-569a5327b) Β· [X / Twitter](https://x.com/SardorRazi99093)
- Email: razikovsardor1@gmail.com Β· razikovs777@gmail.com
|