tazwarrrr commited on
Commit
a02870b
·
1 Parent(s): 0b5416e

feat: clean HF Space with essential ROCmPort AI files and new short README

Browse files
Files changed (2) hide show
  1. README.md +9 -278
  2. start.sh +0 -0
README.md CHANGED
@@ -1,286 +1,17 @@
1
- # ROCmPort AI
2
 
3
- ![ROCm](https://img.shields.io/badge/ROCm-7.0-red) ![Hardware](https://img.shields.io/badge/Hardware-MI300X-blue) ![License](https://img.shields.io/badge/License-Apache%202.0-green) ![HuggingFace](https://img.shields.io/badge/Dataset-HuggingFace-yellow)
4
 
5
- > **Live Demos**
6
- >
7
- > 🚀 **Backend API**: https://rocmport-ai-q2b1.onrender.com
8
- >
9
- > 🤗 **HuggingFace Space**: https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/ROCmPort-AI
10
 
11
- A multi-agent pipeline that migrates CUDA kernels to AMD ROCm/HIP catching the bugs that `hipify` misses, compiling with `hipcc`, profiling with `rocprof` on real MI300X hardware, and iterating until the output is correct and fast.
12
 
13
- ---
14
-
15
- ## The Gap hipify Doesn't Close
16
-
17
- `hipify-clang` translates CUDA API calls mechanically. It cannot detect that `if (tid < 32)` in a warp reduction silently skips lanes 32–63 on AMD wavefront-64. The code compiles. The output is wrong. No errors. No warnings.
18
-
19
- **ROCmPort AI catches this before execution.**
20
-
21
- ```cpp
22
- // NVIDIA assumption — silently wrong on AMD (wavefront = 64)
23
- if (tid < 32) {
24
- vsmem[tid] += vsmem[tid + 32]; // lanes 32-63 never participate
25
- ...
26
- }
27
-
28
- // AMD-aware correction
29
- if (tid < 64) {
30
- vsmem[tid] += vsmem[tid + 32];
31
- if (tid < 32) {
32
- vsmem[tid] += vsmem[tid + 16];
33
- vsmem[tid] += vsmem[tid + 8];
34
- vsmem[tid] += vsmem[tid + 4];
35
- vsmem[tid] += vsmem[tid + 2];
36
- vsmem[tid] += vsmem[tid + 1];
37
- }
38
- }
39
- ```
40
-
41
- ---
42
-
43
- ## How It's Different From hipify
44
-
45
- | | hipify-clang | ROCmPort AI |
46
- |---|---|---|
47
- | API renaming | ✅ | ✅ |
48
- | Wavefront-64 bug detection | ❌ | ✅ |
49
- | Compile verification | ❌ | ✅ |
50
- | Profiler feedback loop | ❌ | ✅ |
51
- | Correctness guarantee | ❌ | Partial |
52
- | Fine-tuned model | ❌ | ✅ |
53
-
54
- ---
55
-
56
- ## What ROCmPort AI Does
57
-
58
- 1. **Analyze** — scan CUDA kernel for AMD-specific risks (wavefront size, ballot/shuffle idioms, shared memory layout)
59
- 2. **Translate** — run hipify + LLM-assisted fixes for bugs hipify can't detect
60
- 3. **Compile** — build with `hipcc` targeting gfx942, surface real errors
61
- 4. **Profile** — run `rocprof` and measure actual throughput on MI300X
62
- 5. **Optimize** — propose changes based on profiler feedback, re-test
63
- 6. **Report** — stream full decision trace with per-agent rationale
64
-
65
- If the optimized output underperforms the baseline, the coordinator retries the optimizer (max 3 iterations) before returning the best result found.
66
-
67
- ---
68
-
69
- ## Reproducible Demo Results
70
-
71
- | Kernel | Input | Baseline HIP | Optimized HIP | Result |
72
- |--------|-------|-------------|---------------|--------|
73
- | matrix_multiply | 512×512 | 0.076ms | 0.026ms | **2.91x speedup** |
74
- | vector_add | 32M elements | — | 0.098ms | **3,918 GB/s bandwidth (74% of MI300X peak)** |
75
- | reduction | 16M elements | — | 0.042ms | **correctness PASS (wavefront-64 fix)** |
76
-
77
- > Source: `docs/benchmark_runs/` — real rocprof CSV output, MI300X gfx942, ROCm 7.0, May 8 2026
78
-
79
- ---
80
-
81
- ## Proof of Hardware
82
-
83
- Raw rocprof CSV output committed to this repo:
84
- - [`docs/benchmark_runs/matmul_out.stats.csv`](docs/benchmark_runs/matmul_out.stats.csv)
85
- - [`docs/benchmark_runs/vecadd_out.stats.csv`](docs/benchmark_runs/vecadd_out.stats.csv)
86
- - [`docs/benchmark_runs/reduction.stats.csv`](docs/benchmark_runs/reduction.stats.csv)
87
-
88
- Hardware: AMD Instinct MI300X VF (gfx942), 192GB HBM3, ROCm 7.0, AMD Developer Cloud
89
-
90
- ---
91
-
92
- ## The Dataset No One Else Built
93
-
94
- **170 expert-curated CUDA→ROCm correctness bugs** across 6 categories. Every example includes the original CUDA, the still-broken `hipify` output, and the correct AMD version — with a precise explanation of why the bug manifests on gfx942.
95
-
96
- | Category | Count | Description |
97
- |----------|-------|-------------|
98
- | `warp_size_hardcoded_32` | 50 | `tid & 31`, `tid >> 5`, loop bounds |
99
- | `threadidx_modulo_warpsize` | 30 | `threadIdx.x % 32` for lane ID |
100
- | `shared_memory_no_padding` | 30 | Arrays sized for 32-thread warps |
101
- | `reduction_loop_bound_32` | 20 | Shuffle loops missing offset=32 step |
102
- | `ballot_sync_warp_assumptions` | 20 | `uint32_t` truncating 64-bit ballot |
103
- | `shfl_down_sync_mask_assumptions` | 20 | 32-bit mask on 64-lane wavefront |
104
-
105
- 📦 **[tazwarrrr/cuda-to-rocm-wavefront-bugs](https://huggingface.co/datasets/tazwarrrr/cuda-to-rocm-wavefront-bugs)**
106
-
107
- ---
108
-
109
- ## The Model Trained on AMD Hardware
110
-
111
- Qwen2.5-Coder-7B-Instruct fine-tuned with LoRA (r=16) on the wavefront bug dataset — trained on an AMD Instinct MI300X via AMD Developer Cloud in 79 seconds. Final loss: 1.189, token accuracy: 81%.
112
-
113
- 🤖 **[tazwarrrr/rocmport-qwen-wavefront-finetuned](https://huggingface.co/tazwarrrr/rocmport-qwen-wavefront-finetuned)**
114
-
115
- ---
116
-
117
- ## Agent Architecture
118
-
119
- ```
120
- CUDA Input
121
-
122
-
123
- ��─────────────┐
124
- │ Analyzer │ Detect wavefront bugs, classify risk
125
- └──────┬──────┘
126
-
127
-
128
- ┌─────────────┐
129
- │ Translator │ hipify + LLM fix for missed bugs
130
- └──────┬──────┘
131
-
132
-
133
- ┌─────────────┐ speedup < 0.95?
134
- │ Optimizer │ ◄──────────────────┐
135
- └──────┬──────┘ │
136
- │ │
137
- ▼ │
138
- ┌─────────────┐ retry (max 3) │
139
- │ Tester │ ───────────────────┘
140
- └──────┬──────┘
141
-
142
-
143
- ┌─────────────┐
144
- │ Coordinator │ Final report + artifacts
145
- └─────────────┘
146
- ```
147
-
148
- | Agent | Model | Role |
149
- |-------|-------|------|
150
- | Analyzer | Qwen2.5-Coder-32B | Code risk analysis |
151
- | Translator | Qwen2.5-Coder-32B | CUDA→HIP translation |
152
- | Optimizer | Qwen2.5-Coder-32B | Hardware-aware optimization |
153
- | Tester | Llama-3.3-70B | Log parsing, compile verification |
154
-
155
- ---
156
-
157
- ## AMD-Specific Technical Considerations
158
-
159
- ROCmPort AI reasons explicitly about MI300X constraints:
160
-
161
- - **Wavefront size 64** — affects reduction trees, ballot/shuffle idioms, launch geometry
162
- - **LDS bank behavior** — tile staging and reuse patterns
163
- - **192GB HBM3** — opportunities to eliminate model sharding in some workflows
164
- - **gfx942 occupancy** — memory access pattern tradeoffs under ROCm compiler
165
-
166
- ---
167
-
168
- ## Why This Is Hard to Replicate
169
-
170
- A basic clone can chain `hipify` and an LLM. The differentiator is:
171
-
172
- - **Decision loop** — detect failure/perf regression, apply next strategy, re-run
173
- - **Explainability** — stream each agent's reasoning via SSE in real time
174
- - **Verification** — every code change paired with compile + profiler evidence
175
- - **Dataset** — 170 labeled correctness bugs that don't exist anywhere else
176
- - **Fine-tuned model** — trained on real AMD hardware on a purpose-built dataset
177
-
178
- ---
179
-
180
- ## Quick Start
181
-
182
- ```bash
183
- # Windows
184
- start.bat
185
-
186
- # Linux/Mac
187
- ./start.sh
188
-
189
- # Manual
190
- python -m venv .venv
191
- # Windows: .venv\Scripts\activate
192
- # Linux/Mac:
193
- . .venv/bin/activate
194
- pip install -r backend/requirements.txt
195
- cp .env.example .env
196
- # Add GROQ_API_KEY
197
- npm --prefix frontend install
198
- npm --prefix frontend run build
199
- python -m uvicorn backend.main:app --reload --port 8000
200
- ```
201
-
202
- Open `http://localhost:8000/index.html` in a browser.
203
-
204
- ### Docker
205
-
206
- ```bash
207
- docker build -t rocmport-ai .
208
- docker run -p 8000:8000 rocmport-ai
209
- ```
210
-
211
- ---
212
-
213
- ## Configuration
214
 
215
- ```bash
216
- GROQ_API_KEY=your_key
217
- GROQ_MODEL=llama-3.3-70b-versatile
218
-
219
- # AMD DevCloud vLLM (production)
220
- USE_VLLM=true
221
- VLLM_BASE_URL=http://your-amd-cloud:8000
222
- VLLM_MODEL=Qwen/Qwen2.5-Coder-32B-Instruct
223
- ROCM_AVAILABLE=true
224
- ```
225
-
226
- ---
227
-
228
- ## Documented Failure Cases
229
-
230
- At least one failure path is documented with source, output, root cause, and fix requirements. See [`docs/FAILURE_CASES.md`](docs/FAILURE_CASES.md).
231
-
232
- Credibility improves when the system's failure boundary is visible.
233
 
234
  ---
235
 
236
- ## Judge Mode
237
-
238
- For technical review, use this flow:
239
-
240
- 1. Show original CUDA kernel
241
- 2. Show baseline HIP from straight `hipify` output
242
- 3. Run ROCmPort AI — watch per-agent trace stream
243
- 4. Show final optimized HIP output
244
- 5. Show measured result vs declared baseline
245
- 6. Show one case with marginal gain or no gain
246
-
247
- Full walkthrough: [`docs/JUDGE_MODE.md`](docs/JUDGE_MODE.md)
248
-
249
- ---
250
-
251
- ## Project Structure
252
-
253
- ```
254
- ROCmPort AI/
255
- ├── backend/
256
- │ ├── agents/ # analyzer, translator, optimizer, tester, coordinator
257
- │ ├── tools/ # hipify_wrapper, rocprof_wrapper, llm_client
258
- │ ├── demo_kernels/ # reduction.cu, matrix_multiply.cu, vector_add.cu
259
- │ └── graph/ # LangGraph StateGraph pipeline
260
- ├── dataset/
261
- │ ├── upload_dataset.py
262
- │ └── finetune_qwen.py
263
- ├── docs/
264
- │ ├── LIVE_RESULTS.md
265
- │ ├── FAILURE_CASES.md
266
- │ └── JUDGE_MODE.md
267
- ├── frontend/
268
- └── BENCHMARKS.md
269
- ```
270
-
271
- ---
272
-
273
- ## Troubleshooting
274
-
275
- | Issue | Resolution |
276
- |-------|-----------|
277
- | `GROQ_API_KEY not found` | Add key to `.env` |
278
- | `hipcc not found` | Install ROCm toolchain or use ROCm-enabled environment |
279
- | Backend unavailable | Verify FastAPI running on port 8000 |
280
- | No improvement observed | Check baseline definition and profiler counters |
281
-
282
- ---
283
-
284
- ## License
285
-
286
- Apache 2.0 — see [`LICENSE`](LICENSE)
 
1
+ # ROCmPort AI
2
 
3
+ **Demo**: [View Live Demo](#)
4
 
5
+ ## What it does
 
 
 
 
6
 
7
+ ROCmPort AI automatically migrates CUDA GPU code to ROCm (AMD's open-source GPU computing platform), enabling seamless portability across different GPU architectures.
8
 
9
+ ## Key Features
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
+ - 🚀 **Automated Code Translation** - Converts CUDA kernels and libraries to ROCm HIP code with minimal manual intervention
12
+ - 📊 **Performance Analysis** - Generates detailed migration reports with benchmark comparisons and optimization recommendations
13
+ - 🔧 **Smart Patching** - Intelligently handles library replacements, API mappings, and architecture-specific optimizations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  ---
16
 
17
+ For detailed documentation and examples, see [BENCHMARKS.md](BENCHMARKS.md) and the [docs](docs/) folder.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
start.sh CHANGED
File without changes