Spaces:
Sleeping
Sleeping
File size: 4,230 Bytes
f91e906 9592189 f91e906 9592189 f91e906 9592189 f91e906 a79facb f91e906 9592189 f91e906 9592189 a79facb 9592189 f91e906 9592189 f91e906 9592189 f91e906 9592189 f91e906 9592189 f91e906 9592189 a79facb f91e906 9592189 f91e906 9592189 f91e906 9592189 f91e906 9592189 f91e906 9592189 f91e906 a79facb 9592189 a79facb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
---
title: Router Control Room (ZeroGPU)
emoji: π°οΈ
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
short_description: ZeroGPU UI for CourseGPT-Pro router checkpoints
---
# π°οΈ Router Control Room β ZeroGPU
This Space exposes the CourseGPT-Pro router checkpoints (Gemma3 27B + Qwen3 32B) with an opinionated Gradio UI. It runs entirely on ZeroGPU hardware using **AWQ 4-bit quantization** and **FlashAttention-2** for optimized inference, with fallback to 8-bit BitsAndBytes if AWQ is unavailable.
## β¨ Whatβs Included
- **Router-specific prompt builder** β inject difficulty, tags, context, acceptance criteria, and additional guidance into the canonical router system prompt.
- **Two curated checkpoints** β `Router-Qwen3-32B-AWQ` and `Router-Gemma3-27B-AWQ`, both merged and optimized with AWQ quantization and FlashAttention-2.
- **JSON extraction + validation** β output is parsed automatically and checked for the required router fields (route_plan, todo_list, metrics, etc.).
- **Raw output + prompt debug** β inspect the verbatim generation and the exact prompt string sent to the checkpoint.
- **One-click clear** β reset the UI between experiments without reloading models.
## π Workflow
1. Describe the user task / homework prompt in the main textbox.
2. Optionally provide context, acceptance criteria, and extra guidance.
3. Choose the difficulty tier, tags, model, and decoding parameters.
4. Click **Generate Router Plan**.
5. Review:
- **Raw Model Output** β plain text returned by the LLM.
- **Parsed Router Plan** β JSON tree extracted from the output.
- **Validation Panel** β confirms whether all required fields are present.
- **Full Prompt** β copy/paste for repro or benchmarking.
If JSON parsing fails, the validation panel will surface the error so you can tweak decoding parameters or the prompt.
## π§ Supported Models
| Name | Base | Notes |
|------|------|-------|
| `Router-Qwen3-32B-AWQ` | Qwen3 32B | Best overall acceptance on CourseGPT-Pro benchmarks. Optimized with AWQ 4-bit quantization and FlashAttention-2. |
| `Router-Gemma3-27B-AWQ` | Gemma3 27B | Slightly smaller, tends to favour math-first plans. Optimized with AWQ 4-bit quantization and FlashAttention-2. |
### Performance Optimizations
- **AWQ (Activation-Aware Weight Quantization)**: 4-bit quantization for faster inference and lower memory usage
- **FlashAttention-2**: Optimized attention mechanism for better throughput
- **TF32 Math**: Enabled for Ampere+ GPUs for faster matrix operations
- **Kernel Warmup**: Automatic CUDA kernel JIT compilation on startup
- **Fast Tokenization**: Uses fast tokenizers with CPU preprocessing
Both checkpoints are merged + quantized in the `Alovestocode` namespace and require `HF_TOKEN` with read access.
## βοΈ Local Development
```bash
cd Milestone-6/router-agent/zero-gpu-space
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
export HF_TOKEN=hf_xxx
python app.py
```
## π Notes
- The app attempts **AWQ 4-bit quantization** first (if available), then falls back to **8-bit BitsAndBytes**, and finally to bf16/fp16/fp32 if quantization fails.
- **FlashAttention-2** is automatically enabled when available for improved performance.
- CUDA kernels are warmed up on startup to reduce first-token latency.
- The UI enforces single-turn router generations; conversation history and web search are intentionally omitted to match the Milestone 6 deliverable.
- If you need to re-enable web search or more checkpoints, extend `MODELS` and adjust the prompt builder accordingly.
- **Benchmarking:** run `python Milestone-6/router-agent/tests/run_router_space_benchmark.py --space Alovestocode/ZeroGPU-LLM-Inference --limit 32` (requires `pip install gradio_client`) to call the Space, dump predictions, and evaluate against the Milestone 5 hard suite + thresholds.
- Set `ROUTER_PREFETCH_MODEL` (single value) or `ROUTER_PREFETCH_MODELS=Router-Qwen3-32B-AWQ,Router-Gemma3-27B-AWQ` (comma-separated, `ALL` for every checkpoint) to warm-load weights during startup. Disable background warming by setting `ROUTER_WARM_REMAINING=0`.
|