Spaces:
Sleeping
Sleeping
| title: Router Control Room (ZeroGPU) | |
| emoji: π°οΈ | |
| colorFrom: indigo | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 5.49.1 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| short_description: ZeroGPU UI for CourseGPT-Pro router checkpoints | |
| # π°οΈ Router Control Room β ZeroGPU | |
| This Space exposes the CourseGPT-Pro router checkpoints (Gemma3 27B + Qwen3 32B) with an opinionated Gradio UI. It runs entirely on ZeroGPU hardware using **AWQ 4-bit quantization** and **FlashAttention-2** for optimized inference, with fallback to 8-bit BitsAndBytes if AWQ is unavailable. | |
| ## β¨ Whatβs Included | |
| - **Router-specific prompt builder** β inject difficulty, tags, context, acceptance criteria, and additional guidance into the canonical router system prompt. | |
| - **Two curated checkpoints** β `Router-Qwen3-32B-AWQ` and `Router-Gemma3-27B-AWQ`, both merged and optimized with AWQ quantization and FlashAttention-2. | |
| - **JSON extraction + validation** β output is parsed automatically and checked for the required router fields (route_plan, todo_list, metrics, etc.). | |
| - **Raw output + prompt debug** β inspect the verbatim generation and the exact prompt string sent to the checkpoint. | |
| - **One-click clear** β reset the UI between experiments without reloading models. | |
| ## π Workflow | |
| 1. Describe the user task / homework prompt in the main textbox. | |
| 2. Optionally provide context, acceptance criteria, and extra guidance. | |
| 3. Choose the difficulty tier, tags, model, and decoding parameters. | |
| 4. Click **Generate Router Plan**. | |
| 5. Review: | |
| - **Raw Model Output** β plain text returned by the LLM. | |
| - **Parsed Router Plan** β JSON tree extracted from the output. | |
| - **Validation Panel** β confirms whether all required fields are present. | |
| - **Full Prompt** β copy/paste for repro or benchmarking. | |
| If JSON parsing fails, the validation panel will surface the error so you can tweak decoding parameters or the prompt. | |
| ## π§ Supported Models | |
| | Name | Base | Notes | | |
| |------|------|-------| | |
| | `Router-Qwen3-32B-AWQ` | Qwen3 32B | Best overall acceptance on CourseGPT-Pro benchmarks. Optimized with AWQ 4-bit quantization and FlashAttention-2. | | |
| | `Router-Gemma3-27B-AWQ` | Gemma3 27B | Slightly smaller, tends to favour math-first plans. Optimized with AWQ 4-bit quantization and FlashAttention-2. | | |
| ### Performance Optimizations | |
| - **AWQ (Activation-Aware Weight Quantization)**: 4-bit quantization for faster inference and lower memory usage | |
| - **FlashAttention-2**: Optimized attention mechanism for better throughput | |
| - **TF32 Math**: Enabled for Ampere+ GPUs for faster matrix operations | |
| - **Kernel Warmup**: Automatic CUDA kernel JIT compilation on startup | |
| - **Fast Tokenization**: Uses fast tokenizers with CPU preprocessing | |
| Both checkpoints are merged + quantized in the `Alovestocode` namespace and require `HF_TOKEN` with read access. | |
| ## βοΈ Local Development | |
| ```bash | |
| cd Milestone-6/router-agent/zero-gpu-space | |
| python -m venv .venv && source .venv/bin/activate | |
| pip install -r requirements.txt | |
| export HF_TOKEN=hf_xxx | |
| python app.py | |
| ``` | |
| ## π Notes | |
| - The app attempts **AWQ 4-bit quantization** first (if available), then falls back to **8-bit BitsAndBytes**, and finally to bf16/fp16/fp32 if quantization fails. | |
| - **FlashAttention-2** is automatically enabled when available for improved performance. | |
| - CUDA kernels are warmed up on startup to reduce first-token latency. | |
| - The UI enforces single-turn router generations; conversation history and web search are intentionally omitted to match the Milestone 6 deliverable. | |
| - If you need to re-enable web search or more checkpoints, extend `MODELS` and adjust the prompt builder accordingly. | |
| - **Benchmarking:** run `python Milestone-6/router-agent/tests/run_router_space_benchmark.py --space Alovestocode/ZeroGPU-LLM-Inference --limit 32` (requires `pip install gradio_client`) to call the Space, dump predictions, and evaluate against the Milestone 5 hard suite + thresholds. | |
| - Set `ROUTER_PREFETCH_MODEL` (single value) or `ROUTER_PREFETCH_MODELS=Router-Qwen3-32B-AWQ,Router-Gemma3-27B-AWQ` (comma-separated, `ALL` for every checkpoint) to warm-load weights during startup. Disable background warming by setting `ROUTER_WARM_REMAINING=0`. | |