--- title: TurboCPP Demo emoji: 🌀 colorFrom: indigo colorTo: blue sdk: gradio sdk_version: 5.5.0 app_file: app.py pinned: false license: mit python_version: "3.12" short_description: Live llama.cpp + Hadamard rotation demo (TurboQuant) --- # turbocpp — llama.cpp + TurboQuant Live demo of [github.com/Ary5272/turbocpp](https://github.com/Ary5272/turbocpp). Two tabs: 1. **Run inference** — TinyLlama-1.1B-Chat (Q4_K_M) loaded via `llama-cpp-python` and run on this Space's CPU. Type a prompt, get tokens, see tok/s. 2. **TurboQuant math viz** — interactive sliders showing how the Hadamard rotation Gaussianizes per-block weight distributions and reduces the per-block max-abs that drives Q4 / Q4_K rounding error. ## Build details - **Gradio 5** + **Python 3.12** — Gradio 4 + new Starlette is broken in ways that don't resolve cleanly with version pins (TemplateResponse signature change, pydantic schema change), so we just upgrade. - **llama-cpp-python** installed from a **prebuilt wheel** at [AIencoder/llama-cpp-wheels](https://huggingface.co/datasets/AIencoder/llama-cpp-wheels) (variant `0.3.16+basic_avx2_fma_f16c-cp312`). HF Spaces don't reliably build this from source, so we ship the binary. - First `generate` cold-starts (~668 MB GGUF download). Subsequent calls are fast (model stays loaded in memory).