Spaces:
Running
Running
File size: 1,360 Bytes
ff5bd9d 50c695a 4ef7879 ff5bd9d 54f1cb1 ff5bd9d 4ef7879 50c695a 54f1cb1 ff5bd9d 50c695a 4ef7879 50c695a 4ef7879 50c695a 4ef7879 50c695a 4ef7879 5696928 4ef7879 54f1cb1 5696928 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | ---
title: TurboCPP Demo
emoji: π
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.5.0
app_file: app.py
pinned: false
license: mit
python_version: "3.12"
short_description: Live llama.cpp + Hadamard rotation demo (TurboQuant)
---
# turbocpp β llama.cpp + TurboQuant
Live demo of [github.com/Ary5272/turbocpp](https://github.com/Ary5272/turbocpp).
Two tabs:
1. **Run inference** β TinyLlama-1.1B-Chat (Q4_K_M) loaded via
`llama-cpp-python` and run on this Space's CPU. Type a prompt, get
tokens, see tok/s.
2. **TurboQuant math viz** β interactive sliders showing how the
Hadamard rotation Gaussianizes per-block weight distributions and
reduces the per-block max-abs that drives Q4 / Q4_K rounding error.
## Build details
- **Gradio 5** + **Python 3.12** β Gradio 4 + new Starlette is broken in
ways that don't resolve cleanly with version pins (TemplateResponse
signature change, pydantic schema change), so we just upgrade.
- **llama-cpp-python** installed from a **prebuilt wheel** at
[AIencoder/llama-cpp-wheels](https://huggingface.co/datasets/AIencoder/llama-cpp-wheels)
(variant `0.3.16+basic_avx2_fma_f16c-cp312`). HF Spaces don't reliably
build this from source, so we ship the binary.
- First `generate` cold-starts (~668 MB GGUF download). Subsequent calls
are fast (model stays loaded in memory).
|