File size: 1,360 Bytes
ff5bd9d
50c695a
4ef7879
 
 
ff5bd9d
54f1cb1
ff5bd9d
 
4ef7879
50c695a
54f1cb1
ff5bd9d
 
50c695a
4ef7879
50c695a
4ef7879
50c695a
4ef7879
50c695a
 
 
 
 
 
4ef7879
5696928
4ef7879
54f1cb1
 
 
5696928
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
---
title: TurboCPP Demo
emoji: πŸŒ€
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.5.0
app_file: app.py
pinned: false
license: mit
python_version: "3.12"
short_description: Live llama.cpp + Hadamard rotation demo (TurboQuant)
---

# turbocpp β€” llama.cpp + TurboQuant

Live demo of [github.com/Ary5272/turbocpp](https://github.com/Ary5272/turbocpp).

Two tabs:

1. **Run inference** β€” TinyLlama-1.1B-Chat (Q4_K_M) loaded via
   `llama-cpp-python` and run on this Space's CPU. Type a prompt, get
   tokens, see tok/s.
2. **TurboQuant math viz** β€” interactive sliders showing how the
   Hadamard rotation Gaussianizes per-block weight distributions and
   reduces the per-block max-abs that drives Q4 / Q4_K rounding error.

## Build details

- **Gradio 5** + **Python 3.12** β€” Gradio 4 + new Starlette is broken in
  ways that don't resolve cleanly with version pins (TemplateResponse
  signature change, pydantic schema change), so we just upgrade.
- **llama-cpp-python** installed from a **prebuilt wheel** at
  [AIencoder/llama-cpp-wheels](https://huggingface.co/datasets/AIencoder/llama-cpp-wheels)
  (variant `0.3.16+basic_avx2_fma_f16c-cp312`). HF Spaces don't reliably
  build this from source, so we ship the binary.
- First `generate` cold-starts (~668 MB GGUF download). Subsequent calls
  are fast (model stays loaded in memory).