Spaces:
Running
Running
A newer version of the Gradio SDK is available: 6.14.0
metadata
title: TurboCPP Demo
emoji: π
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.5.0
app_file: app.py
pinned: false
license: mit
python_version: '3.12'
short_description: Live llama.cpp + Hadamard rotation demo (TurboQuant)
turbocpp β llama.cpp + TurboQuant
Live demo of github.com/Ary5272/turbocpp.
Two tabs:
- Run inference β TinyLlama-1.1B-Chat (Q4_K_M) loaded via
llama-cpp-pythonand run on this Space's CPU. Type a prompt, get tokens, see tok/s. - TurboQuant math viz β interactive sliders showing how the Hadamard rotation Gaussianizes per-block weight distributions and reduces the per-block max-abs that drives Q4 / Q4_K rounding error.
Build details
- Gradio 5 + Python 3.12 β Gradio 4 + new Starlette is broken in ways that don't resolve cleanly with version pins (TemplateResponse signature change, pydantic schema change), so we just upgrade.
- llama-cpp-python installed from a prebuilt wheel at
AIencoder/llama-cpp-wheels
(variant
0.3.16+basic_avx2_fma_f16c-cp312). HF Spaces don't reliably build this from source, so we ship the binary. - First
generatecold-starts (~668 MB GGUF download). Subsequent calls are fast (model stays loaded in memory).