AIencoder's picture
v6: bump to Gradio 5 (fixes Starlette 0.40+ TemplateResponse signature)
54f1cb1 verified

A newer version of the Gradio SDK is available: 6.14.0

Upgrade
metadata
title: TurboCPP Demo
emoji: πŸŒ€
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.5.0
app_file: app.py
pinned: false
license: mit
python_version: '3.12'
short_description: Live llama.cpp + Hadamard rotation demo (TurboQuant)

turbocpp β€” llama.cpp + TurboQuant

Live demo of github.com/Ary5272/turbocpp.

Two tabs:

  1. Run inference β€” TinyLlama-1.1B-Chat (Q4_K_M) loaded via llama-cpp-python and run on this Space's CPU. Type a prompt, get tokens, see tok/s.
  2. TurboQuant math viz β€” interactive sliders showing how the Hadamard rotation Gaussianizes per-block weight distributions and reduces the per-block max-abs that drives Q4 / Q4_K rounding error.

Build details

  • Gradio 5 + Python 3.12 β€” Gradio 4 + new Starlette is broken in ways that don't resolve cleanly with version pins (TemplateResponse signature change, pydantic schema change), so we just upgrade.
  • llama-cpp-python installed from a prebuilt wheel at AIencoder/llama-cpp-wheels (variant 0.3.16+basic_avx2_fma_f16c-cp312). HF Spaces don't reliably build this from source, so we ship the binary.
  • First generate cold-starts (~668 MB GGUF download). Subsequent calls are fast (model stays loaded in memory).