Spaces:

AIencoder
/

turboquant-visualizer

Running

v6: bump to Gradio 5 (fixes Starlette 0.40+ TemplateResponse signature)

54f1cb1 verified 11 days ago

1.36 kB

	---
	title: TurboCPP Demo
	emoji: 🌀
	colorFrom: indigo
	colorTo: blue
	sdk: gradio
	sdk_version: 5.5.0
	app_file: app.py
	pinned: false
	license: mit
	python_version: "3.12"
	short_description: Live llama.cpp + Hadamard rotation demo (TurboQuant)
	---

	# turbocpp — llama.cpp + TurboQuant

	Live demo of [github.com/Ary5272/turbocpp](https://github.com/Ary5272/turbocpp).

	Two tabs:

	1. Run inference — TinyLlama-1.1B-Chat (Q4_K_M) loaded via
	`llama-cpp-python` and run on this Space's CPU. Type a prompt, get
	tokens, see tok/s.
	2. TurboQuant math viz — interactive sliders showing how the
	Hadamard rotation Gaussianizes per-block weight distributions and
	reduces the per-block max-abs that drives Q4 / Q4_K rounding error.

	## Build details

	- Gradio 5 + Python 3.12 — Gradio 4 + new Starlette is broken in
	ways that don't resolve cleanly with version pins (TemplateResponse
	signature change, pydantic schema change), so we just upgrade.
	- llama-cpp-python installed from a prebuilt wheel at
	[AIencoder/llama-cpp-wheels](https://huggingface.co/datasets/AIencoder/llama-cpp-wheels)
	(variant `0.3.16+basic_avx2_fma_f16c-cp312`). HF Spaces don't reliably
	build this from source, so we ship the binary.
	- First `generate` cold-starts (~668 MB GGUF download). Subsequent calls
	are fast (model stays loaded in memory).