Spaces:

AIencoder
/

turboquant-visualizer

Sleeping

App Files Files Community

AIencoder commited on 23 days ago

Commit

5696928

verified ·

1 Parent(s): 50c695a

v3: install llama-cpp-python from prebuilt wheel (HF dataset, no source build)

Browse files

Files changed (3) hide show

README.md +8 -4
app.py +4 -13
requirements.txt +7 -1

README.md CHANGED Viewed

@@ -25,9 +25,13 @@ Two tabs:
    Hadamard rotation Gaussianizes per-block weight distributions and
    reduces the per-block max-abs that drives Q4 / Q4_K rounding error.
-## Notes
-- Python pinned to 3.12 (3.13 dropped stdlib `audioop` which Gradio's
   pydub dep needs).
-- First call cold-starts the model (~668 MB GGUF download). Subsequent
-  calls are fast.

    Hadamard rotation Gaussianizes per-block weight distributions and
    reduces the per-block max-abs that drives Q4 / Q4_K rounding error.
+## Build details
+- **Python 3.12** pinned (3.13 dropped stdlib `audioop` which Gradio's
   pydub dep needs).
+- **llama-cpp-python** installed from a **prebuilt wheel** at
+  [AIencoder/llama-cpp-wheels](https://huggingface.co/datasets/AIencoder/llama-cpp-wheels)
+  (variant `0.3.16+basic_avx2_fma_f16c-cp312`). HF Spaces don't reliably
+  build this from source, so we ship the binary.
+- First `generate` cold-starts (~668 MB GGUF download). Subsequent calls
+  are fast (model stays loaded in memory).

app.py CHANGED Viewed

@@ -1,7 +1,8 @@
 """TurboCPP — llama.cpp + TurboQuant — HuggingFace Space.
 Two tabs:
-  1. Run inference: live llama.cpp on TinyLlama-1.1B-Chat-Q4_K_M.
   2. TurboQuant math viz: shows what the offline rotation does to the
      weight distribution that quantization sees.
 """
@@ -23,9 +24,6 @@ from hadamard import block_hadamard_inplace
 from bench import heavy_tailed_weight, measure
-# ---------------------------------------------------------------------------
-# Inference tab — lazy-load llama-cpp-python + a small GGUF.
-# ---------------------------------------------------------------------------
 _llm = None
 _load_error = None
@@ -91,9 +89,6 @@ def chat(prompt: str, max_tokens: int, temperature: float):
     return text or "(empty)", stats
-# ---------------------------------------------------------------------------
-# Visualization tab
-# ---------------------------------------------------------------------------
 def _plot(W_raw, W_rot, block):
     fig, axes = plt.subplots(1, 3, figsize=(13, 3.6))
     raw = W_raw.flatten().numpy()
@@ -149,16 +144,12 @@ def visualize(rows, cols, block, seed):
     return _plot(W, W_rot, int(block)), summary
-# ---------------------------------------------------------------------------
-# UI
-# ---------------------------------------------------------------------------
 with gr.Blocks(title="turbocpp - llama.cpp + TurboQuant",
                theme=gr.themes.Soft()) as demo:
     gr.Markdown("# turbocpp - llama.cpp + TurboQuant")
     gr.Markdown(
-        "Live llama.cpp running TinyLlama-1.1B-Chat (Q4_K_M) plus an "
-        "interactive math visualizer for the Hadamard-rotation "
-        "preprocessor. "
         "Code: [github.com/Ary5272/turbocpp](https://github.com/Ary5272/turbocpp)"
     )

 """TurboCPP — llama.cpp + TurboQuant — HuggingFace Space.
 Two tabs:
+  1. Run inference: live llama.cpp on TinyLlama-1.1B-Chat-Q4_K_M via the
+     prebuilt llama-cpp-python wheel from AIencoder/llama-cpp-wheels.
   2. TurboQuant math viz: shows what the offline rotation does to the
      weight distribution that quantization sees.
 """
 from bench import heavy_tailed_weight, measure
 _llm = None
 _load_error = None
     return text or "(empty)", stats
 def _plot(W_raw, W_rot, block):
     fig, axes = plt.subplots(1, 3, figsize=(13, 3.6))
     raw = W_raw.flatten().numpy()
     return _plot(W, W_rot, int(block)), summary
 with gr.Blocks(title="turbocpp - llama.cpp + TurboQuant",
                theme=gr.themes.Soft()) as demo:
     gr.Markdown("# turbocpp - llama.cpp + TurboQuant")
     gr.Markdown(
+        "Live llama.cpp running TinyLlama-1.1B-Chat (Q4_K_M) via a "
+        "prebuilt wheel + interactive Hadamard-rotation visualizer. "
         "Code: [github.com/Ary5272/turbocpp](https://github.com/Ary5272/turbocpp)"
     )

requirements.txt CHANGED Viewed

@@ -4,5 +4,11 @@ numpy>=1.24
 torch>=2.0
 pillow>=10.0
 huggingface_hub>=0.24
-llama-cpp-python>=0.3.2
 audioop-lts; python_version >= "3.13"

 torch>=2.0
 pillow>=10.0
 huggingface_hub>=0.24
+# llama-cpp-python: prebuilt wheel from AIencoder/llama-cpp-wheels.
+# CPU-only (no BLAS dep), AVX2 + FMA + F16C — the right baseline for HF
+# Spaces' x86_64 CPU. Avoids source-build, which is unreliable on Spaces.
+https://huggingface.co/datasets/AIencoder/llama-cpp-wheels/resolve/main/llama_cpp_python-0.3.16%2Bbasic_avx2_fma_f16c-cp312-cp312-manylinux_2_31_x86_64.whl
+# Backport for stdlib audioop, removed in Python 3.13.
 audioop-lts; python_version >= "3.13"