Spaces:

ysingh-aiml
/

tinyllama-quantization-gguf

Sleeping

ysingh-aiml commited on Mar 26

Commit

1e16cb3

1 Parent(s): 659322c

fix: enable GGUF generation in Space via pre-built llama-cpp-python CPU wheel

Add llama-cpp-python to requirements.txt using the abetlen CPU wheel index so
the HF Space build installs a pre-compiled wheel instead of compiling from source,
restoring the Generate button and interactive Play tab.

Made-with: Cursor

Files changed (4) hide show

README.md +7 -11
app.py +24 -20
requirements-play.txt +2 -4
requirements.txt +4 -4

README.md CHANGED Viewed

@@ -16,7 +16,7 @@ Interactive **benchmark dashboard** for TinyLlama LoRA fusion → **GGUF** expor
 ## What this Space shows
-- **Try a quant** — in-browser GGUF chat **when `llama-cpp-python` is installed**. The default Space **`requirements.txt` omits it** so builds finish within the HF time limit (that package often **compiles from source** or lacks wheels for **Python 3.13**). Use **`requirements-play.txt`** locally or a custom Space.
 - Deployment comparison (size, throughput, memory, perplexity, batch TPS)
 - Plots: throughput vs quantization, size vs perplexity, batched inference
 - Instructions to run `llama-server` locally (`inference_server/`)
@@ -29,7 +29,7 @@ Quantized **`.gguf`** files live in the model repo **[ysingh-aiml/tinyllama-alpa
 - [model-Q5_K_M.gguf](https://huggingface.co/ysingh-aiml/tinyllama-alpaca-lora-gguf/resolve/main/model-Q5_K_M.gguf)
 - [model-Q8_0.gguf](https://huggingface.co/ysingh-aiml/tinyllama-alpaca-lora-gguf/resolve/main/model-Q8_0.gguf)
-You can still export locally with `task4_quantization_gguf.py` or copy from `results/task4/` if you prefer.
 ## Repository layout
@@ -37,8 +37,8 @@ You can still export locally with `task4_quantization_gguf.py` or copy from `res
 task4_hf/
 ├── app.py
 ├── gguf_chat.py
-├── requirements.txt
-├── requirements-play.txt   # optional: llama-cpp-python for Play tab
 ├── runtime.txt
 ├── results/
 │   ├── deployment_comparison.json
@@ -55,21 +55,17 @@ task4_hf/
 │   └── requirements.txt
 └── models/
     ├── README.md
-    └── *.gguf      # optional; `inference_server/server.py` downloads from the Hub if missing
 ```
 ## Hardware note
 Benchmarks were collected on **Apple Silicon (M-series)** with local **llama.cpp** binaries. Numbers on other CPUs/GPUs will differ.
-## Local: enable “Try a quant” chat
 ```bash
 cd task4_hf
-pip install -r requirements-play.txt   # adds llama-cpp-python; prefer Python 3.10/3.11
 python app.py
 ```
-## Space build note
-`runtime.txt` requests **Python 3.10**. If the builder still uses 3.13, optional wheels may be missing — keeping `llama-cpp-python` out of the default `requirements.txt` avoids long compiles and **timeouts**.

 ## What this Space shows
+- **Try a quant** — in-browser GGUF chat powered by **`llama-cpp-python`** (pre-built CPU wheel, no compilation needed). First generation downloads the GGUF weights (~640 MB – 1.1 GB) and loads them on CPU — allow ~1 minute.
 - Deployment comparison (size, throughput, memory, perplexity, batch TPS)
 - Plots: throughput vs quantization, size vs perplexity, batched inference
 - Instructions to run `llama-server` locally (`inference_server/`)
 - [model-Q5_K_M.gguf](https://huggingface.co/ysingh-aiml/tinyllama-alpaca-lora-gguf/resolve/main/model-Q5_K_M.gguf)
 - [model-Q8_0.gguf](https://huggingface.co/ysingh-aiml/tinyllama-alpaca-lora-gguf/resolve/main/model-Q8_0.gguf)
+You can also export locally with `task4_quantization_gguf.py` or copy from `results/task4/`.
 ## Repository layout
 task4_hf/
 ├── app.py
 ├── gguf_chat.py
+├── requirements.txt          # includes llama-cpp-python via pre-built CPU wheel
+├── requirements-play.txt     # convenience alias (-r requirements.txt)
 ├── runtime.txt
 ├── results/
 │   ├── deployment_comparison.json
 │   └── requirements.txt
 └── models/
     ├── README.md
+    └── *.gguf      # optional; inference_server/server.py downloads from the Hub if missing
 ```
 ## Hardware note
 Benchmarks were collected on **Apple Silicon (M-series)** with local **llama.cpp** binaries. Numbers on other CPUs/GPUs will differ.
+## Local: run "Try a quant" chat
 ```bash
 cd task4_hf
+pip install -r requirements.txt   # includes llama-cpp-python via pre-built CPU wheel
 python app.py
 ```

app.py CHANGED Viewed

@@ -18,8 +18,8 @@ import gradio as gr
 ROOT = Path(__file__).resolve().parent
 RESULTS = ROOT / "results"
 PLOTS = ROOT / "plots"
-DEBUG_LOG_PATH = Path("/Users/ysingh/PyCharmMiscProject/.cursor/debug-64cf04.log")
-DEBUG_SESSION_ID = "64cf04"
 def _debug_log(run_id: str, hypothesis_id: str, location: str, message: str, data: dict) -> None:
@@ -120,7 +120,7 @@ def _play_intro_md() -> str:
         err = f"{type(exc).__name__}: {exc}"
     # region agent log
     _debug_log(
-        "pre-fix-1",
         "H1",
         "task4_hf/app.py:_play_intro_md",
         "Play intro dependency probe",
@@ -138,17 +138,16 @@ def _play_intro_md() -> str:
     if ok:
         return (
             base
-            + "Select a **GGUF**; the first run **downloads** weights then loads them with "
-            "`llama-cpp-python` (CPU — first load can take minutes on small hardware)."
         )
     return (
         base
-        + "### Interactive chat unavailable in this Space build\n\n"
-        "The default `requirements.txt` **does not** install `llama-cpp-python` so Hugging Face "
-        "can finish the image build in time. On many builders, that package **compiles from source** "
-        "(slow) or lacks **prebuilt wheels** for **Python 3.13**, which causes **build timeouts**.\n\n"
         "**Run locally:**\n"
-        "```bash\npip install -r requirements-play.txt\npython app.py\n```\n\n"
         "Or use the **Local server** tab with a system `llama-server` binary."
     )
@@ -165,7 +164,7 @@ def _play_deps_status() -> tuple[bool, str]:
 def _play_generate(quant_label: str, message: str, max_tokens: float) -> tuple[str, str]:
     # region agent log
     _debug_log(
-        "pre-fix-1",
         "H4",
         "task4_hf/app.py:_play_generate",
         "Generate clicked",
@@ -182,7 +181,7 @@ def _play_generate(quant_label: str, message: str, max_tokens: float) -> tuple[s
         ok, err = _deps_ok()
         # region agent log
         _debug_log(
-            "pre-fix-1",
             "H2",
             "task4_hf/app.py:_play_generate",
             "Dependency check result in generate path",
@@ -196,7 +195,7 @@ def _play_generate(quant_label: str, message: str, max_tokens: float) -> tuple[s
         if not ok:
             # region agent log
             _debug_log(
-                "pre-fix-1",
                 "H3",
                 "task4_hf/app.py:_play_generate",
                 "Returning missing dependency message",
@@ -216,7 +215,7 @@ def _play_generate(quant_label: str, message: str, max_tokens: float) -> tuple[s
         )
         # region agent log
         _debug_log(
-            "pre-fix-1",
             "H4",
             "task4_hf/app.py:_play_generate",
             "Generate returned",
@@ -227,7 +226,7 @@ def _play_generate(quant_label: str, message: str, max_tokens: float) -> tuple[s
     except Exception as exc:  # noqa: BLE001
         # region agent log
         _debug_log(
-            "pre-fix-1",
             "H4",
             "task4_hf/app.py:_play_generate",
             "Generate exception",
@@ -242,11 +241,16 @@ def build_app() -> gr.Blocks:
     play_ok, play_err = _play_deps_status()
     # region agent log
     _debug_log(
-        "post-fix-1",
-        "H4",
         "task4_hf/app.py:build_app",
         "Play controls availability decided",
-        {"play_ok": bool(play_ok), "play_err": play_err if not play_ok else ""},
     )
     # endregion
@@ -262,7 +266,7 @@ def build_app() -> gr.Blocks:
 **GGUF weights** live in **[ysingh-aiml/tinyllama-alpaca-lora-gguf](https://huggingface.co/ysingh-aiml/tinyllama-alpaca-lora-gguf)**.
-The **Try a quant** tab needs **`llama-cpp-python`** (optional on the hosted Space — see that tab). Otherwise use **`inference_server/server.py`** with `llama-server`, or install **`requirements-play.txt`** locally.
 """
     with gr.Blocks(
@@ -307,7 +311,7 @@ The **Try a quant** tab needs **`llama-cpp-python`** (optional on the hosted Spa
                 play_status = gr.Markdown(value=play_status_text)
                 # region agent log
                 _debug_log(
-                    "pre-fix-2",
                     "H5",
                     "task4_hf/app.py:build_app",
                     "Play widget interactivity configured",

 ROOT = Path(__file__).resolve().parent
 RESULTS = ROOT / "results"
 PLOTS = ROOT / "plots"
+DEBUG_LOG_PATH = Path("/Users/ysingh/PyCharmMiscProject/.cursor/debug-365fe9.log")
+DEBUG_SESSION_ID = "365fe9"
 def _debug_log(run_id: str, hypothesis_id: str, location: str, message: str, data: dict) -> None:
         err = f"{type(exc).__name__}: {exc}"
     # region agent log
     _debug_log(
+        "run-365fe9",
         "H1",
         "task4_hf/app.py:_play_intro_md",
         "Play intro dependency probe",
     if ok:
         return (
             base
+            + "Select a **GGUF** quantization, type a message, and hit **Generate**. "
+            "The first run **downloads** the weights (~640 MB – 1.1 GB) and loads them with "
+            "`llama-cpp-python` (CPU — first generation can take a minute)."
         )
     return (
         base
+        + "### Interactive chat unavailable\n\n"
+        f"`llama-cpp-python` failed to load: `{err}`\n\n"
         "**Run locally:**\n"
+        "```bash\npip install -r requirements.txt\npython app.py\n```\n\n"
         "Or use the **Local server** tab with a system `llama-server` binary."
     )
 def _play_generate(quant_label: str, message: str, max_tokens: float) -> tuple[str, str]:
     # region agent log
     _debug_log(
+        "run-365fe9",
         "H4",
         "task4_hf/app.py:_play_generate",
         "Generate clicked",
         ok, err = _deps_ok()
         # region agent log
         _debug_log(
+            "run-365fe9",
             "H2",
             "task4_hf/app.py:_play_generate",
             "Dependency check result in generate path",
         if not ok:
             # region agent log
             _debug_log(
+                "run-365fe9",
                 "H3",
                 "task4_hf/app.py:_play_generate",
                 "Returning missing dependency message",
         )
         # region agent log
         _debug_log(
+            "run-365fe9",
             "H4",
             "task4_hf/app.py:_play_generate",
             "Generate returned",
     except Exception as exc:  # noqa: BLE001
         # region agent log
         _debug_log(
+            "run-365fe9",
             "H4",
             "task4_hf/app.py:_play_generate",
             "Generate exception",
     play_ok, play_err = _play_deps_status()
     # region agent log
     _debug_log(
+        "run-365fe9",
+        "H1",
         "task4_hf/app.py:build_app",
         "Play controls availability decided",
+        {
+            "play_ok": bool(play_ok),
+            "play_err": play_err if not play_ok else "",
+            "python_version": os.sys.version,
+            "platform": os.sys.platform,
+        },
     )
     # endregion
 **GGUF weights** live in **[ysingh-aiml/tinyllama-alpaca-lora-gguf](https://huggingface.co/ysingh-aiml/tinyllama-alpaca-lora-gguf)**.
+The **Try a quant** tab lets you run GGUF generation in-browser via **`llama-cpp-python`** (CPU; first generation downloads weights and may take a minute). Use the **Local server** tab for `llama-server`.
 """
     with gr.Blocks(
                 play_status = gr.Markdown(value=play_status_text)
                 # region agent log
                 _debug_log(
+                    "run-365fe9",
                     "H5",
                     "task4_hf/app.py:build_app",
                     "Play widget interactivity configured",

requirements-play.txt CHANGED Viewed

@@ -1,5 +1,3 @@
-# Optional: enable the "Try a quant" tab (GGUF chat in-process).
-# Use Python 3.10 or 3.11 on Linux/macOS/Win so pip can install a binary wheel.
-# Avoid Python 3.13 until llama-cpp-python publishes cp313 wheels broadly.
 -r requirements.txt
-llama-cpp-python>=0.2.90,<0.4

+# Enable the "Try a quant" tab (GGUF chat in-process).
+# requirements.txt now includes llama-cpp-python via pre-built CPU wheels.
 -r requirements.txt

requirements.txt CHANGED Viewed

@@ -1,9 +1,9 @@
-# Gradio Space — fast install (no llama-cpp-python: it often compiles from source and
-# exceeds HF build time limits, especially on Python 3.13 without prebuilt wheels).
 gradio>=5.10.0,<6
 audioop-lts>=0.2.1
 matplotlib>=3.8.0
 numpy>=1.26.0
 huggingface_hub>=0.20.0
-# Interactive "Try a quant" chat: pip install -r requirements-play.txt (local or custom Space)

+# Gradio Space — uses pre-built CPU-only llama-cpp-python wheels to avoid source compilation.
 gradio>=5.10.0,<6
 audioop-lts>=0.2.1
 matplotlib>=3.8.0
 numpy>=1.26.0
 huggingface_hub>=0.20.0
+# Pre-built CPU wheel avoids source compilation (no BLAS/cmake needed, fast install).
+--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
+llama-cpp-python>=0.2.90,<0.4