Commit ·
1e16cb3
1
Parent(s): 659322c
fix: enable GGUF generation in Space via pre-built llama-cpp-python CPU wheel
Browse filesAdd llama-cpp-python to requirements.txt using the abetlen CPU wheel index so
the HF Space build installs a pre-compiled wheel instead of compiling from source,
restoring the Generate button and interactive Play tab.
Made-with: Cursor
- README.md +7 -11
- app.py +24 -20
- requirements-play.txt +2 -4
- requirements.txt +4 -4
README.md
CHANGED
|
@@ -16,7 +16,7 @@ Interactive **benchmark dashboard** for TinyLlama LoRA fusion → **GGUF** expor
|
|
| 16 |
|
| 17 |
## What this Space shows
|
| 18 |
|
| 19 |
-
- **Try a quant** — in-browser GGUF chat **
|
| 20 |
- Deployment comparison (size, throughput, memory, perplexity, batch TPS)
|
| 21 |
- Plots: throughput vs quantization, size vs perplexity, batched inference
|
| 22 |
- Instructions to run `llama-server` locally (`inference_server/`)
|
|
@@ -29,7 +29,7 @@ Quantized **`.gguf`** files live in the model repo **[ysingh-aiml/tinyllama-alpa
|
|
| 29 |
- [model-Q5_K_M.gguf](https://huggingface.co/ysingh-aiml/tinyllama-alpaca-lora-gguf/resolve/main/model-Q5_K_M.gguf)
|
| 30 |
- [model-Q8_0.gguf](https://huggingface.co/ysingh-aiml/tinyllama-alpaca-lora-gguf/resolve/main/model-Q8_0.gguf)
|
| 31 |
|
| 32 |
-
You can
|
| 33 |
|
| 34 |
## Repository layout
|
| 35 |
|
|
@@ -37,8 +37,8 @@ You can still export locally with `task4_quantization_gguf.py` or copy from `res
|
|
| 37 |
task4_hf/
|
| 38 |
├── app.py
|
| 39 |
├── gguf_chat.py
|
| 40 |
-
├── requirements.txt
|
| 41 |
-
├── requirements-play.txt
|
| 42 |
├── runtime.txt
|
| 43 |
├── results/
|
| 44 |
│ ├── deployment_comparison.json
|
|
@@ -55,21 +55,17 @@ task4_hf/
|
|
| 55 |
│ └── requirements.txt
|
| 56 |
└── models/
|
| 57 |
├── README.md
|
| 58 |
-
└── *.gguf # optional;
|
| 59 |
```
|
| 60 |
|
| 61 |
## Hardware note
|
| 62 |
|
| 63 |
Benchmarks were collected on **Apple Silicon (M-series)** with local **llama.cpp** binaries. Numbers on other CPUs/GPUs will differ.
|
| 64 |
|
| 65 |
-
## Local:
|
| 66 |
|
| 67 |
```bash
|
| 68 |
cd task4_hf
|
| 69 |
-
pip install -r requirements
|
| 70 |
python app.py
|
| 71 |
```
|
| 72 |
-
|
| 73 |
-
## Space build note
|
| 74 |
-
|
| 75 |
-
`runtime.txt` requests **Python 3.10**. If the builder still uses 3.13, optional wheels may be missing — keeping `llama-cpp-python` out of the default `requirements.txt` avoids long compiles and **timeouts**.
|
|
|
|
| 16 |
|
| 17 |
## What this Space shows
|
| 18 |
|
| 19 |
+
- **Try a quant** — in-browser GGUF chat powered by **`llama-cpp-python`** (pre-built CPU wheel, no compilation needed). First generation downloads the GGUF weights (~640 MB – 1.1 GB) and loads them on CPU — allow ~1 minute.
|
| 20 |
- Deployment comparison (size, throughput, memory, perplexity, batch TPS)
|
| 21 |
- Plots: throughput vs quantization, size vs perplexity, batched inference
|
| 22 |
- Instructions to run `llama-server` locally (`inference_server/`)
|
|
|
|
| 29 |
- [model-Q5_K_M.gguf](https://huggingface.co/ysingh-aiml/tinyllama-alpaca-lora-gguf/resolve/main/model-Q5_K_M.gguf)
|
| 30 |
- [model-Q8_0.gguf](https://huggingface.co/ysingh-aiml/tinyllama-alpaca-lora-gguf/resolve/main/model-Q8_0.gguf)
|
| 31 |
|
| 32 |
+
You can also export locally with `task4_quantization_gguf.py` or copy from `results/task4/`.
|
| 33 |
|
| 34 |
## Repository layout
|
| 35 |
|
|
|
|
| 37 |
task4_hf/
|
| 38 |
├── app.py
|
| 39 |
├── gguf_chat.py
|
| 40 |
+
├── requirements.txt # includes llama-cpp-python via pre-built CPU wheel
|
| 41 |
+
├── requirements-play.txt # convenience alias (-r requirements.txt)
|
| 42 |
├── runtime.txt
|
| 43 |
├── results/
|
| 44 |
│ ├── deployment_comparison.json
|
|
|
|
| 55 |
│ └── requirements.txt
|
| 56 |
└── models/
|
| 57 |
├── README.md
|
| 58 |
+
└── *.gguf # optional; inference_server/server.py downloads from the Hub if missing
|
| 59 |
```
|
| 60 |
|
| 61 |
## Hardware note
|
| 62 |
|
| 63 |
Benchmarks were collected on **Apple Silicon (M-series)** with local **llama.cpp** binaries. Numbers on other CPUs/GPUs will differ.
|
| 64 |
|
| 65 |
+
## Local: run "Try a quant" chat
|
| 66 |
|
| 67 |
```bash
|
| 68 |
cd task4_hf
|
| 69 |
+
pip install -r requirements.txt # includes llama-cpp-python via pre-built CPU wheel
|
| 70 |
python app.py
|
| 71 |
```
|
|
|
|
|
|
|
|
|
|
|
|
app.py
CHANGED
|
@@ -18,8 +18,8 @@ import gradio as gr
|
|
| 18 |
ROOT = Path(__file__).resolve().parent
|
| 19 |
RESULTS = ROOT / "results"
|
| 20 |
PLOTS = ROOT / "plots"
|
| 21 |
-
DEBUG_LOG_PATH = Path("/Users/ysingh/PyCharmMiscProject/.cursor/debug-
|
| 22 |
-
DEBUG_SESSION_ID = "
|
| 23 |
|
| 24 |
|
| 25 |
def _debug_log(run_id: str, hypothesis_id: str, location: str, message: str, data: dict) -> None:
|
|
@@ -120,7 +120,7 @@ def _play_intro_md() -> str:
|
|
| 120 |
err = f"{type(exc).__name__}: {exc}"
|
| 121 |
# region agent log
|
| 122 |
_debug_log(
|
| 123 |
-
"
|
| 124 |
"H1",
|
| 125 |
"task4_hf/app.py:_play_intro_md",
|
| 126 |
"Play intro dependency probe",
|
|
@@ -138,17 +138,16 @@ def _play_intro_md() -> str:
|
|
| 138 |
if ok:
|
| 139 |
return (
|
| 140 |
base
|
| 141 |
-
+ "Select a **GGUF**
|
| 142 |
-
"
|
|
|
|
| 143 |
)
|
| 144 |
return (
|
| 145 |
base
|
| 146 |
-
+ "### Interactive chat unavailable
|
| 147 |
-
"
|
| 148 |
-
"can finish the image build in time. On many builders, that package **compiles from source** "
|
| 149 |
-
"(slow) or lacks **prebuilt wheels** for **Python 3.13**, which causes **build timeouts**.\n\n"
|
| 150 |
"**Run locally:**\n"
|
| 151 |
-
"```bash\npip install -r requirements
|
| 152 |
"Or use the **Local server** tab with a system `llama-server` binary."
|
| 153 |
)
|
| 154 |
|
|
@@ -165,7 +164,7 @@ def _play_deps_status() -> tuple[bool, str]:
|
|
| 165 |
def _play_generate(quant_label: str, message: str, max_tokens: float) -> tuple[str, str]:
|
| 166 |
# region agent log
|
| 167 |
_debug_log(
|
| 168 |
-
"
|
| 169 |
"H4",
|
| 170 |
"task4_hf/app.py:_play_generate",
|
| 171 |
"Generate clicked",
|
|
@@ -182,7 +181,7 @@ def _play_generate(quant_label: str, message: str, max_tokens: float) -> tuple[s
|
|
| 182 |
ok, err = _deps_ok()
|
| 183 |
# region agent log
|
| 184 |
_debug_log(
|
| 185 |
-
"
|
| 186 |
"H2",
|
| 187 |
"task4_hf/app.py:_play_generate",
|
| 188 |
"Dependency check result in generate path",
|
|
@@ -196,7 +195,7 @@ def _play_generate(quant_label: str, message: str, max_tokens: float) -> tuple[s
|
|
| 196 |
if not ok:
|
| 197 |
# region agent log
|
| 198 |
_debug_log(
|
| 199 |
-
"
|
| 200 |
"H3",
|
| 201 |
"task4_hf/app.py:_play_generate",
|
| 202 |
"Returning missing dependency message",
|
|
@@ -216,7 +215,7 @@ def _play_generate(quant_label: str, message: str, max_tokens: float) -> tuple[s
|
|
| 216 |
)
|
| 217 |
# region agent log
|
| 218 |
_debug_log(
|
| 219 |
-
"
|
| 220 |
"H4",
|
| 221 |
"task4_hf/app.py:_play_generate",
|
| 222 |
"Generate returned",
|
|
@@ -227,7 +226,7 @@ def _play_generate(quant_label: str, message: str, max_tokens: float) -> tuple[s
|
|
| 227 |
except Exception as exc: # noqa: BLE001
|
| 228 |
# region agent log
|
| 229 |
_debug_log(
|
| 230 |
-
"
|
| 231 |
"H4",
|
| 232 |
"task4_hf/app.py:_play_generate",
|
| 233 |
"Generate exception",
|
|
@@ -242,11 +241,16 @@ def build_app() -> gr.Blocks:
|
|
| 242 |
play_ok, play_err = _play_deps_status()
|
| 243 |
# region agent log
|
| 244 |
_debug_log(
|
| 245 |
-
"
|
| 246 |
-
"
|
| 247 |
"task4_hf/app.py:build_app",
|
| 248 |
"Play controls availability decided",
|
| 249 |
-
{
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 250 |
)
|
| 251 |
# endregion
|
| 252 |
|
|
@@ -262,7 +266,7 @@ def build_app() -> gr.Blocks:
|
|
| 262 |
|
| 263 |
**GGUF weights** live in **[ysingh-aiml/tinyllama-alpaca-lora-gguf](https://huggingface.co/ysingh-aiml/tinyllama-alpaca-lora-gguf)**.
|
| 264 |
|
| 265 |
-
The **Try a quant** tab
|
| 266 |
"""
|
| 267 |
|
| 268 |
with gr.Blocks(
|
|
@@ -307,7 +311,7 @@ The **Try a quant** tab needs **`llama-cpp-python`** (optional on the hosted Spa
|
|
| 307 |
play_status = gr.Markdown(value=play_status_text)
|
| 308 |
# region agent log
|
| 309 |
_debug_log(
|
| 310 |
-
"
|
| 311 |
"H5",
|
| 312 |
"task4_hf/app.py:build_app",
|
| 313 |
"Play widget interactivity configured",
|
|
|
|
| 18 |
ROOT = Path(__file__).resolve().parent
|
| 19 |
RESULTS = ROOT / "results"
|
| 20 |
PLOTS = ROOT / "plots"
|
| 21 |
+
DEBUG_LOG_PATH = Path("/Users/ysingh/PyCharmMiscProject/.cursor/debug-365fe9.log")
|
| 22 |
+
DEBUG_SESSION_ID = "365fe9"
|
| 23 |
|
| 24 |
|
| 25 |
def _debug_log(run_id: str, hypothesis_id: str, location: str, message: str, data: dict) -> None:
|
|
|
|
| 120 |
err = f"{type(exc).__name__}: {exc}"
|
| 121 |
# region agent log
|
| 122 |
_debug_log(
|
| 123 |
+
"run-365fe9",
|
| 124 |
"H1",
|
| 125 |
"task4_hf/app.py:_play_intro_md",
|
| 126 |
"Play intro dependency probe",
|
|
|
|
| 138 |
if ok:
|
| 139 |
return (
|
| 140 |
base
|
| 141 |
+
+ "Select a **GGUF** quantization, type a message, and hit **Generate**. "
|
| 142 |
+
"The first run **downloads** the weights (~640 MB – 1.1 GB) and loads them with "
|
| 143 |
+
"`llama-cpp-python` (CPU — first generation can take a minute)."
|
| 144 |
)
|
| 145 |
return (
|
| 146 |
base
|
| 147 |
+
+ "### Interactive chat unavailable\n\n"
|
| 148 |
+
f"`llama-cpp-python` failed to load: `{err}`\n\n"
|
|
|
|
|
|
|
| 149 |
"**Run locally:**\n"
|
| 150 |
+
"```bash\npip install -r requirements.txt\npython app.py\n```\n\n"
|
| 151 |
"Or use the **Local server** tab with a system `llama-server` binary."
|
| 152 |
)
|
| 153 |
|
|
|
|
| 164 |
def _play_generate(quant_label: str, message: str, max_tokens: float) -> tuple[str, str]:
|
| 165 |
# region agent log
|
| 166 |
_debug_log(
|
| 167 |
+
"run-365fe9",
|
| 168 |
"H4",
|
| 169 |
"task4_hf/app.py:_play_generate",
|
| 170 |
"Generate clicked",
|
|
|
|
| 181 |
ok, err = _deps_ok()
|
| 182 |
# region agent log
|
| 183 |
_debug_log(
|
| 184 |
+
"run-365fe9",
|
| 185 |
"H2",
|
| 186 |
"task4_hf/app.py:_play_generate",
|
| 187 |
"Dependency check result in generate path",
|
|
|
|
| 195 |
if not ok:
|
| 196 |
# region agent log
|
| 197 |
_debug_log(
|
| 198 |
+
"run-365fe9",
|
| 199 |
"H3",
|
| 200 |
"task4_hf/app.py:_play_generate",
|
| 201 |
"Returning missing dependency message",
|
|
|
|
| 215 |
)
|
| 216 |
# region agent log
|
| 217 |
_debug_log(
|
| 218 |
+
"run-365fe9",
|
| 219 |
"H4",
|
| 220 |
"task4_hf/app.py:_play_generate",
|
| 221 |
"Generate returned",
|
|
|
|
| 226 |
except Exception as exc: # noqa: BLE001
|
| 227 |
# region agent log
|
| 228 |
_debug_log(
|
| 229 |
+
"run-365fe9",
|
| 230 |
"H4",
|
| 231 |
"task4_hf/app.py:_play_generate",
|
| 232 |
"Generate exception",
|
|
|
|
| 241 |
play_ok, play_err = _play_deps_status()
|
| 242 |
# region agent log
|
| 243 |
_debug_log(
|
| 244 |
+
"run-365fe9",
|
| 245 |
+
"H1",
|
| 246 |
"task4_hf/app.py:build_app",
|
| 247 |
"Play controls availability decided",
|
| 248 |
+
{
|
| 249 |
+
"play_ok": bool(play_ok),
|
| 250 |
+
"play_err": play_err if not play_ok else "",
|
| 251 |
+
"python_version": os.sys.version,
|
| 252 |
+
"platform": os.sys.platform,
|
| 253 |
+
},
|
| 254 |
)
|
| 255 |
# endregion
|
| 256 |
|
|
|
|
| 266 |
|
| 267 |
**GGUF weights** live in **[ysingh-aiml/tinyllama-alpaca-lora-gguf](https://huggingface.co/ysingh-aiml/tinyllama-alpaca-lora-gguf)**.
|
| 268 |
|
| 269 |
+
The **Try a quant** tab lets you run GGUF generation in-browser via **`llama-cpp-python`** (CPU; first generation downloads weights and may take a minute). Use the **Local server** tab for `llama-server`.
|
| 270 |
"""
|
| 271 |
|
| 272 |
with gr.Blocks(
|
|
|
|
| 311 |
play_status = gr.Markdown(value=play_status_text)
|
| 312 |
# region agent log
|
| 313 |
_debug_log(
|
| 314 |
+
"run-365fe9",
|
| 315 |
"H5",
|
| 316 |
"task4_hf/app.py:build_app",
|
| 317 |
"Play widget interactivity configured",
|
requirements-play.txt
CHANGED
|
@@ -1,5 +1,3 @@
|
|
| 1 |
-
#
|
| 2 |
-
#
|
| 3 |
-
# Avoid Python 3.13 until llama-cpp-python publishes cp313 wheels broadly.
|
| 4 |
-r requirements.txt
|
| 5 |
-
llama-cpp-python>=0.2.90,<0.4
|
|
|
|
| 1 |
+
# Enable the "Try a quant" tab (GGUF chat in-process).
|
| 2 |
+
# requirements.txt now includes llama-cpp-python via pre-built CPU wheels.
|
|
|
|
| 3 |
-r requirements.txt
|
|
|
requirements.txt
CHANGED
|
@@ -1,9 +1,9 @@
|
|
| 1 |
-
# Gradio Space —
|
| 2 |
-
# exceeds HF build time limits, especially on Python 3.13 without prebuilt wheels).
|
| 3 |
gradio>=5.10.0,<6
|
| 4 |
audioop-lts>=0.2.1
|
| 5 |
matplotlib>=3.8.0
|
| 6 |
numpy>=1.26.0
|
| 7 |
huggingface_hub>=0.20.0
|
| 8 |
-
|
| 9 |
-
|
|
|
|
|
|
| 1 |
+
# Gradio Space — uses pre-built CPU-only llama-cpp-python wheels to avoid source compilation.
|
|
|
|
| 2 |
gradio>=5.10.0,<6
|
| 3 |
audioop-lts>=0.2.1
|
| 4 |
matplotlib>=3.8.0
|
| 5 |
numpy>=1.26.0
|
| 6 |
huggingface_hub>=0.20.0
|
| 7 |
+
# Pre-built CPU wheel avoids source compilation (no BLAS/cmake needed, fast install).
|
| 8 |
+
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
|
| 9 |
+
llama-cpp-python>=0.2.90,<0.4
|