ysingh-aiml commited on
Commit
1e16cb3
·
1 Parent(s): 659322c

fix: enable GGUF generation in Space via pre-built llama-cpp-python CPU wheel

Browse files

Add llama-cpp-python to requirements.txt using the abetlen CPU wheel index so
the HF Space build installs a pre-compiled wheel instead of compiling from source,
restoring the Generate button and interactive Play tab.

Made-with: Cursor

Files changed (4) hide show
  1. README.md +7 -11
  2. app.py +24 -20
  3. requirements-play.txt +2 -4
  4. requirements.txt +4 -4
README.md CHANGED
@@ -16,7 +16,7 @@ Interactive **benchmark dashboard** for TinyLlama LoRA fusion → **GGUF** expor
16
 
17
  ## What this Space shows
18
 
19
- - **Try a quant** — in-browser GGUF chat **when `llama-cpp-python` is installed**. The default Space **`requirements.txt` omits it** so builds finish within the HF time limit (that package often **compiles from source** or lacks wheels for **Python 3.13**). Use **`requirements-play.txt`** locally or a custom Space.
20
  - Deployment comparison (size, throughput, memory, perplexity, batch TPS)
21
  - Plots: throughput vs quantization, size vs perplexity, batched inference
22
  - Instructions to run `llama-server` locally (`inference_server/`)
@@ -29,7 +29,7 @@ Quantized **`.gguf`** files live in the model repo **[ysingh-aiml/tinyllama-alpa
29
  - [model-Q5_K_M.gguf](https://huggingface.co/ysingh-aiml/tinyllama-alpaca-lora-gguf/resolve/main/model-Q5_K_M.gguf)
30
  - [model-Q8_0.gguf](https://huggingface.co/ysingh-aiml/tinyllama-alpaca-lora-gguf/resolve/main/model-Q8_0.gguf)
31
 
32
- You can still export locally with `task4_quantization_gguf.py` or copy from `results/task4/` if you prefer.
33
 
34
  ## Repository layout
35
 
@@ -37,8 +37,8 @@ You can still export locally with `task4_quantization_gguf.py` or copy from `res
37
  task4_hf/
38
  ├── app.py
39
  ├── gguf_chat.py
40
- ├── requirements.txt
41
- ├── requirements-play.txt # optional: llama-cpp-python for Play tab
42
  ├── runtime.txt
43
  ├── results/
44
  │ ├── deployment_comparison.json
@@ -55,21 +55,17 @@ task4_hf/
55
  │ └── requirements.txt
56
  └── models/
57
  ├── README.md
58
- └── *.gguf # optional; `inference_server/server.py` downloads from the Hub if missing
59
  ```
60
 
61
  ## Hardware note
62
 
63
  Benchmarks were collected on **Apple Silicon (M-series)** with local **llama.cpp** binaries. Numbers on other CPUs/GPUs will differ.
64
 
65
- ## Local: enable Try a quant chat
66
 
67
  ```bash
68
  cd task4_hf
69
- pip install -r requirements-play.txt # adds llama-cpp-python; prefer Python 3.10/3.11
70
  python app.py
71
  ```
72
-
73
- ## Space build note
74
-
75
- `runtime.txt` requests **Python 3.10**. If the builder still uses 3.13, optional wheels may be missing — keeping `llama-cpp-python` out of the default `requirements.txt` avoids long compiles and **timeouts**.
 
16
 
17
  ## What this Space shows
18
 
19
+ - **Try a quant** — in-browser GGUF chat powered by **`llama-cpp-python`** (pre-built CPU wheel, no compilation needed). First generation downloads the GGUF weights (~640 MB 1.1 GB) and loads them on CPU allow ~1 minute.
20
  - Deployment comparison (size, throughput, memory, perplexity, batch TPS)
21
  - Plots: throughput vs quantization, size vs perplexity, batched inference
22
  - Instructions to run `llama-server` locally (`inference_server/`)
 
29
  - [model-Q5_K_M.gguf](https://huggingface.co/ysingh-aiml/tinyllama-alpaca-lora-gguf/resolve/main/model-Q5_K_M.gguf)
30
  - [model-Q8_0.gguf](https://huggingface.co/ysingh-aiml/tinyllama-alpaca-lora-gguf/resolve/main/model-Q8_0.gguf)
31
 
32
+ You can also export locally with `task4_quantization_gguf.py` or copy from `results/task4/`.
33
 
34
  ## Repository layout
35
 
 
37
  task4_hf/
38
  ├── app.py
39
  ├── gguf_chat.py
40
+ ├── requirements.txt # includes llama-cpp-python via pre-built CPU wheel
41
+ ├── requirements-play.txt # convenience alias (-r requirements.txt)
42
  ├── runtime.txt
43
  ├── results/
44
  │ ├── deployment_comparison.json
 
55
  │ └── requirements.txt
56
  └── models/
57
  ├── README.md
58
+ └── *.gguf # optional; inference_server/server.py downloads from the Hub if missing
59
  ```
60
 
61
  ## Hardware note
62
 
63
  Benchmarks were collected on **Apple Silicon (M-series)** with local **llama.cpp** binaries. Numbers on other CPUs/GPUs will differ.
64
 
65
+ ## Local: run "Try a quant" chat
66
 
67
  ```bash
68
  cd task4_hf
69
+ pip install -r requirements.txt # includes llama-cpp-python via pre-built CPU wheel
70
  python app.py
71
  ```
 
 
 
 
app.py CHANGED
@@ -18,8 +18,8 @@ import gradio as gr
18
  ROOT = Path(__file__).resolve().parent
19
  RESULTS = ROOT / "results"
20
  PLOTS = ROOT / "plots"
21
- DEBUG_LOG_PATH = Path("/Users/ysingh/PyCharmMiscProject/.cursor/debug-64cf04.log")
22
- DEBUG_SESSION_ID = "64cf04"
23
 
24
 
25
  def _debug_log(run_id: str, hypothesis_id: str, location: str, message: str, data: dict) -> None:
@@ -120,7 +120,7 @@ def _play_intro_md() -> str:
120
  err = f"{type(exc).__name__}: {exc}"
121
  # region agent log
122
  _debug_log(
123
- "pre-fix-1",
124
  "H1",
125
  "task4_hf/app.py:_play_intro_md",
126
  "Play intro dependency probe",
@@ -138,17 +138,16 @@ def _play_intro_md() -> str:
138
  if ok:
139
  return (
140
  base
141
- + "Select a **GGUF**; the first run **downloads** weights then loads them with "
142
- "`llama-cpp-python` (CPU first load can take minutes on small hardware)."
 
143
  )
144
  return (
145
  base
146
- + "### Interactive chat unavailable in this Space build\n\n"
147
- "The default `requirements.txt` **does not** install `llama-cpp-python` so Hugging Face "
148
- "can finish the image build in time. On many builders, that package **compiles from source** "
149
- "(slow) or lacks **prebuilt wheels** for **Python 3.13**, which causes **build timeouts**.\n\n"
150
  "**Run locally:**\n"
151
- "```bash\npip install -r requirements-play.txt\npython app.py\n```\n\n"
152
  "Or use the **Local server** tab with a system `llama-server` binary."
153
  )
154
 
@@ -165,7 +164,7 @@ def _play_deps_status() -> tuple[bool, str]:
165
  def _play_generate(quant_label: str, message: str, max_tokens: float) -> tuple[str, str]:
166
  # region agent log
167
  _debug_log(
168
- "pre-fix-1",
169
  "H4",
170
  "task4_hf/app.py:_play_generate",
171
  "Generate clicked",
@@ -182,7 +181,7 @@ def _play_generate(quant_label: str, message: str, max_tokens: float) -> tuple[s
182
  ok, err = _deps_ok()
183
  # region agent log
184
  _debug_log(
185
- "pre-fix-1",
186
  "H2",
187
  "task4_hf/app.py:_play_generate",
188
  "Dependency check result in generate path",
@@ -196,7 +195,7 @@ def _play_generate(quant_label: str, message: str, max_tokens: float) -> tuple[s
196
  if not ok:
197
  # region agent log
198
  _debug_log(
199
- "pre-fix-1",
200
  "H3",
201
  "task4_hf/app.py:_play_generate",
202
  "Returning missing dependency message",
@@ -216,7 +215,7 @@ def _play_generate(quant_label: str, message: str, max_tokens: float) -> tuple[s
216
  )
217
  # region agent log
218
  _debug_log(
219
- "pre-fix-1",
220
  "H4",
221
  "task4_hf/app.py:_play_generate",
222
  "Generate returned",
@@ -227,7 +226,7 @@ def _play_generate(quant_label: str, message: str, max_tokens: float) -> tuple[s
227
  except Exception as exc: # noqa: BLE001
228
  # region agent log
229
  _debug_log(
230
- "pre-fix-1",
231
  "H4",
232
  "task4_hf/app.py:_play_generate",
233
  "Generate exception",
@@ -242,11 +241,16 @@ def build_app() -> gr.Blocks:
242
  play_ok, play_err = _play_deps_status()
243
  # region agent log
244
  _debug_log(
245
- "post-fix-1",
246
- "H4",
247
  "task4_hf/app.py:build_app",
248
  "Play controls availability decided",
249
- {"play_ok": bool(play_ok), "play_err": play_err if not play_ok else ""},
 
 
 
 
 
250
  )
251
  # endregion
252
 
@@ -262,7 +266,7 @@ def build_app() -> gr.Blocks:
262
 
263
  **GGUF weights** live in **[ysingh-aiml/tinyllama-alpaca-lora-gguf](https://huggingface.co/ysingh-aiml/tinyllama-alpaca-lora-gguf)**.
264
 
265
- The **Try a quant** tab needs **`llama-cpp-python`** (optional on the hosted Space see that tab). Otherwise use **`inference_server/server.py`** with `llama-server`, or install **`requirements-play.txt`** locally.
266
  """
267
 
268
  with gr.Blocks(
@@ -307,7 +311,7 @@ The **Try a quant** tab needs **`llama-cpp-python`** (optional on the hosted Spa
307
  play_status = gr.Markdown(value=play_status_text)
308
  # region agent log
309
  _debug_log(
310
- "pre-fix-2",
311
  "H5",
312
  "task4_hf/app.py:build_app",
313
  "Play widget interactivity configured",
 
18
  ROOT = Path(__file__).resolve().parent
19
  RESULTS = ROOT / "results"
20
  PLOTS = ROOT / "plots"
21
+ DEBUG_LOG_PATH = Path("/Users/ysingh/PyCharmMiscProject/.cursor/debug-365fe9.log")
22
+ DEBUG_SESSION_ID = "365fe9"
23
 
24
 
25
  def _debug_log(run_id: str, hypothesis_id: str, location: str, message: str, data: dict) -> None:
 
120
  err = f"{type(exc).__name__}: {exc}"
121
  # region agent log
122
  _debug_log(
123
+ "run-365fe9",
124
  "H1",
125
  "task4_hf/app.py:_play_intro_md",
126
  "Play intro dependency probe",
 
138
  if ok:
139
  return (
140
  base
141
+ + "Select a **GGUF** quantization, type a message, and hit **Generate**. "
142
+ "The first run **downloads** the weights (~640 MB 1.1 GB) and loads them with "
143
+ "`llama-cpp-python` (CPU — first generation can take a minute)."
144
  )
145
  return (
146
  base
147
+ + "### Interactive chat unavailable\n\n"
148
+ f"`llama-cpp-python` failed to load: `{err}`\n\n"
 
 
149
  "**Run locally:**\n"
150
+ "```bash\npip install -r requirements.txt\npython app.py\n```\n\n"
151
  "Or use the **Local server** tab with a system `llama-server` binary."
152
  )
153
 
 
164
  def _play_generate(quant_label: str, message: str, max_tokens: float) -> tuple[str, str]:
165
  # region agent log
166
  _debug_log(
167
+ "run-365fe9",
168
  "H4",
169
  "task4_hf/app.py:_play_generate",
170
  "Generate clicked",
 
181
  ok, err = _deps_ok()
182
  # region agent log
183
  _debug_log(
184
+ "run-365fe9",
185
  "H2",
186
  "task4_hf/app.py:_play_generate",
187
  "Dependency check result in generate path",
 
195
  if not ok:
196
  # region agent log
197
  _debug_log(
198
+ "run-365fe9",
199
  "H3",
200
  "task4_hf/app.py:_play_generate",
201
  "Returning missing dependency message",
 
215
  )
216
  # region agent log
217
  _debug_log(
218
+ "run-365fe9",
219
  "H4",
220
  "task4_hf/app.py:_play_generate",
221
  "Generate returned",
 
226
  except Exception as exc: # noqa: BLE001
227
  # region agent log
228
  _debug_log(
229
+ "run-365fe9",
230
  "H4",
231
  "task4_hf/app.py:_play_generate",
232
  "Generate exception",
 
241
  play_ok, play_err = _play_deps_status()
242
  # region agent log
243
  _debug_log(
244
+ "run-365fe9",
245
+ "H1",
246
  "task4_hf/app.py:build_app",
247
  "Play controls availability decided",
248
+ {
249
+ "play_ok": bool(play_ok),
250
+ "play_err": play_err if not play_ok else "",
251
+ "python_version": os.sys.version,
252
+ "platform": os.sys.platform,
253
+ },
254
  )
255
  # endregion
256
 
 
266
 
267
  **GGUF weights** live in **[ysingh-aiml/tinyllama-alpaca-lora-gguf](https://huggingface.co/ysingh-aiml/tinyllama-alpaca-lora-gguf)**.
268
 
269
+ The **Try a quant** tab lets you run GGUF generation in-browser via **`llama-cpp-python`** (CPU; first generation downloads weights and may take a minute). Use the **Local server** tab for `llama-server`.
270
  """
271
 
272
  with gr.Blocks(
 
311
  play_status = gr.Markdown(value=play_status_text)
312
  # region agent log
313
  _debug_log(
314
+ "run-365fe9",
315
  "H5",
316
  "task4_hf/app.py:build_app",
317
  "Play widget interactivity configured",
requirements-play.txt CHANGED
@@ -1,5 +1,3 @@
1
- # Optional: enable the "Try a quant" tab (GGUF chat in-process).
2
- # Use Python 3.10 or 3.11 on Linux/macOS/Win so pip can install a binary wheel.
3
- # Avoid Python 3.13 until llama-cpp-python publishes cp313 wheels broadly.
4
  -r requirements.txt
5
- llama-cpp-python>=0.2.90,<0.4
 
1
+ # Enable the "Try a quant" tab (GGUF chat in-process).
2
+ # requirements.txt now includes llama-cpp-python via pre-built CPU wheels.
 
3
  -r requirements.txt
 
requirements.txt CHANGED
@@ -1,9 +1,9 @@
1
- # Gradio Space — fast install (no llama-cpp-python: it often compiles from source and
2
- # exceeds HF build time limits, especially on Python 3.13 without prebuilt wheels).
3
  gradio>=5.10.0,<6
4
  audioop-lts>=0.2.1
5
  matplotlib>=3.8.0
6
  numpy>=1.26.0
7
  huggingface_hub>=0.20.0
8
-
9
- # Interactive "Try a quant" chat: pip install -r requirements-play.txt (local or custom Space)
 
 
1
+ # Gradio Space — uses pre-built CPU-only llama-cpp-python wheels to avoid source compilation.
 
2
  gradio>=5.10.0,<6
3
  audioop-lts>=0.2.1
4
  matplotlib>=3.8.0
5
  numpy>=1.26.0
6
  huggingface_hub>=0.20.0
7
+ # Pre-built CPU wheel avoids source compilation (no BLAS/cmake needed, fast install).
8
+ --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
9
+ llama-cpp-python>=0.2.90,<0.4