Luigi commited on
Commit
6be837f
·
1 Parent(s): ae53447

Update CLAUDE.md with current implementation details

Browse files
Files changed (1) hide show
  1. CLAUDE.md +112 -18
CLAUDE.md CHANGED
@@ -8,7 +8,7 @@ Tiny Scribe is a transcript summarization tool with two interfaces:
8
  1. **CLI tool** (`summarize_transcript.py`) - Standalone script for local use with SYCL/CPU acceleration
9
  2. **Gradio web app** (`app.py`) - HuggingFace Spaces deployment with streaming UI
10
 
11
- Both use llama-cpp-python to run GGUF quantized models (Qwen3, ERNIE, Granite) and convert output to Traditional Chinese (zh-TW) via OpenCC.
12
 
13
  ## Development Commands
14
 
@@ -85,11 +85,14 @@ User upload → Gradio File → app.py:summarize_streaming()
85
 
86
  | Feature | CLI (`summarize_transcript.py`) | Gradio (`app.py`) |
87
  |---------|--------------------------------|-------------------|
88
- | Model loading | On-demand per run | Global singleton (preloaded) |
89
- | Thinking tags | `<think>...</think>` | `<thinking>...</thinking>` |
 
 
 
90
  | Output | Print to stdout + save files | Yield tuples for dual textboxes |
91
  | GPU support | Configurable via `--cpu` flag | Hardcoded `n_gpu_layers=0` |
92
- | Context window | 32K tokens | 32K tokens |
93
 
94
  ### Model Loading Pattern
95
 
@@ -150,6 +153,37 @@ thinking = '\n\n'.join(match.strip() for match in matches)
150
  summary = re.sub(pattern, '', content, flags=re.DOTALL).strip()
151
  ```
152
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
  ### Chinese Text Conversion
154
 
155
  All outputs are converted from Simplified to Traditional Chinese (Taiwan standard):
@@ -166,12 +200,28 @@ Applied token-by-token during streaming to maintain real-time display.
166
 
167
  The Gradio app is optimized for HF Spaces Free Tier (2 vCPUs):
168
 
169
- - **Model**: Qwen3-0.6B Q4_K_M (~400MB)
170
  - **Dockerfile**: Uses prebuilt llama-cpp-python wheel (skips 10-min compilation)
171
- - **Context limits**: Truncates inputs to 24K chars (~6K tokens) to leave headroom
172
 
173
  See `DEPLOY.md` for full deployment instructions.
174
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
  ### Docker Optimization
176
 
177
  The Dockerfile avoids building llama-cpp-python from source by using a prebuilt wheel:
@@ -201,13 +251,13 @@ git commit -m "Update llama-cpp-python submodule"
201
 
202
  ## Model Format
203
 
204
- Model argument format: `repo_id:quantization`
205
 
206
  Examples:
207
  - `unsloth/Qwen3-0.6B-GGUF:Q4_0` → Searches for `*Q4_0.gguf`
208
  - `unsloth/Qwen3-1.7B-GGUF:Q2_K_L` → Searches for `*Q2_K_L.gguf`
209
 
210
- The `:` separator is parsed in `summarize_transcript.py:126-131`.
211
 
212
  ## Error Handling Notes
213
 
@@ -217,32 +267,76 @@ When modifying streaming logic:
217
  - Gradio error handling: Yield error messages in the summary field, keep thinking field intact
218
  - File upload: Validate file existence and encoding before reading
219
 
220
- ## Common Modifications
221
 
222
- ### Changing the Model
223
 
224
- **CLI:** Use `-m` argument at runtime
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
225
 
226
- **Gradio app:** Edit `app.py` lines 25-26:
 
 
227
  ```python
228
- DEFAULT_MODEL = "unsloth/Qwen3-1.7B-GGUF"
229
- DEFAULT_FILENAME = "*Q2_K_L.gguf"
 
 
 
 
 
 
 
 
 
 
 
230
  ```
231
 
 
 
 
 
 
 
 
 
 
 
232
  ### Adjusting Context Window
233
 
234
- Change `n_ctx` in `Llama.from_pretrained()`:
 
 
 
 
235
  - 32768 (current) = handles ~24KB text input
236
  - 8192 = faster, lower memory, ~6KB text
237
  - 131072 = very slow on CPU, ~100KB text
238
 
239
- Also update `max_chars` truncation in `app.py:119` accordingly (estimate: 4 chars per token).
240
-
241
  ### GPU Acceleration
242
 
243
  **CLI:** Remove `-c` flag (defaults to SYCL/CUDA if available)
244
 
245
- **Gradio app:** Change `app.py:47`:
246
  ```python
247
  n_gpu_layers=-1, # Use all GPU layers
248
  ```
 
8
  1. **CLI tool** (`summarize_transcript.py`) - Standalone script for local use with SYCL/CPU acceleration
9
  2. **Gradio web app** (`app.py`) - HuggingFace Spaces deployment with streaming UI
10
 
11
+ Both use llama-cpp-python to run GGUF quantized models (Qwen3, ERNIE, Granite, Gemma, etc.) and convert output to Traditional Chinese (zh-TW) via OpenCC.
12
 
13
  ## Development Commands
14
 
 
85
 
86
  | Feature | CLI (`summarize_transcript.py`) | Gradio (`app.py`) |
87
  |---------|--------------------------------|-------------------|
88
+ | Model loading | On-demand per run | Global singleton (cached) |
89
+ | Model selection | CLI argument `repo_id:quant` | Dropdown with 10 models |
90
+ | Thinking tags | Supports both formats | Supports both formats + streaming |
91
+ | Reasoning toggle | Not supported | Qwen3: /think or /no_think |
92
+ | Inference settings | Hardcoded per run | Model-specific, dynamic UI |
93
  | Output | Print to stdout + save files | Yield tuples for dual textboxes |
94
  | GPU support | Configurable via `--cpu` flag | Hardcoded `n_gpu_layers=0` |
95
+ | Context window | 32K tokens | Per-model (32K-262K, capped at 32K) |
96
 
97
  ### Model Loading Pattern
98
 
 
153
  summary = re.sub(pattern, '', content, flags=re.DOTALL).strip()
154
  ```
155
 
156
+ The Gradio app also handles streaming mode with unclosed `<think>` tags for real-time display.
157
+
158
+ ### Qwen3 Thinking Mode
159
+
160
+ Qwen3 models support a special "thinking mode" that generates `<think>...</think>` blocks for reasoning before the final answer.
161
+
162
+ **Implementation (llama.cpp/llama-cpp-python):**
163
+ - Add `/think` to system prompt or user message to enable thinking mode
164
+ - Add `/no_think` to disable thinking mode (faster, direct output)
165
+ - Most recent instruction takes precedence in multi-turn conversations
166
+
167
+ **Official Recommended Settings (from Unsloth):**
168
+
169
+ | Setting | Non-Thinking Mode | Thinking Mode |
170
+ |---------|------------------|---------------|
171
+ | Temperature | 0.7 | 0.6 |
172
+ | Top_P | 0.8 | 0.95 |
173
+ | Top_K | 20 | 20 |
174
+ | Min_P | 0.0 | 0.0 |
175
+
176
+ **Important Notes:**
177
+ - **DO NOT use greedy decoding** in thinking mode (causes endless repetitions)
178
+ - In thinking mode, model generates `<think>...</think>` block before final answer
179
+ - For non-thinking mode, empty `<think></think>` tags are purposely used
180
+
181
+ **Current Implementation:**
182
+ The Gradio app (`app.py`) implements this via:
183
+ - `enable_reasoning` checkbox (models with `supports_toggle: true`)
184
+ - Dynamic system prompt: `你是一個有助的助手,負責總結轉錄內容。{reasoning_mode}`
185
+ - Where `reasoning_mode = "/think"` or `/no_think"` based on toggle
186
+
187
  ### Chinese Text Conversion
188
 
189
  All outputs are converted from Simplified to Traditional Chinese (Taiwan standard):
 
200
 
201
  The Gradio app is optimized for HF Spaces Free Tier (2 vCPUs):
202
 
203
+ - **Models**: 10 models available (100M to 1.7B parameters), default: Qwen3-0.6B Q4_K_M (~400MB)
204
  - **Dockerfile**: Uses prebuilt llama-cpp-python wheel (skips 10-min compilation)
205
+ - **Context limits**: Per-model context windows (32K to 262K tokens), capped at 32K for CPU performance
206
 
207
  See `DEPLOY.md` for full deployment instructions.
208
 
209
+ ### Deployment Workflow
210
+
211
+ The `deploy.sh` script ensures meaningful commit messages:
212
+
213
+ ```bash
214
+ ./deploy.sh "Add new model: Gemma-3 270M"
215
+ ```
216
+
217
+ The script:
218
+ 1. Checks for uncommitted changes
219
+ 2. Prompts for commit message if not provided
220
+ 3. Warns about generic/short messages
221
+ 4. Shows commits to be pushed
222
+ 5. Confirms before pushing
223
+ 6. Verifies commit message was preserved on remote
224
+
225
  ### Docker Optimization
226
 
227
  The Dockerfile avoids building llama-cpp-python from source by using a prebuilt wheel:
 
251
 
252
  ## Model Format
253
 
254
+ CLI model argument format: `repo_id:quantization`
255
 
256
  Examples:
257
  - `unsloth/Qwen3-0.6B-GGUF:Q4_0` → Searches for `*Q4_0.gguf`
258
  - `unsloth/Qwen3-1.7B-GGUF:Q2_K_L` → Searches for `*Q2_K_L.gguf`
259
 
260
+ The `:` separator is parsed in `summarize_transcript.py:128-130`.
261
 
262
  ## Error Handling Notes
263
 
 
267
  - Gradio error handling: Yield error messages in the summary field, keep thinking field intact
268
  - File upload: Validate file existence and encoding before reading
269
 
270
+ ## Model Registry
271
 
272
+ The Gradio app (`app.py:32-155`) includes a model registry (`AVAILABLE_MODELS`) with:
273
 
274
+ 1. **Model metadata** (repo_id, filename, max context)
275
+ 2. **Model-specific inference settings** (temperature, top_p, top_k, repeat_penalty)
276
+ 3. **Feature flags** (e.g., `supports_toggle` for Qwen3 reasoning mode)
277
+
278
+ Each model has optimized defaults. The UI updates inference controls when model selection changes.
279
+
280
+ ### Available Models
281
+
282
+ | Key | Model | Params | Max Context | Quant |
283
+ |-----|-------|--------|-------------|-------|
284
+ | `falcon_h1_100m` | Falcon-H1 100M | 100M | 32K | Q8_0 |
285
+ | `gemma3_270m` | Gemma-3 270M | 270M | 32K | Q8_0 |
286
+ | `ernie_300m` | ERNIE-4.5 0.3B | 300M | 131K | Q8_0 |
287
+ | `granite_350m` | Granite-4.0 350M | 350M | 32K | Q8_0 |
288
+ | `lfm2_350m` | LFM2 350M | 350M | 32K | Q8_0 |
289
+ | `bitcpm4_500m` | BitCPM4 0.5B | 500M | 128K | q4_0 |
290
+ | `hunyuan_500m` | Hunyuan 0.5B | 500M | 256K | Q8_0 |
291
+ | `qwen3_600m_q4` | Qwen3 0.6B | 600M | 32K | Q4_K_M |
292
+ | `falcon_h1_1.5b_q4` | Falcon-H1 1.5B | 1.5B | 32K | Q4_K_M |
293
+ | `qwen3_1.7b_q4` | Qwen3 1.7B | 1.7B | 32K | Q4_K_M |
294
 
295
+ ### Adding a New Model
296
+
297
+ 1. Add entry to `AVAILABLE_MODELS` in `app.py`:
298
  ```python
299
+ "model_key": {
300
+ "name": "Human-Readable Name",
301
+ "repo_id": "org/model-name-GGUF",
302
+ "filename": "*Quantization.gguf",
303
+ "max_context": 32768,
304
+ "supports_toggle": False, # For Qwen3 /think mode
305
+ "inference_settings": {
306
+ "temperature": 0.6,
307
+ "top_p": 0.95,
308
+ "top_k": 20,
309
+ "repeat_penalty": 1.05,
310
+ },
311
+ },
312
  ```
313
 
314
+ 2. Set `DEFAULT_MODEL_KEY` to the new key if it should be default
315
+
316
+ ## Common Modifications
317
+
318
+ ### Changing the Default Model
319
+
320
+ **CLI:** Use `-m` argument at runtime
321
+
322
+ **Gradio app:** Change `DEFAULT_MODEL_KEY` in `app.py:157`
323
+
324
  ### Adjusting Context Window
325
 
326
+ **CLI:** Change `n_ctx` in `summarize_transcript.py:23`
327
+
328
+ **Gradio app:** The app dynamically calculates `n_ctx` based on input size and model limits. To change the global cap, modify `MAX_USABLE_CTX` in `app.py:29`.
329
+
330
+ Values:
331
  - 32768 (current) = handles ~24KB text input
332
  - 8192 = faster, lower memory, ~6KB text
333
  - 131072 = very slow on CPU, ~100KB text
334
 
 
 
335
  ### GPU Acceleration
336
 
337
  **CLI:** Remove `-c` flag (defaults to SYCL/CUDA if available)
338
 
339
+ **Gradio app:** Change `app.py:206`:
340
  ```python
341
  n_gpu_layers=-1, # Use all GPU layers
342
  ```