Spaces:
Running
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
Tiny Scribe is a transcript summarization tool with two interfaces:
- CLI tool (
summarize_transcript.py) - Standalone script for local use with SYCL/CPU acceleration - Gradio web app (
app.py) - HuggingFace Spaces deployment with streaming UI
Both use llama-cpp-python to run GGUF quantized models (Qwen3, ERNIE, Granite, Gemma, etc.) and convert output to Traditional Chinese (zh-TW) via OpenCC.
Development Commands
Running the CLI
# Basic usage (default model: Qwen3-0.6B Q4_0)
python summarize_transcript.py -i ./transcripts/short.txt
# Specify model (format: repo_id:quantization)
python summarize_transcript.py -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L
# Force CPU-only (disable SYCL)
python summarize_transcript.py -c
Running the Gradio App
# Local development
pip install -r requirements.txt
python app.py
# Opens at http://localhost:7860
Testing
No test suite exists in the root project. To test llama-cpp-python submodule:
cd llama-cpp-python
pip install ".[test]"
pytest tests/test_llama.py -v
# Single test
pytest tests/test_llama.py::test_function_name -v
Docker Deployment
# Build locally
docker build -t tiny-scribe .
# Run
docker run -p 7860:7860 tiny-scribe
Architecture
Two Execution Paths
CLI Path:
User โ summarize_transcript.py โ Llama.from_pretrained() โ GGUF model
โ
Stream tokens โ OpenCC (s2twp) โ stdout
โ
parse_thinking_blocks() โ thinking.txt + summary.txt
Gradio Path:
User upload โ Gradio File โ app.py:summarize_streaming()
โ
Llama.create_chat_completion(stream=True)
โ
Token-by-token yield โ OpenCC โ Two textboxes:
โ - Thinking (raw stream)
parse_thinking_blocks() - Summary (parsed output)
Key Differences
| Feature | CLI (summarize_transcript.py) |
Gradio (app.py) |
|---|---|---|
| Model loading | On-demand per run | Global singleton (cached) |
| Model selection | CLI argument repo_id:quant |
Dropdown with 10 models |
| Thinking tags | Supports both formats | Supports both formats + streaming |
| Reasoning toggle | Not supported | Qwen3: /think or /no_think |
| Inference settings | Hardcoded per run | Model-specific, dynamic UI |
| Output | Print to stdout + save files | Yield tuples for dual textboxes |
| GPU support | Configurable via --cpu flag |
Hardcoded n_gpu_layers=0 |
| Context window | 32K tokens | Per-model (32K-262K, capped at 32K) |
Model Loading Pattern
Both scripts use Llama.from_pretrained() with HuggingFace Hub integration:
llm = Llama.from_pretrained(
repo_id="unsloth/Qwen3-0.6B-GGUF",
filename="*Q4_K_M.gguf", # Wildcard for flexible matching
n_gpu_layers=0, # 0=CPU, -1=all layers on GPU
n_ctx=32768, # 32K context window
seed=1337, # Reproducibility
verbose=False, # Reduce log noise
)
Important: Always call llm.reset() after each completion to clear KV cache and ensure state isolation.
Streaming Implementation
The Gradio app (app.py) implements real-time streaming with dual outputs:
- Raw stream โ
thinking_outputtextbox (shows every token as generated) - Parsed summary โ
summary_outputmarkdown (extracts content outside<thinking>tags)
Generator pattern:
def summarize_streaming(...) -> Generator[Tuple[str, str], None, None]:
for chunk in stream:
content = chunk['choices'][0]['delta'].get('content', '')
full_response += content
# Show all tokens in thinking field
current_thinking += content
# Extract summary (content outside thinking tags)
thinking_blocks, summary = parse_thinking_blocks(full_response)
current_summary = summary
# Yield both on every token
yield (current_thinking, current_summary)
Thinking Block Parsing
Models may wrap reasoning in special tags that should be separated from final output.
Both versions now support both tag formats:
<think>reasoning</think>(common with Qwen models)<thinking>reasoning</thinking>(Claude-style)
Regex pattern:
# Matches both <think> and <thinking> tags
pattern = r'<think(?:ing)?>(.*?)</think(?:ing)?>'
matches = re.findall(pattern, content, re.DOTALL)
thinking = '\n\n'.join(match.strip() for match in matches)
summary = re.sub(pattern, '', content, flags=re.DOTALL).strip()
The Gradio app also handles streaming mode with unclosed <think> tags for real-time display.
Qwen3 Thinking Mode
Qwen3 models support a special "thinking mode" that generates <think>...</think> blocks for reasoning before the final answer.
Implementation (llama.cpp/llama-cpp-python):
- Add
/thinkto system prompt or user message to enable thinking mode - Add
/no_thinkto disable thinking mode (faster, direct output) - Most recent instruction takes precedence in multi-turn conversations
Official Recommended Settings (from Unsloth):
| Setting | Non-Thinking Mode | Thinking Mode |
|---|---|---|
| Temperature | 0.7 | 0.6 |
| Top_P | 0.8 | 0.95 |
| Top_K | 20 | 20 |
| Min_P | 0.0 | 0.0 |
Important Notes:
- DO NOT use greedy decoding in thinking mode (causes endless repetitions)
- In thinking mode, model generates
<think>...</think>block before final answer - For non-thinking mode, empty
<think></think>tags are purposely used
Current Implementation:
The Gradio app (app.py) implements this via:
enable_reasoningcheckbox (models withsupports_toggle: true)- Dynamic system prompt:
ไฝ ๆฏไธๅๆๅฉ็ๅฉๆ๏ผ่ฒ ่ฒฌ็ธฝ็ต่ฝ้ๅ งๅฎนใ{reasoning_mode} - Where
reasoning_mode = "/think"or/no_think"based on toggle
Chinese Text Conversion
All outputs are converted from Simplified to Traditional Chinese (Taiwan standard):
from opencc import OpenCC
converter = OpenCC('s2twp') # s2twp = Simplified โ Traditional (Taiwan + phrases)
traditional = converter.convert(simplified)
Applied token-by-token during streaming to maintain real-time display.
HuggingFace Spaces Deployment
The Gradio app is optimized for HF Spaces Free Tier (2 vCPUs):
- Models: 10 models available (100M to 1.7B parameters), default: Qwen3-0.6B Q4_K_M (~400MB)
- Dockerfile: Uses prebuilt llama-cpp-python wheel (skips 10-min compilation)
- Context limits: Per-model context windows (32K to 262K tokens), capped at 32K for CPU performance
See DEPLOY.md for full deployment instructions.
Deployment Workflow
The deploy.sh script ensures meaningful commit messages:
./deploy.sh "Add new model: Gemma-3 270M"
The script:
- Checks for uncommitted changes
- Prompts for commit message if not provided
- Warns about generic/short messages
- Shows commits to be pushed
- Confirms before pushing
- Verifies commit message was preserved on remote
Docker Optimization
The Dockerfile avoids building llama-cpp-python from source by using a prebuilt wheel:
RUN pip install --no-cache-dir \
https://huggingface.co/Luigi/llama-cpp-python-wheels-hf-spaces-free-cpu/resolve/main/llama_cpp_python-0.3.22-cp310-cp310-linux_x86_64.whl
This reduces build time from 10+ minutes to ~2 minutes.
Git Submodule
The llama-cpp-python/ directory is a Git submodule tracking upstream development:
# Initialize after clone
git submodule update --init --recursive
# Update to latest
cd llama-cpp-python
git pull origin main
cd ..
git add llama-cpp-python
git commit -m "Update llama-cpp-python submodule"
Model Format
CLI model argument format: repo_id:quantization
Examples:
unsloth/Qwen3-0.6B-GGUF:Q4_0โ Searches for*Q4_0.ggufunsloth/Qwen3-1.7B-GGUF:Q2_K_Lโ Searches for*Q2_K_L.gguf
The : separator is parsed in summarize_transcript.py:128-130.
Error Handling Notes
When modifying streaming logic:
- Always handle
'choices'key presence in chunks - Always check for
'delta'in choice before accessing'content' - Gradio error handling: Yield error messages in the summary field, keep thinking field intact
- File upload: Validate file existence and encoding before reading
Model Registry
The Gradio app (app.py:32-155) includes a model registry (AVAILABLE_MODELS) with:
- Model metadata (repo_id, filename, max context)
- Model-specific inference settings (temperature, top_p, top_k, repeat_penalty)
- Feature flags (e.g.,
supports_togglefor Qwen3 reasoning mode)
Each model has optimized defaults. The UI updates inference controls when model selection changes.
Available Models
| Key | Model | Params | Max Context | Quant |
|---|---|---|---|---|
falcon_h1_100m |
Falcon-H1 100M | 100M | 32K | Q8_0 |
gemma3_270m |
Gemma-3 270M | 270M | 32K | Q8_0 |
ernie_300m |
ERNIE-4.5 0.3B | 300M | 131K | Q8_0 |
granite_350m |
Granite-4.0 350M | 350M | 32K | Q8_0 |
lfm2_350m |
LFM2 350M | 350M | 32K | Q8_0 |
bitcpm4_500m |
BitCPM4 0.5B | 500M | 128K | q4_0 |
hunyuan_500m |
Hunyuan 0.5B | 500M | 256K | Q8_0 |
qwen3_600m_q4 |
Qwen3 0.6B | 600M | 32K | Q4_K_M |
falcon_h1_1.5b_q4 |
Falcon-H1 1.5B | 1.5B | 32K | Q4_K_M |
qwen3_1.7b_q4 |
Qwen3 1.7B | 1.7B | 32K | Q4_K_M |
Adding a New Model
- Add entry to
AVAILABLE_MODELSinapp.py:
"model_key": {
"name": "Human-Readable Name",
"repo_id": "org/model-name-GGUF",
"filename": "*Quantization.gguf",
"max_context": 32768,
"supports_toggle": False, # For Qwen3 /think mode
"inference_settings": {
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"repeat_penalty": 1.05,
},
},
- Set
DEFAULT_MODEL_KEYto the new key if it should be default
Common Modifications
Changing the Default Model
CLI: Use -m argument at runtime
Gradio app: Change DEFAULT_MODEL_KEY in app.py:157
Adjusting Context Window
CLI: Change n_ctx in summarize_transcript.py:23
Gradio app: The app dynamically calculates n_ctx based on input size and model limits. To change the global cap, modify MAX_USABLE_CTX in app.py:29.
Values:
- 32768 (current) = handles ~24KB text input
- 8192 = faster, lower memory, ~6KB text
- 131072 = very slow on CPU, ~100KB text
GPU Acceleration
CLI: Remove -c flag (defaults to SYCL/CUDA if available)
Gradio app: Change app.py:206:
n_gpu_layers=-1, # Use all GPU layers
Note: HF Spaces Free Tier has no GPU access.