File size: 5,365 Bytes
8a9d263
 
 
 
e78283f
8a9d263
 
 
ddec8de
8a9d263
e78283f
 
8a9d263
 
 
 
ddec8de
 
 
 
 
8a9d263
 
 
10d339c
8a9d263
 
 
 
 
ddec8de
8a9d263
 
bc6516c
8a9d263
9d88146
bc6516c
 
 
9d88146
 
 
 
 
8a9d263
 
bc6516c
8a9d263
bc6516c
10d339c
bc6516c
 
8a9d263
 
 
 
 
9d88146
 
8a9d263
bc6516c
8a9d263
bc6516c
8a9d263
10d339c
8a9d263
 
 
ddec8de
9d88146
 
 
8a9d263
 
 
9d88146
 
 
8a9d263
 
9d88146
8a9d263
 
 
9d88146
 
10d339c
8a9d263
 
 
 
9d88146
ddec8de
9d88146
ddec8de
 
9d88146
8a9d263
 
ddec8de
8a9d263
 
 
 
 
 
 
 
bc6516c
ddec8de
8a9d263
bc6516c
 
 
9d88146
 
 
 
8a9d263
 
 
 
 
 
 
 
 
 
 
 
 
10d339c
8a9d263
 
 
9d88146
 
 
 
 
 
 
 
 
 
8a9d263
 
9d88146
 
8a9d263
 
 
 
bc6516c
e78283f
10d339c
ddec8de
 
bc6516c
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
# AGENTS.md - Tiny Scribe Project Guidelines

## Project Overview

Tiny Scribe is a Python CLI tool and Gradio web app for summarizing transcripts using GGUF models (e.g., ERNIE, Qwen, Granite) with llama-cpp-python. It supports live streaming output and bilingual summaries (English or Traditional Chinese zh-TW) via OpenCC.

## Build / Lint / Test Commands

**Run the CLI script:**
```bash
python summarize_transcript.py -i ./transcripts/short.txt              # Default English output
python summarize_transcript.py -i ./transcripts/short.txt -l zh-TW    # Traditional Chinese output
python summarize_transcript.py -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L
python summarize_transcript.py -c  # CPU only
```

**Run the Gradio web app:**
```bash
python app.py  # Starts on port 7860
```

**Linting (if ruff installed):**
```bash
ruff check .
ruff format .            # Auto-format code
```

**Type checking (if mypy installed):**
```bash
mypy summarize_transcript.py
mypy app.py
```

**Running tests (root project tests):**
```bash
# Run all root tests
python test_e2e.py
python test_advanced_mode.py
python test_lfm2_extract.py

# Run single test with pytest
pytest test_e2e.py -v                          # Run all tests in file
pytest test_e2e.py::test_e2e -v               # Run specific function
pytest test_advanced_mode.py -k "test_name"    # Run by name pattern
```

**llama-cpp-python submodule tests:**
```bash
cd llama-cpp-python && pip install ".[test]" && pytest tests/test_llama.py -v

# Run specific test
cd llama-cpp-python && pytest tests/test_llama.py::test_function_name -v
```

## Code Style Guidelines

**Formatting:**
- 4 spaces indentation, 100 char max line length, double quotes for docstrings
- Two blank lines before functions, one after docstrings

**Imports (ordered):**
```python
# Standard library
import os
from typing import Tuple, Optional, Generator

# Third-party packages
from llama_cpp import Llama
import gradio as gr

# Local modules
from meeting_summarizer.trace import Tracer
```

**Type Hints:**
- Use type hints for params/returns
- `Optional[]` for nullable types, `Generator[str, None, None]` for generators
- Example: `def load_model(repo_id: str, filename: str) -> Llama:`

**Naming Conventions:**
- `snake_case` for functions/variables, `CamelCase` for classes, `UPPER_CASE` for constants
- Descriptive names: `stream_summarize_transcript`, not `summ`

**Error Handling:**
- Use explicit error messages with f-strings, check file existence before operations
- Use `try/except` for external API calls (Hugging Face, model loading)
- Log errors with context for debugging

## Dependencies

**Required:**
- `llama-cpp-python>=0.3.0` - Core inference engine (installed from llama-cpp-python submodule)
- `gradio>=5.0.0` - Web UI framework
- `gradio_huggingfacehub_search>=0.0.12` - HuggingFace model search component
- `huggingface-hub>=0.23.0` - Model downloading
- `opencc-python-reimplemented>=0.1.7` - Chinese text conversion
- `numpy>=1.24.0` - Numerical operations for embeddings

**Development (optional):**
- `pytest>=7.4.0` - Testing framework
- `ruff` - Linting and formatting
- `mypy` - Type checking

## Project Structure

```
tiny-scribe/
β”œβ”€β”€ summarize_transcript.py    # Main CLI script
β”œβ”€β”€ app.py                     # Gradio web app
β”œβ”€β”€ requirements.txt           # Python dependencies
β”œβ”€β”€ transcripts/               # Input transcript files
β”œβ”€β”€ test_e2e.py               # E2E test
β”œβ”€β”€ test_advanced_mode.py     # Advanced mode test
β”œβ”€β”€ test_lfm2_extract.py      # LFM2 extraction test
β”œβ”€β”€ meeting_summarizer/       # Core summarization module
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ trace.py             # Tracing/logging utilities
β”‚   └── extraction.py        # Extraction and deduplication logic
β”œβ”€β”€ llama-cpp-python/          # Git submodule
└── README.md                  # Project documentation
```

## Usage Patterns

**Model Loading:**
```python
llm = Llama.from_pretrained(
    repo_id="unsloth/Qwen3-0.6B-GGUF",
    filename="*Q4_0.gguf",
    n_gpu_layers=-1,  # -1 for all GPU, 0 for CPU
    n_ctx=32768,      # Context window size
    verbose=False,    # Cleaner output
)
```

**Inference Settings:**
- Extraction models: Low temp (0.1-0.3) for deterministic JSON
- Synthesis models: Higher temp (0.7-0.9) for creative summaries
- Reasoning types: Non-reasoning (hide checkbox), Hybrid (toggleable), Thinking-only (always on)

**Environment & GPU:**
```bash
DEFAULT_N_THREADS=2          # CPU threads (1-32)
N_GPU_LAYERS=0              # 0=CPU, -1=all GPU
HF_HUB_DOWNLOAD_TIMEOUT=300  # Download timeout (seconds)
```

GPU offload detection: `from llama_cpp import llama_supports_gpu_offload`

## Notes for AI Agents

- Always call `llm.reset()` after completion to ensure state isolation
- Model format: `repo_id:quant` (e.g., `unsloth/Qwen3-1.7B-GGUF:Q2_K_L`)
- Default language output is English (zh-TW available via `-l zh-TW` or web UI)
- OpenCC conversion only applied when output_language is "zh-TW"
- HuggingFace cache at `~/.cache/huggingface/hub/` - clean periodically
- HF Spaces runs on CPU tier with 2 vCPUs, 16GB RAM
- Keep model sizes under 4GB for reasonable performance on free tier
- Tests exist in root (test_e2e.py, test_advanced_mode.py, test_lfm2_extract.py)
- Submodule tests in llama-cpp-python/tests/