tiny-scribe / AGENTS.md
Luigi's picture
docs: update AGENTS.md guidelines and add comprehensive UI/UX implementation plan
9d88146
# AGENTS.md - Tiny Scribe Project Guidelines
## Project Overview
Tiny Scribe is a Python CLI tool and Gradio web app for summarizing transcripts using GGUF models (e.g., ERNIE, Qwen, Granite) with llama-cpp-python. It supports live streaming output and bilingual summaries (English or Traditional Chinese zh-TW) via OpenCC.
## Build / Lint / Test Commands
**Run the CLI script:**
```bash
python summarize_transcript.py -i ./transcripts/short.txt # Default English output
python summarize_transcript.py -i ./transcripts/short.txt -l zh-TW # Traditional Chinese output
python summarize_transcript.py -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L
python summarize_transcript.py -c # CPU only
```
**Run the Gradio web app:**
```bash
python app.py # Starts on port 7860
```
**Linting (if ruff installed):**
```bash
ruff check .
ruff format . # Auto-format code
```
**Type checking (if mypy installed):**
```bash
mypy summarize_transcript.py
mypy app.py
```
**Running tests (root project tests):**
```bash
# Run all root tests
python test_e2e.py
python test_advanced_mode.py
python test_lfm2_extract.py
# Run single test with pytest
pytest test_e2e.py -v # Run all tests in file
pytest test_e2e.py::test_e2e -v # Run specific function
pytest test_advanced_mode.py -k "test_name" # Run by name pattern
```
**llama-cpp-python submodule tests:**
```bash
cd llama-cpp-python && pip install ".[test]" && pytest tests/test_llama.py -v
# Run specific test
cd llama-cpp-python && pytest tests/test_llama.py::test_function_name -v
```
## Code Style Guidelines
**Formatting:**
- 4 spaces indentation, 100 char max line length, double quotes for docstrings
- Two blank lines before functions, one after docstrings
**Imports (ordered):**
```python
# Standard library
import os
from typing import Tuple, Optional, Generator
# Third-party packages
from llama_cpp import Llama
import gradio as gr
# Local modules
from meeting_summarizer.trace import Tracer
```
**Type Hints:**
- Use type hints for params/returns
- `Optional[]` for nullable types, `Generator[str, None, None]` for generators
- Example: `def load_model(repo_id: str, filename: str) -> Llama:`
**Naming Conventions:**
- `snake_case` for functions/variables, `CamelCase` for classes, `UPPER_CASE` for constants
- Descriptive names: `stream_summarize_transcript`, not `summ`
**Error Handling:**
- Use explicit error messages with f-strings, check file existence before operations
- Use `try/except` for external API calls (Hugging Face, model loading)
- Log errors with context for debugging
## Dependencies
**Required:**
- `llama-cpp-python>=0.3.0` - Core inference engine (installed from llama-cpp-python submodule)
- `gradio>=5.0.0` - Web UI framework
- `gradio_huggingfacehub_search>=0.0.12` - HuggingFace model search component
- `huggingface-hub>=0.23.0` - Model downloading
- `opencc-python-reimplemented>=0.1.7` - Chinese text conversion
- `numpy>=1.24.0` - Numerical operations for embeddings
**Development (optional):**
- `pytest>=7.4.0` - Testing framework
- `ruff` - Linting and formatting
- `mypy` - Type checking
## Project Structure
```
tiny-scribe/
β”œβ”€β”€ summarize_transcript.py # Main CLI script
β”œβ”€β”€ app.py # Gradio web app
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ transcripts/ # Input transcript files
β”œβ”€β”€ test_e2e.py # E2E test
β”œβ”€β”€ test_advanced_mode.py # Advanced mode test
β”œβ”€β”€ test_lfm2_extract.py # LFM2 extraction test
β”œβ”€β”€ meeting_summarizer/ # Core summarization module
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ trace.py # Tracing/logging utilities
β”‚ └── extraction.py # Extraction and deduplication logic
β”œβ”€β”€ llama-cpp-python/ # Git submodule
└── README.md # Project documentation
```
## Usage Patterns
**Model Loading:**
```python
llm = Llama.from_pretrained(
repo_id="unsloth/Qwen3-0.6B-GGUF",
filename="*Q4_0.gguf",
n_gpu_layers=-1, # -1 for all GPU, 0 for CPU
n_ctx=32768, # Context window size
verbose=False, # Cleaner output
)
```
**Inference Settings:**
- Extraction models: Low temp (0.1-0.3) for deterministic JSON
- Synthesis models: Higher temp (0.7-0.9) for creative summaries
- Reasoning types: Non-reasoning (hide checkbox), Hybrid (toggleable), Thinking-only (always on)
**Environment & GPU:**
```bash
DEFAULT_N_THREADS=2 # CPU threads (1-32)
N_GPU_LAYERS=0 # 0=CPU, -1=all GPU
HF_HUB_DOWNLOAD_TIMEOUT=300 # Download timeout (seconds)
```
GPU offload detection: `from llama_cpp import llama_supports_gpu_offload`
## Notes for AI Agents
- Always call `llm.reset()` after completion to ensure state isolation
- Model format: `repo_id:quant` (e.g., `unsloth/Qwen3-1.7B-GGUF:Q2_K_L`)
- Default language output is English (zh-TW available via `-l zh-TW` or web UI)
- OpenCC conversion only applied when output_language is "zh-TW"
- HuggingFace cache at `~/.cache/huggingface/hub/` - clean periodically
- HF Spaces runs on CPU tier with 2 vCPUs, 16GB RAM
- Keep model sizes under 4GB for reasonable performance on free tier
- Tests exist in root (test_e2e.py, test_advanced_mode.py, test_lfm2_extract.py)
- Submodule tests in llama-cpp-python/tests/