# AGENTS.md - Tiny Scribe Project Guidelines

## Project Overview

Tiny Scribe is a Python CLI tool and Gradio web app for summarizing transcripts using GGUF models (e.g., ERNIE, Qwen, Granite) with llama-cpp-python. It supports live streaming output and bilingual summaries (English or Traditional Chinese zh-TW) via OpenCC.

## Build / Lint / Test Commands

**Run the CLI script:**
```bash
python summarize_transcript.py -i ./transcripts/short.txt              # Default English output
python summarize_transcript.py -i ./transcripts/short.txt -l zh-TW    # Traditional Chinese output
python summarize_transcript.py -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L
python summarize_transcript.py -c  # CPU only
```

**Run the Gradio web app:**
```bash
python app.py  # Starts on port 7860
```

**Linting (if ruff installed):**
```bash
ruff check .
ruff format .            # Auto-format code
```

**Type checking (if mypy installed):**
```bash
mypy summarize_transcript.py
mypy app.py
```

**Running tests (root project tests):**
```bash
# Run all root tests
python test_e2e.py
python test_advanced_mode.py
python test_lfm2_extract.py

# Run single test with pytest
pytest test_e2e.py -v                          # Run all tests in file
pytest test_e2e.py::test_e2e -v               # Run specific function
pytest test_advanced_mode.py -k "test_name"    # Run by name pattern
```

**llama-cpp-python submodule tests:**
```bash
cd llama-cpp-python && pip install ".[test]" && pytest tests/test_llama.py -v

# Run specific test
cd llama-cpp-python && pytest tests/test_llama.py::test_function_name -v
```

## Code Style Guidelines

**Formatting:**
- 4 spaces indentation, 100 char max line length, double quotes for docstrings
- Two blank lines before functions, one after docstrings

**Imports (ordered):**
```python
# Standard library
import os
from typing import Tuple, Optional, Generator

# Third-party packages
from llama_cpp import Llama
import gradio as gr

# Local modules
from meeting_summarizer.trace import Tracer
```

**Type Hints:**
- Use type hints for params/returns
- `Optional[]` for nullable types, `Generator[str, None, None]` for generators
- Example: `def load_model(repo_id: str, filename: str) -> Llama:`

**Naming Conventions:**
- `snake_case` for functions/variables, `CamelCase` for classes, `UPPER_CASE` for constants
- Descriptive names: `stream_summarize_transcript`, not `summ`

**Error Handling:**
- Use explicit error messages with f-strings, check file existence before operations
- Use `try/except` for external API calls (Hugging Face, model loading)
- Log errors with context for debugging

## Dependencies

**Required:**
- `llama-cpp-python>=0.3.0` - Core inference engine (installed from llama-cpp-python submodule)
- `gradio>=5.0.0` - Web UI framework
- `gradio_huggingfacehub_search>=0.0.12` - HuggingFace model search component
- `huggingface-hub>=0.23.0` - Model downloading
- `opencc-python-reimplemented>=0.1.7` - Chinese text conversion
- `numpy>=1.24.0` - Numerical operations for embeddings

**Development (optional):**
- `pytest>=7.4.0` - Testing framework
- `ruff` - Linting and formatting
- `mypy` - Type checking

## Project Structure

```
tiny-scribe/
├── summarize_transcript.py    # Main CLI script
├── app.py                     # Gradio web app
├── requirements.txt           # Python dependencies
├── transcripts/               # Input transcript files
├── test_e2e.py               # E2E test
├── test_advanced_mode.py     # Advanced mode test
├── test_lfm2_extract.py      # LFM2 extraction test
├── meeting_summarizer/       # Core summarization module
│   ├── __init__.py
│   ├── trace.py             # Tracing/logging utilities
│   └── extraction.py        # Extraction and deduplication logic
├── llama-cpp-python/          # Git submodule
└── README.md                  # Project documentation
```

## Usage Patterns

**Model Loading:**
```python
llm = Llama.from_pretrained(
    repo_id="unsloth/Qwen3-0.6B-GGUF",
    filename="*Q4_0.gguf",
    n_gpu_layers=-1,  # -1 for all GPU, 0 for CPU
    n_ctx=32768,      # Context window size
    verbose=False,    # Cleaner output
)
```

**Inference Settings:**
- Extraction models: Low temp (0.1-0.3) for deterministic JSON
- Synthesis models: Higher temp (0.7-0.9) for creative summaries
- Reasoning types: Non-reasoning (hide checkbox), Hybrid (toggleable), Thinking-only (always on)

**Environment & GPU:**
```bash
DEFAULT_N_THREADS=2          # CPU threads (1-32)
N_GPU_LAYERS=0              # 0=CPU, -1=all GPU
HF_HUB_DOWNLOAD_TIMEOUT=300  # Download timeout (seconds)
```

GPU offload detection: `from llama_cpp import llama_supports_gpu_offload`

## Notes for AI Agents

- Always call `llm.reset()` after completion to ensure state isolation
- Model format: `repo_id:quant` (e.g., `unsloth/Qwen3-1.7B-GGUF:Q2_K_L`)
- Default language output is English (zh-TW available via `-l zh-TW` or web UI)
- OpenCC conversion only applied when output_language is "zh-TW"
- HuggingFace cache at `~/.cache/huggingface/hub/` - clean periodically
- HF Spaces runs on CPU tier with 2 vCPUs, 16GB RAM
- Keep model sizes under 4GB for reasonable performance on free tier
- Tests exist in root (test_e2e.py, test_advanced_mode.py, test_lfm2_extract.py)
- Submodule tests in llama-cpp-python/tests/