# AGENTS.md - Tiny Scribe Project Guidelines ## Project Overview Tiny Scribe is a Python CLI tool and Gradio web app for summarizing transcripts using GGUF models (e.g., ERNIE, Qwen, Granite) with llama-cpp-python. It supports live streaming output and bilingual summaries (English or Traditional Chinese zh-TW) via OpenCC. ## Build / Lint / Test Commands **Run the CLI script:** ```bash python summarize_transcript.py -i ./transcripts/short.txt # Default English output python summarize_transcript.py -i ./transcripts/short.txt -l zh-TW # Traditional Chinese output python summarize_transcript.py -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L python summarize_transcript.py -c # CPU only ``` **Run the Gradio web app:** ```bash python app.py # Starts on port 7860 ``` **Linting (if ruff installed):** ```bash ruff check . ruff format . # Auto-format code ``` **Type checking (if mypy installed):** ```bash mypy summarize_transcript.py mypy app.py ``` **Running tests (root project tests):** ```bash # Run all root tests python test_e2e.py python test_advanced_mode.py python test_lfm2_extract.py # Run single test with pytest pytest test_e2e.py -v # Run all tests in file pytest test_e2e.py::test_e2e -v # Run specific function pytest test_advanced_mode.py -k "test_name" # Run by name pattern ``` **llama-cpp-python submodule tests:** ```bash cd llama-cpp-python && pip install ".[test]" && pytest tests/test_llama.py -v # Run specific test cd llama-cpp-python && pytest tests/test_llama.py::test_function_name -v ``` ## Code Style Guidelines **Formatting:** - 4 spaces indentation, 100 char max line length, double quotes for docstrings - Two blank lines before functions, one after docstrings **Imports (ordered):** ```python # Standard library import os from typing import Tuple, Optional, Generator # Third-party packages from llama_cpp import Llama import gradio as gr # Local modules from meeting_summarizer.trace import Tracer ``` **Type Hints:** - Use type hints for params/returns - `Optional[]` for nullable types, `Generator[str, None, None]` for generators - Example: `def load_model(repo_id: str, filename: str) -> Llama:` **Naming Conventions:** - `snake_case` for functions/variables, `CamelCase` for classes, `UPPER_CASE` for constants - Descriptive names: `stream_summarize_transcript`, not `summ` **Error Handling:** - Use explicit error messages with f-strings, check file existence before operations - Use `try/except` for external API calls (Hugging Face, model loading) - Log errors with context for debugging ## Dependencies **Required:** - `llama-cpp-python>=0.3.0` - Core inference engine (installed from llama-cpp-python submodule) - `gradio>=5.0.0` - Web UI framework - `gradio_huggingfacehub_search>=0.0.12` - HuggingFace model search component - `huggingface-hub>=0.23.0` - Model downloading - `opencc-python-reimplemented>=0.1.7` - Chinese text conversion - `numpy>=1.24.0` - Numerical operations for embeddings **Development (optional):** - `pytest>=7.4.0` - Testing framework - `ruff` - Linting and formatting - `mypy` - Type checking ## Project Structure ``` tiny-scribe/ ├── summarize_transcript.py # Main CLI script ├── app.py # Gradio web app ├── requirements.txt # Python dependencies ├── transcripts/ # Input transcript files ├── test_e2e.py # E2E test ├── test_advanced_mode.py # Advanced mode test ├── test_lfm2_extract.py # LFM2 extraction test ├── meeting_summarizer/ # Core summarization module │ ├── __init__.py │ ├── trace.py # Tracing/logging utilities │ └── extraction.py # Extraction and deduplication logic ├── llama-cpp-python/ # Git submodule └── README.md # Project documentation ``` ## Usage Patterns **Model Loading:** ```python llm = Llama.from_pretrained( repo_id="unsloth/Qwen3-0.6B-GGUF", filename="*Q4_0.gguf", n_gpu_layers=-1, # -1 for all GPU, 0 for CPU n_ctx=32768, # Context window size verbose=False, # Cleaner output ) ``` **Inference Settings:** - Extraction models: Low temp (0.1-0.3) for deterministic JSON - Synthesis models: Higher temp (0.7-0.9) for creative summaries - Reasoning types: Non-reasoning (hide checkbox), Hybrid (toggleable), Thinking-only (always on) **Environment & GPU:** ```bash DEFAULT_N_THREADS=2 # CPU threads (1-32) N_GPU_LAYERS=0 # 0=CPU, -1=all GPU HF_HUB_DOWNLOAD_TIMEOUT=300 # Download timeout (seconds) ``` GPU offload detection: `from llama_cpp import llama_supports_gpu_offload` ## Notes for AI Agents - Always call `llm.reset()` after completion to ensure state isolation - Model format: `repo_id:quant` (e.g., `unsloth/Qwen3-1.7B-GGUF:Q2_K_L`) - Default language output is English (zh-TW available via `-l zh-TW` or web UI) - OpenCC conversion only applied when output_language is "zh-TW" - HuggingFace cache at `~/.cache/huggingface/hub/` - clean periodically - HF Spaces runs on CPU tier with 2 vCPUs, 16GB RAM - Keep model sizes under 4GB for reasonable performance on free tier - Tests exist in root (test_e2e.py, test_advanced_mode.py, test_lfm2_extract.py) - Submodule tests in llama-cpp-python/tests/