tiny-scribe / AGENTS.md
Luigi's picture
docs: update AGENTS.md guidelines and add comprehensive UI/UX implementation plan
9d88146

AGENTS.md - Tiny Scribe Project Guidelines

Project Overview

Tiny Scribe is a Python CLI tool and Gradio web app for summarizing transcripts using GGUF models (e.g., ERNIE, Qwen, Granite) with llama-cpp-python. It supports live streaming output and bilingual summaries (English or Traditional Chinese zh-TW) via OpenCC.

Build / Lint / Test Commands

Run the CLI script:

python summarize_transcript.py -i ./transcripts/short.txt              # Default English output
python summarize_transcript.py -i ./transcripts/short.txt -l zh-TW    # Traditional Chinese output
python summarize_transcript.py -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L
python summarize_transcript.py -c  # CPU only

Run the Gradio web app:

python app.py  # Starts on port 7860

Linting (if ruff installed):

ruff check .
ruff format .            # Auto-format code

Type checking (if mypy installed):

mypy summarize_transcript.py
mypy app.py

Running tests (root project tests):

# Run all root tests
python test_e2e.py
python test_advanced_mode.py
python test_lfm2_extract.py

# Run single test with pytest
pytest test_e2e.py -v                          # Run all tests in file
pytest test_e2e.py::test_e2e -v               # Run specific function
pytest test_advanced_mode.py -k "test_name"    # Run by name pattern

llama-cpp-python submodule tests:

cd llama-cpp-python && pip install ".[test]" && pytest tests/test_llama.py -v

# Run specific test
cd llama-cpp-python && pytest tests/test_llama.py::test_function_name -v

Code Style Guidelines

Formatting:

  • 4 spaces indentation, 100 char max line length, double quotes for docstrings
  • Two blank lines before functions, one after docstrings

Imports (ordered):

# Standard library
import os
from typing import Tuple, Optional, Generator

# Third-party packages
from llama_cpp import Llama
import gradio as gr

# Local modules
from meeting_summarizer.trace import Tracer

Type Hints:

  • Use type hints for params/returns
  • Optional[] for nullable types, Generator[str, None, None] for generators
  • Example: def load_model(repo_id: str, filename: str) -> Llama:

Naming Conventions:

  • snake_case for functions/variables, CamelCase for classes, UPPER_CASE for constants
  • Descriptive names: stream_summarize_transcript, not summ

Error Handling:

  • Use explicit error messages with f-strings, check file existence before operations
  • Use try/except for external API calls (Hugging Face, model loading)
  • Log errors with context for debugging

Dependencies

Required:

  • llama-cpp-python>=0.3.0 - Core inference engine (installed from llama-cpp-python submodule)
  • gradio>=5.0.0 - Web UI framework
  • gradio_huggingfacehub_search>=0.0.12 - HuggingFace model search component
  • huggingface-hub>=0.23.0 - Model downloading
  • opencc-python-reimplemented>=0.1.7 - Chinese text conversion
  • numpy>=1.24.0 - Numerical operations for embeddings

Development (optional):

  • pytest>=7.4.0 - Testing framework
  • ruff - Linting and formatting
  • mypy - Type checking

Project Structure

tiny-scribe/
β”œβ”€β”€ summarize_transcript.py    # Main CLI script
β”œβ”€β”€ app.py                     # Gradio web app
β”œβ”€β”€ requirements.txt           # Python dependencies
β”œβ”€β”€ transcripts/               # Input transcript files
β”œβ”€β”€ test_e2e.py               # E2E test
β”œβ”€β”€ test_advanced_mode.py     # Advanced mode test
β”œβ”€β”€ test_lfm2_extract.py      # LFM2 extraction test
β”œβ”€β”€ meeting_summarizer/       # Core summarization module
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ trace.py             # Tracing/logging utilities
β”‚   └── extraction.py        # Extraction and deduplication logic
β”œβ”€β”€ llama-cpp-python/          # Git submodule
└── README.md                  # Project documentation

Usage Patterns

Model Loading:

llm = Llama.from_pretrained(
    repo_id="unsloth/Qwen3-0.6B-GGUF",
    filename="*Q4_0.gguf",
    n_gpu_layers=-1,  # -1 for all GPU, 0 for CPU
    n_ctx=32768,      # Context window size
    verbose=False,    # Cleaner output
)

Inference Settings:

  • Extraction models: Low temp (0.1-0.3) for deterministic JSON
  • Synthesis models: Higher temp (0.7-0.9) for creative summaries
  • Reasoning types: Non-reasoning (hide checkbox), Hybrid (toggleable), Thinking-only (always on)

Environment & GPU:

DEFAULT_N_THREADS=2          # CPU threads (1-32)
N_GPU_LAYERS=0              # 0=CPU, -1=all GPU
HF_HUB_DOWNLOAD_TIMEOUT=300  # Download timeout (seconds)

GPU offload detection: from llama_cpp import llama_supports_gpu_offload

Notes for AI Agents

  • Always call llm.reset() after completion to ensure state isolation
  • Model format: repo_id:quant (e.g., unsloth/Qwen3-1.7B-GGUF:Q2_K_L)
  • Default language output is English (zh-TW available via -l zh-TW or web UI)
  • OpenCC conversion only applied when output_language is "zh-TW"
  • HuggingFace cache at ~/.cache/huggingface/hub/ - clean periodically
  • HF Spaces runs on CPU tier with 2 vCPUs, 16GB RAM
  • Keep model sizes under 4GB for reasonable performance on free tier
  • Tests exist in root (test_e2e.py, test_advanced_mode.py, test_lfm2_extract.py)
  • Submodule tests in llama-cpp-python/tests/