Spaces:
Running
Running
AGENTS.md - Tiny Scribe Project Guidelines
Project Overview
Tiny Scribe is a Python CLI tool and Gradio web app for summarizing transcripts using GGUF models (e.g., ERNIE, Qwen, Granite) with llama-cpp-python. It supports live streaming output and bilingual summaries (English or Traditional Chinese zh-TW) via OpenCC.
Build / Lint / Test Commands
Run the CLI script:
python summarize_transcript.py -i ./transcripts/short.txt # Default English output
python summarize_transcript.py -i ./transcripts/short.txt -l zh-TW # Traditional Chinese output
python summarize_transcript.py -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L
python summarize_transcript.py -c # CPU only
Run the Gradio web app:
python app.py # Starts on port 7860
Linting (if ruff installed):
ruff check .
ruff format . # Auto-format code
Type checking (if mypy installed):
mypy summarize_transcript.py
mypy app.py
Running tests (root project tests):
# Run all root tests
python test_e2e.py
python test_advanced_mode.py
python test_lfm2_extract.py
# Run single test with pytest
pytest test_e2e.py -v # Run all tests in file
pytest test_e2e.py::test_e2e -v # Run specific function
pytest test_advanced_mode.py -k "test_name" # Run by name pattern
llama-cpp-python submodule tests:
cd llama-cpp-python && pip install ".[test]" && pytest tests/test_llama.py -v
# Run specific test
cd llama-cpp-python && pytest tests/test_llama.py::test_function_name -v
Code Style Guidelines
Formatting:
- 4 spaces indentation, 100 char max line length, double quotes for docstrings
- Two blank lines before functions, one after docstrings
Imports (ordered):
# Standard library
import os
from typing import Tuple, Optional, Generator
# Third-party packages
from llama_cpp import Llama
import gradio as gr
# Local modules
from meeting_summarizer.trace import Tracer
Type Hints:
- Use type hints for params/returns
Optional[]for nullable types,Generator[str, None, None]for generators- Example:
def load_model(repo_id: str, filename: str) -> Llama:
Naming Conventions:
snake_casefor functions/variables,CamelCasefor classes,UPPER_CASEfor constants- Descriptive names:
stream_summarize_transcript, notsumm
Error Handling:
- Use explicit error messages with f-strings, check file existence before operations
- Use
try/exceptfor external API calls (Hugging Face, model loading) - Log errors with context for debugging
Dependencies
Required:
llama-cpp-python>=0.3.0- Core inference engine (installed from llama-cpp-python submodule)gradio>=5.0.0- Web UI frameworkgradio_huggingfacehub_search>=0.0.12- HuggingFace model search componenthuggingface-hub>=0.23.0- Model downloadingopencc-python-reimplemented>=0.1.7- Chinese text conversionnumpy>=1.24.0- Numerical operations for embeddings
Development (optional):
pytest>=7.4.0- Testing frameworkruff- Linting and formattingmypy- Type checking
Project Structure
tiny-scribe/
βββ summarize_transcript.py # Main CLI script
βββ app.py # Gradio web app
βββ requirements.txt # Python dependencies
βββ transcripts/ # Input transcript files
βββ test_e2e.py # E2E test
βββ test_advanced_mode.py # Advanced mode test
βββ test_lfm2_extract.py # LFM2 extraction test
βββ meeting_summarizer/ # Core summarization module
β βββ __init__.py
β βββ trace.py # Tracing/logging utilities
β βββ extraction.py # Extraction and deduplication logic
βββ llama-cpp-python/ # Git submodule
βββ README.md # Project documentation
Usage Patterns
Model Loading:
llm = Llama.from_pretrained(
repo_id="unsloth/Qwen3-0.6B-GGUF",
filename="*Q4_0.gguf",
n_gpu_layers=-1, # -1 for all GPU, 0 for CPU
n_ctx=32768, # Context window size
verbose=False, # Cleaner output
)
Inference Settings:
- Extraction models: Low temp (0.1-0.3) for deterministic JSON
- Synthesis models: Higher temp (0.7-0.9) for creative summaries
- Reasoning types: Non-reasoning (hide checkbox), Hybrid (toggleable), Thinking-only (always on)
Environment & GPU:
DEFAULT_N_THREADS=2 # CPU threads (1-32)
N_GPU_LAYERS=0 # 0=CPU, -1=all GPU
HF_HUB_DOWNLOAD_TIMEOUT=300 # Download timeout (seconds)
GPU offload detection: from llama_cpp import llama_supports_gpu_offload
Notes for AI Agents
- Always call
llm.reset()after completion to ensure state isolation - Model format:
repo_id:quant(e.g.,unsloth/Qwen3-1.7B-GGUF:Q2_K_L) - Default language output is English (zh-TW available via
-l zh-TWor web UI) - OpenCC conversion only applied when output_language is "zh-TW"
- HuggingFace cache at
~/.cache/huggingface/hub/- clean periodically - HF Spaces runs on CPU tier with 2 vCPUs, 16GB RAM
- Keep model sizes under 4GB for reasonable performance on free tier
- Tests exist in root (test_e2e.py, test_advanced_mode.py, test_lfm2_extract.py)
- Submodule tests in llama-cpp-python/tests/