---
title: Tiny Scribe - Transcript Summarizer
emoji: "📄"
colorFrom: blue
colorTo: green
sdk: docker
sdk_version: "3.10"
app_file: app.py
pinned: false
license: mit
---

# Tiny Scribe

A lightweight transcript summarization tool powered by local LLMs. Features 24+ preset models ranging from 100M to 30B parameters, plus the ability to load any GGUF model from HuggingFace Hub. Includes two summarization modes (Standard and Advanced 3-model pipeline), live streaming output, reasoning modes, and flexible deployment options.

## Features

### Core Capabilities
- **24+ Preset Models**: From tiny 100M models to powerful 30B models
- **Custom GGUF Loading**: Load any GGUF model from HuggingFace Hub with live search
- **Dual Summarization Modes**:
  - **Standard Mode**: Single-model direct summarization
  - **Advanced Mode**: 3-stage pipeline (Extraction → Deduplication → Synthesis)
- **Live Streaming**: Real-time summary generation with token-by-token output
- **Reasoning Modes**: Toggle thinking/reasoning for supported models (Qwen3, ERNIE, LFM2)
- **Thinking Buffer**: Automatic 50% context window extension when reasoning enabled

### User Interface
- **Clean Two-Column Layout**: Configuration (left) and output (right)
- **Model Source Selection**: Radio button toggle between Preset and Custom models
- **Real-Time Outputs**: 
  - **Model Thinking Process**: See the AI's reasoning in real-time
  - **Final Summary**: Polished, formatted summary
  - **Generation Metrics**: Separate section for performance stats
- **Unified Model Information**: Displays specs for Standard (1 model) or Advanced (3 models)
- **Hardware Presets**: Free Tier (2 vCPUs), Upgrade (8 vCPUs), or Custom thread count
- **Language Support**: English or Traditional Chinese (zh-TW) output via OpenCC
- **Auto Settings**: Temperature, top_p, and top_k auto-populate per model

## Usage

### Quick Start (Standard Mode)

1. **Configure Global Settings**:
   - **Output Language**: Choose English or Traditional Chinese (zh-TW)
   - **Input Content**: Upload a .txt file or paste your transcript
   - **Hardware Configuration**: Select CPU thread preset (Free Tier, Upgrade, or Custom)

2. **Select Summarization Mode**:
   - **Standard Mode**: Single-model direct summarization (faster, simpler)
   - **Advanced Mode**: 3-model pipeline with extraction, deduplication, synthesis (higher quality)

3. **Choose Model** (Standard Mode):
   - **Preset Models**: Select from 24+ curated models
   - **Custom GGUF**: Search and load any GGUF from HuggingFace Hub

4. **Configure Inference Parameters** (optional):
   - Temperature, Top-p, Top-k (auto-populated with model defaults)
   - Max Output Tokens
   - Enable/disable reasoning mode (for supported models)

5. **Generate Summary**: Click "✨ Generate Summary" and watch:
   - **Model Thinking Process** (left): AI's reasoning in real-time
   - **Final Summary** (right): Polished result
   - **Generation Metrics**: Performance stats (tokens/sec, generation time)

### Advanced Mode (3-Model Pipeline)

For higher quality summarization with large transcripts:

1. **Stage 1 - Extraction**: Small model (≤1.7B) extracts key points from windows
2. **Stage 2 - Deduplication**: Embedding model removes duplicate items
3. **Stage 3 - Synthesis**: Large model (1B-30B) generates executive summary

Configure each stage independently with dedicated model, context window, and inference settings.

## Custom GGUF Models

Load any GGUF model from HuggingFace Hub:

1. Switch to the **🔧 Custom GGUF** tab
2. Search for a model (e.g., "qwen", "llama", "phi")
3. Select a GGUF file (quantization level)
4. Click **Load Selected Model**
5. The model will be downloaded and cached locally

## Model Registry (24 Preset Models)

### Tiny Models (0.1-0.6B)
- Falcon-H1-100M, Gemma-3-270M, ERNIE-0.3B
- Granite-3.1-0.35B, Granite-3.3-0.35B, BitCPM4-0.5B
- Hunyuan-0.5B, Qwen3-0.6B

### Compact Models (1.5-2.6B)
- Granite-3.1-1B, Falcon-H1-1.5B, Qwen3-1.7B-Thinking
- Granite-3.3-2B, Youtu-LLM-2B, LFM2-2.6B-Transcript

### Standard Models (3-7B)
- Granite-3.1-3B, Breeze-3B, Qwen3-4B-Thinking, Granite-4.0-Tiny-7B

### Large Models (21-30B)
- ERNIE-4.5-21B-PT, ERNIE-4.5-21B-Thinking
- GLM-4.7-Flash-30B (REAP & IQ2 variants)
- Qwen3-30B-A3B (Thinking & Instruct variants)

## Technical Details

- **Inference Engine**: llama-cpp-python
- **Model Format**: GGUF (Q2_K_L, Q3_K_XXS, Q4_K_M, Q4_K_L, Q8_0, etc.)
- **Context Windows**: 4K–256K tokens depending on model
- **UI Framework**: Gradio with streaming support
- **Model Search**: gradio_huggingfacehub_search component
- **Language Conversion**: OpenCC for Traditional Chinese (zh-TW)
- **Deployment**: Docker (HuggingFace Spaces compatible)

## Hardware Configuration

| Preset | CPU Threads | Best For |
|--------|-------------|----------|
| HF Free Tier | 2 vCPUs | Small models (< 2B) |
| HF CPU Upgrade | 8 vCPUs | Medium models (2-7B) |
| Custom | 1-32 | Local deployment |

## Reasoning Mode

For models that support thinking/reasoning (marked with ⚡ icon):
- Automatically extends context window by 50%
- Provides reasoning steps before the final summary
- Toggle on/off per generation

## Limitations

- **Input Size**: Varies by model (4K–256K context windows)
- **First Load**: 10–60 seconds depending on model size
- **CPU Inference**: Free tier runs on CPU; larger models need more time
- **Custom Models**: Must be GGUF format from HuggingFace Hub

## CLI Usage

```bash
# Default English output
python summarize_transcript.py -i ./transcripts/short.txt

# Traditional Chinese output
python summarize_transcript.py -i ./transcripts/short.txt -l zh-TW

# Use specific model
python summarize_transcript.py -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L

# CPU only
python summarize_transcript.py -c
```

## Requirements

```bash
pip install -r requirements.txt
```

## Repository

[Luigi/tiny-scribe](https://huggingface.co/spaces/Luigi/tiny-scribe)

## License

MIT License