Spaces:
Running
Running
File size: 5,974 Bytes
10d339c 2ca5026 10d339c d4fd1c3 10d339c d4fd1c3 7b25e5e d4fd1c3 10d339c d4fd1c3 7b25e5e 0733923 7b25e5e 10d339c fd459ca 7b25e5e 0733923 fd459ca 0733923 fd459ca 0733923 7b25e5e 0733923 fd459ca 0733923 fd459ca 0733923 fd459ca 0733923 d4fd1c3 0733923 d4fd1c3 10d339c d4fd1c3 fd459ca 0733923 fd459ca 0733923 fd459ca 0733923 fd459ca 0733923 fd459ca 10d339c 0733923 fd459ca 0733923 fd459ca 0733923 fd459ca 10d339c d4fd1c3 0733923 fd459ca 0733923 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 | ---
title: Tiny Scribe - Transcript Summarizer
emoji: "π"
colorFrom: blue
colorTo: green
sdk: docker
sdk_version: "3.10"
app_file: app.py
pinned: false
license: mit
---
# Tiny Scribe
A lightweight transcript summarization tool powered by local LLMs. Features 24+ preset models ranging from 100M to 30B parameters, plus the ability to load any GGUF model from HuggingFace Hub. Includes two summarization modes (Standard and Advanced 3-model pipeline), live streaming output, reasoning modes, and flexible deployment options.
## Features
### Core Capabilities
- **24+ Preset Models**: From tiny 100M models to powerful 30B models
- **Custom GGUF Loading**: Load any GGUF model from HuggingFace Hub with live search
- **Dual Summarization Modes**:
- **Standard Mode**: Single-model direct summarization
- **Advanced Mode**: 3-stage pipeline (Extraction β Deduplication β Synthesis)
- **Live Streaming**: Real-time summary generation with token-by-token output
- **Reasoning Modes**: Toggle thinking/reasoning for supported models (Qwen3, ERNIE, LFM2)
- **Thinking Buffer**: Automatic 50% context window extension when reasoning enabled
### User Interface
- **Clean Two-Column Layout**: Configuration (left) and output (right)
- **Model Source Selection**: Radio button toggle between Preset and Custom models
- **Real-Time Outputs**:
- **Model Thinking Process**: See the AI's reasoning in real-time
- **Final Summary**: Polished, formatted summary
- **Generation Metrics**: Separate section for performance stats
- **Unified Model Information**: Displays specs for Standard (1 model) or Advanced (3 models)
- **Hardware Presets**: Free Tier (2 vCPUs), Upgrade (8 vCPUs), or Custom thread count
- **Language Support**: English or Traditional Chinese (zh-TW) output via OpenCC
- **Auto Settings**: Temperature, top_p, and top_k auto-populate per model
## Usage
### Quick Start (Standard Mode)
1. **Configure Global Settings**:
- **Output Language**: Choose English or Traditional Chinese (zh-TW)
- **Input Content**: Upload a .txt file or paste your transcript
- **Hardware Configuration**: Select CPU thread preset (Free Tier, Upgrade, or Custom)
2. **Select Summarization Mode**:
- **Standard Mode**: Single-model direct summarization (faster, simpler)
- **Advanced Mode**: 3-model pipeline with extraction, deduplication, synthesis (higher quality)
3. **Choose Model** (Standard Mode):
- **Preset Models**: Select from 24+ curated models
- **Custom GGUF**: Search and load any GGUF from HuggingFace Hub
4. **Configure Inference Parameters** (optional):
- Temperature, Top-p, Top-k (auto-populated with model defaults)
- Max Output Tokens
- Enable/disable reasoning mode (for supported models)
5. **Generate Summary**: Click "β¨ Generate Summary" and watch:
- **Model Thinking Process** (left): AI's reasoning in real-time
- **Final Summary** (right): Polished result
- **Generation Metrics**: Performance stats (tokens/sec, generation time)
### Advanced Mode (3-Model Pipeline)
For higher quality summarization with large transcripts:
1. **Stage 1 - Extraction**: Small model (β€1.7B) extracts key points from windows
2. **Stage 2 - Deduplication**: Embedding model removes duplicate items
3. **Stage 3 - Synthesis**: Large model (1B-30B) generates executive summary
Configure each stage independently with dedicated model, context window, and inference settings.
## Custom GGUF Models
Load any GGUF model from HuggingFace Hub:
1. Switch to the **π§ Custom GGUF** tab
2. Search for a model (e.g., "qwen", "llama", "phi")
3. Select a GGUF file (quantization level)
4. Click **Load Selected Model**
5. The model will be downloaded and cached locally
## Model Registry (24 Preset Models)
### Tiny Models (0.1-0.6B)
- Falcon-H1-100M, Gemma-3-270M, ERNIE-0.3B
- Granite-3.1-0.35B, Granite-3.3-0.35B, BitCPM4-0.5B
- Hunyuan-0.5B, Qwen3-0.6B
### Compact Models (1.5-2.6B)
- Granite-3.1-1B, Falcon-H1-1.5B, Qwen3-1.7B-Thinking
- Granite-3.3-2B, Youtu-LLM-2B, LFM2-2.6B-Transcript
### Standard Models (3-7B)
- Granite-3.1-3B, Breeze-3B, Qwen3-4B-Thinking, Granite-4.0-Tiny-7B
### Large Models (21-30B)
- ERNIE-4.5-21B-PT, ERNIE-4.5-21B-Thinking
- GLM-4.7-Flash-30B (REAP & IQ2 variants)
- Qwen3-30B-A3B (Thinking & Instruct variants)
## Technical Details
- **Inference Engine**: llama-cpp-python
- **Model Format**: GGUF (Q2_K_L, Q3_K_XXS, Q4_K_M, Q4_K_L, Q8_0, etc.)
- **Context Windows**: 4Kβ256K tokens depending on model
- **UI Framework**: Gradio with streaming support
- **Model Search**: gradio_huggingfacehub_search component
- **Language Conversion**: OpenCC for Traditional Chinese (zh-TW)
- **Deployment**: Docker (HuggingFace Spaces compatible)
## Hardware Configuration
| Preset | CPU Threads | Best For |
|--------|-------------|----------|
| HF Free Tier | 2 vCPUs | Small models (< 2B) |
| HF CPU Upgrade | 8 vCPUs | Medium models (2-7B) |
| Custom | 1-32 | Local deployment |
## Reasoning Mode
For models that support thinking/reasoning (marked with β‘ icon):
- Automatically extends context window by 50%
- Provides reasoning steps before the final summary
- Toggle on/off per generation
## Limitations
- **Input Size**: Varies by model (4Kβ256K context windows)
- **First Load**: 10β60 seconds depending on model size
- **CPU Inference**: Free tier runs on CPU; larger models need more time
- **Custom Models**: Must be GGUF format from HuggingFace Hub
## CLI Usage
```bash
# Default English output
python summarize_transcript.py -i ./transcripts/short.txt
# Traditional Chinese output
python summarize_transcript.py -i ./transcripts/short.txt -l zh-TW
# Use specific model
python summarize_transcript.py -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L
# CPU only
python summarize_transcript.py -c
```
## Requirements
```bash
pip install -r requirements.txt
```
## Repository
[Luigi/tiny-scribe](https://huggingface.co/spaces/Luigi/tiny-scribe)
## License
MIT License
|