Spaces:
Running
Running
| title: Tiny Scribe - Transcript Summarizer | |
| emoji: "📄" | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| sdk_version: "3.10" | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| # Tiny Scribe | |
| A lightweight transcript summarization tool powered by local LLMs. Features 24+ preset models ranging from 100M to 30B parameters, plus the ability to load any GGUF model from HuggingFace Hub. Includes two summarization modes (Standard and Advanced 3-model pipeline), live streaming output, reasoning modes, and flexible deployment options. | |
| ## Features | |
| ### Core Capabilities | |
| - **24+ Preset Models**: From tiny 100M models to powerful 30B models | |
| - **Custom GGUF Loading**: Load any GGUF model from HuggingFace Hub with live search | |
| - **Dual Summarization Modes**: | |
| - **Standard Mode**: Single-model direct summarization | |
| - **Advanced Mode**: 3-stage pipeline (Extraction → Deduplication → Synthesis) | |
| - **Live Streaming**: Real-time summary generation with token-by-token output | |
| - **Reasoning Modes**: Toggle thinking/reasoning for supported models (Qwen3, ERNIE, LFM2) | |
| - **Thinking Buffer**: Automatic 50% context window extension when reasoning enabled | |
| ### User Interface | |
| - **Clean Two-Column Layout**: Configuration (left) and output (right) | |
| - **Model Source Selection**: Radio button toggle between Preset and Custom models | |
| - **Real-Time Outputs**: | |
| - **Model Thinking Process**: See the AI's reasoning in real-time | |
| - **Final Summary**: Polished, formatted summary | |
| - **Generation Metrics**: Separate section for performance stats | |
| - **Unified Model Information**: Displays specs for Standard (1 model) or Advanced (3 models) | |
| - **Hardware Presets**: Free Tier (2 vCPUs), Upgrade (8 vCPUs), or Custom thread count | |
| - **Language Support**: English or Traditional Chinese (zh-TW) output via OpenCC | |
| - **Auto Settings**: Temperature, top_p, and top_k auto-populate per model | |
| ## Usage | |
| ### Quick Start (Standard Mode) | |
| 1. **Configure Global Settings**: | |
| - **Output Language**: Choose English or Traditional Chinese (zh-TW) | |
| - **Input Content**: Upload a .txt file or paste your transcript | |
| - **Hardware Configuration**: Select CPU thread preset (Free Tier, Upgrade, or Custom) | |
| 2. **Select Summarization Mode**: | |
| - **Standard Mode**: Single-model direct summarization (faster, simpler) | |
| - **Advanced Mode**: 3-model pipeline with extraction, deduplication, synthesis (higher quality) | |
| 3. **Choose Model** (Standard Mode): | |
| - **Preset Models**: Select from 24+ curated models | |
| - **Custom GGUF**: Search and load any GGUF from HuggingFace Hub | |
| 4. **Configure Inference Parameters** (optional): | |
| - Temperature, Top-p, Top-k (auto-populated with model defaults) | |
| - Max Output Tokens | |
| - Enable/disable reasoning mode (for supported models) | |
| 5. **Generate Summary**: Click "✨ Generate Summary" and watch: | |
| - **Model Thinking Process** (left): AI's reasoning in real-time | |
| - **Final Summary** (right): Polished result | |
| - **Generation Metrics**: Performance stats (tokens/sec, generation time) | |
| ### Advanced Mode (3-Model Pipeline) | |
| For higher quality summarization with large transcripts: | |
| 1. **Stage 1 - Extraction**: Small model (≤1.7B) extracts key points from windows | |
| 2. **Stage 2 - Deduplication**: Embedding model removes duplicate items | |
| 3. **Stage 3 - Synthesis**: Large model (1B-30B) generates executive summary | |
| Configure each stage independently with dedicated model, context window, and inference settings. | |
| ## Custom GGUF Models | |
| Load any GGUF model from HuggingFace Hub: | |
| 1. Switch to the **🔧 Custom GGUF** tab | |
| 2. Search for a model (e.g., "qwen", "llama", "phi") | |
| 3. Select a GGUF file (quantization level) | |
| 4. Click **Load Selected Model** | |
| 5. The model will be downloaded and cached locally | |
| ## Model Registry (24 Preset Models) | |
| ### Tiny Models (0.1-0.6B) | |
| - Falcon-H1-100M, Gemma-3-270M, ERNIE-0.3B | |
| - Granite-3.1-0.35B, Granite-3.3-0.35B, BitCPM4-0.5B | |
| - Hunyuan-0.5B, Qwen3-0.6B | |
| ### Compact Models (1.5-2.6B) | |
| - Granite-3.1-1B, Falcon-H1-1.5B, Qwen3-1.7B-Thinking | |
| - Granite-3.3-2B, Youtu-LLM-2B, LFM2-2.6B-Transcript | |
| ### Standard Models (3-7B) | |
| - Granite-3.1-3B, Breeze-3B, Qwen3-4B-Thinking, Granite-4.0-Tiny-7B | |
| ### Large Models (21-30B) | |
| - ERNIE-4.5-21B-PT, ERNIE-4.5-21B-Thinking | |
| - GLM-4.7-Flash-30B (REAP & IQ2 variants) | |
| - Qwen3-30B-A3B (Thinking & Instruct variants) | |
| ## Technical Details | |
| - **Inference Engine**: llama-cpp-python | |
| - **Model Format**: GGUF (Q2_K_L, Q3_K_XXS, Q4_K_M, Q4_K_L, Q8_0, etc.) | |
| - **Context Windows**: 4K–256K tokens depending on model | |
| - **UI Framework**: Gradio with streaming support | |
| - **Model Search**: gradio_huggingfacehub_search component | |
| - **Language Conversion**: OpenCC for Traditional Chinese (zh-TW) | |
| - **Deployment**: Docker (HuggingFace Spaces compatible) | |
| ## Hardware Configuration | |
| | Preset | CPU Threads | Best For | | |
| |--------|-------------|----------| | |
| | HF Free Tier | 2 vCPUs | Small models (< 2B) | | |
| | HF CPU Upgrade | 8 vCPUs | Medium models (2-7B) | | |
| | Custom | 1-32 | Local deployment | | |
| ## Reasoning Mode | |
| For models that support thinking/reasoning (marked with ⚡ icon): | |
| - Automatically extends context window by 50% | |
| - Provides reasoning steps before the final summary | |
| - Toggle on/off per generation | |
| ## Limitations | |
| - **Input Size**: Varies by model (4K–256K context windows) | |
| - **First Load**: 10–60 seconds depending on model size | |
| - **CPU Inference**: Free tier runs on CPU; larger models need more time | |
| - **Custom Models**: Must be GGUF format from HuggingFace Hub | |
| ## CLI Usage | |
| ```bash | |
| # Default English output | |
| python summarize_transcript.py -i ./transcripts/short.txt | |
| # Traditional Chinese output | |
| python summarize_transcript.py -i ./transcripts/short.txt -l zh-TW | |
| # Use specific model | |
| python summarize_transcript.py -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L | |
| # CPU only | |
| python summarize_transcript.py -c | |
| ``` | |
| ## Requirements | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ## Repository | |
| [Luigi/tiny-scribe](https://huggingface.co/spaces/Luigi/tiny-scribe) | |
| ## License | |
| MIT License | |