tiny-scribe / README.md
Luigi's picture
docs: update README with UI improvements and Advanced Mode features
7b25e5e
metadata
title: Tiny Scribe - Transcript Summarizer
emoji: πŸ“„
colorFrom: blue
colorTo: green
sdk: docker
sdk_version: '3.10'
app_file: app.py
pinned: false
license: mit

Tiny Scribe

A lightweight transcript summarization tool powered by local LLMs. Features 24+ preset models ranging from 100M to 30B parameters, plus the ability to load any GGUF model from HuggingFace Hub. Includes two summarization modes (Standard and Advanced 3-model pipeline), live streaming output, reasoning modes, and flexible deployment options.

Features

Core Capabilities

  • 24+ Preset Models: From tiny 100M models to powerful 30B models
  • Custom GGUF Loading: Load any GGUF model from HuggingFace Hub with live search
  • Dual Summarization Modes:
    • Standard Mode: Single-model direct summarization
    • Advanced Mode: 3-stage pipeline (Extraction β†’ Deduplication β†’ Synthesis)
  • Live Streaming: Real-time summary generation with token-by-token output
  • Reasoning Modes: Toggle thinking/reasoning for supported models (Qwen3, ERNIE, LFM2)
  • Thinking Buffer: Automatic 50% context window extension when reasoning enabled

User Interface

  • Clean Two-Column Layout: Configuration (left) and output (right)
  • Model Source Selection: Radio button toggle between Preset and Custom models
  • Real-Time Outputs:
    • Model Thinking Process: See the AI's reasoning in real-time
    • Final Summary: Polished, formatted summary
    • Generation Metrics: Separate section for performance stats
  • Unified Model Information: Displays specs for Standard (1 model) or Advanced (3 models)
  • Hardware Presets: Free Tier (2 vCPUs), Upgrade (8 vCPUs), or Custom thread count
  • Language Support: English or Traditional Chinese (zh-TW) output via OpenCC
  • Auto Settings: Temperature, top_p, and top_k auto-populate per model

Usage

Quick Start (Standard Mode)

  1. Configure Global Settings:

    • Output Language: Choose English or Traditional Chinese (zh-TW)
    • Input Content: Upload a .txt file or paste your transcript
    • Hardware Configuration: Select CPU thread preset (Free Tier, Upgrade, or Custom)
  2. Select Summarization Mode:

    • Standard Mode: Single-model direct summarization (faster, simpler)
    • Advanced Mode: 3-model pipeline with extraction, deduplication, synthesis (higher quality)
  3. Choose Model (Standard Mode):

    • Preset Models: Select from 24+ curated models
    • Custom GGUF: Search and load any GGUF from HuggingFace Hub
  4. Configure Inference Parameters (optional):

    • Temperature, Top-p, Top-k (auto-populated with model defaults)
    • Max Output Tokens
    • Enable/disable reasoning mode (for supported models)
  5. Generate Summary: Click "✨ Generate Summary" and watch:

    • Model Thinking Process (left): AI's reasoning in real-time
    • Final Summary (right): Polished result
    • Generation Metrics: Performance stats (tokens/sec, generation time)

Advanced Mode (3-Model Pipeline)

For higher quality summarization with large transcripts:

  1. Stage 1 - Extraction: Small model (≀1.7B) extracts key points from windows
  2. Stage 2 - Deduplication: Embedding model removes duplicate items
  3. Stage 3 - Synthesis: Large model (1B-30B) generates executive summary

Configure each stage independently with dedicated model, context window, and inference settings.

Custom GGUF Models

Load any GGUF model from HuggingFace Hub:

  1. Switch to the πŸ”§ Custom GGUF tab
  2. Search for a model (e.g., "qwen", "llama", "phi")
  3. Select a GGUF file (quantization level)
  4. Click Load Selected Model
  5. The model will be downloaded and cached locally

Model Registry (24 Preset Models)

Tiny Models (0.1-0.6B)

  • Falcon-H1-100M, Gemma-3-270M, ERNIE-0.3B
  • Granite-3.1-0.35B, Granite-3.3-0.35B, BitCPM4-0.5B
  • Hunyuan-0.5B, Qwen3-0.6B

Compact Models (1.5-2.6B)

  • Granite-3.1-1B, Falcon-H1-1.5B, Qwen3-1.7B-Thinking
  • Granite-3.3-2B, Youtu-LLM-2B, LFM2-2.6B-Transcript

Standard Models (3-7B)

  • Granite-3.1-3B, Breeze-3B, Qwen3-4B-Thinking, Granite-4.0-Tiny-7B

Large Models (21-30B)

  • ERNIE-4.5-21B-PT, ERNIE-4.5-21B-Thinking
  • GLM-4.7-Flash-30B (REAP & IQ2 variants)
  • Qwen3-30B-A3B (Thinking & Instruct variants)

Technical Details

  • Inference Engine: llama-cpp-python
  • Model Format: GGUF (Q2_K_L, Q3_K_XXS, Q4_K_M, Q4_K_L, Q8_0, etc.)
  • Context Windows: 4K–256K tokens depending on model
  • UI Framework: Gradio with streaming support
  • Model Search: gradio_huggingfacehub_search component
  • Language Conversion: OpenCC for Traditional Chinese (zh-TW)
  • Deployment: Docker (HuggingFace Spaces compatible)

Hardware Configuration

Preset CPU Threads Best For
HF Free Tier 2 vCPUs Small models (< 2B)
HF CPU Upgrade 8 vCPUs Medium models (2-7B)
Custom 1-32 Local deployment

Reasoning Mode

For models that support thinking/reasoning (marked with ⚑ icon):

  • Automatically extends context window by 50%
  • Provides reasoning steps before the final summary
  • Toggle on/off per generation

Limitations

  • Input Size: Varies by model (4K–256K context windows)
  • First Load: 10–60 seconds depending on model size
  • CPU Inference: Free tier runs on CPU; larger models need more time
  • Custom Models: Must be GGUF format from HuggingFace Hub

CLI Usage

# Default English output
python summarize_transcript.py -i ./transcripts/short.txt

# Traditional Chinese output
python summarize_transcript.py -i ./transcripts/short.txt -l zh-TW

# Use specific model
python summarize_transcript.py -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L

# CPU only
python summarize_transcript.py -c

Requirements

pip install -r requirements.txt

Repository

Luigi/tiny-scribe

License

MIT License