Spaces:

Luigi
/

tiny-scribe

Running

App Files Files Community

tiny-scribe / README.md

Luigi

docs: update README with UI improvements and Advanced Mode features

7b25e5e about 1 month ago

preview code

raw

history blame contribute delete

5.97 kB

	---
	title: Tiny Scribe - Transcript Summarizer
	emoji: "📄"
	colorFrom: blue
	colorTo: green
	sdk: docker
	sdk_version: "3.10"
	app_file: app.py
	pinned: false
	license: mit
	---

	# Tiny Scribe

	A lightweight transcript summarization tool powered by local LLMs. Features 24+ preset models ranging from 100M to 30B parameters, plus the ability to load any GGUF model from HuggingFace Hub. Includes two summarization modes (Standard and Advanced 3-model pipeline), live streaming output, reasoning modes, and flexible deployment options.

	## Features

	### Core Capabilities
	- 24+ Preset Models: From tiny 100M models to powerful 30B models
	- Custom GGUF Loading: Load any GGUF model from HuggingFace Hub with live search
	- Dual Summarization Modes:
	- Standard Mode: Single-model direct summarization
	- Advanced Mode: 3-stage pipeline (Extraction → Deduplication → Synthesis)
	- Live Streaming: Real-time summary generation with token-by-token output
	- Reasoning Modes: Toggle thinking/reasoning for supported models (Qwen3, ERNIE, LFM2)
	- Thinking Buffer: Automatic 50% context window extension when reasoning enabled

	### User Interface
	- Clean Two-Column Layout: Configuration (left) and output (right)
	- Model Source Selection: Radio button toggle between Preset and Custom models
	- Real-Time Outputs:
	- Model Thinking Process: See the AI's reasoning in real-time
	- Final Summary: Polished, formatted summary
	- Generation Metrics: Separate section for performance stats
	- Unified Model Information: Displays specs for Standard (1 model) or Advanced (3 models)
	- Hardware Presets: Free Tier (2 vCPUs), Upgrade (8 vCPUs), or Custom thread count
	- Language Support: English or Traditional Chinese (zh-TW) output via OpenCC
	- Auto Settings: Temperature, top_p, and top_k auto-populate per model

	## Usage

	### Quick Start (Standard Mode)

	1. Configure Global Settings:
	- Output Language: Choose English or Traditional Chinese (zh-TW)
	- Input Content: Upload a .txt file or paste your transcript
	- Hardware Configuration: Select CPU thread preset (Free Tier, Upgrade, or Custom)

	2. Select Summarization Mode:
	- Standard Mode: Single-model direct summarization (faster, simpler)
	- Advanced Mode: 3-model pipeline with extraction, deduplication, synthesis (higher quality)

	3. Choose Model (Standard Mode):
	- Preset Models: Select from 24+ curated models
	- Custom GGUF: Search and load any GGUF from HuggingFace Hub

	4. Configure Inference Parameters (optional):
	- Temperature, Top-p, Top-k (auto-populated with model defaults)
	- Max Output Tokens
	- Enable/disable reasoning mode (for supported models)

	5. Generate Summary: Click "✨ Generate Summary" and watch:
	- Model Thinking Process (left): AI's reasoning in real-time
	- Final Summary (right): Polished result
	- Generation Metrics: Performance stats (tokens/sec, generation time)

	### Advanced Mode (3-Model Pipeline)

	For higher quality summarization with large transcripts:

	1. Stage 1 - Extraction: Small model (≤1.7B) extracts key points from windows
	2. Stage 2 - Deduplication: Embedding model removes duplicate items
	3. Stage 3 - Synthesis: Large model (1B-30B) generates executive summary

	Configure each stage independently with dedicated model, context window, and inference settings.

	## Custom GGUF Models

	Load any GGUF model from HuggingFace Hub:

	1. Switch to the 🔧 Custom GGUF tab
	2. Search for a model (e.g., "qwen", "llama", "phi")
	3. Select a GGUF file (quantization level)
	4. Click Load Selected Model
	5. The model will be downloaded and cached locally

	## Model Registry (24 Preset Models)

	### Tiny Models (0.1-0.6B)
	- Falcon-H1-100M, Gemma-3-270M, ERNIE-0.3B
	- Granite-3.1-0.35B, Granite-3.3-0.35B, BitCPM4-0.5B
	- Hunyuan-0.5B, Qwen3-0.6B

	### Compact Models (1.5-2.6B)
	- Granite-3.1-1B, Falcon-H1-1.5B, Qwen3-1.7B-Thinking
	- Granite-3.3-2B, Youtu-LLM-2B, LFM2-2.6B-Transcript

	### Standard Models (3-7B)
	- Granite-3.1-3B, Breeze-3B, Qwen3-4B-Thinking, Granite-4.0-Tiny-7B

	### Large Models (21-30B)
	- ERNIE-4.5-21B-PT, ERNIE-4.5-21B-Thinking
	- GLM-4.7-Flash-30B (REAP & IQ2 variants)
	- Qwen3-30B-A3B (Thinking & Instruct variants)

	## Technical Details

	- Inference Engine: llama-cpp-python
	- Model Format: GGUF (Q2_K_L, Q3_K_XXS, Q4_K_M, Q4_K_L, Q8_0, etc.)
	- Context Windows: 4K–256K tokens depending on model
	- UI Framework: Gradio with streaming support
	- Model Search: gradio_huggingfacehub_search component
	- Language Conversion: OpenCC for Traditional Chinese (zh-TW)
	- Deployment: Docker (HuggingFace Spaces compatible)

	## Hardware Configuration

	\| Preset \| CPU Threads \| Best For \|
	\|--------\|-------------\|----------\|
	\| HF Free Tier \| 2 vCPUs \| Small models (< 2B) \|
	\| HF CPU Upgrade \| 8 vCPUs \| Medium models (2-7B) \|
	\| Custom \| 1-32 \| Local deployment \|

	## Reasoning Mode

	For models that support thinking/reasoning (marked with ⚡ icon):
	- Automatically extends context window by 50%
	- Provides reasoning steps before the final summary
	- Toggle on/off per generation

	## Limitations

	- Input Size: Varies by model (4K–256K context windows)
	- First Load: 10–60 seconds depending on model size
	- CPU Inference: Free tier runs on CPU; larger models need more time
	- Custom Models: Must be GGUF format from HuggingFace Hub

	## CLI Usage

	```bash
	# Default English output
	python summarize_transcript.py -i ./transcripts/short.txt

	# Traditional Chinese output
	python summarize_transcript.py -i ./transcripts/short.txt -l zh-TW

	# Use specific model
	python summarize_transcript.py -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L

	# CPU only
	python summarize_transcript.py -c
	```

	## Requirements

	```bash
	pip install -r requirements.txt
	```

	## Repository

	[Luigi/tiny-scribe](https://huggingface.co/spaces/Luigi/tiny-scribe)

	## License

	MIT License