File size: 5,974 Bytes
10d339c
 
2ca5026
10d339c
 
 
 
 
 
 
 
d4fd1c3
10d339c
d4fd1c3
7b25e5e
d4fd1c3
10d339c
d4fd1c3
7b25e5e
0733923
7b25e5e
 
 
 
10d339c
fd459ca
 
7b25e5e
 
 
 
 
 
 
 
 
0733923
fd459ca
0733923
fd459ca
0733923
 
7b25e5e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0733923
 
 
 
 
 
 
 
 
 
 
 
fd459ca
 
0733923
 
 
fd459ca
 
0733923
 
fd459ca
 
0733923
d4fd1c3
0733923
 
 
 
d4fd1c3
10d339c
d4fd1c3
fd459ca
0733923
 
fd459ca
0733923
fd459ca
 
 
0733923
 
 
 
 
 
 
 
fd459ca
 
0733923
fd459ca
 
 
 
10d339c
 
0733923
 
 
 
fd459ca
 
 
 
 
 
 
 
 
 
 
0733923
fd459ca
 
0733923
fd459ca
 
 
 
 
 
 
 
10d339c
d4fd1c3
0733923
fd459ca
 
 
0733923
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
title: Tiny Scribe - Transcript Summarizer
emoji: "πŸ“„"
colorFrom: blue
colorTo: green
sdk: docker
sdk_version: "3.10"
app_file: app.py
pinned: false
license: mit
---

# Tiny Scribe

A lightweight transcript summarization tool powered by local LLMs. Features 24+ preset models ranging from 100M to 30B parameters, plus the ability to load any GGUF model from HuggingFace Hub. Includes two summarization modes (Standard and Advanced 3-model pipeline), live streaming output, reasoning modes, and flexible deployment options.

## Features

### Core Capabilities
- **24+ Preset Models**: From tiny 100M models to powerful 30B models
- **Custom GGUF Loading**: Load any GGUF model from HuggingFace Hub with live search
- **Dual Summarization Modes**:
  - **Standard Mode**: Single-model direct summarization
  - **Advanced Mode**: 3-stage pipeline (Extraction β†’ Deduplication β†’ Synthesis)
- **Live Streaming**: Real-time summary generation with token-by-token output
- **Reasoning Modes**: Toggle thinking/reasoning for supported models (Qwen3, ERNIE, LFM2)
- **Thinking Buffer**: Automatic 50% context window extension when reasoning enabled

### User Interface
- **Clean Two-Column Layout**: Configuration (left) and output (right)
- **Model Source Selection**: Radio button toggle between Preset and Custom models
- **Real-Time Outputs**: 
  - **Model Thinking Process**: See the AI's reasoning in real-time
  - **Final Summary**: Polished, formatted summary
  - **Generation Metrics**: Separate section for performance stats
- **Unified Model Information**: Displays specs for Standard (1 model) or Advanced (3 models)
- **Hardware Presets**: Free Tier (2 vCPUs), Upgrade (8 vCPUs), or Custom thread count
- **Language Support**: English or Traditional Chinese (zh-TW) output via OpenCC
- **Auto Settings**: Temperature, top_p, and top_k auto-populate per model

## Usage

### Quick Start (Standard Mode)

1. **Configure Global Settings**:
   - **Output Language**: Choose English or Traditional Chinese (zh-TW)
   - **Input Content**: Upload a .txt file or paste your transcript
   - **Hardware Configuration**: Select CPU thread preset (Free Tier, Upgrade, or Custom)

2. **Select Summarization Mode**:
   - **Standard Mode**: Single-model direct summarization (faster, simpler)
   - **Advanced Mode**: 3-model pipeline with extraction, deduplication, synthesis (higher quality)

3. **Choose Model** (Standard Mode):
   - **Preset Models**: Select from 24+ curated models
   - **Custom GGUF**: Search and load any GGUF from HuggingFace Hub

4. **Configure Inference Parameters** (optional):
   - Temperature, Top-p, Top-k (auto-populated with model defaults)
   - Max Output Tokens
   - Enable/disable reasoning mode (for supported models)

5. **Generate Summary**: Click "✨ Generate Summary" and watch:
   - **Model Thinking Process** (left): AI's reasoning in real-time
   - **Final Summary** (right): Polished result
   - **Generation Metrics**: Performance stats (tokens/sec, generation time)

### Advanced Mode (3-Model Pipeline)

For higher quality summarization with large transcripts:

1. **Stage 1 - Extraction**: Small model (≀1.7B) extracts key points from windows
2. **Stage 2 - Deduplication**: Embedding model removes duplicate items
3. **Stage 3 - Synthesis**: Large model (1B-30B) generates executive summary

Configure each stage independently with dedicated model, context window, and inference settings.

## Custom GGUF Models

Load any GGUF model from HuggingFace Hub:

1. Switch to the **πŸ”§ Custom GGUF** tab
2. Search for a model (e.g., "qwen", "llama", "phi")
3. Select a GGUF file (quantization level)
4. Click **Load Selected Model**
5. The model will be downloaded and cached locally

## Model Registry (24 Preset Models)

### Tiny Models (0.1-0.6B)
- Falcon-H1-100M, Gemma-3-270M, ERNIE-0.3B
- Granite-3.1-0.35B, Granite-3.3-0.35B, BitCPM4-0.5B
- Hunyuan-0.5B, Qwen3-0.6B

### Compact Models (1.5-2.6B)
- Granite-3.1-1B, Falcon-H1-1.5B, Qwen3-1.7B-Thinking
- Granite-3.3-2B, Youtu-LLM-2B, LFM2-2.6B-Transcript

### Standard Models (3-7B)
- Granite-3.1-3B, Breeze-3B, Qwen3-4B-Thinking, Granite-4.0-Tiny-7B

### Large Models (21-30B)
- ERNIE-4.5-21B-PT, ERNIE-4.5-21B-Thinking
- GLM-4.7-Flash-30B (REAP & IQ2 variants)
- Qwen3-30B-A3B (Thinking & Instruct variants)

## Technical Details

- **Inference Engine**: llama-cpp-python
- **Model Format**: GGUF (Q2_K_L, Q3_K_XXS, Q4_K_M, Q4_K_L, Q8_0, etc.)
- **Context Windows**: 4K–256K tokens depending on model
- **UI Framework**: Gradio with streaming support
- **Model Search**: gradio_huggingfacehub_search component
- **Language Conversion**: OpenCC for Traditional Chinese (zh-TW)
- **Deployment**: Docker (HuggingFace Spaces compatible)

## Hardware Configuration

| Preset | CPU Threads | Best For |
|--------|-------------|----------|
| HF Free Tier | 2 vCPUs | Small models (< 2B) |
| HF CPU Upgrade | 8 vCPUs | Medium models (2-7B) |
| Custom | 1-32 | Local deployment |

## Reasoning Mode

For models that support thinking/reasoning (marked with ⚑ icon):
- Automatically extends context window by 50%
- Provides reasoning steps before the final summary
- Toggle on/off per generation

## Limitations

- **Input Size**: Varies by model (4K–256K context windows)
- **First Load**: 10–60 seconds depending on model size
- **CPU Inference**: Free tier runs on CPU; larger models need more time
- **Custom Models**: Must be GGUF format from HuggingFace Hub

## CLI Usage

```bash
# Default English output
python summarize_transcript.py -i ./transcripts/short.txt

# Traditional Chinese output
python summarize_transcript.py -i ./transcripts/short.txt -l zh-TW

# Use specific model
python summarize_transcript.py -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L

# CPU only
python summarize_transcript.py -c
```

## Requirements

```bash
pip install -r requirements.txt
```

## Repository

[Luigi/tiny-scribe](https://huggingface.co/spaces/Luigi/tiny-scribe)

## License

MIT License