Spaces:
Running
Running
docs: Update README with 22 models, reasoning modes, and GPU support
Browse files
README.md
CHANGED
|
@@ -12,36 +12,120 @@ license: mit
|
|
| 12 |
|
| 13 |
# Tiny Scribe
|
| 14 |
|
| 15 |
-
A lightweight transcript summarization tool powered by local LLMs
|
| 16 |
|
| 17 |
## Features
|
| 18 |
|
|
|
|
| 19 |
- **Live Streaming**: Real-time summary generation with token-by-token output
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
- **File Upload**: Upload .txt files to summarize
|
| 21 |
-
- **
|
| 22 |
-
- **
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
## Usage
|
| 26 |
|
| 27 |
-
1.
|
| 28 |
-
2.
|
| 29 |
-
3.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
## Technical Details
|
| 32 |
|
| 33 |
-
- **
|
| 34 |
-
- **
|
| 35 |
-
- **
|
| 36 |
-
- **UI**: Gradio with streaming support
|
| 37 |
-
- **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
## Limitations
|
| 40 |
|
| 41 |
-
-
|
| 42 |
-
- First
|
| 43 |
-
- CPU
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
## Repository
|
| 46 |
|
| 47 |
[tiny-scribe](https://huggingface.co/spaces/your-username/tiny-scribe)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
# Tiny Scribe
|
| 14 |
|
| 15 |
+
A lightweight transcript summarization tool powered by local LLMs. Features 22 models ranging from 100M to 30B parameters with live streaming output, reasoning modes, and flexible deployment options.
|
| 16 |
|
| 17 |
## Features
|
| 18 |
|
| 19 |
+
- **22 Local Models**: From tiny 100M models to powerful 30B models
|
| 20 |
- **Live Streaming**: Real-time summary generation with token-by-token output
|
| 21 |
+
- **Model Selection**: Dropdown to choose from 22 available models
|
| 22 |
+
- **Reasoning Modes**: Toggle thinking/reasoning for supported models (Qwen3, ERNIE, LFM2)
|
| 23 |
+
- **Thinking Buffer**: Automatic 50% context window extension when reasoning enabled
|
| 24 |
+
- **GPU Acceleration**: Optional GPU layers support (set via environment or CPU-only fallback)
|
| 25 |
- **File Upload**: Upload .txt files to summarize
|
| 26 |
+
- **Language Support**: English or Traditional Chinese (zh-TW) output via OpenCC
|
| 27 |
+
- **Auto Settings**: Temperature, top_p, and top_k sliders auto-populate per model
|
| 28 |
+
|
| 29 |
+
## Model Registry (22 Models)
|
| 30 |
+
|
| 31 |
+
### Tiny Models (0.1-0.6B)
|
| 32 |
+
- **Falcon-H1-100M** - 100M parameters, 4K context
|
| 33 |
+
- **Gemma-3-270M** - 270M parameters, 4K context
|
| 34 |
+
- **ERNIE-0.3B** - 300M parameters, 4K context
|
| 35 |
+
- **Granite-3.1-0.35B-A600M** - 350M parameters, 4K context
|
| 36 |
+
- **Granite-3.3-0.35B-A800M** - 350M parameters, 4K context
|
| 37 |
+
- **BitCPM4-0.5B** - 500M parameters, 32K context
|
| 38 |
+
- **Hunyuan-0.5B** - 500M parameters, 4K context
|
| 39 |
+
- **Qwen3-0.6B** - 600M parameters, 4K context
|
| 40 |
+
|
| 41 |
+
### Compact Models (1.5-2.6B)
|
| 42 |
+
- **Granite-3.1-1B-A400M** - 1B parameters, 4K context
|
| 43 |
+
- **Falcon-H1-1.5B** - 1.5B parameters, 32K context
|
| 44 |
+
- **Qwen3-1.7B-Thinking** - 1.7B parameters, 32K context (reasoning)
|
| 45 |
+
- **Granite-3.3-2B** - 2B parameters, 4K context
|
| 46 |
+
- **Youtu-LLM-2B** - 2B parameters, 8K context (reasoning toggle)
|
| 47 |
+
- **LFM2-2.6B-Transcript** - 2.6B parameters, 32K context (transcript-specialized)
|
| 48 |
+
|
| 49 |
+
### Standard Models (3-7B)
|
| 50 |
+
- **Granite-3.1-3B-A800M** - 3B parameters, 4K context
|
| 51 |
+
- **Qwen3-4B-Thinking** - 4B parameters, 8K context (reasoning)
|
| 52 |
+
- **Granite-4.0-Tiny-7B** - 7B parameters, 8K context
|
| 53 |
+
|
| 54 |
+
### Medium Models (21-30B)
|
| 55 |
+
- **ERNIE-4.5-21B-PT** - 21B parameters, 32K context
|
| 56 |
+
- **ERNIE-4.5-21B-Thinking** - 21B parameters, 32K context (reasoning)
|
| 57 |
+
- **GLM-4.7-Flash-23B-REAP** - 23B parameters, 32K context
|
| 58 |
+
- **Qwen3-30B-A3B-Thinking** - 30B parameters, 32K context (reasoning)
|
| 59 |
+
- **Qwen3-30B-A3B-Instruct** - 30B parameters, 32K context
|
| 60 |
|
| 61 |
## Usage
|
| 62 |
|
| 63 |
+
1. **Select Output Language**: Choose English or Traditional Chinese (zh-TW)
|
| 64 |
+
2. **Select Model**: Choose from the dropdown of 22 available models
|
| 65 |
+
3. **Configure Settings** (optional):
|
| 66 |
+
- Enable "Use Reasoning Mode" for thinking models
|
| 67 |
+
- Adjust Temperature, Top-p, and Top-k (auto-populated per model)
|
| 68 |
+
4. **Upload File**: Upload a .txt file containing your transcript
|
| 69 |
+
5. **Click Summarize**: Watch the summary appear in real-time!
|
| 70 |
|
| 71 |
## Technical Details
|
| 72 |
|
| 73 |
+
- **Inference Engine**: llama-cpp-python
|
| 74 |
+
- **Model Format**: GGUF (various quantizations: Q2_K_L, Q3_K_XXS, Q4_K_M, Q4_K_L, Q8_0)
|
| 75 |
+
- **Context Windows**: 4K–32K tokens depending on model
|
| 76 |
+
- **UI Framework**: Gradio with streaming support
|
| 77 |
+
- **Language Conversion**: OpenCC for Traditional Chinese (zh-TW)
|
| 78 |
+
- **Deployment**: Docker (HuggingFace Spaces compatible)
|
| 79 |
+
|
| 80 |
+
## Reasoning Mode
|
| 81 |
+
|
| 82 |
+
For models that support thinking/reasoning (marked with 🔮 icon):
|
| 83 |
+
- Automatically extends context window by 50%
|
| 84 |
+
- Provides reasoning steps before the final summary
|
| 85 |
+
- Toggle on/off per generation
|
| 86 |
+
|
| 87 |
+
## GPU Acceleration
|
| 88 |
+
|
| 89 |
+
Set the `N_GPU_LAYERS` environment variable:
|
| 90 |
+
- `-1` or high value: Use GPU for all layers
|
| 91 |
+
- `0`: CPU-only inference
|
| 92 |
+
- Default: Automatically detects GPU availability
|
| 93 |
|
| 94 |
## Limitations
|
| 95 |
|
| 96 |
+
- **Input Size**: Varies by model (4K–32K context windows)
|
| 97 |
+
- **First Load**: 10–60 seconds depending on model size (0.6B = fast, 30B = slower)
|
| 98 |
+
- **CPU Inference**: Free tier runs on CPU; GPU available with environment configuration
|
| 99 |
+
- **Model Size**: Larger models (21B–30B) require more RAM and download time
|
| 100 |
+
|
| 101 |
+
## CLI Usage
|
| 102 |
+
|
| 103 |
+
```bash
|
| 104 |
+
# Default English output
|
| 105 |
+
python summarize_transcript.py -i ./transcripts/short.txt
|
| 106 |
+
|
| 107 |
+
# Traditional Chinese output
|
| 108 |
+
python summarize_transcript.py -i ./transcripts/short.txt -l zh-TW
|
| 109 |
+
|
| 110 |
+
# Use specific model
|
| 111 |
+
python summarize_transcript.py -i ./transcripts/short.txt -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L
|
| 112 |
+
|
| 113 |
+
# CPU only
|
| 114 |
+
python summarize_transcript.py -i ./transcripts/short.txt -c
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
## Requirements
|
| 118 |
+
|
| 119 |
+
```bash
|
| 120 |
+
pip install -r requirements.txt
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
See `requirements.txt` for full dependencies including llama-cpp-python, gradio, and opencc.
|
| 124 |
|
| 125 |
## Repository
|
| 126 |
|
| 127 |
[tiny-scribe](https://huggingface.co/spaces/your-username/tiny-scribe)
|
| 128 |
+
|
| 129 |
+
## License
|
| 130 |
+
|
| 131 |
+
MIT License - See LICENSE file for details
|