Luigi commited on
Commit
fd459ca
·
1 Parent(s): 2129c2f

docs: Update README with 22 models, reasoning modes, and GPU support

Browse files
Files changed (1) hide show
  1. README.md +99 -15
README.md CHANGED
@@ -12,36 +12,120 @@ license: mit
12
 
13
  # Tiny Scribe
14
 
15
- A lightweight transcript summarization tool powered by local LLMs (Qwen3-0.6B).
16
 
17
  ## Features
18
 
 
19
  - **Live Streaming**: Real-time summary generation with token-by-token output
 
 
 
 
20
  - **File Upload**: Upload .txt files to summarize
21
- - **Traditional Chinese**: Automatic conversion to zh-TW
22
- - **CPU Optimized**: Runs efficiently on 2 vCPUs (HuggingFace Spaces Free Tier)
23
- - **Small Model**: Uses Qwen3-0.6B-GGUF (Q4_K_M quantization) for fast inference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ## Usage
26
 
27
- 1. Upload a .txt file containing your transcript
28
- 2. Click "Summarize"
29
- 3. Watch the summary appear in real-time!
 
 
 
 
30
 
31
  ## Technical Details
32
 
33
- - **Model**: unsloth/Qwen3-0.6B-GGUF (Q4_K_M quantization)
34
- - **Context Window**: 4096 tokens
35
- - **Inference**: CPU-only (llama-cpp-python)
36
- - **UI**: Gradio with streaming support
37
- - **Output**: Traditional Chinese (zh-TW) via OpenCC
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
  ## Limitations
40
 
41
- - Max input: ~3KB of text (truncated if exceeded)
42
- - First load: 30-60 seconds (model download)
43
- - CPU-only inference (no GPU acceleration on Free Tier)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
  ## Repository
46
 
47
  [tiny-scribe](https://huggingface.co/spaces/your-username/tiny-scribe)
 
 
 
 
 
12
 
13
  # Tiny Scribe
14
 
15
+ A lightweight transcript summarization tool powered by local LLMs. Features 22 models ranging from 100M to 30B parameters with live streaming output, reasoning modes, and flexible deployment options.
16
 
17
  ## Features
18
 
19
+ - **22 Local Models**: From tiny 100M models to powerful 30B models
20
  - **Live Streaming**: Real-time summary generation with token-by-token output
21
+ - **Model Selection**: Dropdown to choose from 22 available models
22
+ - **Reasoning Modes**: Toggle thinking/reasoning for supported models (Qwen3, ERNIE, LFM2)
23
+ - **Thinking Buffer**: Automatic 50% context window extension when reasoning enabled
24
+ - **GPU Acceleration**: Optional GPU layers support (set via environment or CPU-only fallback)
25
  - **File Upload**: Upload .txt files to summarize
26
+ - **Language Support**: English or Traditional Chinese (zh-TW) output via OpenCC
27
+ - **Auto Settings**: Temperature, top_p, and top_k sliders auto-populate per model
28
+
29
+ ## Model Registry (22 Models)
30
+
31
+ ### Tiny Models (0.1-0.6B)
32
+ - **Falcon-H1-100M** - 100M parameters, 4K context
33
+ - **Gemma-3-270M** - 270M parameters, 4K context
34
+ - **ERNIE-0.3B** - 300M parameters, 4K context
35
+ - **Granite-3.1-0.35B-A600M** - 350M parameters, 4K context
36
+ - **Granite-3.3-0.35B-A800M** - 350M parameters, 4K context
37
+ - **BitCPM4-0.5B** - 500M parameters, 32K context
38
+ - **Hunyuan-0.5B** - 500M parameters, 4K context
39
+ - **Qwen3-0.6B** - 600M parameters, 4K context
40
+
41
+ ### Compact Models (1.5-2.6B)
42
+ - **Granite-3.1-1B-A400M** - 1B parameters, 4K context
43
+ - **Falcon-H1-1.5B** - 1.5B parameters, 32K context
44
+ - **Qwen3-1.7B-Thinking** - 1.7B parameters, 32K context (reasoning)
45
+ - **Granite-3.3-2B** - 2B parameters, 4K context
46
+ - **Youtu-LLM-2B** - 2B parameters, 8K context (reasoning toggle)
47
+ - **LFM2-2.6B-Transcript** - 2.6B parameters, 32K context (transcript-specialized)
48
+
49
+ ### Standard Models (3-7B)
50
+ - **Granite-3.1-3B-A800M** - 3B parameters, 4K context
51
+ - **Qwen3-4B-Thinking** - 4B parameters, 8K context (reasoning)
52
+ - **Granite-4.0-Tiny-7B** - 7B parameters, 8K context
53
+
54
+ ### Medium Models (21-30B)
55
+ - **ERNIE-4.5-21B-PT** - 21B parameters, 32K context
56
+ - **ERNIE-4.5-21B-Thinking** - 21B parameters, 32K context (reasoning)
57
+ - **GLM-4.7-Flash-23B-REAP** - 23B parameters, 32K context
58
+ - **Qwen3-30B-A3B-Thinking** - 30B parameters, 32K context (reasoning)
59
+ - **Qwen3-30B-A3B-Instruct** - 30B parameters, 32K context
60
 
61
  ## Usage
62
 
63
+ 1. **Select Output Language**: Choose English or Traditional Chinese (zh-TW)
64
+ 2. **Select Model**: Choose from the dropdown of 22 available models
65
+ 3. **Configure Settings** (optional):
66
+ - Enable "Use Reasoning Mode" for thinking models
67
+ - Adjust Temperature, Top-p, and Top-k (auto-populated per model)
68
+ 4. **Upload File**: Upload a .txt file containing your transcript
69
+ 5. **Click Summarize**: Watch the summary appear in real-time!
70
 
71
  ## Technical Details
72
 
73
+ - **Inference Engine**: llama-cpp-python
74
+ - **Model Format**: GGUF (various quantizations: Q2_K_L, Q3_K_XXS, Q4_K_M, Q4_K_L, Q8_0)
75
+ - **Context Windows**: 4K–32K tokens depending on model
76
+ - **UI Framework**: Gradio with streaming support
77
+ - **Language Conversion**: OpenCC for Traditional Chinese (zh-TW)
78
+ - **Deployment**: Docker (HuggingFace Spaces compatible)
79
+
80
+ ## Reasoning Mode
81
+
82
+ For models that support thinking/reasoning (marked with 🔮 icon):
83
+ - Automatically extends context window by 50%
84
+ - Provides reasoning steps before the final summary
85
+ - Toggle on/off per generation
86
+
87
+ ## GPU Acceleration
88
+
89
+ Set the `N_GPU_LAYERS` environment variable:
90
+ - `-1` or high value: Use GPU for all layers
91
+ - `0`: CPU-only inference
92
+ - Default: Automatically detects GPU availability
93
 
94
  ## Limitations
95
 
96
+ - **Input Size**: Varies by model (4K–32K context windows)
97
+ - **First Load**: 10–60 seconds depending on model size (0.6B = fast, 30B = slower)
98
+ - **CPU Inference**: Free tier runs on CPU; GPU available with environment configuration
99
+ - **Model Size**: Larger models (21B–30B) require more RAM and download time
100
+
101
+ ## CLI Usage
102
+
103
+ ```bash
104
+ # Default English output
105
+ python summarize_transcript.py -i ./transcripts/short.txt
106
+
107
+ # Traditional Chinese output
108
+ python summarize_transcript.py -i ./transcripts/short.txt -l zh-TW
109
+
110
+ # Use specific model
111
+ python summarize_transcript.py -i ./transcripts/short.txt -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L
112
+
113
+ # CPU only
114
+ python summarize_transcript.py -i ./transcripts/short.txt -c
115
+ ```
116
+
117
+ ## Requirements
118
+
119
+ ```bash
120
+ pip install -r requirements.txt
121
+ ```
122
+
123
+ See `requirements.txt` for full dependencies including llama-cpp-python, gradio, and opencc.
124
 
125
  ## Repository
126
 
127
  [tiny-scribe](https://huggingface.co/spaces/your-username/tiny-scribe)
128
+
129
+ ## License
130
+
131
+ MIT License - See LICENSE file for details