Luigi commited on
Commit
0733923
·
1 Parent(s): 85487c1

docs: update README with new UI features and custom GGUF loading

Browse files

- Add Custom GGUF loading feature description
- Document tabbed interface (Preset Models / Custom GGUF)
- Add Hardware Configuration table
- New section: Custom GGUF Models how-to
- Update Usage section to match new UI flow
- Add gradio_huggingfacehub_search to Technical Details
- Fix repository link to Luigi/tiny-scribe
- Condense model registry format

Files changed (1) hide show
  1. README.md +59 -61
README.md CHANGED
@@ -12,93 +12,93 @@ license: mit
12
 
13
  # Tiny Scribe
14
 
15
- A lightweight transcript summarization tool powered by local LLMs. Features 24 models ranging from 100M to 30B parameters with live streaming output, reasoning modes, and flexible deployment options.
16
 
17
  ## Features
18
 
19
- - **24 Local Models**: From tiny 100M models to powerful 30B models
 
 
20
  - **Live Streaming**: Real-time summary generation with token-by-token output
21
- - **Model Selection**: Dropdown to choose from 22 available models
22
  - **Reasoning Modes**: Toggle thinking/reasoning for supported models (Qwen3, ERNIE, LFM2)
23
  - **Thinking Buffer**: Automatic 50% context window extension when reasoning enabled
24
- - **GPU Acceleration**: Optional GPU layers support (set via environment or CPU-only fallback)
25
  - **File Upload**: Upload .txt files to summarize
26
  - **Language Support**: English or Traditional Chinese (zh-TW) output via OpenCC
27
- - **Auto Settings**: Temperature, top_p, and top_k sliders auto-populate per model
28
 
29
- ## Model Registry (24 Models)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  ### Tiny Models (0.1-0.6B)
32
- - **Falcon-H1-100M** - 100M parameters, 4K context
33
- - **Gemma-3-270M** - 270M parameters, 4K context
34
- - **ERNIE-0.3B** - 300M parameters, 4K context
35
- - **Granite-3.1-0.35B-A600M** - 350M parameters, 4K context
36
- - **Granite-3.3-0.35B-A800M** - 350M parameters, 4K context
37
- - **BitCPM4-0.5B** - 500M parameters, 32K context
38
- - **Hunyuan-0.5B** - 500M parameters, 4K context
39
- - **Qwen3-0.6B** - 600M parameters, 4K context
40
 
41
  ### Compact Models (1.5-2.6B)
42
- - **Granite-3.1-1B-A400M** - 1B parameters, 4K context
43
- - **Falcon-H1-1.5B** - 1.5B parameters, 32K context
44
- - **Qwen3-1.7B-Thinking** - 1.7B parameters, 32K context (reasoning)
45
- - **Granite-3.3-2B** - 2B parameters, 4K context
46
- - **Youtu-LLM-2B** - 2B parameters, 8K context (reasoning toggle)
47
- - **LFM2-2.6B-Transcript** - 2.6B parameters, 32K context (transcript-specialized)
48
 
49
  ### Standard Models (3-7B)
50
- - **Granite-3.1-3B-A800M** - 3B parameters, 4K context
51
- - **Breeze-3B-Q4** - 3B parameters, 32K context
52
- - **Qwen3-4B-Thinking** - 4B parameters, 8K context (reasoning)
53
- - **Granite-4.0-Tiny-7B** - 7B parameters, 8K context
54
-
55
- ### Medium Models (21-30B)
56
- - **ERNIE-4.5-21B-PT** - 21B parameters, 32K context
57
- - **ERNIE-4.5-21B-Thinking** - 21B parameters, 32K context (reasoning)
58
- - **GLM-4.7-Flash-30B-REAP** - 30B parameters, 128K context (TQ1_0, REAP variant)
59
- - **GLM-4.7-Flash-30B-Original-IQ2** - 30B parameters, 128K context (IQ2_XXS 2-bit, original zai-org)
60
- - **Qwen3-30B-A3B-Thinking** - 30B parameters, 256K context (reasoning)
61
- - **Qwen3-30B-A3B-Instruct** - 30B parameters, 256K context
62
 
63
- ## Usage
64
-
65
- 1. **Select Output Language**: Choose English or Traditional Chinese (zh-TW)
66
- 2. **Select Model**: Choose from the dropdown of 24 available models
67
- 3. **Configure Settings** (optional):
68
- - Enable "Use Reasoning Mode" for thinking models
69
- - Adjust Temperature, Top-p, and Top-k (auto-populated per model)
70
- 4. **Upload File**: Upload a .txt file containing your transcript
71
- 5. **Click Summarize**: Watch the summary appear in real-time!
72
 
73
  ## Technical Details
74
 
75
  - **Inference Engine**: llama-cpp-python
76
- - **Model Format**: GGUF (various quantizations: Q2_K_L, Q3_K_XXS, Q4_K_M, Q4_K_L, Q8_0)
77
- - **Context Windows**: 4K–32K tokens depending on model
78
  - **UI Framework**: Gradio with streaming support
 
79
  - **Language Conversion**: OpenCC for Traditional Chinese (zh-TW)
80
  - **Deployment**: Docker (HuggingFace Spaces compatible)
81
 
 
 
 
 
 
 
 
 
82
  ## Reasoning Mode
83
 
84
- For models that support thinking/reasoning (marked with 🔮 icon):
85
  - Automatically extends context window by 50%
86
  - Provides reasoning steps before the final summary
87
  - Toggle on/off per generation
88
 
89
- ## GPU Acceleration
90
-
91
- Set the `N_GPU_LAYERS` environment variable:
92
- - `-1` or high value: Use GPU for all layers
93
- - `0`: CPU-only inference
94
- - Default: Automatically detects GPU availability
95
-
96
  ## Limitations
97
 
98
- - **Input Size**: Varies by model (4K–32K context windows)
99
- - **First Load**: 10–60 seconds depending on model size (0.6B = fast, 30B = slower)
100
- - **CPU Inference**: Free tier runs on CPU; GPU available with environment configuration
101
- - **Model Size**: Larger models (21B–30B) require more RAM and download time
102
 
103
  ## CLI Usage
104
 
@@ -110,10 +110,10 @@ python summarize_transcript.py -i ./transcripts/short.txt
110
  python summarize_transcript.py -i ./transcripts/short.txt -l zh-TW
111
 
112
  # Use specific model
113
- python summarize_transcript.py -i ./transcripts/short.txt -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L
114
 
115
  # CPU only
116
- python summarize_transcript.py -i ./transcripts/short.txt -c
117
  ```
118
 
119
  ## Requirements
@@ -122,12 +122,10 @@ python summarize_transcript.py -i ./transcripts/short.txt -c
122
  pip install -r requirements.txt
123
  ```
124
 
125
- See `requirements.txt` for full dependencies including llama-cpp-python, gradio, and opencc.
126
-
127
  ## Repository
128
 
129
- [tiny-scribe](https://huggingface.co/spaces/your-username/tiny-scribe)
130
 
131
  ## License
132
 
133
- MIT License - See LICENSE file for details
 
12
 
13
  # Tiny Scribe
14
 
15
+ A lightweight transcript summarization tool powered by local LLMs. Features 24+ preset models ranging from 100M to 30B parameters, plus the ability to load any GGUF model from HuggingFace Hub. Includes live streaming output, reasoning modes, and flexible deployment options.
16
 
17
  ## Features
18
 
19
+ - **24+ Preset Models**: From tiny 100M models to powerful 30B models
20
+ - **Custom GGUF Loading**: Load any GGUF model from HuggingFace Hub
21
+ - **Tabbed Interface**: Clean separation between Preset Models and Custom GGUF
22
  - **Live Streaming**: Real-time summary generation with token-by-token output
 
23
  - **Reasoning Modes**: Toggle thinking/reasoning for supported models (Qwen3, ERNIE, LFM2)
24
  - **Thinking Buffer**: Automatic 50% context window extension when reasoning enabled
25
+ - **Hardware Presets**: Free Tier (2 vCPUs), Upgrade (8 vCPUs), or Custom thread count
26
  - **File Upload**: Upload .txt files to summarize
27
  - **Language Support**: English or Traditional Chinese (zh-TW) output via OpenCC
28
+ - **Auto Settings**: Temperature, top_p, and top_k auto-populate per model
29
 
30
+ ## Usage
31
+
32
+ 1. **Upload File**: Upload a .txt file containing your transcript
33
+ 2. **Select Output Language**: Choose English or Traditional Chinese (zh-TW)
34
+ 3. **Choose Model**:
35
+ - **Preset Models tab**: Select from 24+ curated models
36
+ - **Custom GGUF tab**: Search and load any GGUF from HuggingFace
37
+ 4. **Configure Settings** (optional in Advanced Settings):
38
+ - Hardware tier (CPU threads)
39
+ - Temperature, Top-p, Top-k inference parameters
40
+ 5. **Click Generate Summary**: Watch the thinking process and summary appear in real-time!
41
+
42
+ ## Custom GGUF Models
43
+
44
+ Load any GGUF model from HuggingFace Hub:
45
+
46
+ 1. Switch to the **🔧 Custom GGUF** tab
47
+ 2. Search for a model (e.g., "qwen", "llama", "phi")
48
+ 3. Select a GGUF file (quantization level)
49
+ 4. Click **Load Selected Model**
50
+ 5. The model will be downloaded and cached locally
51
+
52
+ ## Model Registry (24 Preset Models)
53
 
54
  ### Tiny Models (0.1-0.6B)
55
+ - Falcon-H1-100M, Gemma-3-270M, ERNIE-0.3B
56
+ - Granite-3.1-0.35B, Granite-3.3-0.35B, BitCPM4-0.5B
57
+ - Hunyuan-0.5B, Qwen3-0.6B
 
 
 
 
 
58
 
59
  ### Compact Models (1.5-2.6B)
60
+ - Granite-3.1-1B, Falcon-H1-1.5B, Qwen3-1.7B-Thinking
61
+ - Granite-3.3-2B, Youtu-LLM-2B, LFM2-2.6B-Transcript
 
 
 
 
62
 
63
  ### Standard Models (3-7B)
64
+ - Granite-3.1-3B, Breeze-3B, Qwen3-4B-Thinking, Granite-4.0-Tiny-7B
 
 
 
 
 
 
 
 
 
 
 
65
 
66
+ ### Large Models (21-30B)
67
+ - ERNIE-4.5-21B-PT, ERNIE-4.5-21B-Thinking
68
+ - GLM-4.7-Flash-30B (REAP & IQ2 variants)
69
+ - Qwen3-30B-A3B (Thinking & Instruct variants)
 
 
 
 
 
70
 
71
  ## Technical Details
72
 
73
  - **Inference Engine**: llama-cpp-python
74
+ - **Model Format**: GGUF (Q2_K_L, Q3_K_XXS, Q4_K_M, Q4_K_L, Q8_0, etc.)
75
+ - **Context Windows**: 4K–256K tokens depending on model
76
  - **UI Framework**: Gradio with streaming support
77
+ - **Model Search**: gradio_huggingfacehub_search component
78
  - **Language Conversion**: OpenCC for Traditional Chinese (zh-TW)
79
  - **Deployment**: Docker (HuggingFace Spaces compatible)
80
 
81
+ ## Hardware Configuration
82
+
83
+ | Preset | CPU Threads | Best For |
84
+ |--------|-------------|----------|
85
+ | HF Free Tier | 2 vCPUs | Small models (< 2B) |
86
+ | HF CPU Upgrade | 8 vCPUs | Medium models (2-7B) |
87
+ | Custom | 1-32 | Local deployment |
88
+
89
  ## Reasoning Mode
90
 
91
+ For models that support thinking/reasoning (marked with icon):
92
  - Automatically extends context window by 50%
93
  - Provides reasoning steps before the final summary
94
  - Toggle on/off per generation
95
 
 
 
 
 
 
 
 
96
  ## Limitations
97
 
98
+ - **Input Size**: Varies by model (4K–256K context windows)
99
+ - **First Load**: 10–60 seconds depending on model size
100
+ - **CPU Inference**: Free tier runs on CPU; larger models need more time
101
+ - **Custom Models**: Must be GGUF format from HuggingFace Hub
102
 
103
  ## CLI Usage
104
 
 
110
  python summarize_transcript.py -i ./transcripts/short.txt -l zh-TW
111
 
112
  # Use specific model
113
+ python summarize_transcript.py -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L
114
 
115
  # CPU only
116
+ python summarize_transcript.py -c
117
  ```
118
 
119
  ## Requirements
 
122
  pip install -r requirements.txt
123
  ```
124
 
 
 
125
  ## Repository
126
 
127
+ [Luigi/tiny-scribe](https://huggingface.co/spaces/Luigi/tiny-scribe)
128
 
129
  ## License
130
 
131
+ MIT License