Spaces:

Luigi
/

tiny-scribe

Running

Luigi commited on Feb 2

Commit

0733923

1 Parent(s): 85487c1

docs: update README with new UI features and custom GGUF loading

- Add Custom GGUF loading feature description
- Document tabbed interface (Preset Models / Custom GGUF)
- Add Hardware Configuration table
- New section: Custom GGUF Models how-to
- Update Usage section to match new UI flow
- Add gradio_huggingfacehub_search to Technical Details
- Fix repository link to Luigi/tiny-scribe
- Condense model registry format

Files changed (1) hide show

README.md +59 -61

README.md CHANGED Viewed

@@ -12,93 +12,93 @@ license: mit
 # Tiny Scribe
-A lightweight transcript summarization tool powered by local LLMs. Features 24 models ranging from 100M to 30B parameters with live streaming output, reasoning modes, and flexible deployment options.
 ## Features
-- **24 Local Models**: From tiny 100M models to powerful 30B models
 - **Live Streaming**: Real-time summary generation with token-by-token output
-- **Model Selection**: Dropdown to choose from 22 available models
 - **Reasoning Modes**: Toggle thinking/reasoning for supported models (Qwen3, ERNIE, LFM2)
 - **Thinking Buffer**: Automatic 50% context window extension when reasoning enabled
-- **GPU Acceleration**: Optional GPU layers support (set via environment or CPU-only fallback)
 - **File Upload**: Upload .txt files to summarize
 - **Language Support**: English or Traditional Chinese (zh-TW) output via OpenCC
-- **Auto Settings**: Temperature, top_p, and top_k sliders auto-populate per model
-## Model Registry (24 Models)
 ### Tiny Models (0.1-0.6B)
-- **Falcon-H1-100M** - 100M parameters, 4K context
-- **Gemma-3-270M** - 270M parameters, 4K context
-- **ERNIE-0.3B** - 300M parameters, 4K context
-- **Granite-3.1-0.35B-A600M** - 350M parameters, 4K context
-- **Granite-3.3-0.35B-A800M** - 350M parameters, 4K context
-- **BitCPM4-0.5B** - 500M parameters, 32K context
-- **Hunyuan-0.5B** - 500M parameters, 4K context
-- **Qwen3-0.6B** - 600M parameters, 4K context
 ### Compact Models (1.5-2.6B)
-- **Granite-3.1-1B-A400M** - 1B parameters, 4K context
-- **Falcon-H1-1.5B** - 1.5B parameters, 32K context
-- **Qwen3-1.7B-Thinking** - 1.7B parameters, 32K context (reasoning)
-- **Granite-3.3-2B** - 2B parameters, 4K context
-- **Youtu-LLM-2B** - 2B parameters, 8K context (reasoning toggle)
-- **LFM2-2.6B-Transcript** - 2.6B parameters, 32K context (transcript-specialized)
 ### Standard Models (3-7B)
-- **Granite-3.1-3B-A800M** - 3B parameters, 4K context
-- **Breeze-3B-Q4** - 3B parameters, 32K context
-- **Qwen3-4B-Thinking** - 4B parameters, 8K context (reasoning)
-- **Granite-4.0-Tiny-7B** - 7B parameters, 8K context
-### Medium Models (21-30B)
-- **ERNIE-4.5-21B-PT** - 21B parameters, 32K context
-- **ERNIE-4.5-21B-Thinking** - 21B parameters, 32K context (reasoning)
-- **GLM-4.7-Flash-30B-REAP** - 30B parameters, 128K context (TQ1_0, REAP variant)
-- **GLM-4.7-Flash-30B-Original-IQ2** - 30B parameters, 128K context (IQ2_XXS 2-bit, original zai-org)
-- **Qwen3-30B-A3B-Thinking** - 30B parameters, 256K context (reasoning)
-- **Qwen3-30B-A3B-Instruct** - 30B parameters, 256K context
-## Usage
-1. **Select Output Language**: Choose English or Traditional Chinese (zh-TW)
-2. **Select Model**: Choose from the dropdown of 24 available models
-3. **Configure Settings** (optional):
-   - Enable "Use Reasoning Mode" for thinking models
-   - Adjust Temperature, Top-p, and Top-k (auto-populated per model)
-4. **Upload File**: Upload a .txt file containing your transcript
-5. **Click Summarize**: Watch the summary appear in real-time!
 ## Technical Details
 - **Inference Engine**: llama-cpp-python
-- **Model Format**: GGUF (various quantizations: Q2_K_L, Q3_K_XXS, Q4_K_M, Q4_K_L, Q8_0)
-- **Context Windows**: 4K–32K tokens depending on model
 - **UI Framework**: Gradio with streaming support
 - **Language Conversion**: OpenCC for Traditional Chinese (zh-TW)
 - **Deployment**: Docker (HuggingFace Spaces compatible)
 ## Reasoning Mode
-For models that support thinking/reasoning (marked with 🔮 icon):
 - Automatically extends context window by 50%
 - Provides reasoning steps before the final summary
 - Toggle on/off per generation
-## GPU Acceleration
-Set the `N_GPU_LAYERS` environment variable:
-- `-1` or high value: Use GPU for all layers
-- `0`: CPU-only inference
-- Default: Automatically detects GPU availability
 ## Limitations
-- **Input Size**: Varies by model (4K–32K context windows)
-- **First Load**: 10–60 seconds depending on model size (0.6B = fast, 30B = slower)
-- **CPU Inference**: Free tier runs on CPU; GPU available with environment configuration
-- **Model Size**: Larger models (21B–30B) require more RAM and download time
 ## CLI Usage
@@ -110,10 +110,10 @@ python summarize_transcript.py -i ./transcripts/short.txt
 python summarize_transcript.py -i ./transcripts/short.txt -l zh-TW
 # Use specific model
-python summarize_transcript.py -i ./transcripts/short.txt -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L
 # CPU only
-python summarize_transcript.py -i ./transcripts/short.txt -c
 ```
 ## Requirements
@@ -122,12 +122,10 @@ python summarize_transcript.py -i ./transcripts/short.txt -c
 pip install -r requirements.txt
 ```
-See `requirements.txt` for full dependencies including llama-cpp-python, gradio, and opencc.
 ## Repository
-[tiny-scribe](https://huggingface.co/spaces/your-username/tiny-scribe)
 ## License
-MIT License - See LICENSE file for details

 # Tiny Scribe
+A lightweight transcript summarization tool powered by local LLMs. Features 24+ preset models ranging from 100M to 30B parameters, plus the ability to load any GGUF model from HuggingFace Hub. Includes live streaming output, reasoning modes, and flexible deployment options.
 ## Features
+- **24+ Preset Models**: From tiny 100M models to powerful 30B models
+- **Custom GGUF Loading**: Load any GGUF model from HuggingFace Hub
+- **Tabbed Interface**: Clean separation between Preset Models and Custom GGUF
 - **Live Streaming**: Real-time summary generation with token-by-token output
 - **Reasoning Modes**: Toggle thinking/reasoning for supported models (Qwen3, ERNIE, LFM2)
 - **Thinking Buffer**: Automatic 50% context window extension when reasoning enabled
+- **Hardware Presets**: Free Tier (2 vCPUs), Upgrade (8 vCPUs), or Custom thread count
 - **File Upload**: Upload .txt files to summarize
 - **Language Support**: English or Traditional Chinese (zh-TW) output via OpenCC
+- **Auto Settings**: Temperature, top_p, and top_k auto-populate per model
+## Usage
+1. **Upload File**: Upload a .txt file containing your transcript
+2. **Select Output Language**: Choose English or Traditional Chinese (zh-TW)
+3. **Choose Model**:
+   - **Preset Models tab**: Select from 24+ curated models
+   - **Custom GGUF tab**: Search and load any GGUF from HuggingFace
+4. **Configure Settings** (optional in Advanced Settings):
+   - Hardware tier (CPU threads)
+   - Temperature, Top-p, Top-k inference parameters
+5. **Click Generate Summary**: Watch the thinking process and summary appear in real-time!
+## Custom GGUF Models
+Load any GGUF model from HuggingFace Hub:
+1. Switch to the **🔧 Custom GGUF** tab
+2. Search for a model (e.g., "qwen", "llama", "phi")
+3. Select a GGUF file (quantization level)
+4. Click **Load Selected Model**
+5. The model will be downloaded and cached locally
+## Model Registry (24 Preset Models)
 ### Tiny Models (0.1-0.6B)
+- Falcon-H1-100M, Gemma-3-270M, ERNIE-0.3B
+- Granite-3.1-0.35B, Granite-3.3-0.35B, BitCPM4-0.5B
+- Hunyuan-0.5B, Qwen3-0.6B
 ### Compact Models (1.5-2.6B)
+- Granite-3.1-1B, Falcon-H1-1.5B, Qwen3-1.7B-Thinking
+- Granite-3.3-2B, Youtu-LLM-2B, LFM2-2.6B-Transcript
 ### Standard Models (3-7B)
+- Granite-3.1-3B, Breeze-3B, Qwen3-4B-Thinking, Granite-4.0-Tiny-7B
+### Large Models (21-30B)
+- ERNIE-4.5-21B-PT, ERNIE-4.5-21B-Thinking
+- GLM-4.7-Flash-30B (REAP & IQ2 variants)
+- Qwen3-30B-A3B (Thinking & Instruct variants)
 ## Technical Details
 - **Inference Engine**: llama-cpp-python
+- **Model Format**: GGUF (Q2_K_L, Q3_K_XXS, Q4_K_M, Q4_K_L, Q8_0, etc.)
+- **Context Windows**: 4K–256K tokens depending on model
 - **UI Framework**: Gradio with streaming support
+- **Model Search**: gradio_huggingfacehub_search component
 - **Language Conversion**: OpenCC for Traditional Chinese (zh-TW)
 - **Deployment**: Docker (HuggingFace Spaces compatible)
+## Hardware Configuration
+| Preset | CPU Threads | Best For |
+|--------|-------------|----------|
+| HF Free Tier | 2 vCPUs | Small models (< 2B) |
+| HF CPU Upgrade | 8 vCPUs | Medium models (2-7B) |
+| Custom | 1-32 | Local deployment |
 ## Reasoning Mode
+For models that support thinking/reasoning (marked with ⚡ icon):
 - Automatically extends context window by 50%
 - Provides reasoning steps before the final summary
 - Toggle on/off per generation
 ## Limitations
+- **Input Size**: Varies by model (4K–256K context windows)
+- **First Load**: 10–60 seconds depending on model size
+- **CPU Inference**: Free tier runs on CPU; larger models need more time
+- **Custom Models**: Must be GGUF format from HuggingFace Hub
 ## CLI Usage
 python summarize_transcript.py -i ./transcripts/short.txt -l zh-TW
 # Use specific model
+python summarize_transcript.py -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L
 # CPU only
+python summarize_transcript.py -c
 ```
 ## Requirements
 pip install -r requirements.txt
 ```
 ## Repository
+[Luigi/tiny-scribe](https://huggingface.co/spaces/Luigi/tiny-scribe)
 ## License
+MIT License