Spaces:

Luigi
/

tiny-scribe

Running

App Files Files Community

Luigi commited on Feb 1

Commit

fd459ca

1 Parent(s): 2129c2f

docs: Update README with 22 models, reasoning modes, and GPU support

Browse files

Files changed (1) hide show

README.md +99 -15

README.md CHANGED Viewed

@@ -12,36 +12,120 @@ license: mit
 # Tiny Scribe
-A lightweight transcript summarization tool powered by local LLMs (Qwen3-0.6B).
 ## Features
 - **Live Streaming**: Real-time summary generation with token-by-token output
 - **File Upload**: Upload .txt files to summarize
-- **Traditional Chinese**: Automatic conversion to zh-TW
-- **CPU Optimized**: Runs efficiently on 2 vCPUs (HuggingFace Spaces Free Tier)
-- **Small Model**: Uses Qwen3-0.6B-GGUF (Q4_K_M quantization) for fast inference
 ## Usage
-1. Upload a .txt file containing your transcript
-2. Click "Summarize"
-3. Watch the summary appear in real-time!
 ## Technical Details
-- **Model**: unsloth/Qwen3-0.6B-GGUF (Q4_K_M quantization)
-- **Context Window**: 4096 tokens
-- **Inference**: CPU-only (llama-cpp-python)
-- **UI**: Gradio with streaming support
-- **Output**: Traditional Chinese (zh-TW) via OpenCC
 ## Limitations
-- Max input: ~3KB of text (truncated if exceeded)
-- First load: 30-60 seconds (model download)
-- CPU-only inference (no GPU acceleration on Free Tier)
 ## Repository
 [tiny-scribe](https://huggingface.co/spaces/your-username/tiny-scribe)

 # Tiny Scribe
+A lightweight transcript summarization tool powered by local LLMs. Features 22 models ranging from 100M to 30B parameters with live streaming output, reasoning modes, and flexible deployment options.
 ## Features
+- **22 Local Models**: From tiny 100M models to powerful 30B models
 - **Live Streaming**: Real-time summary generation with token-by-token output
+- **Model Selection**: Dropdown to choose from 22 available models
+- **Reasoning Modes**: Toggle thinking/reasoning for supported models (Qwen3, ERNIE, LFM2)
+- **Thinking Buffer**: Automatic 50% context window extension when reasoning enabled
+- **GPU Acceleration**: Optional GPU layers support (set via environment or CPU-only fallback)
 - **File Upload**: Upload .txt files to summarize
+- **Language Support**: English or Traditional Chinese (zh-TW) output via OpenCC
+- **Auto Settings**: Temperature, top_p, and top_k sliders auto-populate per model
+## Model Registry (22 Models)
+### Tiny Models (0.1-0.6B)
+- **Falcon-H1-100M** - 100M parameters, 4K context
+- **Gemma-3-270M** - 270M parameters, 4K context
+- **ERNIE-0.3B** - 300M parameters, 4K context
+- **Granite-3.1-0.35B-A600M** - 350M parameters, 4K context
+- **Granite-3.3-0.35B-A800M** - 350M parameters, 4K context
+- **BitCPM4-0.5B** - 500M parameters, 32K context
+- **Hunyuan-0.5B** - 500M parameters, 4K context
+- **Qwen3-0.6B** - 600M parameters, 4K context
+### Compact Models (1.5-2.6B)
+- **Granite-3.1-1B-A400M** - 1B parameters, 4K context
+- **Falcon-H1-1.5B** - 1.5B parameters, 32K context
+- **Qwen3-1.7B-Thinking** - 1.7B parameters, 32K context (reasoning)
+- **Granite-3.3-2B** - 2B parameters, 4K context
+- **Youtu-LLM-2B** - 2B parameters, 8K context (reasoning toggle)
+- **LFM2-2.6B-Transcript** - 2.6B parameters, 32K context (transcript-specialized)
+### Standard Models (3-7B)
+- **Granite-3.1-3B-A800M** - 3B parameters, 4K context
+- **Qwen3-4B-Thinking** - 4B parameters, 8K context (reasoning)
+- **Granite-4.0-Tiny-7B** - 7B parameters, 8K context
+### Medium Models (21-30B)
+- **ERNIE-4.5-21B-PT** - 21B parameters, 32K context
+- **ERNIE-4.5-21B-Thinking** - 21B parameters, 32K context (reasoning)
+- **GLM-4.7-Flash-23B-REAP** - 23B parameters, 32K context
+- **Qwen3-30B-A3B-Thinking** - 30B parameters, 32K context (reasoning)
+- **Qwen3-30B-A3B-Instruct** - 30B parameters, 32K context
 ## Usage
+1. **Select Output Language**: Choose English or Traditional Chinese (zh-TW)
+2. **Select Model**: Choose from the dropdown of 22 available models
+3. **Configure Settings** (optional):
+   - Enable "Use Reasoning Mode" for thinking models
+   - Adjust Temperature, Top-p, and Top-k (auto-populated per model)
+4. **Upload File**: Upload a .txt file containing your transcript
+5. **Click Summarize**: Watch the summary appear in real-time!
 ## Technical Details
+- **Inference Engine**: llama-cpp-python
+- **Model Format**: GGUF (various quantizations: Q2_K_L, Q3_K_XXS, Q4_K_M, Q4_K_L, Q8_0)
+- **Context Windows**: 4K–32K tokens depending on model
+- **UI Framework**: Gradio with streaming support
+- **Language Conversion**: OpenCC for Traditional Chinese (zh-TW)
+- **Deployment**: Docker (HuggingFace Spaces compatible)
+## Reasoning Mode
+For models that support thinking/reasoning (marked with 🔮 icon):
+- Automatically extends context window by 50%
+- Provides reasoning steps before the final summary
+- Toggle on/off per generation
+## GPU Acceleration
+Set the `N_GPU_LAYERS` environment variable:
+- `-1` or high value: Use GPU for all layers
+- `0`: CPU-only inference
+- Default: Automatically detects GPU availability
 ## Limitations
+- **Input Size**: Varies by model (4K–32K context windows)
+- **First Load**: 10–60 seconds depending on model size (0.6B = fast, 30B = slower)
+- **CPU Inference**: Free tier runs on CPU; GPU available with environment configuration
+- **Model Size**: Larger models (21B–30B) require more RAM and download time
+## CLI Usage
+```bash
+# Default English output
+python summarize_transcript.py -i ./transcripts/short.txt
+# Traditional Chinese output
+python summarize_transcript.py -i ./transcripts/short.txt -l zh-TW
+# Use specific model
+python summarize_transcript.py -i ./transcripts/short.txt -m unsloth/Qwen3-1.7B-GGUF:Q2_K_L
+# CPU only
+python summarize_transcript.py -i ./transcripts/short.txt -c
+```
+## Requirements
+```bash
+pip install -r requirements.txt
+```
+See `requirements.txt` for full dependencies including llama-cpp-python, gradio, and opencc.
 ## Repository
 [tiny-scribe](https://huggingface.co/spaces/your-username/tiny-scribe)
+## License
+MIT License - See LICENSE file for details