Spaces:

Luigi
/

tiny-scribe

Running

Luigi commited on Jan 30

Commit

10d339c

1 Parent(s): f175554

Add HuggingFace Spaces demo with Gradio UI

- Create app.py with Gradio interface for file upload and streaming output
- Add Dockerfile for containerized deployment
- Add requirements.txt with prebuilt llama-cpp-python
- Create DEPLOY.md with deployment instructions
- Update README.md for HF Spaces documentation
- Add .gitignore and .gitattributes for proper git handling

Features:
- Live streaming summary generation
- File upload support (.txt files)
- CPU-optimized for HF Spaces Free Tier (2 vCPUs)
- Traditional Chinese (zh-TW) conversion
- Uses Qwen3-0.6B-GGUF model

Files changed (8) hide show

.gitattributes +7 -0
.gitignore +55 -0
AGENTS.md +49 -3
DEPLOY.md +125 -0
Dockerfile +23 -0
README.md +36 -25
app.py +277 -0
requirements.txt +4 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,7 @@

+*.gguf filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,55 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+# Virtual environments
+venv/
+ENV/
+env/
+.venv
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# Model files (large files)
+*.gguf
+*.bin
+models/
+~/.cache/
+# Generated outputs
+summary.txt
+thinking.txt
+# Gradio
+.gradio/
+flagged/
+# OS
+.DS_Store
+Thumbs.db
+# Logs
+*.log

AGENTS.md CHANGED Viewed

@@ -2,7 +2,7 @@
 ## Project Overview
-Tiny Scribe is a Python CLI tool for summarizing transcripts using GGUF models (e.g., ERNIE, Qwen) with llama-cpp-python. It supports live streaming output and Traditional Chinese (zh-TW) conversion via OpenCC.
 ## Build / Lint / Test Commands
@@ -17,6 +17,7 @@ python summarize_transcript.py -c  # CPU only
 ```bash
 ruff check .
 ruff check --select I .  # Import sorting
 ```
 **Type checking (if mypy installed):**
@@ -28,13 +29,16 @@ mypy summarize_transcript.py
 ```bash
 # No test suite in root project yet
 # Tests exist in llama-cpp-python/tests/ submodule
-# To test llama-cpp-python:
 cd llama-cpp-python && pip install ".[test]" && pytest tests/test_llama.py -v
 ```
 **Single test:**
 ```bash
-pytest tests/test_llama.py::test_function_name -v
 ```
 ## Code Style Guidelines
@@ -51,6 +55,8 @@ pytest tests/test_llama.py::test_function_name -v
 # Standard library first
 import os
 import argparse
 # Third-party packages
 from llama_cpp import Llama
@@ -61,6 +67,7 @@ from opencc import OpenCC
 **Type Hints:**
 - Use type hints for function parameters and return values
 - Use `Optional[]` for nullable types
 - Example: `def load_model(repo_id: str, filename: str, cpu_only: bool = False) -> Llama:`
 **Naming Conventions:**
@@ -78,6 +85,7 @@ from opencc import OpenCC
 - Use explicit error messages with f-strings
 - Check file existence before operations
 - Use `try/except` blocks for external API calls (Hugging Face, model loading)
 ## Dependencies
@@ -102,6 +110,9 @@ tiny-scribe/
 │   └── full.txt
 ├── summary.txt                # Generated output
 ├── llama-cpp-python/          # Git submodule
 │   └── vendor/llama.cpp/      # Core C++ library
 └── README.md                  # Project documentation
 ```
@@ -115,6 +126,7 @@ llm = Llama.from_pretrained(
     filename="*Q4_0.gguf",
     n_gpu_layers=-1,  # -1 for all GPU, 0 for CPU
     n_ctx=32768,      # Context window size
 )
 ```
@@ -128,6 +140,28 @@ stream = llm.create_chat_completion(
 )
 ```
 ## Notes for AI Agents
 - This is a simple utility project; no formal CI/CD or test suite in root
@@ -135,3 +169,15 @@ stream = llm.create_chat_completion(
 - Always call `llm.reset()` after completion to ensure state isolation
 - Model format: `repo_id:quant` (e.g., `unsloth/Qwen3-1.7B-GGUF:Q2_K_L`)
 - Default language output is Traditional Chinese (zh-TW) via OpenCC conversion

 ## Project Overview
+Tiny Scribe is a Python CLI tool for summarizing transcripts using GGUF models (e.g., ERNIE, Qwen, Granite) with llama-cpp-python. It supports live streaming output and Traditional Chinese (zh-TW) conversion via OpenCC.
 ## Build / Lint / Test Commands
 ```bash
 ruff check .
 ruff check --select I .  # Import sorting
+ruff format .            # Auto-format code
 ```
 **Type checking (if mypy installed):**
 ```bash
 # No test suite in root project yet
 # Tests exist in llama-cpp-python/tests/ submodule
 cd llama-cpp-python && pip install ".[test]" && pytest tests/test_llama.py -v
 ```
 **Single test:**
 ```bash
+# Run specific test function
+cd llama-cpp-python && pytest tests/test_llama.py::test_function_name -v
+# Run with traceback
+cd llama-cpp-python && pytest --full-trace -v
 ```
 ## Code Style Guidelines
 # Standard library first
 import os
 import argparse
+import re
+from typing import Tuple, Optional, Generator
 # Third-party packages
 from llama_cpp import Llama
 **Type Hints:**
 - Use type hints for function parameters and return values
 - Use `Optional[]` for nullable types
+- Use `Generator[str, None, None]` for generator yields
 - Example: `def load_model(repo_id: str, filename: str, cpu_only: bool = False) -> Llama:`
 **Naming Conventions:**
 - Use explicit error messages with f-strings
 - Check file existence before operations
 - Use `try/except` blocks for external API calls (Hugging Face, model loading)
+- Log errors with context for debugging
 ## Dependencies
 │   └── full.txt
 ├── summary.txt                # Generated output
 ├── llama-cpp-python/          # Git submodule
+│   ├── tests/                 # Test suite
+│   │   ├── test_llama.py
+│   │   └── test_llama_grammar.py
 │   └── vendor/llama.cpp/      # Core C++ library
 └── README.md                  # Project documentation
 ```
     filename="*Q4_0.gguf",
     n_gpu_layers=-1,  # -1 for all GPU, 0 for CPU
     n_ctx=32768,      # Context window size
+    verbose=False,    # Cleaner output
 )
 ```
 )
 ```
+**Thinking Block Parsing:**
+```python
+# Extract thinking/reasoning blocks from model output
+THINKING_PATTERN = re.compile(r'<thinking>(.*?)</thinking>', re.DOTALL)
+for chunk in stream:
+    delta = chunk["choices"][0]["delta"]
+    if content := delta.get("content", ""):
+        buffer += content
+        thinking_match = THINKING_PATTERN.search(buffer)
+        if thinking_match:
+            thinking = thinking_match.group(1).strip()
+            buffer = buffer[:thinking_match.start()] + buffer[thinking_match.end():]
+```
+**Chinese Text Conversion:**
+```python
+# Convert Simplified Chinese to Traditional Chinese (Taiwan)
+converter = OpenCC('s2twp')  # s2twp = Simplified to Traditional (Taiwan)
+traditional_text = converter.convert(simplified_text)
+```
 ## Notes for AI Agents
 - This is a simple utility project; no formal CI/CD or test suite in root
 - Always call `llm.reset()` after completion to ensure state isolation
 - Model format: `repo_id:quant` (e.g., `unsloth/Qwen3-1.7B-GGUF:Q2_K_L`)
 - Default language output is Traditional Chinese (zh-TW) via OpenCC conversion
+- Claude permissions configured in `.claude/settings.local.json` for tool access
+- HuggingFace cache at `~/.cache/huggingface/hub/` - clean periodically
+## Git Submodule Management
+```bash
+# Initialize/update submodules
+git submodule update --init --recursive
+# Update llama-cpp-python to latest
+cd llama-cpp-python && git pull origin main && cd .. && git add llama-cpp-python
+```

DEPLOY.md ADDED Viewed

	@@ -0,0 +1,125 @@

+# HuggingFace Spaces Deployment Guide
+## Quick Start
+### 1. Create Space on HuggingFace
+1. Go to [huggingface.co/spaces](https://huggingface.co/spaces)
+2. Click "Create new Space"
+3. Select:
+   - **Space name**: `tiny-scribe` (or your preferred name)
+   - **SDK**: Docker
+   - **Space hardware**: CPU (Free Tier - 2 vCPUs)
+4. Click "Create Space"
+### 2. Upload Files
+Upload these files to your Space:
+- `app.py` - Main Gradio application
+- `Dockerfile` - Container configuration
+- `requirements.txt` - Python dependencies
+- `README.md` - Space documentation
+- `transcripts/` - Example files (optional)
+Using Git:
+```bash
+git clone https://huggingface.co/spaces/your-username/tiny-scribe
+cd tiny-scribe
+# Copy files from this repo
+git add .
+git commit -m "Initial HF Spaces deployment"
+git push
+```
+### 3. Wait for Build
+The Space will automatically:
+1. Build the Docker container (~2-5 minutes)
+2. Install dependencies (llama-cpp-python wheel is prebuilt)
+3. Start the Gradio app
+### 4. Access Your App
+Once built, visit: `https://your-username-tiny-scribe.hf.space`
+## Configuration
+### Model Selection
+The default model (`unsloth/Qwen3-0.6B-GGUF` Q4_K_M) is optimized for CPU:
+- Small: 0.6B parameters
+- Fast: ~2-5 seconds for short texts
+- Efficient: Uses ~400MB RAM
+To change models, edit `app.py`:
+```python
+DEFAULT_MODEL = "unsloth/Qwen3-1.7B-GGUF"  # Larger model
+DEFAULT_FILENAME = "*Q2_K_L.gguf"  # Lower quantization for speed
+```
+### Performance Tuning
+For Free Tier (2 vCPUs):
+- Keep `n_ctx=4096` (context window)
+- Use `max_tokens=512` (output length)
+- Set `temperature=0.6` (balance creativity/coherence)
+### Environment Variables
+Optional settings in Space Settings:
+```
+MODEL_REPO=unsloth/Qwen3-0.6B-GGUF
+MODEL_FILENAME=*Q4_K_M.gguf
+MAX_TOKENS=512
+TEMPERATURE=0.6
+```
+## Features
+1. **File Upload**: Drag & drop .txt files
+2. **Live Streaming**: Real-time token output
+3. **Traditional Chinese**: Auto-conversion to zh-TW
+4. **Progressive Loading**: Model downloads on first use (~30-60s)
+5. **Responsive UI**: Works on mobile and desktop
+## Troubleshooting
+### Build Fails
+- Check Docker Hub status
+- Verify requirements.txt syntax
+- Ensure no large files in repo
+### Out of Memory
+- Reduce `n_ctx` (context window)
+- Use smaller model (Q2_K quantization)
+- Limit input file size
+### Slow Inference
+- Normal for CPU-only Free Tier
+- First request downloads model (~400MB)
+- Subsequent requests are faster
+## Architecture
+```
+User Upload → Gradio Interface → app.py → llama-cpp-python → Qwen Model
+                                    ↓
+                              OpenCC (s2twp)
+                                    ↓
+                         Streaming Output → User
+```
+## Local Testing
+Before deploying to HF Spaces:
+```bash
+pip install -r requirements.txt
+python app.py
+```
+Then open: http://localhost:7860
+## License
+MIT - See LICENSE file for details.

Dockerfile ADDED Viewed

	@@ -0,0 +1,23 @@

+FROM python:3.10-slim
+WORKDIR /app
+# Install system dependencies (minimal for prebuilt wheels)
+RUN apt-get update && apt-get install -y \
+    libopencc-dev \
+    && rm -rf /var/lib/apt/lists/*
+# Copy requirements first for better caching
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application files
+COPY app.py .
+COPY transcripts/ ./transcripts/
+# Pre-download model on build (optional, speeds up first run)
+# RUN python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='unsloth/Qwen3-0.6B-GGUF', filename='Qwen3-0.6B-Q4_K_M.gguf', local_dir='./models')"
+EXPOSE 7860
+CMD ["python", "app.py"]

README.md CHANGED Viewed

@@ -1,37 +1,48 @@
-# Transcript Summarization Script
-This script provides functionality to summarize transcripts using the Falcon-H1-Tiny-Multilingual model with SYCL acceleration. It focuses on live streaming summarization for immediate feedback.
-## Key Features
-### 1. State Isolation
-Each summarization call ensures a clean state by calling `llm.reset()` after each operation. This prevents any carryover from previous summarizations, ensuring consistent and independent results.
-### 2. Live Streaming Summary
-The script implements a live streaming summary feature that generates the summary in real-time, displaying tokens as they are produced by the model. This provides immediate feedback.
-### 3. Multi-language Support
-The script supports both English and Traditional Chinese (zh-TW) summarization.
-## Functions
-### `stream_summarize_transcript(llm, transcript, language='zh-TW')`
-Performs live streaming summary by generating the summary in real-time and displaying tokens as they are produced by the model.
-## Improvements Made
-1. **Streaming-Only Workflow**: Simplified the script to focus on real-time streaming for all summaries.
-2. **State Isolation**: Added `llm.reset()` calls after each summarization to ensure clean state between operations.
-3. **True Live Streaming**: Implemented real-time token streaming using `create_chat_completion` for immediate output display.
-4. **Reduced Verbosity**: Set `verbose=False` for cleaner output during model operations.
-## Usage
-```bash
-python summarize_transcript.py
-```
-The script will:
-1. Load the model.
-2. Generate Chinese and English summaries using live streaming.
-3. Save the summaries to `chinese_summary.txt` and `english_summary.txt`.

+---
+title: Tiny Scribe - Transcript Summarizer
+emoji:
+colorFrom: blue
+colorTo: green
+sdk: docker
+sdk_version: "3.10"
+app_file: app.py
+pinned: false
+license: mit
+---
+# Tiny Scribe
+A lightweight transcript summarization tool powered by local LLMs (Qwen3-0.6B).
+## Features
+- **Live Streaming**: Real-time summary generation with token-by-token output
+- **File Upload**: Upload .txt files to summarize
+- **Traditional Chinese**: Automatic conversion to zh-TW
+- **CPU Optimized**: Runs efficiently on 2 vCPUs (HuggingFace Spaces Free Tier)
+- **Small Model**: Uses Qwen3-0.6B-GGUF (Q4_K_M quantization) for fast inference
+## Usage
+1. Upload a .txt file containing your transcript
+2. Click "Summarize"
+3. Watch the summary appear in real-time!
+## Technical Details
+- **Model**: unsloth/Qwen3-0.6B-GGUF (Q4_K_M quantization)
+- **Context Window**: 4096 tokens
+- **Inference**: CPU-only (llama-cpp-python)
+- **UI**: Gradio with streaming support
+- **Output**: Traditional Chinese (zh-TW) via OpenCC
+## Limitations
+- Max input: ~3KB of text (truncated if exceeded)
+- First load: 30-60 seconds (model download)
+- CPU-only inference (no GPU acceleration on Free Tier)
+## Repository
+[tiny-scribe](https://huggingface.co/spaces/your-username/tiny-scribe)

app.py ADDED Viewed

	@@ -0,0 +1,277 @@

+#!/usr/bin/env python3
+"""
+Tiny Scribe - HuggingFace Spaces Demo
+A Gradio app for summarizing transcripts using GGUF models with live streaming output.
+Optimized for HuggingFace Spaces Free CPU Tier (2 vCPUs).
+"""
+import os
+import re
+import gradio as gr
+from typing import Tuple, Generator
+from llama_cpp import Llama
+from opencc import OpenCC
+import logging
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# Global model instance (loaded once)
+llm = None
+converter = None
+# Default model optimized for CPU (small, fast)
+DEFAULT_MODEL = "unsloth/Qwen3-0.6B-GGUF"
+DEFAULT_FILENAME = "*Q4_K_M.gguf"  # Good balance of speed/quality
+def load_model():
+    """Load the model once at startup."""
+    global llm, converter
+    if llm is not None:
+        return
+    logger.info(f"Loading model: {DEFAULT_MODEL}")
+    try:
+        # Initialize OpenCC converter for Traditional Chinese (Taiwan)
+        converter = OpenCC('s2twp')
+        # Load model optimized for CPU
+        # n_ctx=4096 is sufficient for most transcripts and uses less memory
+        llm = Llama.from_pretrained(
+            repo_id=DEFAULT_MODEL,
+            filename=DEFAULT_FILENAME,
+            n_gpu_layers=0,  # CPU only for HF Spaces Free Tier
+            n_ctx=4096,      # Reduced context for CPU efficiency
+            verbose=False,   # Cleaner logs
+            seed=1337,
+        )
+        logger.info("Model loaded successfully")
+    except Exception as e:
+        logger.error(f"Error loading model: {e}")
+        raise
+def parse_thinking_blocks(content: str) -> Tuple[str, str]:
+    """
+    Parse thinking blocks from model output.
+    Args:
+        content: Full model response
+    Returns:
+        Tuple of (thinking_content, summary_content)
+    """
+    pattern = r'<thinking>(.*?)</thinking>'
+    matches = re.findall(pattern, content, re.DOTALL)
+    if not matches:
+        return ("", content)
+    thinking = '\n\n'.join(match.strip() for match in matches)
+    summary = re.sub(pattern, '', content, flags=re.DOTALL).strip()
+    return (thinking, summary)
+def summarize_streaming(file_obj, max_tokens: int = 512, temperature: float = 0.6) -> Generator[str, None, None]:
+    """
+    Stream summary generation from uploaded file.
+    Args:
+        file_obj: Gradio file object
+        max_tokens: Maximum tokens to generate
+        temperature: Sampling temperature
+    Yields:
+        Partial summary text for streaming display
+    """
+    global llm, converter
+    # Ensure model is loaded
+    if llm is None:
+        load_model()
+    # Read uploaded file
+    try:
+        if hasattr(file_obj, 'name'):
+            # Gradio file object
+            with open(file_obj.name, 'r', encoding='utf-8') as f:
+                transcript = f.read()
+        else:
+            # Direct file path
+            with open(file_obj, 'r', encoding='utf-8') as f:
+                transcript = f.read()
+    except Exception as e:
+        yield f"Error reading file: {str(e)}"
+        return
+    # Validate content
+    if not transcript.strip():
+        yield "Error: File is empty"
+        return
+    # Check length (rough estimate: 4 chars per token)
+    max_chars = 3000  # Leave room for generation
+    if len(transcript) > max_chars:
+        transcript = transcript[:max_chars] + "...\n[Content truncated due to length limits]"
+        yield "Note: Content was truncated to fit model context window.\n\n"
+    # Prepare messages
+    messages = [
+        {"role": "system", "content": "你是一個有助的助手，負責總結轉錄內容。"},
+        {"role": "user", "content": f"請總結以下內容：\n\n{transcript}"}
+    ]
+    # Generate streaming response
+    full_response = ""
+    buffer = ""
+    try:
+        stream = llm.create_chat_completion(
+            messages=messages,
+            max_tokens=max_tokens,
+            temperature=temperature,
+            min_p=0.0,
+            top_p=0.95,
+            top_k=20,
+            stop=["<|end_of_text|>", "<|eot_id|>", "<|eom_id|>"],
+            stream=True
+        )
+        for chunk in stream:
+            if 'choices' in chunk and len(chunk['choices']) > 0:
+                delta = chunk['choices'][0].get('delta', {})
+                content = delta.get('content', '')
+                if content:
+                    # Convert to Traditional Chinese (Taiwan)
+                    converted = converter.convert(content)
+                    buffer += converted
+                    full_response += converted
+                    # Parse and clean thinking blocks for display
+                    thinking, summary = parse_thinking_blocks(buffer)
+                    if summary:
+                        yield summary
+        # Final parse to remove any remaining thinking blocks
+        thinking, final_summary = parse_thinking_blocks(full_response)
+        if final_summary:
+            yield final_summary
+        # Reset model state
+        llm.reset()
+    except Exception as e:
+        logger.error(f"Error during generation: {e}")
+        yield f"\n\nError during generation: {str(e)}"
+# Create Gradio interface
+def create_interface():
+    """Create and configure the Gradio interface."""
+    with gr.Blocks(
+        title="Tiny Scribe - Transcript Summarizer",
+        theme=gr.themes.Soft(),
+        css="""
+        .output-text { font-size: 16px; line-height: 1.6; }
+        .info-text { color: #666; font-size: 14px; }
+        """
+    ) as demo:
+        gr.Markdown("""
+        # Tiny Scribe
+        Summarize your text files (transcripts, notes, documents) with AI.
+        **Features:**
+        - Live streaming output
+        - Traditional Chinese (zh-TW) conversion
+        - Optimized for CPU inference
+        - Supports .txt files
+        """)
+        with gr.Row():
+            with gr.Column(scale=1):
+                # Input section
+                gr.Markdown("### Upload File")
+                file_input = gr.File(
+                    label="Upload .txt file",
+                    file_types=[".txt"],
+                    type="filepath"
+                )
+                with gr.Accordion("Advanced Settings", open=False):
+                    max_tokens = gr.Slider(
+                        minimum=128,
+                        maximum=1024,
+                        value=512,
+                        step=64,
+                        label="Max Tokens"
+                    )
+                    temperature = gr.Slider(
+                        minimum=0.1,
+                        maximum=1.0,
+                        value=0.6,
+                        step=0.1,
+                        label="Temperature"
+                    )
+                submit_btn = gr.Button(
+                    "Summarize",
+                    variant="primary",
+                    size="lg"
+                )
+                gr.Markdown("""
+                <div class="info-text">
+                <strong>Note:</strong> First load may take 30-60 seconds as the model downloads.
+                <br>Max file size: ~3KB of text (context window limit).
+                </div>
+                """)
+            with gr.Column(scale=2):
+                # Output section
+                gr.Markdown("### Summary Output")
+                output = gr.Markdown(
+                    label="Summary",
+                    elem_classes=["output-text"]
+                )
+        # Event handlers
+        submit_btn.click(
+            fn=summarize_streaming,
+            inputs=[file_input, max_tokens, temperature],
+            outputs=output,
+            show_progress=True
+        )
+        # Note: File upload examples don't work well in HF Spaces UI
+        # Users can upload their own .txt files
+    return demo
+# Main entry point
+if __name__ == "__main__":
+    # Pre-load model on startup
+    try:
+        load_model()
+    except Exception as e:
+        logger.error(f"Failed to pre-load model: {e}")
+        logger.info("Model will be loaded on first request")
+    # Create and launch interface
+    demo = create_interface()
+    demo.launch(
+        server_name="0.0.0.0",
+        server_port=7860,
+        share=False,
+        show_error=True
+    )

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+gradio>=5.0.0
+opencc-python-reimplemented>=0.1.7
+huggingface-hub>=0.23.0
+llama-cpp-python>=0.3.0