Luigi commited on
Commit
10d339c
·
1 Parent(s): f175554

Add HuggingFace Spaces demo with Gradio UI

Browse files

- Create app.py with Gradio interface for file upload and streaming output
- Add Dockerfile for containerized deployment
- Add requirements.txt with prebuilt llama-cpp-python
- Create DEPLOY.md with deployment instructions
- Update README.md for HF Spaces documentation
- Add .gitignore and .gitattributes for proper git handling

Features:
- Live streaming summary generation
- File upload support (.txt files)
- CPU-optimized for HF Spaces Free Tier (2 vCPUs)
- Traditional Chinese (zh-TW) conversion
- Uses Qwen3-0.6B-GGUF model

Files changed (8) hide show
  1. .gitattributes +7 -0
  2. .gitignore +55 -0
  3. AGENTS.md +49 -3
  4. DEPLOY.md +125 -0
  5. Dockerfile +23 -0
  6. README.md +36 -25
  7. app.py +277 -0
  8. requirements.txt +4 -0
.gitattributes ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ *.gguf filter=lfs diff=lfs merge=lfs -text
2
+ *.bin filter=lfs diff=lfs merge=lfs -text
3
+ *.pt filter=lfs diff=lfs merge=lfs -text
4
+ *.pth filter=lfs diff=lfs merge=lfs -text
5
+ *.h5 filter=lfs diff=lfs merge=lfs -text
6
+ *.onnx filter=lfs diff=lfs merge=lfs -text
7
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ build/
8
+ develop-eggs/
9
+ dist/
10
+ downloads/
11
+ eggs/
12
+ .eggs/
13
+ lib/
14
+ lib64/
15
+ parts/
16
+ sdist/
17
+ var/
18
+ wheels/
19
+ *.egg-info/
20
+ .installed.cfg
21
+ *.egg
22
+
23
+ # Virtual environments
24
+ venv/
25
+ ENV/
26
+ env/
27
+ .venv
28
+
29
+ # IDE
30
+ .vscode/
31
+ .idea/
32
+ *.swp
33
+ *.swo
34
+ *~
35
+
36
+ # Model files (large files)
37
+ *.gguf
38
+ *.bin
39
+ models/
40
+ ~/.cache/
41
+
42
+ # Generated outputs
43
+ summary.txt
44
+ thinking.txt
45
+
46
+ # Gradio
47
+ .gradio/
48
+ flagged/
49
+
50
+ # OS
51
+ .DS_Store
52
+ Thumbs.db
53
+
54
+ # Logs
55
+ *.log
AGENTS.md CHANGED
@@ -2,7 +2,7 @@
2
 
3
  ## Project Overview
4
 
5
- Tiny Scribe is a Python CLI tool for summarizing transcripts using GGUF models (e.g., ERNIE, Qwen) with llama-cpp-python. It supports live streaming output and Traditional Chinese (zh-TW) conversion via OpenCC.
6
 
7
  ## Build / Lint / Test Commands
8
 
@@ -17,6 +17,7 @@ python summarize_transcript.py -c # CPU only
17
  ```bash
18
  ruff check .
19
  ruff check --select I . # Import sorting
 
20
  ```
21
 
22
  **Type checking (if mypy installed):**
@@ -28,13 +29,16 @@ mypy summarize_transcript.py
28
  ```bash
29
  # No test suite in root project yet
30
  # Tests exist in llama-cpp-python/tests/ submodule
31
- # To test llama-cpp-python:
32
  cd llama-cpp-python && pip install ".[test]" && pytest tests/test_llama.py -v
33
  ```
34
 
35
  **Single test:**
36
  ```bash
37
- pytest tests/test_llama.py::test_function_name -v
 
 
 
 
38
  ```
39
 
40
  ## Code Style Guidelines
@@ -51,6 +55,8 @@ pytest tests/test_llama.py::test_function_name -v
51
  # Standard library first
52
  import os
53
  import argparse
 
 
54
 
55
  # Third-party packages
56
  from llama_cpp import Llama
@@ -61,6 +67,7 @@ from opencc import OpenCC
61
  **Type Hints:**
62
  - Use type hints for function parameters and return values
63
  - Use `Optional[]` for nullable types
 
64
  - Example: `def load_model(repo_id: str, filename: str, cpu_only: bool = False) -> Llama:`
65
 
66
  **Naming Conventions:**
@@ -78,6 +85,7 @@ from opencc import OpenCC
78
  - Use explicit error messages with f-strings
79
  - Check file existence before operations
80
  - Use `try/except` blocks for external API calls (Hugging Face, model loading)
 
81
 
82
  ## Dependencies
83
 
@@ -102,6 +110,9 @@ tiny-scribe/
102
  │ └── full.txt
103
  ├── summary.txt # Generated output
104
  ├── llama-cpp-python/ # Git submodule
 
 
 
105
  │ └── vendor/llama.cpp/ # Core C++ library
106
  └── README.md # Project documentation
107
  ```
@@ -115,6 +126,7 @@ llm = Llama.from_pretrained(
115
  filename="*Q4_0.gguf",
116
  n_gpu_layers=-1, # -1 for all GPU, 0 for CPU
117
  n_ctx=32768, # Context window size
 
118
  )
119
  ```
120
 
@@ -128,6 +140,28 @@ stream = llm.create_chat_completion(
128
  )
129
  ```
130
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
131
  ## Notes for AI Agents
132
 
133
  - This is a simple utility project; no formal CI/CD or test suite in root
@@ -135,3 +169,15 @@ stream = llm.create_chat_completion(
135
  - Always call `llm.reset()` after completion to ensure state isolation
136
  - Model format: `repo_id:quant` (e.g., `unsloth/Qwen3-1.7B-GGUF:Q2_K_L`)
137
  - Default language output is Traditional Chinese (zh-TW) via OpenCC conversion
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  ## Project Overview
4
 
5
+ Tiny Scribe is a Python CLI tool for summarizing transcripts using GGUF models (e.g., ERNIE, Qwen, Granite) with llama-cpp-python. It supports live streaming output and Traditional Chinese (zh-TW) conversion via OpenCC.
6
 
7
  ## Build / Lint / Test Commands
8
 
 
17
  ```bash
18
  ruff check .
19
  ruff check --select I . # Import sorting
20
+ ruff format . # Auto-format code
21
  ```
22
 
23
  **Type checking (if mypy installed):**
 
29
  ```bash
30
  # No test suite in root project yet
31
  # Tests exist in llama-cpp-python/tests/ submodule
 
32
  cd llama-cpp-python && pip install ".[test]" && pytest tests/test_llama.py -v
33
  ```
34
 
35
  **Single test:**
36
  ```bash
37
+ # Run specific test function
38
+ cd llama-cpp-python && pytest tests/test_llama.py::test_function_name -v
39
+
40
+ # Run with traceback
41
+ cd llama-cpp-python && pytest --full-trace -v
42
  ```
43
 
44
  ## Code Style Guidelines
 
55
  # Standard library first
56
  import os
57
  import argparse
58
+ import re
59
+ from typing import Tuple, Optional, Generator
60
 
61
  # Third-party packages
62
  from llama_cpp import Llama
 
67
  **Type Hints:**
68
  - Use type hints for function parameters and return values
69
  - Use `Optional[]` for nullable types
70
+ - Use `Generator[str, None, None]` for generator yields
71
  - Example: `def load_model(repo_id: str, filename: str, cpu_only: bool = False) -> Llama:`
72
 
73
  **Naming Conventions:**
 
85
  - Use explicit error messages with f-strings
86
  - Check file existence before operations
87
  - Use `try/except` blocks for external API calls (Hugging Face, model loading)
88
+ - Log errors with context for debugging
89
 
90
  ## Dependencies
91
 
 
110
  │ └── full.txt
111
  ├── summary.txt # Generated output
112
  ├── llama-cpp-python/ # Git submodule
113
+ │ ├── tests/ # Test suite
114
+ │ │ ├── test_llama.py
115
+ │ │ └── test_llama_grammar.py
116
  │ └── vendor/llama.cpp/ # Core C++ library
117
  └── README.md # Project documentation
118
  ```
 
126
  filename="*Q4_0.gguf",
127
  n_gpu_layers=-1, # -1 for all GPU, 0 for CPU
128
  n_ctx=32768, # Context window size
129
+ verbose=False, # Cleaner output
130
  )
131
  ```
132
 
 
140
  )
141
  ```
142
 
143
+ **Thinking Block Parsing:**
144
+ ```python
145
+ # Extract thinking/reasoning blocks from model output
146
+ THINKING_PATTERN = re.compile(r'<thinking>(.*?)</thinking>', re.DOTALL)
147
+
148
+ for chunk in stream:
149
+ delta = chunk["choices"][0]["delta"]
150
+ if content := delta.get("content", ""):
151
+ buffer += content
152
+ thinking_match = THINKING_PATTERN.search(buffer)
153
+ if thinking_match:
154
+ thinking = thinking_match.group(1).strip()
155
+ buffer = buffer[:thinking_match.start()] + buffer[thinking_match.end():]
156
+ ```
157
+
158
+ **Chinese Text Conversion:**
159
+ ```python
160
+ # Convert Simplified Chinese to Traditional Chinese (Taiwan)
161
+ converter = OpenCC('s2twp') # s2twp = Simplified to Traditional (Taiwan)
162
+ traditional_text = converter.convert(simplified_text)
163
+ ```
164
+
165
  ## Notes for AI Agents
166
 
167
  - This is a simple utility project; no formal CI/CD or test suite in root
 
169
  - Always call `llm.reset()` after completion to ensure state isolation
170
  - Model format: `repo_id:quant` (e.g., `unsloth/Qwen3-1.7B-GGUF:Q2_K_L`)
171
  - Default language output is Traditional Chinese (zh-TW) via OpenCC conversion
172
+ - Claude permissions configured in `.claude/settings.local.json` for tool access
173
+ - HuggingFace cache at `~/.cache/huggingface/hub/` - clean periodically
174
+
175
+ ## Git Submodule Management
176
+
177
+ ```bash
178
+ # Initialize/update submodules
179
+ git submodule update --init --recursive
180
+
181
+ # Update llama-cpp-python to latest
182
+ cd llama-cpp-python && git pull origin main && cd .. && git add llama-cpp-python
183
+ ```
DEPLOY.md ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HuggingFace Spaces Deployment Guide
2
+
3
+ ## Quick Start
4
+
5
+ ### 1. Create Space on HuggingFace
6
+
7
+ 1. Go to [huggingface.co/spaces](https://huggingface.co/spaces)
8
+ 2. Click "Create new Space"
9
+ 3. Select:
10
+ - **Space name**: `tiny-scribe` (or your preferred name)
11
+ - **SDK**: Docker
12
+ - **Space hardware**: CPU (Free Tier - 2 vCPUs)
13
+ 4. Click "Create Space"
14
+
15
+ ### 2. Upload Files
16
+
17
+ Upload these files to your Space:
18
+ - `app.py` - Main Gradio application
19
+ - `Dockerfile` - Container configuration
20
+ - `requirements.txt` - Python dependencies
21
+ - `README.md` - Space documentation
22
+ - `transcripts/` - Example files (optional)
23
+
24
+ Using Git:
25
+ ```bash
26
+ git clone https://huggingface.co/spaces/your-username/tiny-scribe
27
+ cd tiny-scribe
28
+ # Copy files from this repo
29
+ git add .
30
+ git commit -m "Initial HF Spaces deployment"
31
+ git push
32
+ ```
33
+
34
+ ### 3. Wait for Build
35
+
36
+ The Space will automatically:
37
+ 1. Build the Docker container (~2-5 minutes)
38
+ 2. Install dependencies (llama-cpp-python wheel is prebuilt)
39
+ 3. Start the Gradio app
40
+
41
+ ### 4. Access Your App
42
+
43
+ Once built, visit: `https://your-username-tiny-scribe.hf.space`
44
+
45
+ ## Configuration
46
+
47
+ ### Model Selection
48
+
49
+ The default model (`unsloth/Qwen3-0.6B-GGUF` Q4_K_M) is optimized for CPU:
50
+ - Small: 0.6B parameters
51
+ - Fast: ~2-5 seconds for short texts
52
+ - Efficient: Uses ~400MB RAM
53
+
54
+ To change models, edit `app.py`:
55
+ ```python
56
+ DEFAULT_MODEL = "unsloth/Qwen3-1.7B-GGUF" # Larger model
57
+ DEFAULT_FILENAME = "*Q2_K_L.gguf" # Lower quantization for speed
58
+ ```
59
+
60
+ ### Performance Tuning
61
+
62
+ For Free Tier (2 vCPUs):
63
+ - Keep `n_ctx=4096` (context window)
64
+ - Use `max_tokens=512` (output length)
65
+ - Set `temperature=0.6` (balance creativity/coherence)
66
+
67
+ ### Environment Variables
68
+
69
+ Optional settings in Space Settings:
70
+ ```
71
+ MODEL_REPO=unsloth/Qwen3-0.6B-GGUF
72
+ MODEL_FILENAME=*Q4_K_M.gguf
73
+ MAX_TOKENS=512
74
+ TEMPERATURE=0.6
75
+ ```
76
+
77
+ ## Features
78
+
79
+ 1. **File Upload**: Drag & drop .txt files
80
+ 2. **Live Streaming**: Real-time token output
81
+ 3. **Traditional Chinese**: Auto-conversion to zh-TW
82
+ 4. **Progressive Loading**: Model downloads on first use (~30-60s)
83
+ 5. **Responsive UI**: Works on mobile and desktop
84
+
85
+ ## Troubleshooting
86
+
87
+ ### Build Fails
88
+ - Check Docker Hub status
89
+ - Verify requirements.txt syntax
90
+ - Ensure no large files in repo
91
+
92
+ ### Out of Memory
93
+ - Reduce `n_ctx` (context window)
94
+ - Use smaller model (Q2_K quantization)
95
+ - Limit input file size
96
+
97
+ ### Slow Inference
98
+ - Normal for CPU-only Free Tier
99
+ - First request downloads model (~400MB)
100
+ - Subsequent requests are faster
101
+
102
+ ## Architecture
103
+
104
+ ```
105
+ User Upload → Gradio Interface → app.py → llama-cpp-python → Qwen Model
106
+
107
+ OpenCC (s2twp)
108
+
109
+ Streaming Output → User
110
+ ```
111
+
112
+ ## Local Testing
113
+
114
+ Before deploying to HF Spaces:
115
+
116
+ ```bash
117
+ pip install -r requirements.txt
118
+ python app.py
119
+ ```
120
+
121
+ Then open: http://localhost:7860
122
+
123
+ ## License
124
+
125
+ MIT - See LICENSE file for details.
Dockerfile ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.10-slim
2
+
3
+ WORKDIR /app
4
+
5
+ # Install system dependencies (minimal for prebuilt wheels)
6
+ RUN apt-get update && apt-get install -y \
7
+ libopencc-dev \
8
+ && rm -rf /var/lib/apt/lists/*
9
+
10
+ # Copy requirements first for better caching
11
+ COPY requirements.txt .
12
+ RUN pip install --no-cache-dir -r requirements.txt
13
+
14
+ # Copy application files
15
+ COPY app.py .
16
+ COPY transcripts/ ./transcripts/
17
+
18
+ # Pre-download model on build (optional, speeds up first run)
19
+ # RUN python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='unsloth/Qwen3-0.6B-GGUF', filename='Qwen3-0.6B-Q4_K_M.gguf', local_dir='./models')"
20
+
21
+ EXPOSE 7860
22
+
23
+ CMD ["python", "app.py"]
README.md CHANGED
@@ -1,37 +1,48 @@
1
- # Transcript Summarization Script
 
 
2
 
3
- This script provides functionality to summarize transcripts using the Falcon-H1-Tiny-Multilingual model with SYCL acceleration. It focuses on live streaming summarization for immediate feedback.
 
 
 
 
 
 
 
4
 
5
- ## Key Features
6
 
7
- ### 1. State Isolation
8
- Each summarization call ensures a clean state by calling `llm.reset()` after each operation. This prevents any carryover from previous summarizations, ensuring consistent and independent results.
9
 
10
- ### 2. Live Streaming Summary
11
- The script implements a live streaming summary feature that generates the summary in real-time, displaying tokens as they are produced by the model. This provides immediate feedback.
12
 
13
- ### 3. Multi-language Support
14
- The script supports both English and Traditional Chinese (zh-TW) summarization.
 
 
 
15
 
16
- ## Functions
17
 
18
- ### `stream_summarize_transcript(llm, transcript, language='zh-TW')`
19
- Performs live streaming summary by generating the summary in real-time and displaying tokens as they are produced by the model.
 
20
 
21
- ## Improvements Made
22
 
23
- 1. **Streaming-Only Workflow**: Simplified the script to focus on real-time streaming for all summaries.
24
- 2. **State Isolation**: Added `llm.reset()` calls after each summarization to ensure clean state between operations.
25
- 3. **True Live Streaming**: Implemented real-time token streaming using `create_chat_completion` for immediate output display.
26
- 4. **Reduced Verbosity**: Set `verbose=False` for cleaner output during model operations.
 
27
 
28
- ## Usage
 
 
 
 
29
 
30
- ```bash
31
- python summarize_transcript.py
32
- ```
33
 
34
- The script will:
35
- 1. Load the model.
36
- 2. Generate Chinese and English summaries using live streaming.
37
- 3. Save the summaries to `chinese_summary.txt` and `english_summary.txt`.
 
1
+ ---
2
+ title: Tiny Scribe - Transcript Summarizer
3
+ emoji:
4
 
5
+ colorFrom: blue
6
+ colorTo: green
7
+ sdk: docker
8
+ sdk_version: "3.10"
9
+ app_file: app.py
10
+ pinned: false
11
+ license: mit
12
+ ---
13
 
14
+ # Tiny Scribe
15
 
16
+ A lightweight transcript summarization tool powered by local LLMs (Qwen3-0.6B).
 
17
 
18
+ ## Features
 
19
 
20
+ - **Live Streaming**: Real-time summary generation with token-by-token output
21
+ - **File Upload**: Upload .txt files to summarize
22
+ - **Traditional Chinese**: Automatic conversion to zh-TW
23
+ - **CPU Optimized**: Runs efficiently on 2 vCPUs (HuggingFace Spaces Free Tier)
24
+ - **Small Model**: Uses Qwen3-0.6B-GGUF (Q4_K_M quantization) for fast inference
25
 
26
+ ## Usage
27
 
28
+ 1. Upload a .txt file containing your transcript
29
+ 2. Click "Summarize"
30
+ 3. Watch the summary appear in real-time!
31
 
32
+ ## Technical Details
33
 
34
+ - **Model**: unsloth/Qwen3-0.6B-GGUF (Q4_K_M quantization)
35
+ - **Context Window**: 4096 tokens
36
+ - **Inference**: CPU-only (llama-cpp-python)
37
+ - **UI**: Gradio with streaming support
38
+ - **Output**: Traditional Chinese (zh-TW) via OpenCC
39
 
40
+ ## Limitations
41
+
42
+ - Max input: ~3KB of text (truncated if exceeded)
43
+ - First load: 30-60 seconds (model download)
44
+ - CPU-only inference (no GPU acceleration on Free Tier)
45
 
46
+ ## Repository
 
 
47
 
48
+ [tiny-scribe](https://huggingface.co/spaces/your-username/tiny-scribe)
 
 
 
app.py ADDED
@@ -0,0 +1,277 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Tiny Scribe - HuggingFace Spaces Demo
4
+ A Gradio app for summarizing transcripts using GGUF models with live streaming output.
5
+ Optimized for HuggingFace Spaces Free CPU Tier (2 vCPUs).
6
+ """
7
+
8
+ import os
9
+ import re
10
+ import gradio as gr
11
+ from typing import Tuple, Generator
12
+ from llama_cpp import Llama
13
+ from opencc import OpenCC
14
+ import logging
15
+
16
+ # Configure logging
17
+ logging.basicConfig(level=logging.INFO)
18
+ logger = logging.getLogger(__name__)
19
+
20
+ # Global model instance (loaded once)
21
+ llm = None
22
+ converter = None
23
+
24
+ # Default model optimized for CPU (small, fast)
25
+ DEFAULT_MODEL = "unsloth/Qwen3-0.6B-GGUF"
26
+ DEFAULT_FILENAME = "*Q4_K_M.gguf" # Good balance of speed/quality
27
+
28
+
29
+ def load_model():
30
+ """Load the model once at startup."""
31
+ global llm, converter
32
+
33
+ if llm is not None:
34
+ return
35
+
36
+ logger.info(f"Loading model: {DEFAULT_MODEL}")
37
+
38
+ try:
39
+ # Initialize OpenCC converter for Traditional Chinese (Taiwan)
40
+ converter = OpenCC('s2twp')
41
+
42
+ # Load model optimized for CPU
43
+ # n_ctx=4096 is sufficient for most transcripts and uses less memory
44
+ llm = Llama.from_pretrained(
45
+ repo_id=DEFAULT_MODEL,
46
+ filename=DEFAULT_FILENAME,
47
+ n_gpu_layers=0, # CPU only for HF Spaces Free Tier
48
+ n_ctx=4096, # Reduced context for CPU efficiency
49
+ verbose=False, # Cleaner logs
50
+ seed=1337,
51
+ )
52
+
53
+ logger.info("Model loaded successfully")
54
+ except Exception as e:
55
+ logger.error(f"Error loading model: {e}")
56
+ raise
57
+
58
+
59
+ def parse_thinking_blocks(content: str) -> Tuple[str, str]:
60
+ """
61
+ Parse thinking blocks from model output.
62
+
63
+ Args:
64
+ content: Full model response
65
+
66
+ Returns:
67
+ Tuple of (thinking_content, summary_content)
68
+ """
69
+ pattern = r'<thinking>(.*?)</thinking>'
70
+ matches = re.findall(pattern, content, re.DOTALL)
71
+
72
+ if not matches:
73
+ return ("", content)
74
+
75
+ thinking = '\n\n'.join(match.strip() for match in matches)
76
+ summary = re.sub(pattern, '', content, flags=re.DOTALL).strip()
77
+
78
+ return (thinking, summary)
79
+
80
+
81
+ def summarize_streaming(file_obj, max_tokens: int = 512, temperature: float = 0.6) -> Generator[str, None, None]:
82
+ """
83
+ Stream summary generation from uploaded file.
84
+
85
+ Args:
86
+ file_obj: Gradio file object
87
+ max_tokens: Maximum tokens to generate
88
+ temperature: Sampling temperature
89
+
90
+ Yields:
91
+ Partial summary text for streaming display
92
+ """
93
+ global llm, converter
94
+
95
+ # Ensure model is loaded
96
+ if llm is None:
97
+ load_model()
98
+
99
+ # Read uploaded file
100
+ try:
101
+ if hasattr(file_obj, 'name'):
102
+ # Gradio file object
103
+ with open(file_obj.name, 'r', encoding='utf-8') as f:
104
+ transcript = f.read()
105
+ else:
106
+ # Direct file path
107
+ with open(file_obj, 'r', encoding='utf-8') as f:
108
+ transcript = f.read()
109
+ except Exception as e:
110
+ yield f"Error reading file: {str(e)}"
111
+ return
112
+
113
+ # Validate content
114
+ if not transcript.strip():
115
+ yield "Error: File is empty"
116
+ return
117
+
118
+ # Check length (rough estimate: 4 chars per token)
119
+ max_chars = 3000 # Leave room for generation
120
+ if len(transcript) > max_chars:
121
+ transcript = transcript[:max_chars] + "...\n[Content truncated due to length limits]"
122
+ yield "Note: Content was truncated to fit model context window.\n\n"
123
+
124
+ # Prepare messages
125
+ messages = [
126
+ {"role": "system", "content": "你是一個有助的助手,負責總結轉錄內容。"},
127
+ {"role": "user", "content": f"請總結以下內容:\n\n{transcript}"}
128
+ ]
129
+
130
+ # Generate streaming response
131
+ full_response = ""
132
+ buffer = ""
133
+
134
+ try:
135
+ stream = llm.create_chat_completion(
136
+ messages=messages,
137
+ max_tokens=max_tokens,
138
+ temperature=temperature,
139
+ min_p=0.0,
140
+ top_p=0.95,
141
+ top_k=20,
142
+ stop=["<|end_of_text|>", "<|eot_id|>", "<|eom_id|>"],
143
+ stream=True
144
+ )
145
+
146
+ for chunk in stream:
147
+ if 'choices' in chunk and len(chunk['choices']) > 0:
148
+ delta = chunk['choices'][0].get('delta', {})
149
+ content = delta.get('content', '')
150
+ if content:
151
+ # Convert to Traditional Chinese (Taiwan)
152
+ converted = converter.convert(content)
153
+ buffer += converted
154
+ full_response += converted
155
+
156
+ # Parse and clean thinking blocks for display
157
+ thinking, summary = parse_thinking_blocks(buffer)
158
+ if summary:
159
+ yield summary
160
+
161
+ # Final parse to remove any remaining thinking blocks
162
+ thinking, final_summary = parse_thinking_blocks(full_response)
163
+ if final_summary:
164
+ yield final_summary
165
+
166
+ # Reset model state
167
+ llm.reset()
168
+
169
+ except Exception as e:
170
+ logger.error(f"Error during generation: {e}")
171
+ yield f"\n\nError during generation: {str(e)}"
172
+
173
+
174
+ # Create Gradio interface
175
+ def create_interface():
176
+ """Create and configure the Gradio interface."""
177
+
178
+ with gr.Blocks(
179
+ title="Tiny Scribe - Transcript Summarizer",
180
+ theme=gr.themes.Soft(),
181
+ css="""
182
+ .output-text { font-size: 16px; line-height: 1.6; }
183
+ .info-text { color: #666; font-size: 14px; }
184
+ """
185
+ ) as demo:
186
+
187
+ gr.Markdown("""
188
+ # Tiny Scribe
189
+
190
+ Summarize your text files (transcripts, notes, documents) with AI.
191
+
192
+ **Features:**
193
+ - Live streaming output
194
+ - Traditional Chinese (zh-TW) conversion
195
+ - Optimized for CPU inference
196
+ - Supports .txt files
197
+ """)
198
+
199
+ with gr.Row():
200
+ with gr.Column(scale=1):
201
+ # Input section
202
+ gr.Markdown("### Upload File")
203
+ file_input = gr.File(
204
+ label="Upload .txt file",
205
+ file_types=[".txt"],
206
+ type="filepath"
207
+ )
208
+
209
+ with gr.Accordion("Advanced Settings", open=False):
210
+ max_tokens = gr.Slider(
211
+ minimum=128,
212
+ maximum=1024,
213
+ value=512,
214
+ step=64,
215
+ label="Max Tokens"
216
+ )
217
+ temperature = gr.Slider(
218
+ minimum=0.1,
219
+ maximum=1.0,
220
+ value=0.6,
221
+ step=0.1,
222
+ label="Temperature"
223
+ )
224
+
225
+ submit_btn = gr.Button(
226
+ "Summarize",
227
+ variant="primary",
228
+ size="lg"
229
+ )
230
+
231
+ gr.Markdown("""
232
+ <div class="info-text">
233
+ <strong>Note:</strong> First load may take 30-60 seconds as the model downloads.
234
+ <br>Max file size: ~3KB of text (context window limit).
235
+ </div>
236
+ """)
237
+
238
+ with gr.Column(scale=2):
239
+ # Output section
240
+ gr.Markdown("### Summary Output")
241
+ output = gr.Markdown(
242
+ label="Summary",
243
+ elem_classes=["output-text"]
244
+ )
245
+
246
+ # Event handlers
247
+ submit_btn.click(
248
+ fn=summarize_streaming,
249
+ inputs=[file_input, max_tokens, temperature],
250
+ outputs=output,
251
+ show_progress=True
252
+ )
253
+
254
+ # Note: File upload examples don't work well in HF Spaces UI
255
+ # Users can upload their own .txt files
256
+
257
+ return demo
258
+
259
+
260
+ # Main entry point
261
+ if __name__ == "__main__":
262
+ # Pre-load model on startup
263
+ try:
264
+ load_model()
265
+ except Exception as e:
266
+ logger.error(f"Failed to pre-load model: {e}")
267
+ logger.info("Model will be loaded on first request")
268
+
269
+ # Create and launch interface
270
+ demo = create_interface()
271
+
272
+ demo.launch(
273
+ server_name="0.0.0.0",
274
+ server_port=7860,
275
+ share=False,
276
+ show_error=True
277
+ )
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ gradio>=5.0.0
2
+ opencc-python-reimplemented>=0.1.7
3
+ huggingface-hub>=0.23.0
4
+ llama-cpp-python>=0.3.0