Spaces:
Sleeping
Sleeping
File size: 14,016 Bytes
da911de ec90699 87daad5 ec90699 87daad5 ec90699 87daad5 463e0e4 87daad5 ec90699 87daad5 ec90699 463e0e4 87daad5 ec90699 463e0e4 87daad5 463e0e4 87daad5 463e0e4 87daad5 ec90699 87daad5 ec90699 87daad5 ca79cbf 87daad5 ec90699 87daad5 ec90699 87daad5 ec90699 87daad5 ec90699 87daad5 ec90699 87daad5 ec90699 87daad5 ec90699 87daad5 ec90699 87daad5 ec90699 87daad5 ca79cbf ec90699 87daad5 ec90699 87daad5 ec90699 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 | ---
title: SmartScribe
emoji: ๐๏ธ
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.9.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Transcription, Summarization & Translation
---
<div align="center">
# SmartScribe
[](https://www.python.org/downloads/)
[](https://openai.com/research/whisper)
[](https://github.com/guillaumekln/faster-whisper)
[](https://huggingface.co/spaces/itsasutosha/SmartScribe)
[](https://www.llama.com/)
[](https://gradio.app/)
**AI-Powered Audio Transcription, Meeting Minutes Generation, and Multi-Language Translation**
</div>
<div align="center">
<h2>๐ Table of Contents</h2>
<table>
<tr>
<td><a href="#features">โจ Features</a></td>
<td><a href="#supported-models">๐ค Supported Models</a></td>
<td><a href="#requirements">๐ฆ Requirements</a></td>
<td><a href="#installation">๐ง Installation</a></td>
</tr>
<tr>
<td><a href="#configuration">โ๏ธ Configuration</a></td>
<td><a href="#usage">๐ฎ Usage</a></td>
<td><a href="#architecture">๐๏ธ Architecture</a></td>
<td><a href="#troubleshooting">๐ Troubleshooting</a></td>
</tr>
</tr>
</table>
</div>
---
## โจ Features
<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 20px;">
<div>
### ๐๏ธ Audio/Video Transcription
- Convert YouTube links or local audio/video files to text
- Support for multiple audio formats (MP3, WAV, M4A, etc.)
- GPU-accelerated transcription using Faster-Whisper
- Timestamped transcription output
### ๐ Multi-Language Translation
- Translate transcriptions into any supported language
- Language validation using pycountry
- Clean, paragraph-formatted output
- Preserves original meaning and tone
### ๐ฅ๏ธ Interactive Web UI
- Beautiful Gradio interface
- Drag-and-drop file upload
- YouTube link support
- Side-by-side input and output panels
- Model selection dropdown
- Real-time streaming responses
</div>
<div>
### ๐ Minutes of Meeting Generation
- Automatically generate structured MOM documents
- Professional summary with participants and date
- Key discussion points extraction
- Takeaways and conclusions identification
- Actionable items with clear ownership and deadlines
- Markdown-formatted output
### ๐ค Multi-Model Support
- LLAMA 3.2 3B Instruct
- PHI 4 Mini Instruct
- QWEN 3 4B Instruct
- DeepSeek R1 Distill Qwen 1.5B
- Google Gemma 3 4B IT
### โก Performance Optimization
- 4-bit quantization for efficient inference
- GPU acceleration support
- Memory-efficient model loading
- Garbage collection and cache clearing
</div>
</div>
---
## ๐ค Supported Models
| Model | Provider | Size | Speed | Quality | Best For |
|-------|----------|------|-------|---------|----------|
| LLAMA | Meta | 3.2B | โกโก | โญโญโญโญ | Balanced |
| PHI | Microsoft | 4B | โกโก | โญโญโญโญ | General |
| QWEN | Alibaba | 4B | โกโกโก | โญโญโญโญ | Fast |
| DEEPSEEK | DeepSeek | 1.5B | โกโกโก | โญโญโญ | Minimal Resources |
| Gemma | Google | 3-4B | โกโกโก | โญโญโญโญ | Efficient |
---
## ๐ฆ Requirements
### System Requirements
- **Python 3.8+**
- **CUDA-capable GPU** (recommended for transcription)
- **8GB+ RAM**
- **FFmpeg** for audio processing
### Python Dependencies
```
gradio>=4.0.0
torch>=2.0.0
transformers>=4.30.0
faster-whisper>=0.10.0
yt-dlp>=2023.0.0
pydub>=0.25.0
bitsandbytes>=0.41.0
accelerate>=0.20.0
pycountry>=23.0.0
huggingface-hub>=0.16.0
```
---
## ๐ง Local Installation
### 1. Create Virtual Environment
```bash
python -m venv venv
source venv/bin/activate # On macOS/Linux
# or
venv\Scripts\activate # On Windows
```
### 2. Install Dependencies
```bash
pip install -r requirements.txt
```
### 3. Setup HuggingFace Token
Create a `.env` file in the project root:
```env
HF_TOKEN=your_huggingface_token_here
```
Get your token from [HuggingFace Settings](https://huggingface.co/settings/tokens)
### 4. Setup YouTube Cookies (Optional)
For YouTube link support, set environment variable or create `cookies.txt`:
```bash
export YOUTUBE_COOKIES="your_cookies_content"
```
Or create `cookies.txt` with Netscape HTTP Cookie format.
---
## โ๏ธ Configuration
### Model Selection
Edit model paths in `app.py`:
```python
LLAMA = "meta-llama/Llama-3.2-3B-Instruct"
QWEN = "Qwen/Qwen3-4B-Instruct-2507"
PHI = "microsoft/Phi-4-mini-instruct"
DEEPSEEK = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
Gemma = 'google/gemma-3-4b-it'
```
### Quantization Configuration
```python
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type='nf4'
)
```
### Server Configuration
```python
ui.launch(server_name="0.0.0.0", server_port=7860)
```
---
## โ๏ธ Deployment
### HuggingFace Spaces
SmartScribe is deployed and available at: [https://huggingface.co/spaces/itsasutosha/SmartScribe](https://huggingface.co/spaces/itsasutosha/SmartScribe)
**Features:**
- โ
Free to use
- โ
No installation needed
- โ
GPU-accelerated inference
- โ
Persistent storage for temporary files
- โ
Real-time streaming output
**To Deploy Your Own:**
1. Create a HuggingFace account at [huggingface.co](https://huggingface.co)
2. Create a new Space
3. Select "Gradio" as the framework
4. Upload your repository files
5. Add secrets in Space settings:
- `HF_TOKEN`: Your HuggingFace token
- `YOUTUBE_COOKIES`: (Optional) YouTube authentication cookies
6. Space will automatically build and deploy
---
## ๐ฎ Usage
### Quick Start - Live Demo
#### ๐ Try Online
Visit the live application at: **[SmartScribe on HuggingFace Spaces](https://huggingface.co/spaces/itsasutosha/SmartScribe)**
No installation required! Just upload your audio/video or paste a YouTube link.
#### 1. Launch Application (Local Setup)
```bash
python app.py
```
The application will start at `http://0.0.0.0:7860`
#### 2. Using the Web UI
1. **Upload Content**:
- Upload audio/video file directly, OR
- Paste YouTube link
2. **Choose Operation**:
- Click "Transcribe" to extract text from audio
- Click "Summarize" to generate Minutes of Meeting
- Click "Translate" for multi-language translation
3. **Select Model**: Choose preferred LLM from dropdown
4. **View Results**: See output in corresponding text areas
### Programmatic Usage
#### Transcribe Audio
```python
from app import transcription_whisper
formatted_output, segments = transcription_whisper("audio.mp3")
print(formatted_output)
# Access individual segments
for seg in segments:
print(f"[{seg['start']:.2f}s - {seg['end']:.2f}s] {seg['text']}")
```
#### Generate Minutes of Meeting
```python
from app import optimize
for chunk in optimize("LLAMA", "audio.mp3"):
print(chunk, end="", flush=True)
```
#### Translate Transcription
```python
from app import optimize_translate
for chunk in optimize_translate("LLAMA", "audio.mp3", "Spanish"):
print(chunk, end="", flush=True)
```
---
## ๐๏ธ Architecture
### Component Overview
```
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Gradio Web Interface (UI Layer) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ โ
โ โ Audio/Video Input โ โ Model Select โ โ
โ โโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Transcription | MOM | Translation Output โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Multi-Module Processing Layer โ
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโค
โ โ โ โ
โ Transcription โ MOM Generation โ Translation โ
โ Module โ Module โ Module โ
โ โโโโโโโโโโโ โ โโโโโโโโโโโโโโ โ โโโโโโโโโโโโ โ
โ โข Download โ โข System Prompt โ โข Language โ
โ โข Convert โ โข User Prompt โ Validation โ
โ โข Transcribe โ โข Generation โ โข Extraction โ
โ โ โ โข Translation โ
โโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโค
โ LLM Integration Layer โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ LLAMA | PHI | QWEN | DEEPSEEK | Gemma โ
โ (with 4-bit Quantization & GPU Acceleration) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
```
### Key Functions
| Function | Purpose | Input | Output |
|----------|---------|-------|--------|
| `transcription_whisper()` | Convert audio to text | Audio file/URL | Formatted transcript |
| `user_prompt_for()` | Build MOM generation prompt | Audio source | User prompt string |
| `messages_for()` | Build message structure | Audio source | Message array |
| `generate()` | Route to LLM for MOM | Model, audio | Generator yielding output |
| `optimize()` | Execute MOM generation | Model, audio | Streaming MOM content |
| `user_prompt_translate()` | Build translation prompt | Audio, language | Translation prompt |
| `messages_for_translate()` | Build translation messages | Audio, language | Message array |
| `translate_transcribe()` | Execute translation | Model, audio, lang | Streaming translation |
| `optimize_translate()` | Route translation task | Model, audio, lang | Streaming result |
| `valid_language()` | Validate language code | Language string | Boolean |
---
## ๐ Troubleshooting
### Issue: YouTube download fails
**Solution**: Update YouTube cookies or use direct file upload
```bash
export YOUTUBE_COOKIES="your_updated_cookies"
# or use direct file upload instead
```
### Issue: CUDA out of memory
**Solution**: Reduce model size or use CPU inference
```python
device = "cpu" # Force CPU usage
```
### Issue: HuggingFace authentication failed
**Solution**: Verify HF_TOKEN in .env file
```bash
huggingface-cli login # Interactive login
```
### Issue: Transcription is slow
**Solution**: Ensure CUDA is properly configured
```python
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
```
### Issue: Language validation fails
**Solution**: Use full language name or ISO code
```python
# Valid formats:
valid_language("English") # Full name
valid_language("en") # ISO 639-1 code
valid_language("eng") # ISO 639-3 code
```
### Issue: Memory issues with large files
**Solution**: Reduce chunk size or break audio into segments
```python
# Process smaller chunks
segment_duration = 300 # 5 minutes per segment
```
### Issue: Generated MOM missing action items
**Solution**: Try different model or update system prompt
- Claude models typically produce better structured output
- QWEN is faster and generally reliable
---
## ๐ File Structure
```
smartscribe/
โโโ app.py # Main application
โโโ requirements.txt # Python dependencies
โโโ cookies.txt # YouTube cookies (optional)
โโโ README.md # This file
โโโ LICENSE # MIT License
โโโ .env # Environment variables (git-ignored)
```
---
## ๐ License
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
---
## ๐ Citation
If you use SmartScribe in your project, please cite:
```bibtex
@software{smartscribe2025,
author = {Asutosha Nanda},
title = {SmartScribe},
year = {2025},
url = {https://huggingface.co/spaces/itsasutosha/SmartScribe}
}
```
---
<div align="center">
**[โฌ Back to Top](#-smartscribe)**
**Intelligent Audio Transcription & Meeting Documentation**
Powered by Advanced LLMs and Faster-Whisper
Deployed on [HuggingFace Spaces](https://huggingface.co/spaces/itsasutosha/SmartScribe)
</div> |