Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.9.0
title: SmartScribe
emoji: ๐๏ธ
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Transcription, Summarization & Translation
SmartScribe
AI-Powered Audio Transcription, Meeting Minutes Generation, and Multi-Language Translation
๐ Table of Contents
| โจ Features | ๐ค Supported Models | ๐ฆ Requirements | ๐ง Installation |
| โ๏ธ Configuration | ๐ฎ Usage | ๐๏ธ Architecture | ๐ Troubleshooting |
โจ Features
๐๏ธ Audio/Video Transcription
- Convert YouTube links or local audio/video files to text
- Support for multiple audio formats (MP3, WAV, M4A, etc.)
- GPU-accelerated transcription using Faster-Whisper
- Timestamped transcription output
๐ Multi-Language Translation
- Translate transcriptions into any supported language
- Language validation using pycountry
- Clean, paragraph-formatted output
- Preserves original meaning and tone
๐ฅ๏ธ Interactive Web UI
- Beautiful Gradio interface
- Drag-and-drop file upload
- YouTube link support
- Side-by-side input and output panels
- Model selection dropdown
- Real-time streaming responses
๐ Minutes of Meeting Generation
- Automatically generate structured MOM documents
- Professional summary with participants and date
- Key discussion points extraction
- Takeaways and conclusions identification
- Actionable items with clear ownership and deadlines
- Markdown-formatted output
๐ค Multi-Model Support
- LLAMA 3.2 3B Instruct
- PHI 4 Mini Instruct
- QWEN 3 4B Instruct
- DeepSeek R1 Distill Qwen 1.5B
- Google Gemma 3 4B IT
โก Performance Optimization
- 4-bit quantization for efficient inference
- GPU acceleration support
- Memory-efficient model loading
- Garbage collection and cache clearing
๐ค Supported Models
| Model | Provider | Size | Speed | Quality | Best For |
|---|---|---|---|---|---|
| LLAMA | Meta | 3.2B | โกโก | โญโญโญโญ | Balanced |
| PHI | Microsoft | 4B | โกโก | โญโญโญโญ | General |
| QWEN | Alibaba | 4B | โกโกโก | โญโญโญโญ | Fast |
| DEEPSEEK | DeepSeek | 1.5B | โกโกโก | โญโญโญ | Minimal Resources |
| Gemma | 3-4B | โกโกโก | โญโญโญโญ | Efficient |
๐ฆ Requirements
System Requirements
- Python 3.8+
- CUDA-capable GPU (recommended for transcription)
- 8GB+ RAM
- FFmpeg for audio processing
Python Dependencies
gradio>=4.0.0
torch>=2.0.0
transformers>=4.30.0
faster-whisper>=0.10.0
yt-dlp>=2023.0.0
pydub>=0.25.0
bitsandbytes>=0.41.0
accelerate>=0.20.0
pycountry>=23.0.0
huggingface-hub>=0.16.0
๐ง Local Installation
1. Create Virtual Environment
python -m venv venv
source venv/bin/activate # On macOS/Linux
# or
venv\Scripts\activate # On Windows
2. Install Dependencies
pip install -r requirements.txt
3. Setup HuggingFace Token
Create a .env file in the project root:
HF_TOKEN=your_huggingface_token_here
Get your token from HuggingFace Settings
4. Setup YouTube Cookies (Optional)
For YouTube link support, set environment variable or create cookies.txt:
export YOUTUBE_COOKIES="your_cookies_content"
Or create cookies.txt with Netscape HTTP Cookie format.
โ๏ธ Configuration
Model Selection
Edit model paths in app.py:
LLAMA = "meta-llama/Llama-3.2-3B-Instruct"
QWEN = "Qwen/Qwen3-4B-Instruct-2507"
PHI = "microsoft/Phi-4-mini-instruct"
DEEPSEEK = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
Gemma = 'google/gemma-3-4b-it'
Quantization Configuration
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type='nf4'
)
Server Configuration
ui.launch(server_name="0.0.0.0", server_port=7860)
โ๏ธ Deployment
HuggingFace Spaces
SmartScribe is deployed and available at: https://huggingface.co/spaces/itsasutosha/SmartScribe
Features:
- โ Free to use
- โ No installation needed
- โ GPU-accelerated inference
- โ Persistent storage for temporary files
- โ Real-time streaming output
To Deploy Your Own:
Create a HuggingFace account at huggingface.co
Create a new Space
Select "Gradio" as the framework
Upload your repository files
Add secrets in Space settings:
HF_TOKEN: Your HuggingFace tokenYOUTUBE_COOKIES: (Optional) YouTube authentication cookies
Space will automatically build and deploy
๐ฎ Usage
Quick Start - Live Demo
๐ Try Online
Visit the live application at: SmartScribe on HuggingFace Spaces
No installation required! Just upload your audio/video or paste a YouTube link.
1. Launch Application (Local Setup)
python app.py
The application will start at http://0.0.0.0:7860
2. Using the Web UI
Upload Content:
- Upload audio/video file directly, OR
- Paste YouTube link
Choose Operation:
- Click "Transcribe" to extract text from audio
- Click "Summarize" to generate Minutes of Meeting
- Click "Translate" for multi-language translation
Select Model: Choose preferred LLM from dropdown
View Results: See output in corresponding text areas
Programmatic Usage
Transcribe Audio
from app import transcription_whisper
formatted_output, segments = transcription_whisper("audio.mp3")
print(formatted_output)
# Access individual segments
for seg in segments:
print(f"[{seg['start']:.2f}s - {seg['end']:.2f}s] {seg['text']}")
Generate Minutes of Meeting
from app import optimize
for chunk in optimize("LLAMA", "audio.mp3"):
print(chunk, end="", flush=True)
Translate Transcription
from app import optimize_translate
for chunk in optimize_translate("LLAMA", "audio.mp3", "Spanish"):
print(chunk, end="", flush=True)
๐๏ธ Architecture
Component Overview
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Gradio Web Interface (UI Layer) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ โ
โ โ Audio/Video Input โ โ Model Select โ โ
โ โโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Transcription | MOM | Translation Output โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Multi-Module Processing Layer โ
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโค
โ โ โ โ
โ Transcription โ MOM Generation โ Translation โ
โ Module โ Module โ Module โ
โ โโโโโโโโโโโ โ โโโโโโโโโโโโโโ โ โโโโโโโโโโโโ โ
โ โข Download โ โข System Prompt โ โข Language โ
โ โข Convert โ โข User Prompt โ Validation โ
โ โข Transcribe โ โข Generation โ โข Extraction โ
โ โ โ โข Translation โ
โโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโค
โ LLM Integration Layer โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ LLAMA | PHI | QWEN | DEEPSEEK | Gemma โ
โ (with 4-bit Quantization & GPU Acceleration) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Key Functions
| Function | Purpose | Input | Output |
|---|---|---|---|
transcription_whisper() |
Convert audio to text | Audio file/URL | Formatted transcript |
user_prompt_for() |
Build MOM generation prompt | Audio source | User prompt string |
messages_for() |
Build message structure | Audio source | Message array |
generate() |
Route to LLM for MOM | Model, audio | Generator yielding output |
optimize() |
Execute MOM generation | Model, audio | Streaming MOM content |
user_prompt_translate() |
Build translation prompt | Audio, language | Translation prompt |
messages_for_translate() |
Build translation messages | Audio, language | Message array |
translate_transcribe() |
Execute translation | Model, audio, lang | Streaming translation |
optimize_translate() |
Route translation task | Model, audio, lang | Streaming result |
valid_language() |
Validate language code | Language string | Boolean |
๐ Troubleshooting
Issue: YouTube download fails
Solution: Update YouTube cookies or use direct file upload
export YOUTUBE_COOKIES="your_updated_cookies"
# or use direct file upload instead
Issue: CUDA out of memory
Solution: Reduce model size or use CPU inference
device = "cpu" # Force CPU usage
Issue: HuggingFace authentication failed
Solution: Verify HF_TOKEN in .env file
huggingface-cli login # Interactive login
Issue: Transcription is slow
Solution: Ensure CUDA is properly configured
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
Issue: Language validation fails
Solution: Use full language name or ISO code
# Valid formats:
valid_language("English") # Full name
valid_language("en") # ISO 639-1 code
valid_language("eng") # ISO 639-3 code
Issue: Memory issues with large files
Solution: Reduce chunk size or break audio into segments
# Process smaller chunks
segment_duration = 300 # 5 minutes per segment
Issue: Generated MOM missing action items
Solution: Try different model or update system prompt
- Claude models typically produce better structured output
- QWEN is faster and generally reliable
๐ File Structure
smartscribe/
โโโ app.py # Main application
โโโ requirements.txt # Python dependencies
โโโ cookies.txt # YouTube cookies (optional)
โโโ README.md # This file
โโโ LICENSE # MIT License
โโโ .env # Environment variables (git-ignored)
๐ License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
๐ Citation
If you use SmartScribe in your project, please cite:
@software{smartscribe2025,
author = {Asutosha Nanda},
title = {SmartScribe},
year = {2025},
url = {https://huggingface.co/spaces/itsasutosha/SmartScribe}
}
Intelligent Audio Transcription & Meeting Documentation
Powered by Advanced LLMs and Faster-Whisper
Deployed on HuggingFace Spaces