SmartScribe / README.md
itsasutosha's picture
Update README.md
ec90699 verified

A newer version of the Gradio SDK is available: 6.9.0

Upgrade
metadata
title: SmartScribe
emoji: ๐ŸŽ™๏ธ
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Transcription, Summarization & Translation

SmartScribe

Python Whisper Faster-Whisper HuggingFace LLAMA Gradio

AI-Powered Audio Transcription, Meeting Minutes Generation, and Multi-Language Translation


โœจ Features

๐ŸŽ™๏ธ Audio/Video Transcription

  • Convert YouTube links or local audio/video files to text
  • Support for multiple audio formats (MP3, WAV, M4A, etc.)
  • GPU-accelerated transcription using Faster-Whisper
  • Timestamped transcription output

๐ŸŒ Multi-Language Translation

  • Translate transcriptions into any supported language
  • Language validation using pycountry
  • Clean, paragraph-formatted output
  • Preserves original meaning and tone

๐Ÿ–ฅ๏ธ Interactive Web UI

  • Beautiful Gradio interface
  • Drag-and-drop file upload
  • YouTube link support
  • Side-by-side input and output panels
  • Model selection dropdown
  • Real-time streaming responses

๐Ÿ“ Minutes of Meeting Generation

  • Automatically generate structured MOM documents
  • Professional summary with participants and date
  • Key discussion points extraction
  • Takeaways and conclusions identification
  • Actionable items with clear ownership and deadlines
  • Markdown-formatted output

๐Ÿค– Multi-Model Support

  • LLAMA 3.2 3B Instruct
  • PHI 4 Mini Instruct
  • QWEN 3 4B Instruct
  • DeepSeek R1 Distill Qwen 1.5B
  • Google Gemma 3 4B IT

โšก Performance Optimization

  • 4-bit quantization for efficient inference
  • GPU acceleration support
  • Memory-efficient model loading
  • Garbage collection and cache clearing

๐Ÿค– Supported Models

Model Provider Size Speed Quality Best For
LLAMA Meta 3.2B โšกโšก โญโญโญโญ Balanced
PHI Microsoft 4B โšกโšก โญโญโญโญ General
QWEN Alibaba 4B โšกโšกโšก โญโญโญโญ Fast
DEEPSEEK DeepSeek 1.5B โšกโšกโšก โญโญโญ Minimal Resources
Gemma Google 3-4B โšกโšกโšก โญโญโญโญ Efficient

๐Ÿ“ฆ Requirements

System Requirements

  • Python 3.8+
  • CUDA-capable GPU (recommended for transcription)
  • 8GB+ RAM
  • FFmpeg for audio processing

Python Dependencies

gradio>=4.0.0
torch>=2.0.0
transformers>=4.30.0
faster-whisper>=0.10.0
yt-dlp>=2023.0.0
pydub>=0.25.0
bitsandbytes>=0.41.0
accelerate>=0.20.0
pycountry>=23.0.0
huggingface-hub>=0.16.0

๐Ÿ”ง Local Installation

1. Create Virtual Environment

python -m venv venv
source venv/bin/activate  # On macOS/Linux
# or
venv\Scripts\activate  # On Windows

2. Install Dependencies

pip install -r requirements.txt

3. Setup HuggingFace Token

Create a .env file in the project root:

HF_TOKEN=your_huggingface_token_here

Get your token from HuggingFace Settings

4. Setup YouTube Cookies (Optional)

For YouTube link support, set environment variable or create cookies.txt:

export YOUTUBE_COOKIES="your_cookies_content"

Or create cookies.txt with Netscape HTTP Cookie format.


โš™๏ธ Configuration

Model Selection

Edit model paths in app.py:

LLAMA = "meta-llama/Llama-3.2-3B-Instruct"
QWEN = "Qwen/Qwen3-4B-Instruct-2507"
PHI = "microsoft/Phi-4-mini-instruct"
DEEPSEEK = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
Gemma = 'google/gemma-3-4b-it'

Quantization Configuration

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type='nf4'
)

Server Configuration

ui.launch(server_name="0.0.0.0", server_port=7860)

โ˜๏ธ Deployment

HuggingFace Spaces

SmartScribe is deployed and available at: https://huggingface.co/spaces/itsasutosha/SmartScribe

Features:

  • โœ… Free to use
  • โœ… No installation needed
  • โœ… GPU-accelerated inference
  • โœ… Persistent storage for temporary files
  • โœ… Real-time streaming output

To Deploy Your Own:

  1. Create a HuggingFace account at huggingface.co

  2. Create a new Space

  3. Select "Gradio" as the framework

  4. Upload your repository files

  5. Add secrets in Space settings:

    • HF_TOKEN: Your HuggingFace token
    • YOUTUBE_COOKIES: (Optional) YouTube authentication cookies
  6. Space will automatically build and deploy


๐ŸŽฎ Usage

Quick Start - Live Demo

๐ŸŒ Try Online

Visit the live application at: SmartScribe on HuggingFace Spaces

No installation required! Just upload your audio/video or paste a YouTube link.

1. Launch Application (Local Setup)

python app.py

The application will start at http://0.0.0.0:7860

2. Using the Web UI

  1. Upload Content:

    • Upload audio/video file directly, OR
    • Paste YouTube link
  2. Choose Operation:

    • Click "Transcribe" to extract text from audio
    • Click "Summarize" to generate Minutes of Meeting
    • Click "Translate" for multi-language translation
  3. Select Model: Choose preferred LLM from dropdown

  4. View Results: See output in corresponding text areas

Programmatic Usage

Transcribe Audio

from app import transcription_whisper

formatted_output, segments = transcription_whisper("audio.mp3")
print(formatted_output)

# Access individual segments
for seg in segments:
    print(f"[{seg['start']:.2f}s - {seg['end']:.2f}s] {seg['text']}")

Generate Minutes of Meeting

from app import optimize

for chunk in optimize("LLAMA", "audio.mp3"):
    print(chunk, end="", flush=True)

Translate Transcription

from app import optimize_translate

for chunk in optimize_translate("LLAMA", "audio.mp3", "Spanish"):
    print(chunk, end="", flush=True)

๐Ÿ—๏ธ Architecture

Component Overview

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚            Gradio Web Interface (UI Layer)               โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                            โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”              โ”‚
โ”‚  โ”‚ Audio/Video Input  โ”‚  โ”‚  Model Select  โ”‚              โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜              โ”‚
โ”‚                                                            โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”‚
โ”‚  โ”‚  Transcription | MOM | Translation Output     โ”‚      โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚          Multi-Module Processing Layer                   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                 โ”‚                  โ”‚                  โ”‚
โ”‚  Transcription  โ”‚  MOM Generation  โ”‚   Translation   โ”‚
โ”‚  Module         โ”‚  Module          โ”‚   Module        โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€    โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”‚   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”‚
โ”‚  โ€ข Download     โ”‚  โ€ข System Prompt โ”‚  โ€ข Language     โ”‚
โ”‚  โ€ข Convert      โ”‚  โ€ข User Prompt   โ”‚    Validation   โ”‚
โ”‚  โ€ข Transcribe   โ”‚  โ€ข Generation    โ”‚  โ€ข Extraction   โ”‚
โ”‚                 โ”‚                  โ”‚  โ€ข Translation  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚              LLM Integration Layer                      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                          โ”‚
โ”‚  LLAMA | PHI | QWEN | DEEPSEEK | Gemma                โ”‚
โ”‚  (with 4-bit Quantization & GPU Acceleration)          โ”‚
โ”‚                                                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Functions

Function Purpose Input Output
transcription_whisper() Convert audio to text Audio file/URL Formatted transcript
user_prompt_for() Build MOM generation prompt Audio source User prompt string
messages_for() Build message structure Audio source Message array
generate() Route to LLM for MOM Model, audio Generator yielding output
optimize() Execute MOM generation Model, audio Streaming MOM content
user_prompt_translate() Build translation prompt Audio, language Translation prompt
messages_for_translate() Build translation messages Audio, language Message array
translate_transcribe() Execute translation Model, audio, lang Streaming translation
optimize_translate() Route translation task Model, audio, lang Streaming result
valid_language() Validate language code Language string Boolean

๐Ÿ› Troubleshooting

Issue: YouTube download fails

Solution: Update YouTube cookies or use direct file upload

export YOUTUBE_COOKIES="your_updated_cookies"
# or use direct file upload instead

Issue: CUDA out of memory

Solution: Reduce model size or use CPU inference

device = "cpu"  # Force CPU usage

Issue: HuggingFace authentication failed

Solution: Verify HF_TOKEN in .env file

huggingface-cli login  # Interactive login

Issue: Transcription is slow

Solution: Ensure CUDA is properly configured

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Issue: Language validation fails

Solution: Use full language name or ISO code

# Valid formats:
valid_language("English")  # Full name
valid_language("en")       # ISO 639-1 code
valid_language("eng")      # ISO 639-3 code

Issue: Memory issues with large files

Solution: Reduce chunk size or break audio into segments

# Process smaller chunks
segment_duration = 300  # 5 minutes per segment

Issue: Generated MOM missing action items

Solution: Try different model or update system prompt

  • Claude models typically produce better structured output
  • QWEN is faster and generally reliable

๐Ÿ“ File Structure

smartscribe/
โ”œโ”€โ”€ app.py                      # Main application
โ”œโ”€โ”€ requirements.txt            # Python dependencies
โ”œโ”€โ”€ cookies.txt                # YouTube cookies (optional)
โ”œโ”€โ”€ README.md                  # This file
โ”œโ”€โ”€ LICENSE                    # MIT License
โ””โ”€โ”€ .env                       # Environment variables (git-ignored)

๐Ÿ“„ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.


๐ŸŽ“ Citation

If you use SmartScribe in your project, please cite:

@software{smartscribe2025,
  author = {Asutosha Nanda},
  title = {SmartScribe},
  year = {2025},
  url = {https://huggingface.co/spaces/itsasutosha/SmartScribe}
}

โฌ† Back to Top

Intelligent Audio Transcription & Meeting Documentation
Powered by Advanced LLMs and Faster-Whisper

Deployed on HuggingFace Spaces