Spaces:

Mahnoor00
/

vocal-sync-intelligence

Sleeping

App Files Files Community

Fnu Mahnoor commited on Jan 25

Commit

4f54a59

1 Parent(s): bf2d622

update readme

Browse files

Files changed (1) hide show

README.md +63 -215

README.md CHANGED Viewed

@@ -1,245 +1,93 @@
-# Voice Summarizer - Open-Source Speech-to-Text Transcriber
-A comprehensive, open-source speech-to-text transcription application with AI-powered meeting analysis. Uses Faster Whisper for local transcription and local LLMs for intelligent analysis - no external APIs required after initial setup.
-## Features
-- **🎤 Live Transcription**: Real-time speech-to-text from microphone input
-- **🌐 Web Interface**: Modern Gradio-based UI with multiple transcription modes
-- **📹 Video URL Support**: Transcribe audio from YouTube, Vimeo, Teams recordings, and 1000+ other platforms
-- **🤖 AI Meeting Analysis**: Local LLM analysis for meeting notes, action items, and key insights
-- **💾 Auto-Saving**: Automatic saving of transcripts and analyses with timestamps
-- **🔄 Multiple Modes**: Real-time streaming, after-speech accumulation, file upload, and video URL processing
-- **⚡ Optimized Performance**: Uses Faster Whisper for fast, accurate transcription
-- **🔒 Privacy-First**: All processing happens locally, no data sent to external servers
-## Prerequisites
-- **Python 3.8+** (3.12 recommended)
-- **FFmpeg** (required for video URL processing)
-- **Git** (for cloning the repository)
-- **Conda/Miniconda** (recommended for environment management)
-## Installation
-### 1. Clone the Repository
-```bash
-git clone https://github.com/yourusername/voice-summarizer.git
-cd voice-summarizer
-```
-### 2. Set Up Python Environment
-#### Using Conda (Recommended)
-```bash
-# Create a new conda environment
-conda create -n voice-summarizer python=3.12
-conda activate voice-summarizer
-# Install dependencies
-pip install -r requirements.txt
-```
-#### Using venv (Alternative)
-```bash
-# Create virtual environment
-python -m venv venv
-source venv/bin/activate  # On Windows: venv\Scripts\activate
-# Install dependencies
-pip install -r requirements.txt
-```
-### 3. Install FFmpeg
-FFmpeg is required for processing video URLs. Choose one of the following methods:
-#### Windows (Chocolatey)
-```bash
-choco install ffmpeg
-```
-#### Windows (Conda)
-```bash
-conda install ffmpeg -c conda-forge
-```
-#### Windows (Manual)
-1. Download from https://ffmpeg.org/download.html
-2. Extract to a folder (e.g., `C:\ffmpeg`)
-3. Add `C:\ffmpeg\bin` to your system PATH
-#### Linux
-```bash
-sudo apt install ffmpeg  # Ubuntu/Debian
-# or
-sudo dnf install ffmpeg  # Fedora
-```
-#### macOS
-```bash
-brew install ffmpeg
-```
-### 4. Configure Hugging Face Token
-Create a `.env` file in the project root:
-```bash
-# Create .env file
-echo "HF_TOKEN=your_hugging_face_token_here" > .env
-```
-Get your token from: https://huggingface.co/settings/tokens
-**Note**: The token is required for downloading models. Without it, you'll get authentication errors.
-## Usage
-### Web Application (Recommended)
-Launch the interactive web interface:
-```bash
-python app.py
-```
-This opens a Gradio web app with three main tabs:
-#### 1. Live Recording Tab
-- **Real-time Mode**: Start speaking immediately - transcription appears as you speak
-- **After Speech Mode**: Speak first, then click "Transcribe Accumulated" to process
-- **Analysis**: Click "Analyze Transcription" for AI-powered meeting insights
-#### 2. File Upload Tab
-- Upload audio/video files (WAV, MP3, M4A, MP4, etc.)
-- Automatic transcription and optional AI analysis
-#### 3. Video URL Tab
-- Paste URLs from YouTube, Vimeo, Teams recordings, etc.
-- Supports Microsoft Stream, OneDrive, SharePoint (for Teams meetings)
-- Automatic audio extraction and transcription
-### Command-Line Interface
-#### Live Transcription
-```bash
-python cli.py live
-```
-#### File Transcription
-```bash
-python cli.py transcribe path/to/audio.wav --model base --analyze
-```
-#### Available Models
-- `tiny` (fastest, least accurate)
-- `base` (good balance)
-- `small` (better accuracy)
-- `medium` (high accuracy)
-- `large` (best accuracy, slowest)
-## Outputs
-All results are automatically saved to the `outputs/` directory with timestamps:
-```
-outputs/
-├── 2026-01-18_14-30-00_transcript.txt
-├── 2026-01-18_14-30-00_analysis.txt
-├── 2026-01-18_14-45-15_transcript.txt
-└── 2026-01-18_14-45-15_analysis.txt
-```
-## Supported Formats
-### Audio Files
-- WAV, MP3, M4A, FLAC, OGG, AAC
-- Any format supported by librosa/soundfile
-### Video URLs
-- YouTube, Vimeo, Dailymotion
-- Microsoft Stream/OneDrive/SharePoint (Teams recordings)
-- TikTok, Instagram, Twitter
-- 1000+ platforms supported by yt-dlp
-## Troubleshooting
-### Common Issues
-#### "FFmpeg not found" Error
-- Ensure FFmpeg is installed and in your PATH
-- Test with: `ffmpeg -version`
-#### "Authentication failed" for Hugging Face
-- Check your `.env` file has a valid `HF_TOKEN`
-- Regenerate token if needed
-#### Video URL Not Working
-- Some private/protected videos require authentication
-- Try downloading manually and use File Upload tab
-- Check yt-dlp logs for specific errors
-#### LLM Analysis Not Working
-- Ensure you have a Hugging Face token
-- Check internet connection for model downloads
-- First run may take time to download models
-#### Microphone Not Detected
-- Check browser permissions for microphone access
-- Try refreshing the page
-- Ensure no other applications are using the microphone
-### Performance Tips
-- Use smaller Whisper models (`tiny`, `base`) for faster processing
-- Close other applications to free up CPU/GPU resources
-- For GPU acceleration, ensure CUDA is available
-## Project Structure
-```
-voice-summarizer/
-├── app.py                 # Main Gradio web application
-├── cli.py                 # Command-line interface
-├── requirements.txt       # Python dependencies
-├── .env                   # Environment variables (create this)
-├── outputs/               # Auto-saved transcripts and analyses
-└── src/
-    ├── transcription/     # Transcription modules
-    │   ├── streaming_transcriber.py
-    │   └── file_transcriber.py
-    ├── analysis/          # LLM analysis modules
-    │   └── llm.py
-    ├── handlers/          # Request handlers
-    │   ├── transcription_handler.py
-    │   └── analysis_handler.py
-    └── io/                # Input/output utilities
-        └── saver.py
-```
-## Contributing
-1. Fork the repository
-2. Create a feature branch
-3. Make your changes
-4. Test thoroughly
-5. Submit a pull request
-## License
-This project uses open-source libraries:
-- Faster Whisper: MIT License
-- Transformers: Apache 2.0
-- Gradio: Apache 2.0
-- yt-dlp: Unlicense
-## Acknowledgments
-- OpenAI Whisper for the base transcription model
-- Faster Whisper for optimized implementation
-- Hugging Face for model hosting and API
-- yt-dlp for video downloading capabilities

+# 🎙️ VocalSync Intelligence: Deconstructing Speech-to-Text
+**A curiosity-driven experiment in deconstructing the ASR-to-LLM pipeline.**
+VocalSync Intelligence is a learning experiment designed to explore the bridge between raw audio waves and structured digital thoughts. Instead of treating AI as a "black box," this project deconstructs the process of capturing scattered brainstorming and streamlining it into detailed guidelines using local hardware constraints.
+---
+## ✨ Features
+* **🎤 Live Transcription**: Real-time speech-to-text conversion from microphone input.
+* **🤖 AI Meeting Analysis**: Integrated Meeting Manager logic using Llama-3.2-3B to generate action items and key insights from raw transcripts.
+* **🌐 Web Interface**: A modern Gradio-based UI designed for seamless interaction with the ASR engine.
+* **📹 Universal Video Support**: Ability to ingest and transcribe audio from YouTube, Vimeo, Teams, and 1000+ other platforms via URL.
+* **🔄 Hybrid Modes**: Support for real-time streaming, after-speech accumulation, and direct file uploads.
+* **⚡ Optimized Engine**: Leverages Faster Whisper with `int8` quantization for high-speed local CPU inference.
+* **💾 Auto-Scribe**: Automatic persistence of all sessions to the `/outputs` directory with unique timestamps.
+* **🔒 Privacy-First**: 100% local processing: no audio data or transcripts ever leave your machine.
+---
+## 🏗️ Technical Architecture
+To balance semantic clarity with local CPU limitations, the project focuses on three technical pillars:
+### 1. Signal Normalization
+Using **PyAudio** to sample sound at 16kHz and normalizing 16-bit integers into `float32` decimals. This is the essential digital handshake between the microphone and the neural network.
+### 2. Contextual Anchoring
+Implementing a **Sliding Window** history. By feeding the last 200 characters of the transcript back into the `initial_prompt`, the system fixes phonetic hallucinations (e.g., ensuring "AI" isn't misheard as "Ali").
+### 3. Inference Pipeline
+* **ASR:** `faster-whisper` (Base model) using `int8` quantization for CPU efficiency.
+* **LLM:** `Llama-3.2-3B-Instruct` acting as a "Meeting Manager" to align scattered thoughts into a streamlined roadmap.
+---
+## 📂 Project Structure
+```plaintext
+.
+├── app.py                  # Main entry point (Gradio UI)
+├── src/
+│   ├── transcription/      # ASR Logic (Live, File, and Streaming engines)
+│   ├── analysis/           # Llama-3.2-3B Integration
+│   ├── handlers/           # Orchestration between audio and text processing
+│   └── io/                 # Logic for persistent storage
+├── outputs/                # Local storage for transcripts and AI analysis
+└── requirements.txt        # Project dependencies
+🚀 Getting Started
+1. Prerequisites
+Python 3.10+
+FFmpeg: Essential for audio stream handling and URL processing.
+Windows: choco install ffmpeg
+Mac: brew install ffmpeg
+Linux: sudo apt install ffmpeg
+2. Installation
+Clone the repository and set up a local environment:
+Bash
+git clone [https://github.com/mahnoor-khalid9/vocal-sync-speech-to-text.git](https://github.com/mahnoor-khalid9/vocal-sync-speech-to-text.git)
+cd vocal-sync-speech-to-text
+python -m venv venv
+source venv/bin/activate  # Windows: venv\Scripts\activate
+pip install -r requirements.txt
+3. Environment Setup
+Create a .env file in the root directory:
+Bash
+HF_TOKEN=your_huggingface_token
+4. Running the Experiment
+Launch the interface to start the live thought-collection process:
+Bash
+python app.py
+🎓 Findings & Learning Autopsy
+The Warm-up Pulse: Solved the "Cold Start" lag where the model would miss the first few words by injecting a 1s silent np.zeros buffer at launch to initialize the engine.
+VAD Gating: Implemented a Voice Activity Detection threshold of 0.5 to prevent the model from hallucinating text during silent periods or background noise.
+Context > Model Size: Discovered that a "Base" model with a smart sliding-window prompt can often provide more coherent brainstorming flow than a "Large" model listening in a vacuum.
+Note: This project is a learning exercise in seeing how data architecture: from signal normalization to metadata syncing: directly influences AI behavior.