--- title: "VocalSync Intelligence: Speech-to-Text" emoji: 🎙️ colorFrom: blue colorTo: green sdk: gradio sdk_version: 6.4.0 python_version: '3.10' app_file: app.py pinned: false --- # 🎙️ VocalSync Intelligence: Deconstructing Speech-to-Text **A curiosity-driven experiment in deconstructing the ASR-to-LLM pipeline.** VocalSync Intelligence is a learning experiment designed to explore the bridge between raw audio waves and structured digital thoughts. Instead of treating AI as a "black box," this project deconstructs the process of capturing scattered brainstorming and streamlining it into detailed guidelines using local hardware constraints. --- ## ✨ Features * **🎤 Live Transcription**: Real-time speech-to-text conversion from microphone input. * **🤖 AI Meeting Analysis**: Integrated Meeting Manager logic using Llama-3.2-3B to generate action items and key insights from raw transcripts. * **🌐 Web Interface**: A modern Gradio-based UI designed for seamless interaction with the ASR engine. * **📹 Universal Video Support**: Ability to ingest and transcribe audio from YouTube, Vimeo, Teams, and 1000+ other platforms via URL. * **🔄 Hybrid Modes**: Support for real-time streaming, after-speech accumulation, and direct file uploads. * **⚡ Optimized Engine**: Leverages Faster Whisper with `int8` quantization for high-speed local CPU inference. * **💾 Auto-Scribe**: Automatic persistence of all sessions to the `/outputs` directory with unique timestamps. * **🔒 Privacy-First**: 100% local processing: no audio data or transcripts ever leave your machine. --- ## 🏗️ Technical Architecture To balance semantic clarity with local CPU limitations, the project focuses on three technical pillars: ### 1. Signal Normalization Using **PyAudio** to sample sound at 16kHz and normalizing 16-bit integers into `float32` decimals. This is the essential digital handshake between the microphone and the neural network. ### 2. Contextual Anchoring Implementing a **Sliding Window** history. By feeding the last 200 characters of the transcript back into the `initial_prompt`, the system fixes phonetic hallucinations (e.g., ensuring "AI" isn't misheard as "Ali"). ### 3. Inference Pipeline * **ASR:** `faster-whisper` (Base model) using `int8` quantization for CPU efficiency. * **LLM:** `Llama-3.2-3B-Instruct` acting as a "Meeting Manager" to align scattered thoughts into a streamlined roadmap. --- ## 📂 Project Structure ```plaintext . ├── app.py # Main entry point (Gradio UI) ├── src/ │ ├── transcription/ # ASR Logic (Live, File, and Streaming engines) │ ├── analysis/ # Llama-3.2-3B Integration │ ├── handlers/ # Orchestration between audio and text processing │ └── io/ # Logic for persistent storage ├── outputs/ # Local storage for transcripts and AI analysis └── requirements.txt # Project dependencies 🚀 Getting Started 1. Prerequisites Python 3.10+ FFmpeg: Essential for audio stream handling and URL processing. Windows: choco install ffmpeg Mac: brew install ffmpeg Linux: sudo apt install ffmpeg 2. Installation Clone the repository and set up a local environment: Bash git clone [https://github.com/mahnoor-khalid9/vocal-sync-speech-to-text.git](https://github.com/mahnoor-khalid9/vocal-sync-speech-to-text.git) cd vocal-sync-speech-to-text python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate pip install -r requirements.txt 3. Environment Setup Create a .env file in the root directory: Bash HF_TOKEN=your_huggingface_token 4. Running the Experiment Launch the interface to start the live thought-collection process: Bash python app.py 🎓 Findings & Learning Autopsy The Warm-up Pulse: Solved the "Cold Start" lag where the model would miss the first few words by injecting a 1s silent np.zeros buffer at launch to initialize the engine. VAD Gating: Implemented a Voice Activity Detection threshold of 0.5 to prevent the model from hallucinating text during silent periods or background noise. Context > Model Size: Discovered that a "Base" model with a smart sliding-window prompt can often provide more coherent brainstorming flow than a "Large" model listening in a vacuum. Note: This project is a learning exercise in seeing how data architecture: from signal normalization to metadata syncing: directly influences AI behavior.