Spaces:
Sleeping
Sleeping
| title: "VocalSync Intelligence: Speech-to-Text" | |
| emoji: ποΈ | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: gradio | |
| sdk_version: 6.4.0 | |
| python_version: '3.10' | |
| app_file: app.py | |
| pinned: false | |
| # ποΈ VocalSync Intelligence: Deconstructing Speech-to-Text | |
| **A curiosity-driven experiment in deconstructing the ASR-to-LLM pipeline.** | |
| VocalSync Intelligence is a learning experiment designed to explore the bridge between raw audio waves and structured digital thoughts. Instead of treating AI as a "black box," this project deconstructs the process of capturing scattered brainstorming and streamlining it into detailed guidelines using local hardware constraints. | |
| --- | |
| ## β¨ Features | |
| * **π€ Live Transcription**: Real-time speech-to-text conversion from microphone input. | |
| * **π€ AI Meeting Analysis**: Integrated Meeting Manager logic using Llama-3.2-3B to generate action items and key insights from raw transcripts. | |
| * **π Web Interface**: A modern Gradio-based UI designed for seamless interaction with the ASR engine. | |
| * **πΉ Universal Video Support**: Ability to ingest and transcribe audio from YouTube, Vimeo, Teams, and 1000+ other platforms via URL. | |
| * **π Hybrid Modes**: Support for real-time streaming, after-speech accumulation, and direct file uploads. | |
| * **β‘ Optimized Engine**: Leverages Faster Whisper with `int8` quantization for high-speed local CPU inference. | |
| * **πΎ Auto-Scribe**: Automatic persistence of all sessions to the `/outputs` directory with unique timestamps. | |
| * **π Privacy-First**: 100% local processing: no audio data or transcripts ever leave your machine. | |
| --- | |
| ## ποΈ Technical Architecture | |
| To balance semantic clarity with local CPU limitations, the project focuses on three technical pillars: | |
| ### 1. Signal Normalization | |
| Using **PyAudio** to sample sound at 16kHz and normalizing 16-bit integers into `float32` decimals. This is the essential digital handshake between the microphone and the neural network. | |
| ### 2. Contextual Anchoring | |
| Implementing a **Sliding Window** history. By feeding the last 200 characters of the transcript back into the `initial_prompt`, the system fixes phonetic hallucinations (e.g., ensuring "AI" isn't misheard as "Ali"). | |
| ### 3. Inference Pipeline | |
| * **ASR:** `faster-whisper` (Base model) using `int8` quantization for CPU efficiency. | |
| * **LLM:** `Llama-3.2-3B-Instruct` acting as a "Meeting Manager" to align scattered thoughts into a streamlined roadmap. | |
| --- | |
| ## π Project Structure | |
| ```plaintext | |
| . | |
| βββ app.py # Main entry point (Gradio UI) | |
| βββ src/ | |
| β βββ transcription/ # ASR Logic (Live, File, and Streaming engines) | |
| β βββ analysis/ # Llama-3.2-3B Integration | |
| β βββ handlers/ # Orchestration between audio and text processing | |
| β βββ io/ # Logic for persistent storage | |
| βββ outputs/ # Local storage for transcripts and AI analysis | |
| βββ requirements.txt # Project dependencies | |
| π Getting Started | |
| 1. Prerequisites | |
| Python 3.10+ | |
| FFmpeg: Essential for audio stream handling and URL processing. | |
| Windows: choco install ffmpeg | |
| Mac: brew install ffmpeg | |
| Linux: sudo apt install ffmpeg | |
| 2. Installation | |
| Clone the repository and set up a local environment: | |
| Bash | |
| git clone [https://github.com/mahnoor-khalid9/vocal-sync-speech-to-text.git](https://github.com/mahnoor-khalid9/vocal-sync-speech-to-text.git) | |
| cd vocal-sync-speech-to-text | |
| python -m venv venv | |
| source venv/bin/activate # Windows: venv\Scripts\activate | |
| pip install -r requirements.txt | |
| 3. Environment Setup | |
| Create a .env file in the root directory: | |
| Bash | |
| HF_TOKEN=your_huggingface_token | |
| 4. Running the Experiment | |
| Launch the interface to start the live thought-collection process: | |
| Bash | |
| python app.py | |
| π Findings & Learning Autopsy | |
| The Warm-up Pulse: Solved the "Cold Start" lag where the model would miss the first few words by injecting a 1s silent np.zeros buffer at launch to initialize the engine. | |
| VAD Gating: Implemented a Voice Activity Detection threshold of 0.5 to prevent the model from hallucinating text during silent periods or background noise. | |
| Context > Model Size: Discovered that a "Base" model with a smart sliding-window prompt can often provide more coherent brainstorming flow than a "Large" model listening in a vacuum. | |
| Note: This project is a learning exercise in seeing how data architecture: from signal normalization to metadata syncing: directly influences AI behavior. |