Fnu Mahnoor
Fix app
ae6c745
---
title: "VocalSync Intelligence: Speech-to-Text"
emoji: πŸŽ™οΈ
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.4.0
python_version: '3.10'
app_file: app.py
pinned: false
---
# πŸŽ™οΈ VocalSync Intelligence: Deconstructing Speech-to-Text
**A curiosity-driven experiment in deconstructing the ASR-to-LLM pipeline.**
VocalSync Intelligence is a learning experiment designed to explore the bridge between raw audio waves and structured digital thoughts. Instead of treating AI as a "black box," this project deconstructs the process of capturing scattered brainstorming and streamlining it into detailed guidelines using local hardware constraints.
---
## ✨ Features
* **🎀 Live Transcription**: Real-time speech-to-text conversion from microphone input.
* **πŸ€– AI Meeting Analysis**: Integrated Meeting Manager logic using Llama-3.2-3B to generate action items and key insights from raw transcripts.
* **🌐 Web Interface**: A modern Gradio-based UI designed for seamless interaction with the ASR engine.
* **πŸ“Ή Universal Video Support**: Ability to ingest and transcribe audio from YouTube, Vimeo, Teams, and 1000+ other platforms via URL.
* **πŸ”„ Hybrid Modes**: Support for real-time streaming, after-speech accumulation, and direct file uploads.
* **⚑ Optimized Engine**: Leverages Faster Whisper with `int8` quantization for high-speed local CPU inference.
* **πŸ’Ύ Auto-Scribe**: Automatic persistence of all sessions to the `/outputs` directory with unique timestamps.
* **πŸ”’ Privacy-First**: 100% local processing: no audio data or transcripts ever leave your machine.
---
## πŸ—οΈ Technical Architecture
To balance semantic clarity with local CPU limitations, the project focuses on three technical pillars:
### 1. Signal Normalization
Using **PyAudio** to sample sound at 16kHz and normalizing 16-bit integers into `float32` decimals. This is the essential digital handshake between the microphone and the neural network.
### 2. Contextual Anchoring
Implementing a **Sliding Window** history. By feeding the last 200 characters of the transcript back into the `initial_prompt`, the system fixes phonetic hallucinations (e.g., ensuring "AI" isn't misheard as "Ali").
### 3. Inference Pipeline
* **ASR:** `faster-whisper` (Base model) using `int8` quantization for CPU efficiency.
* **LLM:** `Llama-3.2-3B-Instruct` acting as a "Meeting Manager" to align scattered thoughts into a streamlined roadmap.
---
## πŸ“‚ Project Structure
```plaintext
.
β”œβ”€β”€ app.py # Main entry point (Gradio UI)
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ transcription/ # ASR Logic (Live, File, and Streaming engines)
β”‚ β”œβ”€β”€ analysis/ # Llama-3.2-3B Integration
β”‚ β”œβ”€β”€ handlers/ # Orchestration between audio and text processing
β”‚ └── io/ # Logic for persistent storage
β”œβ”€β”€ outputs/ # Local storage for transcripts and AI analysis
└── requirements.txt # Project dependencies
πŸš€ Getting Started
1. Prerequisites
Python 3.10+
FFmpeg: Essential for audio stream handling and URL processing.
Windows: choco install ffmpeg
Mac: brew install ffmpeg
Linux: sudo apt install ffmpeg
2. Installation
Clone the repository and set up a local environment:
Bash
git clone [https://github.com/mahnoor-khalid9/vocal-sync-speech-to-text.git](https://github.com/mahnoor-khalid9/vocal-sync-speech-to-text.git)
cd vocal-sync-speech-to-text
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
3. Environment Setup
Create a .env file in the root directory:
Bash
HF_TOKEN=your_huggingface_token
4. Running the Experiment
Launch the interface to start the live thought-collection process:
Bash
python app.py
πŸŽ“ Findings & Learning Autopsy
The Warm-up Pulse: Solved the "Cold Start" lag where the model would miss the first few words by injecting a 1s silent np.zeros buffer at launch to initialize the engine.
VAD Gating: Implemented a Voice Activity Detection threshold of 0.5 to prevent the model from hallucinating text during silent periods or background noise.
Context > Model Size: Discovered that a "Base" model with a smart sliding-window prompt can often provide more coherent brainstorming flow than a "Large" model listening in a vacuum.
Note: This project is a learning exercise in seeing how data architecture: from signal normalization to metadata syncing: directly influences AI behavior.