Spaces:
Sleeping
Sleeping
File size: 4,500 Bytes
3cadb40 7c9ea30 b1f89ec 3cadb40 ae6c745 f758ecb 3cadb40 1308c66 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 bf2d622 311c667 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
---
title: "VocalSync Intelligence: Speech-to-Text"
emoji: ποΈ
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.4.0
python_version: '3.10'
app_file: app.py
pinned: false
---
# ποΈ VocalSync Intelligence: Deconstructing Speech-to-Text
**A curiosity-driven experiment in deconstructing the ASR-to-LLM pipeline.**
VocalSync Intelligence is a learning experiment designed to explore the bridge between raw audio waves and structured digital thoughts. Instead of treating AI as a "black box," this project deconstructs the process of capturing scattered brainstorming and streamlining it into detailed guidelines using local hardware constraints.
---
## β¨ Features
* **π€ Live Transcription**: Real-time speech-to-text conversion from microphone input.
* **π€ AI Meeting Analysis**: Integrated Meeting Manager logic using Llama-3.2-3B to generate action items and key insights from raw transcripts.
* **π Web Interface**: A modern Gradio-based UI designed for seamless interaction with the ASR engine.
* **πΉ Universal Video Support**: Ability to ingest and transcribe audio from YouTube, Vimeo, Teams, and 1000+ other platforms via URL.
* **π Hybrid Modes**: Support for real-time streaming, after-speech accumulation, and direct file uploads.
* **β‘ Optimized Engine**: Leverages Faster Whisper with `int8` quantization for high-speed local CPU inference.
* **πΎ Auto-Scribe**: Automatic persistence of all sessions to the `/outputs` directory with unique timestamps.
* **π Privacy-First**: 100% local processing: no audio data or transcripts ever leave your machine.
---
## ποΈ Technical Architecture
To balance semantic clarity with local CPU limitations, the project focuses on three technical pillars:
### 1. Signal Normalization
Using **PyAudio** to sample sound at 16kHz and normalizing 16-bit integers into `float32` decimals. This is the essential digital handshake between the microphone and the neural network.
### 2. Contextual Anchoring
Implementing a **Sliding Window** history. By feeding the last 200 characters of the transcript back into the `initial_prompt`, the system fixes phonetic hallucinations (e.g., ensuring "AI" isn't misheard as "Ali").
### 3. Inference Pipeline
* **ASR:** `faster-whisper` (Base model) using `int8` quantization for CPU efficiency.
* **LLM:** `Llama-3.2-3B-Instruct` acting as a "Meeting Manager" to align scattered thoughts into a streamlined roadmap.
---
## π Project Structure
```plaintext
.
βββ app.py # Main entry point (Gradio UI)
βββ src/
β βββ transcription/ # ASR Logic (Live, File, and Streaming engines)
β βββ analysis/ # Llama-3.2-3B Integration
β βββ handlers/ # Orchestration between audio and text processing
β βββ io/ # Logic for persistent storage
βββ outputs/ # Local storage for transcripts and AI analysis
βββ requirements.txt # Project dependencies
π Getting Started
1. Prerequisites
Python 3.10+
FFmpeg: Essential for audio stream handling and URL processing.
Windows: choco install ffmpeg
Mac: brew install ffmpeg
Linux: sudo apt install ffmpeg
2. Installation
Clone the repository and set up a local environment:
Bash
git clone [https://github.com/mahnoor-khalid9/vocal-sync-speech-to-text.git](https://github.com/mahnoor-khalid9/vocal-sync-speech-to-text.git)
cd vocal-sync-speech-to-text
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
3. Environment Setup
Create a .env file in the root directory:
Bash
HF_TOKEN=your_huggingface_token
4. Running the Experiment
Launch the interface to start the live thought-collection process:
Bash
python app.py
π Findings & Learning Autopsy
The Warm-up Pulse: Solved the "Cold Start" lag where the model would miss the first few words by injecting a 1s silent np.zeros buffer at launch to initialize the engine.
VAD Gating: Implemented a Voice Activity Detection threshold of 0.5 to prevent the model from hallucinating text during silent periods or background noise.
Context > Model Size: Discovered that a "Base" model with a smart sliding-window prompt can often provide more coherent brainstorming flow than a "Large" model listening in a vacuum.
Note: This project is a learning exercise in seeing how data architecture: from signal normalization to metadata syncing: directly influences AI behavior. |