Spaces:

Mahnoor00
/

vocal-sync-intelligence

Sleeping

App Files Files Community

vocal-sync-intelligence / README.md

Fnu Mahnoor

Fix app

ae6c745 13 days ago

preview code

raw

history blame contribute delete

4.5 kB

	---
	title: "VocalSync Intelligence: Speech-to-Text"
	emoji: 🎙️
	colorFrom: blue
	colorTo: green
	sdk: gradio
	sdk_version: 6.4.0
	python_version: '3.10'
	app_file: app.py
	pinned: false
	---


	# 🎙️ VocalSync Intelligence: Deconstructing Speech-to-Text

	A curiosity-driven experiment in deconstructing the ASR-to-LLM pipeline.

	VocalSync Intelligence is a learning experiment designed to explore the bridge between raw audio waves and structured digital thoughts. Instead of treating AI as a "black box," this project deconstructs the process of capturing scattered brainstorming and streamlining it into detailed guidelines using local hardware constraints.

	---

	## ✨ Features

	* 🎤 Live Transcription: Real-time speech-to-text conversion from microphone input.
	* 🤖 AI Meeting Analysis: Integrated Meeting Manager logic using Llama-3.2-3B to generate action items and key insights from raw transcripts.
	* 🌐 Web Interface: A modern Gradio-based UI designed for seamless interaction with the ASR engine.
	* 📹 Universal Video Support: Ability to ingest and transcribe audio from YouTube, Vimeo, Teams, and 1000+ other platforms via URL.
	* 🔄 Hybrid Modes: Support for real-time streaming, after-speech accumulation, and direct file uploads.
	* ⚡ Optimized Engine: Leverages Faster Whisper with `int8` quantization for high-speed local CPU inference.
	* 💾 Auto-Scribe: Automatic persistence of all sessions to the `/outputs` directory with unique timestamps.
	* 🔒 Privacy-First: 100% local processing: no audio data or transcripts ever leave your machine.

	---

	## 🏗️ Technical Architecture

	To balance semantic clarity with local CPU limitations, the project focuses on three technical pillars:

	### 1. Signal Normalization
	Using PyAudio to sample sound at 16kHz and normalizing 16-bit integers into `float32` decimals. This is the essential digital handshake between the microphone and the neural network.



	### 2. Contextual Anchoring
	Implementing a Sliding Window history. By feeding the last 200 characters of the transcript back into the `initial_prompt`, the system fixes phonetic hallucinations (e.g., ensuring "AI" isn't misheard as "Ali").

	### 3. Inference Pipeline
	* ASR: `faster-whisper` (Base model) using `int8` quantization for CPU efficiency.
	* LLM: `Llama-3.2-3B-Instruct` acting as a "Meeting Manager" to align scattered thoughts into a streamlined roadmap.

	---

	## 📂 Project Structure

	```plaintext
	.
	├── app.py # Main entry point (Gradio UI)
	├── src/
	│ ├── transcription/ # ASR Logic (Live, File, and Streaming engines)
	│ ├── analysis/ # Llama-3.2-3B Integration
	│ ├── handlers/ # Orchestration between audio and text processing
	│ └── io/ # Logic for persistent storage
	├── outputs/ # Local storage for transcripts and AI analysis
	└── requirements.txt # Project dependencies
	🚀 Getting Started
	1. Prerequisites
	Python 3.10+

	FFmpeg: Essential for audio stream handling and URL processing.

	Windows: choco install ffmpeg

	Mac: brew install ffmpeg

	Linux: sudo apt install ffmpeg

	2. Installation
	Clone the repository and set up a local environment:

	Bash

	git clone [https://github.com/mahnoor-khalid9/vocal-sync-speech-to-text.git](https://github.com/mahnoor-khalid9/vocal-sync-speech-to-text.git)
	cd vocal-sync-speech-to-text
	python -m venv venv
	source venv/bin/activate # Windows: venv\Scripts\activate
	pip install -r requirements.txt
	3. Environment Setup
	Create a .env file in the root directory:

	Bash

	HF_TOKEN=your_huggingface_token
	4. Running the Experiment
	Launch the interface to start the live thought-collection process:

	Bash

	python app.py
	🎓 Findings & Learning Autopsy
	The Warm-up Pulse: Solved the "Cold Start" lag where the model would miss the first few words by injecting a 1s silent np.zeros buffer at launch to initialize the engine.

	VAD Gating: Implemented a Voice Activity Detection threshold of 0.5 to prevent the model from hallucinating text during silent periods or background noise.

	Context > Model Size: Discovered that a "Base" model with a smart sliding-window prompt can often provide more coherent brainstorming flow than a "Large" model listening in a vacuum.

	Note: This project is a learning exercise in seeing how data architecture: from signal normalization to metadata syncing: directly influences AI behavior.