File size: 4,500 Bytes
3cadb40
7c9ea30
b1f89ec
3cadb40
 
 
ae6c745
f758ecb
3cadb40
 
 
 
1308c66
311c667
bf2d622
311c667
bf2d622
311c667
bf2d622
311c667
bf2d622
311c667
bf2d622
311c667
 
 
 
 
 
 
 
bf2d622
311c667
bf2d622
311c667
bf2d622
311c667
bf2d622
311c667
 
bf2d622
 
 
311c667
 
bf2d622
311c667
 
 
bf2d622
311c667
bf2d622
311c667
bf2d622
311c667
 
 
 
 
 
 
 
 
 
 
 
 
bf2d622
311c667
bf2d622
311c667
bf2d622
311c667
bf2d622
311c667
bf2d622
311c667
 
bf2d622
311c667
bf2d622
311c667
 
 
 
 
 
 
bf2d622
311c667
bf2d622
311c667
 
 
bf2d622
311c667
bf2d622
311c667
 
 
bf2d622
311c667
bf2d622
311c667
bf2d622
311c667
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
title: "VocalSync Intelligence: Speech-to-Text"
emoji: πŸŽ™οΈ
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 6.4.0
python_version: '3.10'
app_file: app.py
pinned: false
---


# πŸŽ™οΈ VocalSync Intelligence: Deconstructing Speech-to-Text

**A curiosity-driven experiment in deconstructing the ASR-to-LLM pipeline.**

VocalSync Intelligence is a learning experiment designed to explore the bridge between raw audio waves and structured digital thoughts. Instead of treating AI as a "black box," this project deconstructs the process of capturing scattered brainstorming and streamlining it into detailed guidelines using local hardware constraints.

---

## ✨ Features

* **🎀 Live Transcription**: Real-time speech-to-text conversion from microphone input.
* **πŸ€– AI Meeting Analysis**: Integrated Meeting Manager logic using Llama-3.2-3B to generate action items and key insights from raw transcripts.
* **🌐 Web Interface**: A modern Gradio-based UI designed for seamless interaction with the ASR engine.
* **πŸ“Ή Universal Video Support**: Ability to ingest and transcribe audio from YouTube, Vimeo, Teams, and 1000+ other platforms via URL.
* **πŸ”„ Hybrid Modes**: Support for real-time streaming, after-speech accumulation, and direct file uploads.
* **⚑ Optimized Engine**: Leverages Faster Whisper with `int8` quantization for high-speed local CPU inference.
* **πŸ’Ύ Auto-Scribe**: Automatic persistence of all sessions to the `/outputs` directory with unique timestamps.
* **πŸ”’ Privacy-First**: 100% local processing: no audio data or transcripts ever leave your machine.

---

## πŸ—οΈ Technical Architecture

To balance semantic clarity with local CPU limitations, the project focuses on three technical pillars:

### 1. Signal Normalization
Using **PyAudio** to sample sound at 16kHz and normalizing 16-bit integers into `float32` decimals. This is the essential digital handshake between the microphone and the neural network.



### 2. Contextual Anchoring
Implementing a **Sliding Window** history. By feeding the last 200 characters of the transcript back into the `initial_prompt`, the system fixes phonetic hallucinations (e.g., ensuring "AI" isn't misheard as "Ali").

### 3. Inference Pipeline
* **ASR:** `faster-whisper` (Base model) using `int8` quantization for CPU efficiency.
* **LLM:** `Llama-3.2-3B-Instruct` acting as a "Meeting Manager" to align scattered thoughts into a streamlined roadmap.

---

## πŸ“‚ Project Structure

```plaintext
.
β”œβ”€β”€ app.py                  # Main entry point (Gradio UI)
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ transcription/      # ASR Logic (Live, File, and Streaming engines)
β”‚   β”œβ”€β”€ analysis/           # Llama-3.2-3B Integration
β”‚   β”œβ”€β”€ handlers/           # Orchestration between audio and text processing
β”‚   └── io/                 # Logic for persistent storage
β”œβ”€β”€ outputs/                # Local storage for transcripts and AI analysis
└── requirements.txt        # Project dependencies
πŸš€ Getting Started
1. Prerequisites
Python 3.10+

FFmpeg: Essential for audio stream handling and URL processing.

Windows: choco install ffmpeg

Mac: brew install ffmpeg

Linux: sudo apt install ffmpeg

2. Installation
Clone the repository and set up a local environment:

Bash

git clone [https://github.com/mahnoor-khalid9/vocal-sync-speech-to-text.git](https://github.com/mahnoor-khalid9/vocal-sync-speech-to-text.git)
cd vocal-sync-speech-to-text
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt
3. Environment Setup
Create a .env file in the root directory:

Bash

HF_TOKEN=your_huggingface_token
4. Running the Experiment
Launch the interface to start the live thought-collection process:

Bash

python app.py
πŸŽ“ Findings & Learning Autopsy
The Warm-up Pulse: Solved the "Cold Start" lag where the model would miss the first few words by injecting a 1s silent np.zeros buffer at launch to initialize the engine.

VAD Gating: Implemented a Voice Activity Detection threshold of 0.5 to prevent the model from hallucinating text during silent periods or background noise.

Context > Model Size: Discovered that a "Base" model with a smart sliding-window prompt can often provide more coherent brainstorming flow than a "Large" model listening in a vacuum.

Note: This project is a learning exercise in seeing how data architecture: from signal normalization to metadata syncing: directly influences AI behavior.