Fnu Mahnoor commited on
Commit
4f54a59
Β·
1 Parent(s): bf2d622

update readme

Browse files
Files changed (1) hide show
  1. README.md +63 -215
README.md CHANGED
@@ -1,245 +1,93 @@
1
- # Voice Summarizer - Open-Source Speech-to-Text Transcriber
2
 
3
- A comprehensive, open-source speech-to-text transcription application with AI-powered meeting analysis. Uses Faster Whisper for local transcription and local LLMs for intelligent analysis - no external APIs required after initial setup.
4
 
5
- ## Features
6
 
7
- - **🎀 Live Transcription**: Real-time speech-to-text from microphone input
8
- - **🌐 Web Interface**: Modern Gradio-based UI with multiple transcription modes
9
- - **πŸ“Ή Video URL Support**: Transcribe audio from YouTube, Vimeo, Teams recordings, and 1000+ other platforms
10
- - **πŸ€– AI Meeting Analysis**: Local LLM analysis for meeting notes, action items, and key insights
11
- - **πŸ’Ύ Auto-Saving**: Automatic saving of transcripts and analyses with timestamps
12
- - **πŸ”„ Multiple Modes**: Real-time streaming, after-speech accumulation, file upload, and video URL processing
13
- - **⚑ Optimized Performance**: Uses Faster Whisper for fast, accurate transcription
14
- - **πŸ”’ Privacy-First**: All processing happens locally, no data sent to external servers
15
 
16
- ## Prerequisites
17
 
18
- - **Python 3.8+** (3.12 recommended)
19
- - **FFmpeg** (required for video URL processing)
20
- - **Git** (for cloning the repository)
21
- - **Conda/Miniconda** (recommended for environment management)
 
 
 
 
22
 
23
- ## Installation
24
 
25
- ### 1. Clone the Repository
26
 
27
- ```bash
28
- git clone https://github.com/yourusername/voice-summarizer.git
29
- cd voice-summarizer
30
- ```
31
 
32
- ### 2. Set Up Python Environment
 
33
 
34
- #### Using Conda (Recommended)
35
 
36
- ```bash
37
- # Create a new conda environment
38
- conda create -n voice-summarizer python=3.12
39
- conda activate voice-summarizer
40
 
41
- # Install dependencies
42
- pip install -r requirements.txt
43
- ```
44
-
45
- #### Using venv (Alternative)
46
 
47
- ```bash
48
- # Create virtual environment
49
- python -m venv venv
50
- source venv/bin/activate # On Windows: venv\Scripts\activate
51
-
52
- # Install dependencies
53
- pip install -r requirements.txt
54
- ```
55
 
56
- ### 3. Install FFmpeg
57
 
58
- FFmpeg is required for processing video URLs. Choose one of the following methods:
59
 
60
- #### Windows (Chocolatey)
61
- ```bash
62
- choco install ffmpeg
63
- ```
 
 
 
 
 
 
 
 
 
64
 
65
- #### Windows (Conda)
66
- ```bash
67
- conda install ffmpeg -c conda-forge
68
- ```
69
 
70
- #### Windows (Manual)
71
- 1. Download from https://ffmpeg.org/download.html
72
- 2. Extract to a folder (e.g., `C:\ffmpeg`)
73
- 3. Add `C:\ffmpeg\bin` to your system PATH
74
 
75
- #### Linux
76
- ```bash
77
- sudo apt install ffmpeg # Ubuntu/Debian
78
- # or
79
- sudo dnf install ffmpeg # Fedora
80
- ```
81
 
82
- #### macOS
83
- ```bash
84
- brew install ffmpeg
85
- ```
86
 
87
- ### 4. Configure Hugging Face Token
 
88
 
89
- Create a `.env` file in the project root:
90
 
91
- ```bash
92
- # Create .env file
93
- echo "HF_TOKEN=your_hugging_face_token_here" > .env
94
- ```
 
 
 
95
 
96
- Get your token from: https://huggingface.co/settings/tokens
97
 
98
- **Note**: The token is required for downloading models. Without it, you'll get authentication errors.
 
 
99
 
100
- ## Usage
101
 
102
- ### Web Application (Recommended)
 
 
103
 
104
- Launch the interactive web interface:
105
 
106
- ```bash
107
- python app.py
108
- ```
109
 
110
- This opens a Gradio web app with three main tabs:
111
-
112
- #### 1. Live Recording Tab
113
- - **Real-time Mode**: Start speaking immediately - transcription appears as you speak
114
- - **After Speech Mode**: Speak first, then click "Transcribe Accumulated" to process
115
- - **Analysis**: Click "Analyze Transcription" for AI-powered meeting insights
116
-
117
- #### 2. File Upload Tab
118
- - Upload audio/video files (WAV, MP3, M4A, MP4, etc.)
119
- - Automatic transcription and optional AI analysis
120
-
121
- #### 3. Video URL Tab
122
- - Paste URLs from YouTube, Vimeo, Teams recordings, etc.
123
- - Supports Microsoft Stream, OneDrive, SharePoint (for Teams meetings)
124
- - Automatic audio extraction and transcription
125
-
126
- ### Command-Line Interface
127
-
128
- #### Live Transcription
129
- ```bash
130
- python cli.py live
131
- ```
132
-
133
- #### File Transcription
134
- ```bash
135
- python cli.py transcribe path/to/audio.wav --model base --analyze
136
- ```
137
-
138
- #### Available Models
139
- - `tiny` (fastest, least accurate)
140
- - `base` (good balance)
141
- - `small` (better accuracy)
142
- - `medium` (high accuracy)
143
- - `large` (best accuracy, slowest)
144
-
145
- ## Outputs
146
-
147
- All results are automatically saved to the `outputs/` directory with timestamps:
148
-
149
- ```
150
- outputs/
151
- β”œβ”€β”€ 2026-01-18_14-30-00_transcript.txt
152
- β”œβ”€β”€ 2026-01-18_14-30-00_analysis.txt
153
- β”œβ”€β”€ 2026-01-18_14-45-15_transcript.txt
154
- └── 2026-01-18_14-45-15_analysis.txt
155
- ```
156
-
157
- ## Supported Formats
158
-
159
- ### Audio Files
160
- - WAV, MP3, M4A, FLAC, OGG, AAC
161
- - Any format supported by librosa/soundfile
162
-
163
- ### Video URLs
164
- - YouTube, Vimeo, Dailymotion
165
- - Microsoft Stream/OneDrive/SharePoint (Teams recordings)
166
- - TikTok, Instagram, Twitter
167
- - 1000+ platforms supported by yt-dlp
168
-
169
- ## Troubleshooting
170
-
171
- ### Common Issues
172
-
173
- #### "FFmpeg not found" Error
174
- - Ensure FFmpeg is installed and in your PATH
175
- - Test with: `ffmpeg -version`
176
-
177
- #### "Authentication failed" for Hugging Face
178
- - Check your `.env` file has a valid `HF_TOKEN`
179
- - Regenerate token if needed
180
-
181
- #### Video URL Not Working
182
- - Some private/protected videos require authentication
183
- - Try downloading manually and use File Upload tab
184
- - Check yt-dlp logs for specific errors
185
-
186
- #### LLM Analysis Not Working
187
- - Ensure you have a Hugging Face token
188
- - Check internet connection for model downloads
189
- - First run may take time to download models
190
-
191
- #### Microphone Not Detected
192
- - Check browser permissions for microphone access
193
- - Try refreshing the page
194
- - Ensure no other applications are using the microphone
195
-
196
- ### Performance Tips
197
-
198
- - Use smaller Whisper models (`tiny`, `base`) for faster processing
199
- - Close other applications to free up CPU/GPU resources
200
- - For GPU acceleration, ensure CUDA is available
201
-
202
- ## Project Structure
203
-
204
- ```
205
- voice-summarizer/
206
- β”œβ”€β”€ app.py # Main Gradio web application
207
- β”œβ”€β”€ cli.py # Command-line interface
208
- β”œβ”€β”€ requirements.txt # Python dependencies
209
- β”œβ”€β”€ .env # Environment variables (create this)
210
- β”œβ”€β”€ outputs/ # Auto-saved transcripts and analyses
211
- └── src/
212
- β”œβ”€β”€ transcription/ # Transcription modules
213
- β”‚ β”œβ”€β”€ streaming_transcriber.py
214
- β”‚ └── file_transcriber.py
215
- β”œβ”€β”€ analysis/ # LLM analysis modules
216
- β”‚ └── llm.py
217
- β”œβ”€β”€ handlers/ # Request handlers
218
- β”‚ β”œβ”€β”€ transcription_handler.py
219
- β”‚ └── analysis_handler.py
220
- └── io/ # Input/output utilities
221
- └── saver.py
222
- ```
223
-
224
- ## Contributing
225
-
226
- 1. Fork the repository
227
- 2. Create a feature branch
228
- 3. Make your changes
229
- 4. Test thoroughly
230
- 5. Submit a pull request
231
-
232
- ## License
233
-
234
- This project uses open-source libraries:
235
- - Faster Whisper: MIT License
236
- - Transformers: Apache 2.0
237
- - Gradio: Apache 2.0
238
- - yt-dlp: Unlicense
239
-
240
- ## Acknowledgments
241
-
242
- - OpenAI Whisper for the base transcription model
243
- - Faster Whisper for optimized implementation
244
- - Hugging Face for model hosting and API
245
- - yt-dlp for video downloading capabilities
 
1
+ # πŸŽ™οΈ VocalSync Intelligence: Deconstructing Speech-to-Text
2
 
3
+ **A curiosity-driven experiment in deconstructing the ASR-to-LLM pipeline.**
4
 
5
+ VocalSync Intelligence is a learning experiment designed to explore the bridge between raw audio waves and structured digital thoughts. Instead of treating AI as a "black box," this project deconstructs the process of capturing scattered brainstorming and streamlining it into detailed guidelines using local hardware constraints.
6
 
7
+ ---
 
 
 
 
 
 
 
8
 
9
+ ## ✨ Features
10
 
11
+ * **🎀 Live Transcription**: Real-time speech-to-text conversion from microphone input.
12
+ * **πŸ€– AI Meeting Analysis**: Integrated Meeting Manager logic using Llama-3.2-3B to generate action items and key insights from raw transcripts.
13
+ * **🌐 Web Interface**: A modern Gradio-based UI designed for seamless interaction with the ASR engine.
14
+ * **πŸ“Ή Universal Video Support**: Ability to ingest and transcribe audio from YouTube, Vimeo, Teams, and 1000+ other platforms via URL.
15
+ * **πŸ”„ Hybrid Modes**: Support for real-time streaming, after-speech accumulation, and direct file uploads.
16
+ * **⚑ Optimized Engine**: Leverages Faster Whisper with `int8` quantization for high-speed local CPU inference.
17
+ * **πŸ’Ύ Auto-Scribe**: Automatic persistence of all sessions to the `/outputs` directory with unique timestamps.
18
+ * **πŸ”’ Privacy-First**: 100% local processing: no audio data or transcripts ever leave your machine.
19
 
20
+ ---
21
 
22
+ ## πŸ—οΈ Technical Architecture
23
 
24
+ To balance semantic clarity with local CPU limitations, the project focuses on three technical pillars:
 
 
 
25
 
26
+ ### 1. Signal Normalization
27
+ Using **PyAudio** to sample sound at 16kHz and normalizing 16-bit integers into `float32` decimals. This is the essential digital handshake between the microphone and the neural network.
28
 
 
29
 
 
 
 
 
30
 
31
+ ### 2. Contextual Anchoring
32
+ Implementing a **Sliding Window** history. By feeding the last 200 characters of the transcript back into the `initial_prompt`, the system fixes phonetic hallucinations (e.g., ensuring "AI" isn't misheard as "Ali").
 
 
 
33
 
34
+ ### 3. Inference Pipeline
35
+ * **ASR:** `faster-whisper` (Base model) using `int8` quantization for CPU efficiency.
36
+ * **LLM:** `Llama-3.2-3B-Instruct` acting as a "Meeting Manager" to align scattered thoughts into a streamlined roadmap.
 
 
 
 
 
37
 
38
+ ---
39
 
40
+ ## πŸ“‚ Project Structure
41
 
42
+ ```plaintext
43
+ .
44
+ β”œβ”€β”€ app.py # Main entry point (Gradio UI)
45
+ β”œβ”€β”€ src/
46
+ β”‚ β”œβ”€β”€ transcription/ # ASR Logic (Live, File, and Streaming engines)
47
+ β”‚ β”œβ”€β”€ analysis/ # Llama-3.2-3B Integration
48
+ β”‚ β”œβ”€β”€ handlers/ # Orchestration between audio and text processing
49
+ β”‚ └── io/ # Logic for persistent storage
50
+ β”œβ”€β”€ outputs/ # Local storage for transcripts and AI analysis
51
+ └── requirements.txt # Project dependencies
52
+ πŸš€ Getting Started
53
+ 1. Prerequisites
54
+ Python 3.10+
55
 
56
+ FFmpeg: Essential for audio stream handling and URL processing.
 
 
 
57
 
58
+ Windows: choco install ffmpeg
 
 
 
59
 
60
+ Mac: brew install ffmpeg
 
 
 
 
 
61
 
62
+ Linux: sudo apt install ffmpeg
 
 
 
63
 
64
+ 2. Installation
65
+ Clone the repository and set up a local environment:
66
 
67
+ Bash
68
 
69
+ git clone [https://github.com/mahnoor-khalid9/vocal-sync-speech-to-text.git](https://github.com/mahnoor-khalid9/vocal-sync-speech-to-text.git)
70
+ cd vocal-sync-speech-to-text
71
+ python -m venv venv
72
+ source venv/bin/activate # Windows: venv\Scripts\activate
73
+ pip install -r requirements.txt
74
+ 3. Environment Setup
75
+ Create a .env file in the root directory:
76
 
77
+ Bash
78
 
79
+ HF_TOKEN=your_huggingface_token
80
+ 4. Running the Experiment
81
+ Launch the interface to start the live thought-collection process:
82
 
83
+ Bash
84
 
85
+ python app.py
86
+ πŸŽ“ Findings & Learning Autopsy
87
+ The Warm-up Pulse: Solved the "Cold Start" lag where the model would miss the first few words by injecting a 1s silent np.zeros buffer at launch to initialize the engine.
88
 
89
+ VAD Gating: Implemented a Voice Activity Detection threshold of 0.5 to prevent the model from hallucinating text during silent periods or background noise.
90
 
91
+ Context > Model Size: Discovered that a "Base" model with a smart sliding-window prompt can often provide more coherent brainstorming flow than a "Large" model listening in a vacuum.
 
 
92
 
93
+ Note: This project is a learning exercise in seeing how data architecture: from signal normalization to metadata syncing: directly influences AI behavior.