prthm11 commited on
Commit
105dda6
·
verified ·
1 Parent(s): a0b8bc0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +238 -10
README.md CHANGED
@@ -1,12 +1,240 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: AudioTransDiar
3
- emoji: 📚
4
- colorFrom: pink
5
- colorTo: red
6
- sdk: docker
7
- pinned: false
8
- license: apache-2.0
9
- short_description: Real Time Transcription with Speaker Diarization
10
- ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #### 1. Initialization & configuration
2
+
3
+ FORMAT, CHANNELS, RATE, CHUNK, CHUNK_DURATION_SECS, OUTPUT_DIR, CHUNKS_DIR, FINAL_WAV, TRANSCRIPT_FILE, MODEL_NAME
4
+
5
+ #### Device Listing
6
+
7
+ ###### list_input_devices():
8
+
9
+ - Lists all available audio input devices (microphones, loopbacks, etc.) with their indices and channel counts.
10
+
11
+ 1. **Create a PyAudio Instance**
12
+ - Initialize a new `PyAudio` object to interact with the audio hardware.
13
+ 2. **Print Header**
14
+ - Print a message indicating that available audio input devices will be listed.
15
+ 3. **Iterate Over All Devices**
16
+ - For each device index from `0` to `get_device_count() - 1`:
17
+ - Retrieve device information using `get_device_info_by_index(i)`.
18
+ 4. **Filter Input Devices**
19
+ - For each device, check if `"maxInputChannels"` is greater than `0` (i.e., it can record audio).
20
+ 5. **Print Device Info**
21
+ - If the device is an input device, print its index, name, and number of input channels.
22
+ 6. **Terminate PyAudio**
23
+ - After listing, terminate the `PyAudio` instance to free resources.
24
+
25
+ #### Audio Stream Handling
26
+
27
+ ###### open_stream_for_device(device_index, channels):
28
+
29
+ - Opens a PyAudio input stream for the given device index and channel count.
30
+
31
+ 1. **Input Parameters**
32
+ - [device_index](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html): The index of the audio input device to use (e.g., microphone or system audio).
33
+ - [channels](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html): Number of audio channels to record (default is 1, i.e., mono).
34
+ 2. **Open Audio Stream**
35
+ - Use the global [audio](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html) object (an instance of [pyaudio.PyAudio()](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html)).
36
+ - Call [audio.open()](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html) with the following parameters:
37
+ - [format=FORMAT](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html) (audio sample format, e.g., 16-bit int)
38
+ - [channels=channels](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html) (number of channels)
39
+ - [rate=RATE](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html) (sample rate, e.g., 44100 Hz)
40
+ - [input=True](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html) (open for input/recording)
41
+ - [frames_per_buffer=CHUNK](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html) (buffer size per read)
42
+ - [input_device_index=device_index](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html) (which device to use)
43
+ 3. **Return Stream**
44
+ - Return the opened stream object to the caller.
45
+
46
+ #### Audio file Operations
47
+
48
+ ###### save_wav_from_frames(path: Path, frames: list, nchannels=1):
49
+
50
+ 1. **Input Parameters**
51
+ - [path](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html): The file path where the WAV file will be saved.
52
+ - [frames](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html): A list of audio frames (byte strings) to write.
53
+ - [nchannels](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html): Number of audio channels (default is 1).
54
+ 2. **Open WAV File for Writing**
55
+ - Use the [wave.open()](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html) function to open the file at [path](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html) in write-binary (`'wb'`) mode.
56
+ 3. **Set WAV File Parameters**
57
+ - Set the number of channels using [setnchannels(nchannels)](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html).
58
+ - Set the sample width using [setsampwidth(audio.get_sample_size(FORMAT))](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html).
59
+ - Set the frame rate (sample rate) using [setframerate(RATE)](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html).
60
+ 4. **Write Audio Data**
61
+ - Concatenate all frames in the [frames](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html) list into a single bytes object.
62
+ - Write the concatenated bytes to the WAV file using [writeframes()](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html).
63
+ 5. **Close the File**
64
+ - The `with` statement ensures the file is properly closed after writing.
65
+
66
+ ###### merge_mono_files_to_stereo(mic_path: Path, sys_path: Path, out_path: Path):
67
+
68
+ - Merges two mono WAV files (mic and system) into a stereo WAV file.
69
+
70
+ 1. **Check for numpy Availability**
71
+ - If numpy is not available, print a message and exit the function.
72
+ 2. **Open Input WAV Files**
73
+ - Open the microphone WAV file ([mic_path](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html)) for reading.
74
+ - Open the system audio WAV file ([sys_path](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html)) for reading.
75
+ 3. **Validate Audio Properties**
76
+ - Assert that both files have the same sample rate ([RATE](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html)).
77
+ 4. **Read Audio Data**
78
+ - Get the sample width from the mic file.
79
+ - Determine the minimum number of frames available in both files.
80
+ - Read that many frames from both files into [mic_bytes](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html) and [sys_bytes](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html).
81
+ 5. **Convert Bytes to Arrays**
82
+ - Convert [mic_bytes](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html) and [sys_bytes](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html) to numpy arrays of type [int16](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html).
83
+ 6. **Interleave Channels for Stereo**
84
+ - Create an empty numpy array of size [nframes \* 2](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html) (for stereo).
85
+ - Assign mic samples to even indices (left channel).
86
+ - Assign system samples to odd indices (right channel).
87
+ 7. **Write Stereo WAV File**
88
+ - Open the output WAV file ([out_path](vscode-file://vscode-app/c:/Users/as/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html)) for writing.
89
+ - Set number of channels to 2 (stereo).
90
+ - Set sample width and frame rate.
91
+ - Write the interleaved stereo data to the file.
92
+
93
+ #### Transcription
94
+
95
+ ###### **Transcriber class** :
96
+
97
+ - Loads the `faster-whisper` model if available.
98
+
99
+ 1. **Initialization (`__init__` method)**
100
+
101
+ - Set `self.model` to `None`.
102
+ - If `faster-whisper` is available:
103
+ - Print a message about loading the model.
104
+ - Try to import `torch` and detect device:
105
+ - If CUDA is available, set `device = "cuda"`.
106
+ - Else, set `device = "cpu"`.
107
+ - Set `compute_type`: `"float16"` if device is `"cuda"`, else `"float32"`.
108
+ - Try to instantiate the `WhisperModel` with the selected model name, device, and compute type:
109
+ - If successful, assign the model to `self.model` and print a success message.
110
+ - If failed, print an error and set `self.model = None`.
111
+ - Else (if `faster-whisper` not available):
112
+ - Print a message that transcription is disabled.
113
+ 2. **Transcription (`transcribe_file` method)**
114
+
115
+ - If `self.model` is not set, return `None`.
116
+ - Try to transcribe the given WAV file using `self.model.transcribe()`:
117
+ - Use `beam_size=5` for decoding.
118
+ - Concatenate all segment texts into a single string.
119
+ - Return the transcribed text.
120
+ - If an error occurs, print an error message and return `None`.
121
+
122
+ #### Diarization
123
+
124
+ ###### diarization_hook(audio_path: str):
125
+
126
+ - Runs speaker diarization and returns a list of (start, end, speaker) tuples.
127
+
128
+ 1. **Check Diarization Availability**
129
+ - If `DIARIZATION_AVAILABLE` is `False`, return `None`.
130
+ 2. **Run Diarization Pipeline**
131
+ - Use the global `diarization_pipeline` to process the audio file at `audio_path`.
132
+ - Store the result in a variable (e.g., `diarization`).
133
+ 3. **Extract Speaker Segments**
134
+ - Initialize an empty list called `results`.
135
+ - For each segment in `diarization.itertracks(yield_label=True)`:
136
+ - Extract the segment's start time, end time, and speaker label.
137
+ - Append a tuple `(turn.start, turn.end, speaker)` to `results`.
138
+ 4. **Return Results**
139
+ - Return the `results` list containing tuples of (start, end, speaker) for each detected speaker segment.
140
+
141
+ #### Recording Threads
142
+
143
+ ###### record_loop(device_index, out_queue, label="mic"):
144
+
145
+ - Continuously reads bytes from the device stream and pushes full-second frames to a queue.
146
+
147
+ 1. **Open Audio Stream**
148
+ - Open a PyAudio input stream for the given device index and channel count.
149
+ 2. **Read Audio Data**
150
+ - Continuously read audio data in chunks.
151
+ - After enough frames for a chunk are collected, put them (with a timestamped filename) into a queue.
152
+ - Runs in a thread for each device (mic and optionally system).
153
+ 3. **Error Handling**
154
+ - If repeated read errors occur, the thread will stop for that device.
155
+
156
+ #### Chunk Writing & Transcription
157
+
158
+ ###### chunk_writer_and_transcribe_worker(in_queue, final_frames_list, transcriber, single_channel_label)
159
+
160
+ - Waits for audio chunks from the queue.
161
+ - Saves each chunk as a WAV file.
162
+ - Appends frames to a list for final concatenation.
163
+ - If transcription is enabled, transcribes the chunk and appends the result to a transcript file.
164
+ - Calls diarization on each chunk and aligns speaker segments with transcription.
165
+ - Runs in a thread for each device.
166
+
167
+ #### Main Recording Orchestration
168
+
169
+ ###### run_recording(mic_index, sys_index=None, chunk_secs=CHUNK_DURATION_SECS, model_name=MODEL_NAME, no_transcribe=False)
170
+
171
+ - Sets up and starts the recording and writer threads for mic and (optionally) system audio.
172
+ - Handles stopping and joining threads on KeyboardInterrupt.
173
+ - Saves the final concatenated WAV file(s).
174
+ - If both mic and system were recorded, merges them into a stereo WAV.
175
+ - Terminates PyAudio and prints completion message.
176
+
177
+ #### CLI Wrapper (cli.py)
178
+
179
+ ###### Algorithm of cli.py
180
+
181
+ - Provides a command-line interface for recording, chunking, and optional transcription.
182
+
183
+ 1. **Argument Parsing**
184
+
185
+ - Uses `argparse` to define and parse command-line arguments:
186
+ - `--mic` / `-m`: Device index for microphone (optional)
187
+ - `--sys` / `-s`: Device index for system/loopback (optional)
188
+ - `--chunk-secs`: Chunk length in seconds (default from config)
189
+ - `--model`: Model name for transcription (default from config)
190
+ - `--no-transcribe`: Disable transcription if set
191
+ 2. **Device Selection**
192
+
193
+ - If `--mic` is not provided:
194
+ - Calls `list_input_devices()` to show available devices.
195
+ - Prompts the user to enter a mic device index (or uses default if blank).
196
+ - If `--sys` is not provided:
197
+ - Prompts the user whether to record system audio (loopback).
198
+ - If yes, prompts for system device index.
199
+ 3. **Run Recording**
200
+
201
+ - Calls `run_recording()` with the selected device indexes, chunk length, model, and transcription flag.
202
+ 4. **Entrypoint**
203
+
204
+ - If the script is run directly, calls `main()` to start the CLI workflow.
205
+
206
  ---
 
 
 
 
 
 
 
 
 
207
 
208
+ #### Usage
209
+
210
+ **Command-line (interactive):**
211
+
212
+ ```sh
213
+ python cli.py
214
+ ```
215
+
216
+ You will be prompted to select device indexes for mic and (optionally) system audio.
217
+
218
+ **Command-line (with arguments):**
219
+
220
+ ```sh
221
+ python cli.py --mic 6 --sys 8 --chunk-secs 5 --model medium
222
+ ```
223
+
224
+ #### Requirements
225
+
226
+ - Python 3.8+
227
+ - pyaudio
228
+ - numpy
229
+ - faster-whisper (optional, for transcription)
230
+ - pyannote.audio (optional, for diarization)
231
+
232
+ Install requirements:
233
+
234
+ ```sh
235
+ pip install pyaudio numpy
236
+ # For transcription:
237
+ pip install faster-whisper
238
+ # For diarization:
239
+ pip install pyannote.audio
240
+ ```