grider-transwithai commited on
Commit
3e96984
·
verified ·
1 Parent(s): ea2f6a4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +224 -3
README.md CHANGED
@@ -1,3 +1,224 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ja
4
+ - multilingual
5
+ tags:
6
+ - voice-activity-detection
7
+ - vad
8
+ - whisper
9
+ - onnx
10
+ - speech-detection
11
+ - audio-classification
12
+ - asmr
13
+ - japanese
14
+ - whispered-speech
15
+ license: mit
16
+ base_model: openai/whisper-base
17
+ library_name: transformers
18
+ pipeline_tag: audio-classification
19
+ ---
20
+
21
+ # Whisper-base Voice Activity Detection (VAD) for Japanese ASMR - ONNX
22
+
23
+ ## Model Description
24
+
25
+ This is a refined Whisper-based Voice Activity Detection (VAD) model that leverages the pre-trained Whisper encoder with a lightweight non-autoregressive decoder for high-precision speech activity detection. While fine-tuned on Japanese ASMR content for optimal performance on soft speech and whispers, the model retains Whisper's robust multilingual foundation, enabling effective speech detection across diverse languages and acoustic conditions. It has been optimized and exported to ONNX format for efficient inference across different platforms.
26
+
27
+ This work builds upon recent research demonstrating the positive transfer of Whisper's speech representations to VAD tasks, as shown in [WhisperSeg](https://github.com/nianlonggu/WhisperSeg) and related work.
28
+
29
+ ### Key Features
30
+
31
+ - **Architecture**: Encoder-Decoder model based on whisper-base
32
+ - **Frame Resolution**: 20ms per frame for precise temporal detection
33
+ - **Input Duration**: Processes 30-second audio chunks
34
+ - **Output**: Frame-level speech/non-speech predictions
35
+ - **Optimized**: ONNX format for cross-platform deployment
36
+ - **Real-time capable**: Fast non-autoregressive inference
37
+
38
+ ### Model Architecture Details
39
+
40
+ - **Base Model**: OpenAI whisper-base encoder (frozen during training)
41
+ - **Decoder**: 2-layer transformer decoder with 8 attention heads
42
+ - **Processing**:
43
+ - Input: 30-second audio chunks (480,000 samples @ 16kHz)
44
+ - Features: 80-channel log-mel spectrogram
45
+ - Output: 1500 frame predictions (20ms per frame)
46
+
47
+ ## Performance
48
+
49
+ - **Frame Duration**: 20ms per frame for precise temporal detection
50
+ - **Processing Speed**: ~100x real-time on CPU (single-sample processing)
51
+ - **Batch Processing**: Currently limited to batch size of 1 due to ONNX export constraints, but single-sample inference is extremely fast
52
+ - **Specialized Training**: Japanese ASMR and whispered speech
53
+ - **Generalization**: Despite being fine-tuned on Japanese ASMR, the model inherits Whisper's strong multilingual capabilities and can effectively detect speech in various languages and acoustic environments
54
+
55
+ ### Advantages over Native Whisper VAD
56
+
57
+ - **No hallucinations**: Discriminative model cannot generate spurious text
58
+ - **Much faster**: Single forward pass, non-autoregressive inference
59
+ - **Higher precision**: 20ms frame-level temporal resolution vs Whisper's 30s chunks
60
+ - **Robust**: Focal loss training handles speech/silence imbalance effectively
61
+ - **Lightweight**: Decoder adds minimal parameters to base Whisper encoder
62
+
63
+ ## Usage
64
+
65
+ ### Quick Start with ONNX Runtime
66
+
67
+ ```python
68
+ import numpy as np
69
+ import onnxruntime as ort
70
+ from transformers import WhisperFeatureExtractor
71
+ import librosa
72
+
73
+ # Load model
74
+ session = ort.InferenceSession("model.onnx")
75
+ feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-base")
76
+
77
+ # Load and preprocess audio
78
+ audio, sr = librosa.load("audio.wav", sr=16000)
79
+ audio_chunk = audio[:480000] # 30 seconds
80
+
81
+ # Extract features
82
+ inputs = feature_extractor(
83
+ audio_chunk,
84
+ sampling_rate=16000,
85
+ return_tensors="np"
86
+ )
87
+
88
+ # Run inference
89
+ outputs = session.run(None, {session.get_inputs()[0].name: inputs.input_features})
90
+ predictions = outputs[0] # Shape: [1, 1500] - 1500 frames of 20ms each
91
+
92
+ # Apply threshold
93
+ speech_frames = predictions[0] > 0.5
94
+ ```
95
+
96
+ ### Using the Provided Inference Script
97
+
98
+ The model repository includes a comprehensive `inference.py` script with advanced features:
99
+
100
+ ```python
101
+ from inference import WhisperVADInference
102
+
103
+ # Initialize model
104
+ vad = WhisperVADInference(
105
+ model_path="model.onnx",
106
+ threshold=0.5, # Speech detection threshold
107
+ min_speech_duration=0.25, # Minimum speech segment duration
108
+ min_silence_duration=0.1 # Minimum silence between segments
109
+ )
110
+
111
+ # Process audio file
112
+ segments = vad.process_audio("audio.wav")
113
+
114
+ # Segments format: List of (start_time, end_time) tuples
115
+ for start, end in segments:
116
+ print(f"Speech detected: {start:.2f}s - {end:.2f}s")
117
+ ```
118
+
119
+ ### Streaming/Real-time Processing
120
+
121
+ ```python
122
+ # Process audio stream in chunks
123
+ vad = WhisperVADInference("model.onnx", streaming=True)
124
+
125
+ for audio_chunk in audio_stream:
126
+ speech_active = vad.process_chunk(audio_chunk)
127
+ if speech_active:
128
+ # Handle speech detection
129
+ pass
130
+ ```
131
+
132
+ ## Input/Output Specifications
133
+
134
+ ### Input
135
+ - **Audio Format**: 16kHz mono audio
136
+ - **Chunk Size**: 30 seconds (480,000 samples)
137
+ - **Feature Type**: 80-channel log-mel spectrogram
138
+ - **Shape**: `[1, 80, 3000]` (batch size fixed to 1 - see note below)
139
+
140
+ ### Output
141
+ - **Type**: Frame-level probabilities
142
+ - **Shape**: `[1, 1500]` (batch size fixed to 1)
143
+ - **Frame Duration**: 20ms per frame
144
+ - **Range**: [0, 1] probability of speech presence
145
+
146
+ **Note on Batch Processing**: Currently, the ONNX model only supports batch size of 1 due to export limitations between PyTorch transformers and ONNX. However, single-sample inference is highly optimized and runs extremely fast (~100x real-time on CPU), making sequential processing still very efficient for most use cases.
147
+
148
+ ## Training Details
149
+
150
+ ### Training Configuration
151
+ - **Dataset**: ~500 Japanese ASMR audio recordings with accurate speech timestamps
152
+ - **Loss Function**: Focal Loss (α=0.25, γ=2.0) for class imbalance
153
+ - **Optimizer**: AdamW with learning rate 1.5e-3
154
+ - **Batch Size**: 128
155
+ - **Training Duration**: 5 epochs
156
+ - **Hardware**: Single GPU training with mixed precision (bf16)
157
+
158
+ ### Data Processing
159
+ - Audio segmented into 30-second chunks
160
+ - Frame-level labels generated from word-level timestamps
161
+ - Augmentation: None (relying on Whisper's pre-training robustness)
162
+
163
+ ## Limitations and Considerations
164
+
165
+ 1. **Fixed Duration**: Model expects 30-second chunks; shorter audio needs padding
166
+ 2. **Training Specialization**: While the model performs well across languages and environments due to Whisper's strong multilingual foundation, it excels particularly at:
167
+ - Japanese ASMR content (primary training data)
168
+ - Whispered and soft speech detection
169
+ - Quiet, intimate audio environments
170
+ 3. **Generalization**: The model can effectively handle various languages and normal speech volumes, though performance may be slightly better on content similar to the training data
171
+ 4. **Background Noise**: Performance may degrade in very noisy conditions
172
+ 5. **Music/Singing**: Primarily trained on speech; may have variable performance on singing
173
+
174
+ ## Model Files
175
+
176
+ - `model.onnx`: ONNX model file
177
+ - `model_metadata.json`: Model configuration and parameters
178
+ - `inference.py`: Ready-to-use inference script with post-processing
179
+ - `requirements.txt`: Python dependencies
180
+
181
+ ## Installation
182
+
183
+ ```bash
184
+ pip install onnxruntime # or onnxruntime-gpu for GPU support
185
+ pip install librosa transformers numpy
186
+ ```
187
+
188
+ ## Applications
189
+
190
+ - **ASMR Content Processing**: Detect whispered speech and subtle vocalizations in ASMR recordings
191
+ - **Japanese Audio Processing**: Optimized for Japanese language content, especially soft speech
192
+ - **Transcription Pre-processing**: Filter out silence before ASR, particularly effective for whispered content
193
+ - **Audio Indexing**: Identify speech segments in long recordings
194
+ - **Real-time Communication**: Detect active speech in calls/meetings
195
+ - **Audio Analytics**: Speech/silence ratio analysis for ASMR and meditation content
196
+ - **Subtitle Alignment**: Accurate timing for subtitles, including whispered dialogue
197
+
198
+ ## Citation
199
+
200
+ If you use this model, please cite:
201
+
202
+ ```bibtex
203
+ @misc{whisper-vad,
204
+ title={Whisper-VAD: Whisper-based Voice Activity Detection},
205
+ author={Grider},
206
+ year={2025},
207
+ publisher={Hugging Face},
208
+ howpublished={\url{https://huggingface.co/TransWithAI/Whisper-Vad-EncDec-ASMR-onnx}}
209
+ }
210
+ ```
211
+
212
+ ## References
213
+
214
+ - Original Whisper Paper: [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)
215
+ - WhisperSeg: [Positive Transfer of the Whisper Speech Transformer to Human and Animal Voice Activity Detection](https://doi.org/10.1101/2023.09.30.560270)
216
+ - GitHub: [https://github.com/nianlonggu/WhisperSeg](https://github.com/nianlonggu/WhisperSeg)
217
+
218
+ ## License
219
+
220
+ MIT License
221
+
222
+ ## Acknowledgments
223
+
224
+ This model builds upon OpenAI's Whisper model and implements architectural refinements for efficient voice activity detection.