File size: 9,800 Bytes
c2f1451 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 |
# ONNX Real-Time DOA Streaming
Real-time Direction of Arrival (DOA) detection using an ONNX model with microphone streaming. This script processes audio from a multi-channel microphone array (ReSpeaker) in real-time and displays detected sound source directions.
## Overview
The script performs the following process:
1. **Audio Capture**: Streams audio from a 6-channel microphone array (ReSpeaker)
2. **Channel Selection**: Selects and reorders channels `[1, 4, 3, 2]` to get 4 channels
3. **Feature Extraction**: Computes STFT features (magnitude, phase, cosine, sine) from the audio
4. **ONNX Inference**: Runs the DOA model on GPU (CUDA) or CPU to get per-frame logits
5. **Histogram Aggregation**: Aggregates logits into a circular histogram of azimuth angles
6. **Peak Detection**: Finds peaks in the histogram to identify sound source directions
7. **Event Gating**: Filters detections based on audio level changes and coherence
8. **Visualization**: Displays detected directions on a polar plot in real-time
## Prerequisites
### Hardware
- **ReSpeaker 6-Mic Array** (or compatible multi-channel microphone)
- microphone:
positions:
- [0.0277, 0.0] # Mic 0: 0°
- [0.0, 0.0277] # Mic 1: 90°
- [-0.0277, 0.0] # Mic 2: 180°
- [0.0, -0.0277] # Mic 3: 270°
- **NVIDIA GPU** (optional, for faster inference)
### Software Dependencies
Install the required packages:
```bash
conda activate doaEnv
pip install onnxruntime-gpu # For GPU inference
# OR
pip install onnxruntime # For CPU-only inference
pip install pyaudio numpy matplotlib torch pyyaml
```
### ONNX Model
You need a converted ONNX model file. If you haven't converted your PyTorch model yet:
```bash
python convert_to_onnx.py --checkpoint models/basic/2025-11-06_22-37-00-6a5fbc92/last.pt --output models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx
```
## Quick Start
### 1. List Available Audio Devices
First, find your ReSpeaker device index:
```bash
python onnx_stream_microphone.py --list-devices
```
Look for a device named "ReSpeaker" or "Seeed" or containing "2886". Note the device index.
### 2. Stop PulseAudio (Required)
On Linux, PulseAudio often locks the ALSA devices. You need to temporarily stop it:
```bash
pulseaudio --kill
```
**Note**: You can use the helper script `run_onnx_stream.sh` which automates this (see below).
### 3. Run the Streaming Script
Basic usage:
```bash
python onnx_stream_microphone.py \
--onnx models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx \
--device-index 9
```
### 4. Restart PulseAudio (After Stopping)
After you're done, restart PulseAudio:
```bash
pulseaudio --start
```
## Using the Helper Script
A helper script automates PulseAudio management:
```bash
chmod +x run_onnx_stream.sh
./run_onnx_stream.sh --onnx models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx --device-index 9
```
This script will:
1. Stop PulseAudio
2. Run the streaming script
3. Restart PulseAudio when you exit (Ctrl+C)
## Command-Line Arguments
### Required Arguments
- `--onnx PATH`: Path to the ONNX model file
### Audio Configuration
- `--device-index INT`: Audio device index (use `--list-devices` to find it)
- `--sample-rate INT`: Sample rate in Hz (default: 16000)
- `--window-ms INT`: Analysis window length in milliseconds (default: 200)
- `--hop-ms INT`: Hop size (overlap) in milliseconds (default: 100)
- `--chunk-size INT`: Audio buffer chunk size (default: 1600)
- `--cpu-only`: Use CPU only (disable GPU inference)
- `--list-devices`: List all available audio input devices and exit
### Model Configuration
- `--config PATH`: Path to config.yaml (default: `configs/train.yaml`)
### Histogram Detection Parameters
These control how DOA peaks are detected from the model logits:
- `--K INT`: Number of azimuth bins (default: 72, should match model)
- `--tau FLOAT`: Softmax temperature for histogram (default: 0.8)
- `--smooth-k INT`: Histogram smoothing kernel size (default: 1)
- `--min-peak-height FLOAT`: Minimum peak height threshold (default: 0.10)
- `--min-window-mass FLOAT`: Minimum window mass for peak validation (default: 0.24)
- `--min-sep-deg FLOAT`: Minimum angular separation between peaks in degrees (default: 20.0)
- `--min-active-ratio FLOAT`: Minimum active frame ratio (default: 0.20)
- `--max-sources INT`: Maximum number of sources to detect (default: 3)
### Event Gate Parameters
These control when detections are considered valid (filtering noise):
- `--level-delta-on-db FLOAT`: Level increase threshold to open gate (default: 2.5)
- `--level-delta-off-db FLOAT`: Level decrease threshold to close gate (default: 1.0)
- `--level-min-dbfs FLOAT`: Minimum audio level in dBFS (default: -60.0)
- `--level-ema-alpha FLOAT`: Exponential moving average alpha for level tracking (default: 0.05)
- `--event-hold-ms INT`: Minimum time to keep gate open after detection (default: 300)
- `--min-R-clip FLOAT`: Minimum R_clip (coherence measure) to open gate (default: 0.18)
- `--event-refractory-ms INT`: Minimum time between gate state changes (default: 120)
### Onset Detection Parameters
- `--onset-alpha FLOAT`: EMA alpha for spectral flux tracking (default: 0.05)
## Example with Custom Parameters
```bash
python onnx_stream_microphone.py \
--onnx doa_model.onnx \
--device-index 9 \
--window-ms 400 \
--hop-ms 100 \
--K 72 \
--max-sources 2 \
--tau 0.8 \
--smooth-k 1 \
--min-peak-height 0.08 \
--min-window-mass 0.16 \
--min-sep-deg 22.5 \
--min-active-ratio 0.15 \
--level-delta-on-db 4.0 \
--level-delta-off-db 1.5 \
--level-min-dbfs -55.0 \
--level-ema-alpha 0.05 \
--event-hold-ms 320 \
--event-refractory-ms 200 \
--min-R-clip 0.30 \
--onset-alpha 0.05
```
## Understanding the Output
### Console Output
Each line shows:
```
[ 12.34s] LVL= -45.2 dBFS diff=+3.5 | FLUXz=2.10 COH=0.75 | GATE=OPEN | MODEL= 12.3ms HIST= 2.1ms | DOA(R=0.45, n=2) [45°, 180°]
```
- `[time]`: Elapsed time in seconds
- `LVL`: Audio level in dBFS
- `diff`: Level difference from background (dB)
- `FLUXz`: Spectral flux z-score (onset detection)
- `COH`: Inter-microphone coherence
- `GATE`: Gate state (OPEN/CLOSED)
- `MODEL`: Model inference time (ms)
- `HIST`: Histogram processing time (ms)
- `DOA(R=..., n=...)`: R_clip value and number of detected peaks
- `[angles]`: Detected azimuth angles in degrees
### Visual Output
A polar plot window shows:
- **Green lines**: Detected sound source directions
- **Line thickness**: Proportional to confidence score
- **Angle labels**: Azimuth in degrees (0° = North/front)
### Azimuth Convention
- **0°** = North (front of microphone)
- **90°** = East (right)
- **180°** = South (back)
- **270°** = West (left)
## How It Works
### 1. Audio Processing Pipeline
```
Microphone (6 ch) → Channel Selection [1,4,3,2] → 4-channel audio
```
### 2. Feature Extraction
For each analysis window:
- Compute STFT for all 4 channels
- Extract magnitude, phase, cosine, and sine components
- Result: `(T_frames, 12_features, F_freq_bins)`
### 3. Model Inference
- Batch process features through ONNX model
- Output: `(T_frames, K_bins)` logits per frame
- Each frame has K probability scores for different azimuth angles
### 4. Histogram Aggregation
- Apply softmax with temperature `tau` to logits
- Weight by circular coherence (R_clip)
- Aggregate across all frames into a single histogram
- Smooth the histogram
### 5. Peak Detection
- Find local maxima in the histogram
- Filter by minimum height, separation, and window mass
- Refine peak positions using parabolic interpolation
- Return up to `max_sources` peaks
### 6. Event Gating
- Track audio level with exponential moving average
- Open gate when:
- Level increases by `level_delta_on_db` OR
- Valid peaks detected AND R_clip > `min_R_clip`
- Close gate when level drops and no valid peaks
- Apply hold and refractory periods to prevent flickering
## Troubleshooting
### "Invalid number of channels" Error
**Problem**: Device reports 0 channels or PyAudio can't open it.
**Solution**:
1. Stop PulseAudio: `pulseaudio --kill`
2. Run the script
3. Restart PulseAudio: `pulseaudio --start`
Or use the helper script `run_onnx_stream.sh`.
### No Audio Detected
- Check microphone connections
- Verify device index with `--list-devices`
- Check audio levels (should be above `level_min_dbfs`)
- Adjust `level_delta_on_db` to be more sensitive
### GPU Not Used
- Verify CUDA is available: `python -c "import torch; print(torch.cuda.is_available())"`
- Install `onnxruntime-gpu` instead of `onnxruntime`
- Check that CUDA providers are listed in the model loading message
### Model Mismatch Errors
- Ensure `--K` matches the model's K value (usually 72)
- Check that the ONNX model was exported with the correct input shape
- Verify config.yaml matches training configuration
### Poor DOA Accuracy
- Increase `--window-ms` for longer analysis windows (more stable)
- Adjust `--min-peak-height` and `--min-window-mass` thresholds
- Tune `--tau` (lower = sharper peaks, higher = smoother)
- Check microphone array calibration and positioning
## Performance Tips
- **GPU Inference**: Use `onnxruntime-gpu` for 5-10x speedup
- **Window Size**: Larger windows (400ms) = more stable but higher latency
- **Hop Size**: Smaller hops (50ms) = more responsive but more computation
- **Batch Size**: The script uses batch_size=25 internally for efficient GPU usage
## Stopping the Script
Press **Ctrl+C** to stop the stream. The script will:
- Close the audio stream
- Close the visualization window
- Clean up resources
## Integration
To use this in your own code, see `onnx_doa_inference.py` which provides a standalone inference class that can be integrated into other projects.
|