| # ONNX Real-Time DOA Streaming | |
| Real-time Direction of Arrival (DOA) detection using an ONNX model with microphone streaming. This script processes audio from a multi-channel microphone array (ReSpeaker) in real-time and displays detected sound source directions. | |
| ## Overview | |
| The script performs the following process: | |
| 1. **Audio Capture**: Streams audio from a 6-channel microphone array (ReSpeaker) | |
| 2. **Channel Selection**: Selects and reorders channels `[1, 4, 3, 2]` to get 4 channels | |
| 3. **Feature Extraction**: Computes STFT features (magnitude, phase, cosine, sine) from the audio | |
| 4. **ONNX Inference**: Runs the DOA model on GPU (CUDA) or CPU to get per-frame logits | |
| 5. **Histogram Aggregation**: Aggregates logits into a circular histogram of azimuth angles | |
| 6. **Peak Detection**: Finds peaks in the histogram to identify sound source directions | |
| 7. **Event Gating**: Filters detections based on audio level changes and coherence | |
| 8. **Visualization**: Displays detected directions on a polar plot in real-time | |
| ## Prerequisites | |
| ### Hardware | |
| - **ReSpeaker 6-Mic Array** (or compatible multi-channel microphone) | |
| - microphone: | |
| positions: | |
| - [0.0277, 0.0] # Mic 0: 0° | |
| - [0.0, 0.0277] # Mic 1: 90° | |
| - [-0.0277, 0.0] # Mic 2: 180° | |
| - [0.0, -0.0277] # Mic 3: 270° | |
| - **NVIDIA GPU** (optional, for faster inference) | |
| ### Software Dependencies | |
| Install the required packages: | |
| ```bash | |
| conda activate doaEnv | |
| pip install onnxruntime-gpu # For GPU inference | |
| # OR | |
| pip install onnxruntime # For CPU-only inference | |
| pip install pyaudio numpy matplotlib torch pyyaml | |
| ``` | |
| ### ONNX Model | |
| You need a converted ONNX model file. If you haven't converted your PyTorch model yet: | |
| ```bash | |
| python convert_to_onnx.py --checkpoint models/basic/2025-11-06_22-37-00-6a5fbc92/last.pt --output models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx | |
| ``` | |
| ## Quick Start | |
| ### 1. List Available Audio Devices | |
| First, find your ReSpeaker device index: | |
| ```bash | |
| python onnx_stream_microphone.py --list-devices | |
| ``` | |
| Look for a device named "ReSpeaker" or "Seeed" or containing "2886". Note the device index. | |
| ### 2. Stop PulseAudio (Required) | |
| On Linux, PulseAudio often locks the ALSA devices. You need to temporarily stop it: | |
| ```bash | |
| pulseaudio --kill | |
| ``` | |
| **Note**: You can use the helper script `run_onnx_stream.sh` which automates this (see below). | |
| ### 3. Run the Streaming Script | |
| Basic usage: | |
| ```bash | |
| python onnx_stream_microphone.py \ | |
| --onnx models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx \ | |
| --device-index 9 | |
| ``` | |
| ### 4. Restart PulseAudio (After Stopping) | |
| After you're done, restart PulseAudio: | |
| ```bash | |
| pulseaudio --start | |
| ``` | |
| ## Using the Helper Script | |
| A helper script automates PulseAudio management: | |
| ```bash | |
| chmod +x run_onnx_stream.sh | |
| ./run_onnx_stream.sh --onnx models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx --device-index 9 | |
| ``` | |
| This script will: | |
| 1. Stop PulseAudio | |
| 2. Run the streaming script | |
| 3. Restart PulseAudio when you exit (Ctrl+C) | |
| ## Command-Line Arguments | |
| ### Required Arguments | |
| - `--onnx PATH`: Path to the ONNX model file | |
| ### Audio Configuration | |
| - `--device-index INT`: Audio device index (use `--list-devices` to find it) | |
| - `--sample-rate INT`: Sample rate in Hz (default: 16000) | |
| - `--window-ms INT`: Analysis window length in milliseconds (default: 200) | |
| - `--hop-ms INT`: Hop size (overlap) in milliseconds (default: 100) | |
| - `--chunk-size INT`: Audio buffer chunk size (default: 1600) | |
| - `--cpu-only`: Use CPU only (disable GPU inference) | |
| - `--list-devices`: List all available audio input devices and exit | |
| ### Model Configuration | |
| - `--config PATH`: Path to config.yaml (default: `configs/train.yaml`) | |
| ### Histogram Detection Parameters | |
| These control how DOA peaks are detected from the model logits: | |
| - `--K INT`: Number of azimuth bins (default: 72, should match model) | |
| - `--tau FLOAT`: Softmax temperature for histogram (default: 0.8) | |
| - `--smooth-k INT`: Histogram smoothing kernel size (default: 1) | |
| - `--min-peak-height FLOAT`: Minimum peak height threshold (default: 0.10) | |
| - `--min-window-mass FLOAT`: Minimum window mass for peak validation (default: 0.24) | |
| - `--min-sep-deg FLOAT`: Minimum angular separation between peaks in degrees (default: 20.0) | |
| - `--min-active-ratio FLOAT`: Minimum active frame ratio (default: 0.20) | |
| - `--max-sources INT`: Maximum number of sources to detect (default: 3) | |
| ### Event Gate Parameters | |
| These control when detections are considered valid (filtering noise): | |
| - `--level-delta-on-db FLOAT`: Level increase threshold to open gate (default: 2.5) | |
| - `--level-delta-off-db FLOAT`: Level decrease threshold to close gate (default: 1.0) | |
| - `--level-min-dbfs FLOAT`: Minimum audio level in dBFS (default: -60.0) | |
| - `--level-ema-alpha FLOAT`: Exponential moving average alpha for level tracking (default: 0.05) | |
| - `--event-hold-ms INT`: Minimum time to keep gate open after detection (default: 300) | |
| - `--min-R-clip FLOAT`: Minimum R_clip (coherence measure) to open gate (default: 0.18) | |
| - `--event-refractory-ms INT`: Minimum time between gate state changes (default: 120) | |
| ### Onset Detection Parameters | |
| - `--onset-alpha FLOAT`: EMA alpha for spectral flux tracking (default: 0.05) | |
| ## Example with Custom Parameters | |
| ```bash | |
| python onnx_stream_microphone.py \ | |
| --onnx doa_model.onnx \ | |
| --device-index 9 \ | |
| --window-ms 400 \ | |
| --hop-ms 100 \ | |
| --K 72 \ | |
| --max-sources 2 \ | |
| --tau 0.8 \ | |
| --smooth-k 1 \ | |
| --min-peak-height 0.08 \ | |
| --min-window-mass 0.16 \ | |
| --min-sep-deg 22.5 \ | |
| --min-active-ratio 0.15 \ | |
| --level-delta-on-db 4.0 \ | |
| --level-delta-off-db 1.5 \ | |
| --level-min-dbfs -55.0 \ | |
| --level-ema-alpha 0.05 \ | |
| --event-hold-ms 320 \ | |
| --event-refractory-ms 200 \ | |
| --min-R-clip 0.30 \ | |
| --onset-alpha 0.05 | |
| ``` | |
| ## Understanding the Output | |
| ### Console Output | |
| Each line shows: | |
| ``` | |
| [ 12.34s] LVL= -45.2 dBFS diff=+3.5 | FLUXz=2.10 COH=0.75 | GATE=OPEN | MODEL= 12.3ms HIST= 2.1ms | DOA(R=0.45, n=2) [45°, 180°] | |
| ``` | |
| - `[time]`: Elapsed time in seconds | |
| - `LVL`: Audio level in dBFS | |
| - `diff`: Level difference from background (dB) | |
| - `FLUXz`: Spectral flux z-score (onset detection) | |
| - `COH`: Inter-microphone coherence | |
| - `GATE`: Gate state (OPEN/CLOSED) | |
| - `MODEL`: Model inference time (ms) | |
| - `HIST`: Histogram processing time (ms) | |
| - `DOA(R=..., n=...)`: R_clip value and number of detected peaks | |
| - `[angles]`: Detected azimuth angles in degrees | |
| ### Visual Output | |
| A polar plot window shows: | |
| - **Green lines**: Detected sound source directions | |
| - **Line thickness**: Proportional to confidence score | |
| - **Angle labels**: Azimuth in degrees (0° = North/front) | |
| ### Azimuth Convention | |
| - **0°** = North (front of microphone) | |
| - **90°** = East (right) | |
| - **180°** = South (back) | |
| - **270°** = West (left) | |
| ## How It Works | |
| ### 1. Audio Processing Pipeline | |
| ``` | |
| Microphone (6 ch) → Channel Selection [1,4,3,2] → 4-channel audio | |
| ``` | |
| ### 2. Feature Extraction | |
| For each analysis window: | |
| - Compute STFT for all 4 channels | |
| - Extract magnitude, phase, cosine, and sine components | |
| - Result: `(T_frames, 12_features, F_freq_bins)` | |
| ### 3. Model Inference | |
| - Batch process features through ONNX model | |
| - Output: `(T_frames, K_bins)` logits per frame | |
| - Each frame has K probability scores for different azimuth angles | |
| ### 4. Histogram Aggregation | |
| - Apply softmax with temperature `tau` to logits | |
| - Weight by circular coherence (R_clip) | |
| - Aggregate across all frames into a single histogram | |
| - Smooth the histogram | |
| ### 5. Peak Detection | |
| - Find local maxima in the histogram | |
| - Filter by minimum height, separation, and window mass | |
| - Refine peak positions using parabolic interpolation | |
| - Return up to `max_sources` peaks | |
| ### 6. Event Gating | |
| - Track audio level with exponential moving average | |
| - Open gate when: | |
| - Level increases by `level_delta_on_db` OR | |
| - Valid peaks detected AND R_clip > `min_R_clip` | |
| - Close gate when level drops and no valid peaks | |
| - Apply hold and refractory periods to prevent flickering | |
| ## Troubleshooting | |
| ### "Invalid number of channels" Error | |
| **Problem**: Device reports 0 channels or PyAudio can't open it. | |
| **Solution**: | |
| 1. Stop PulseAudio: `pulseaudio --kill` | |
| 2. Run the script | |
| 3. Restart PulseAudio: `pulseaudio --start` | |
| Or use the helper script `run_onnx_stream.sh`. | |
| ### No Audio Detected | |
| - Check microphone connections | |
| - Verify device index with `--list-devices` | |
| - Check audio levels (should be above `level_min_dbfs`) | |
| - Adjust `level_delta_on_db` to be more sensitive | |
| ### GPU Not Used | |
| - Verify CUDA is available: `python -c "import torch; print(torch.cuda.is_available())"` | |
| - Install `onnxruntime-gpu` instead of `onnxruntime` | |
| - Check that CUDA providers are listed in the model loading message | |
| ### Model Mismatch Errors | |
| - Ensure `--K` matches the model's K value (usually 72) | |
| - Check that the ONNX model was exported with the correct input shape | |
| - Verify config.yaml matches training configuration | |
| ### Poor DOA Accuracy | |
| - Increase `--window-ms` for longer analysis windows (more stable) | |
| - Adjust `--min-peak-height` and `--min-window-mass` thresholds | |
| - Tune `--tau` (lower = sharper peaks, higher = smoother) | |
| - Check microphone array calibration and positioning | |
| ## Performance Tips | |
| - **GPU Inference**: Use `onnxruntime-gpu` for 5-10x speedup | |
| - **Window Size**: Larger windows (400ms) = more stable but higher latency | |
| - **Hop Size**: Smaller hops (50ms) = more responsive but more computation | |
| - **Batch Size**: The script uses batch_size=25 internally for efficient GPU usage | |
| ## Stopping the Script | |
| Press **Ctrl+C** to stop the stream. The script will: | |
| - Close the audio stream | |
| - Close the visualization window | |
| - Clean up resources | |
| ## Integration | |
| To use this in your own code, see `onnx_doa_inference.py` which provides a standalone inference class that can be integrated into other projects. | |