# ONNX Real-Time DOA Streaming Real-time Direction of Arrival (DOA) detection using an ONNX model with microphone streaming. This script processes audio from a multi-channel microphone array (ReSpeaker) in real-time and displays detected sound source directions. ## Overview The script performs the following process: 1. **Audio Capture**: Streams audio from a 6-channel microphone array (ReSpeaker) 2. **Channel Selection**: Selects and reorders channels `[1, 4, 3, 2]` to get 4 channels 3. **Feature Extraction**: Computes STFT features (magnitude, phase, cosine, sine) from the audio 4. **ONNX Inference**: Runs the DOA model on GPU (CUDA) or CPU to get per-frame logits 5. **Histogram Aggregation**: Aggregates logits into a circular histogram of azimuth angles 6. **Peak Detection**: Finds peaks in the histogram to identify sound source directions 7. **Event Gating**: Filters detections based on audio level changes and coherence 8. **Visualization**: Displays detected directions on a polar plot in real-time ## Prerequisites ### Hardware - **ReSpeaker 6-Mic Array** (or compatible multi-channel microphone) - microphone: positions: - [0.0277, 0.0] # Mic 0: 0° - [0.0, 0.0277] # Mic 1: 90° - [-0.0277, 0.0] # Mic 2: 180° - [0.0, -0.0277] # Mic 3: 270° - **NVIDIA GPU** (optional, for faster inference) ### Software Dependencies Install the required packages: ```bash conda activate doaEnv pip install onnxruntime-gpu # For GPU inference # OR pip install onnxruntime # For CPU-only inference pip install pyaudio numpy matplotlib torch pyyaml ``` ### ONNX Model You need a converted ONNX model file. If you haven't converted your PyTorch model yet: ```bash python convert_to_onnx.py --checkpoint models/basic/2025-11-06_22-37-00-6a5fbc92/last.pt --output models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx ``` ## Quick Start ### 1. List Available Audio Devices First, find your ReSpeaker device index: ```bash python onnx_stream_microphone.py --list-devices ``` Look for a device named "ReSpeaker" or "Seeed" or containing "2886". Note the device index. ### 2. Stop PulseAudio (Required) On Linux, PulseAudio often locks the ALSA devices. You need to temporarily stop it: ```bash pulseaudio --kill ``` **Note**: You can use the helper script `run_onnx_stream.sh` which automates this (see below). ### 3. Run the Streaming Script Basic usage: ```bash python onnx_stream_microphone.py \ --onnx models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx \ --device-index 9 ``` ### 4. Restart PulseAudio (After Stopping) After you're done, restart PulseAudio: ```bash pulseaudio --start ``` ## Using the Helper Script A helper script automates PulseAudio management: ```bash chmod +x run_onnx_stream.sh ./run_onnx_stream.sh --onnx models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx --device-index 9 ``` This script will: 1. Stop PulseAudio 2. Run the streaming script 3. Restart PulseAudio when you exit (Ctrl+C) ## Command-Line Arguments ### Required Arguments - `--onnx PATH`: Path to the ONNX model file ### Audio Configuration - `--device-index INT`: Audio device index (use `--list-devices` to find it) - `--sample-rate INT`: Sample rate in Hz (default: 16000) - `--window-ms INT`: Analysis window length in milliseconds (default: 200) - `--hop-ms INT`: Hop size (overlap) in milliseconds (default: 100) - `--chunk-size INT`: Audio buffer chunk size (default: 1600) - `--cpu-only`: Use CPU only (disable GPU inference) - `--list-devices`: List all available audio input devices and exit ### Model Configuration - `--config PATH`: Path to config.yaml (default: `configs/train.yaml`) ### Histogram Detection Parameters These control how DOA peaks are detected from the model logits: - `--K INT`: Number of azimuth bins (default: 72, should match model) - `--tau FLOAT`: Softmax temperature for histogram (default: 0.8) - `--smooth-k INT`: Histogram smoothing kernel size (default: 1) - `--min-peak-height FLOAT`: Minimum peak height threshold (default: 0.10) - `--min-window-mass FLOAT`: Minimum window mass for peak validation (default: 0.24) - `--min-sep-deg FLOAT`: Minimum angular separation between peaks in degrees (default: 20.0) - `--min-active-ratio FLOAT`: Minimum active frame ratio (default: 0.20) - `--max-sources INT`: Maximum number of sources to detect (default: 3) ### Event Gate Parameters These control when detections are considered valid (filtering noise): - `--level-delta-on-db FLOAT`: Level increase threshold to open gate (default: 2.5) - `--level-delta-off-db FLOAT`: Level decrease threshold to close gate (default: 1.0) - `--level-min-dbfs FLOAT`: Minimum audio level in dBFS (default: -60.0) - `--level-ema-alpha FLOAT`: Exponential moving average alpha for level tracking (default: 0.05) - `--event-hold-ms INT`: Minimum time to keep gate open after detection (default: 300) - `--min-R-clip FLOAT`: Minimum R_clip (coherence measure) to open gate (default: 0.18) - `--event-refractory-ms INT`: Minimum time between gate state changes (default: 120) ### Onset Detection Parameters - `--onset-alpha FLOAT`: EMA alpha for spectral flux tracking (default: 0.05) ## Example with Custom Parameters ```bash python onnx_stream_microphone.py \ --onnx doa_model.onnx \ --device-index 9 \ --window-ms 400 \ --hop-ms 100 \ --K 72 \ --max-sources 2 \ --tau 0.8 \ --smooth-k 1 \ --min-peak-height 0.08 \ --min-window-mass 0.16 \ --min-sep-deg 22.5 \ --min-active-ratio 0.15 \ --level-delta-on-db 4.0 \ --level-delta-off-db 1.5 \ --level-min-dbfs -55.0 \ --level-ema-alpha 0.05 \ --event-hold-ms 320 \ --event-refractory-ms 200 \ --min-R-clip 0.30 \ --onset-alpha 0.05 ``` ## Understanding the Output ### Console Output Each line shows: ``` [ 12.34s] LVL= -45.2 dBFS diff=+3.5 | FLUXz=2.10 COH=0.75 | GATE=OPEN | MODEL= 12.3ms HIST= 2.1ms | DOA(R=0.45, n=2) [45°, 180°] ``` - `[time]`: Elapsed time in seconds - `LVL`: Audio level in dBFS - `diff`: Level difference from background (dB) - `FLUXz`: Spectral flux z-score (onset detection) - `COH`: Inter-microphone coherence - `GATE`: Gate state (OPEN/CLOSED) - `MODEL`: Model inference time (ms) - `HIST`: Histogram processing time (ms) - `DOA(R=..., n=...)`: R_clip value and number of detected peaks - `[angles]`: Detected azimuth angles in degrees ### Visual Output A polar plot window shows: - **Green lines**: Detected sound source directions - **Line thickness**: Proportional to confidence score - **Angle labels**: Azimuth in degrees (0° = North/front) ### Azimuth Convention - **0°** = North (front of microphone) - **90°** = East (right) - **180°** = South (back) - **270°** = West (left) ## How It Works ### 1. Audio Processing Pipeline ``` Microphone (6 ch) → Channel Selection [1,4,3,2] → 4-channel audio ``` ### 2. Feature Extraction For each analysis window: - Compute STFT for all 4 channels - Extract magnitude, phase, cosine, and sine components - Result: `(T_frames, 12_features, F_freq_bins)` ### 3. Model Inference - Batch process features through ONNX model - Output: `(T_frames, K_bins)` logits per frame - Each frame has K probability scores for different azimuth angles ### 4. Histogram Aggregation - Apply softmax with temperature `tau` to logits - Weight by circular coherence (R_clip) - Aggregate across all frames into a single histogram - Smooth the histogram ### 5. Peak Detection - Find local maxima in the histogram - Filter by minimum height, separation, and window mass - Refine peak positions using parabolic interpolation - Return up to `max_sources` peaks ### 6. Event Gating - Track audio level with exponential moving average - Open gate when: - Level increases by `level_delta_on_db` OR - Valid peaks detected AND R_clip > `min_R_clip` - Close gate when level drops and no valid peaks - Apply hold and refractory periods to prevent flickering ## Troubleshooting ### "Invalid number of channels" Error **Problem**: Device reports 0 channels or PyAudio can't open it. **Solution**: 1. Stop PulseAudio: `pulseaudio --kill` 2. Run the script 3. Restart PulseAudio: `pulseaudio --start` Or use the helper script `run_onnx_stream.sh`. ### No Audio Detected - Check microphone connections - Verify device index with `--list-devices` - Check audio levels (should be above `level_min_dbfs`) - Adjust `level_delta_on_db` to be more sensitive ### GPU Not Used - Verify CUDA is available: `python -c "import torch; print(torch.cuda.is_available())"` - Install `onnxruntime-gpu` instead of `onnxruntime` - Check that CUDA providers are listed in the model loading message ### Model Mismatch Errors - Ensure `--K` matches the model's K value (usually 72) - Check that the ONNX model was exported with the correct input shape - Verify config.yaml matches training configuration ### Poor DOA Accuracy - Increase `--window-ms` for longer analysis windows (more stable) - Adjust `--min-peak-height` and `--min-window-mass` thresholds - Tune `--tau` (lower = sharper peaks, higher = smoother) - Check microphone array calibration and positioning ## Performance Tips - **GPU Inference**: Use `onnxruntime-gpu` for 5-10x speedup - **Window Size**: Larger windows (400ms) = more stable but higher latency - **Hop Size**: Smaller hops (50ms) = more responsive but more computation - **Batch Size**: The script uses batch_size=25 internally for efficient GPU usage ## Stopping the Script Press **Ctrl+C** to stop the stream. The script will: - Close the audio stream - Close the visualization window - Clean up resources ## Integration To use this in your own code, see `onnx_doa_inference.py` which provides a standalone inference class that can be integrated into other projects.