EtMmohammedHafsati's picture
Upload folder using huggingface_hub
c2f1451 verified
# ONNX Real-Time DOA Streaming
Real-time Direction of Arrival (DOA) detection using an ONNX model with microphone streaming. This script processes audio from a multi-channel microphone array (ReSpeaker) in real-time and displays detected sound source directions.
## Overview
The script performs the following process:
1. **Audio Capture**: Streams audio from a 6-channel microphone array (ReSpeaker)
2. **Channel Selection**: Selects and reorders channels `[1, 4, 3, 2]` to get 4 channels
3. **Feature Extraction**: Computes STFT features (magnitude, phase, cosine, sine) from the audio
4. **ONNX Inference**: Runs the DOA model on GPU (CUDA) or CPU to get per-frame logits
5. **Histogram Aggregation**: Aggregates logits into a circular histogram of azimuth angles
6. **Peak Detection**: Finds peaks in the histogram to identify sound source directions
7. **Event Gating**: Filters detections based on audio level changes and coherence
8. **Visualization**: Displays detected directions on a polar plot in real-time
## Prerequisites
### Hardware
- **ReSpeaker 6-Mic Array** (or compatible multi-channel microphone)
- microphone:
positions:
- [0.0277, 0.0] # Mic 0: 0°
- [0.0, 0.0277] # Mic 1: 90°
- [-0.0277, 0.0] # Mic 2: 180°
- [0.0, -0.0277] # Mic 3: 270°
- **NVIDIA GPU** (optional, for faster inference)
### Software Dependencies
Install the required packages:
```bash
conda activate doaEnv
pip install onnxruntime-gpu # For GPU inference
# OR
pip install onnxruntime # For CPU-only inference
pip install pyaudio numpy matplotlib torch pyyaml
```
### ONNX Model
You need a converted ONNX model file. If you haven't converted your PyTorch model yet:
```bash
python convert_to_onnx.py --checkpoint models/basic/2025-11-06_22-37-00-6a5fbc92/last.pt --output models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx
```
## Quick Start
### 1. List Available Audio Devices
First, find your ReSpeaker device index:
```bash
python onnx_stream_microphone.py --list-devices
```
Look for a device named "ReSpeaker" or "Seeed" or containing "2886". Note the device index.
### 2. Stop PulseAudio (Required)
On Linux, PulseAudio often locks the ALSA devices. You need to temporarily stop it:
```bash
pulseaudio --kill
```
**Note**: You can use the helper script `run_onnx_stream.sh` which automates this (see below).
### 3. Run the Streaming Script
Basic usage:
```bash
python onnx_stream_microphone.py \
--onnx models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx \
--device-index 9
```
### 4. Restart PulseAudio (After Stopping)
After you're done, restart PulseAudio:
```bash
pulseaudio --start
```
## Using the Helper Script
A helper script automates PulseAudio management:
```bash
chmod +x run_onnx_stream.sh
./run_onnx_stream.sh --onnx models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx --device-index 9
```
This script will:
1. Stop PulseAudio
2. Run the streaming script
3. Restart PulseAudio when you exit (Ctrl+C)
## Command-Line Arguments
### Required Arguments
- `--onnx PATH`: Path to the ONNX model file
### Audio Configuration
- `--device-index INT`: Audio device index (use `--list-devices` to find it)
- `--sample-rate INT`: Sample rate in Hz (default: 16000)
- `--window-ms INT`: Analysis window length in milliseconds (default: 200)
- `--hop-ms INT`: Hop size (overlap) in milliseconds (default: 100)
- `--chunk-size INT`: Audio buffer chunk size (default: 1600)
- `--cpu-only`: Use CPU only (disable GPU inference)
- `--list-devices`: List all available audio input devices and exit
### Model Configuration
- `--config PATH`: Path to config.yaml (default: `configs/train.yaml`)
### Histogram Detection Parameters
These control how DOA peaks are detected from the model logits:
- `--K INT`: Number of azimuth bins (default: 72, should match model)
- `--tau FLOAT`: Softmax temperature for histogram (default: 0.8)
- `--smooth-k INT`: Histogram smoothing kernel size (default: 1)
- `--min-peak-height FLOAT`: Minimum peak height threshold (default: 0.10)
- `--min-window-mass FLOAT`: Minimum window mass for peak validation (default: 0.24)
- `--min-sep-deg FLOAT`: Minimum angular separation between peaks in degrees (default: 20.0)
- `--min-active-ratio FLOAT`: Minimum active frame ratio (default: 0.20)
- `--max-sources INT`: Maximum number of sources to detect (default: 3)
### Event Gate Parameters
These control when detections are considered valid (filtering noise):
- `--level-delta-on-db FLOAT`: Level increase threshold to open gate (default: 2.5)
- `--level-delta-off-db FLOAT`: Level decrease threshold to close gate (default: 1.0)
- `--level-min-dbfs FLOAT`: Minimum audio level in dBFS (default: -60.0)
- `--level-ema-alpha FLOAT`: Exponential moving average alpha for level tracking (default: 0.05)
- `--event-hold-ms INT`: Minimum time to keep gate open after detection (default: 300)
- `--min-R-clip FLOAT`: Minimum R_clip (coherence measure) to open gate (default: 0.18)
- `--event-refractory-ms INT`: Minimum time between gate state changes (default: 120)
### Onset Detection Parameters
- `--onset-alpha FLOAT`: EMA alpha for spectral flux tracking (default: 0.05)
## Example with Custom Parameters
```bash
python onnx_stream_microphone.py \
--onnx doa_model.onnx \
--device-index 9 \
--window-ms 400 \
--hop-ms 100 \
--K 72 \
--max-sources 2 \
--tau 0.8 \
--smooth-k 1 \
--min-peak-height 0.08 \
--min-window-mass 0.16 \
--min-sep-deg 22.5 \
--min-active-ratio 0.15 \
--level-delta-on-db 4.0 \
--level-delta-off-db 1.5 \
--level-min-dbfs -55.0 \
--level-ema-alpha 0.05 \
--event-hold-ms 320 \
--event-refractory-ms 200 \
--min-R-clip 0.30 \
--onset-alpha 0.05
```
## Understanding the Output
### Console Output
Each line shows:
```
[ 12.34s] LVL= -45.2 dBFS diff=+3.5 | FLUXz=2.10 COH=0.75 | GATE=OPEN | MODEL= 12.3ms HIST= 2.1ms | DOA(R=0.45, n=2) [45°, 180°]
```
- `[time]`: Elapsed time in seconds
- `LVL`: Audio level in dBFS
- `diff`: Level difference from background (dB)
- `FLUXz`: Spectral flux z-score (onset detection)
- `COH`: Inter-microphone coherence
- `GATE`: Gate state (OPEN/CLOSED)
- `MODEL`: Model inference time (ms)
- `HIST`: Histogram processing time (ms)
- `DOA(R=..., n=...)`: R_clip value and number of detected peaks
- `[angles]`: Detected azimuth angles in degrees
### Visual Output
A polar plot window shows:
- **Green lines**: Detected sound source directions
- **Line thickness**: Proportional to confidence score
- **Angle labels**: Azimuth in degrees (0° = North/front)
### Azimuth Convention
- **0°** = North (front of microphone)
- **90°** = East (right)
- **180°** = South (back)
- **270°** = West (left)
## How It Works
### 1. Audio Processing Pipeline
```
Microphone (6 ch) → Channel Selection [1,4,3,2] → 4-channel audio
```
### 2. Feature Extraction
For each analysis window:
- Compute STFT for all 4 channels
- Extract magnitude, phase, cosine, and sine components
- Result: `(T_frames, 12_features, F_freq_bins)`
### 3. Model Inference
- Batch process features through ONNX model
- Output: `(T_frames, K_bins)` logits per frame
- Each frame has K probability scores for different azimuth angles
### 4. Histogram Aggregation
- Apply softmax with temperature `tau` to logits
- Weight by circular coherence (R_clip)
- Aggregate across all frames into a single histogram
- Smooth the histogram
### 5. Peak Detection
- Find local maxima in the histogram
- Filter by minimum height, separation, and window mass
- Refine peak positions using parabolic interpolation
- Return up to `max_sources` peaks
### 6. Event Gating
- Track audio level with exponential moving average
- Open gate when:
- Level increases by `level_delta_on_db` OR
- Valid peaks detected AND R_clip > `min_R_clip`
- Close gate when level drops and no valid peaks
- Apply hold and refractory periods to prevent flickering
## Troubleshooting
### "Invalid number of channels" Error
**Problem**: Device reports 0 channels or PyAudio can't open it.
**Solution**:
1. Stop PulseAudio: `pulseaudio --kill`
2. Run the script
3. Restart PulseAudio: `pulseaudio --start`
Or use the helper script `run_onnx_stream.sh`.
### No Audio Detected
- Check microphone connections
- Verify device index with `--list-devices`
- Check audio levels (should be above `level_min_dbfs`)
- Adjust `level_delta_on_db` to be more sensitive
### GPU Not Used
- Verify CUDA is available: `python -c "import torch; print(torch.cuda.is_available())"`
- Install `onnxruntime-gpu` instead of `onnxruntime`
- Check that CUDA providers are listed in the model loading message
### Model Mismatch Errors
- Ensure `--K` matches the model's K value (usually 72)
- Check that the ONNX model was exported with the correct input shape
- Verify config.yaml matches training configuration
### Poor DOA Accuracy
- Increase `--window-ms` for longer analysis windows (more stable)
- Adjust `--min-peak-height` and `--min-window-mass` thresholds
- Tune `--tau` (lower = sharper peaks, higher = smoother)
- Check microphone array calibration and positioning
## Performance Tips
- **GPU Inference**: Use `onnxruntime-gpu` for 5-10x speedup
- **Window Size**: Larger windows (400ms) = more stable but higher latency
- **Hop Size**: Smaller hops (50ms) = more responsive but more computation
- **Batch Size**: The script uses batch_size=25 internally for efficient GPU usage
## Stopping the Script
Press **Ctrl+C** to stop the stream. The script will:
- Close the audio stream
- Close the visualization window
- Clean up resources
## Integration
To use this in your own code, see `onnx_doa_inference.py` which provides a standalone inference class that can be integrated into other projects.