File size: 9,800 Bytes

c2f1451

# ONNX Real-Time DOA Streaming

Real-time Direction of Arrival (DOA) detection using an ONNX model with microphone streaming. This script processes audio from a multi-channel microphone array (ReSpeaker) in real-time and displays detected sound source directions.

## Overview

The script performs the following process:

1. **Audio Capture**: Streams audio from a 6-channel microphone array (ReSpeaker)
2. **Channel Selection**: Selects and reorders channels `[1, 4, 3, 2]` to get 4 channels
3. **Feature Extraction**: Computes STFT features (magnitude, phase, cosine, sine) from the audio
4. **ONNX Inference**: Runs the DOA model on GPU (CUDA) or CPU to get per-frame logits
5. **Histogram Aggregation**: Aggregates logits into a circular histogram of azimuth angles
6. **Peak Detection**: Finds peaks in the histogram to identify sound source directions
7. **Event Gating**: Filters detections based on audio level changes and coherence
8. **Visualization**: Displays detected directions on a polar plot in real-time

## Prerequisites

### Hardware
- **ReSpeaker 6-Mic Array** (or compatible multi-channel microphone)
- microphone:
   positions:
    - [0.0277, 0.0]    # Mic 0: 0°
    - [0.0, 0.0277]    # Mic 1: 90°
    - [-0.0277, 0.0]   # Mic 2: 180°
    - [0.0, -0.0277]   # Mic 3: 270°
- **NVIDIA GPU** (optional, for faster inference)

### Software Dependencies

Install the required packages:

```bash
conda activate doaEnv
pip install onnxruntime-gpu  # For GPU inference
# OR
pip install onnxruntime      # For CPU-only inference

pip install pyaudio numpy matplotlib torch pyyaml
```

### ONNX Model

You need a converted ONNX model file. If you haven't converted your PyTorch model yet:

```bash
python convert_to_onnx.py --checkpoint models/basic/2025-11-06_22-37-00-6a5fbc92/last.pt --output models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx
```

## Quick Start

### 1. List Available Audio Devices

First, find your ReSpeaker device index:

```bash
python onnx_stream_microphone.py --list-devices
```

Look for a device named "ReSpeaker" or "Seeed" or containing "2886". Note the device index.

### 2. Stop PulseAudio (Required)

On Linux, PulseAudio often locks the ALSA devices. You need to temporarily stop it:

```bash
pulseaudio --kill
```

**Note**: You can use the helper script `run_onnx_stream.sh` which automates this (see below).

### 3. Run the Streaming Script

Basic usage:

```bash
python onnx_stream_microphone.py \
    --onnx models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx \
    --device-index 9
```

### 4. Restart PulseAudio (After Stopping)

After you're done, restart PulseAudio:

```bash
pulseaudio --start
```

## Using the Helper Script

A helper script automates PulseAudio management:

```bash
chmod +x run_onnx_stream.sh
./run_onnx_stream.sh --onnx models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx --device-index 9
```

This script will:
1. Stop PulseAudio
2. Run the streaming script
3. Restart PulseAudio when you exit (Ctrl+C)

## Command-Line Arguments

### Required Arguments

- `--onnx PATH`: Path to the ONNX model file

### Audio Configuration

- `--device-index INT`: Audio device index (use `--list-devices` to find it)
- `--sample-rate INT`: Sample rate in Hz (default: 16000)
- `--window-ms INT`: Analysis window length in milliseconds (default: 200)
- `--hop-ms INT`: Hop size (overlap) in milliseconds (default: 100)
- `--chunk-size INT`: Audio buffer chunk size (default: 1600)
- `--cpu-only`: Use CPU only (disable GPU inference)
- `--list-devices`: List all available audio input devices and exit

### Model Configuration

- `--config PATH`: Path to config.yaml (default: `configs/train.yaml`)

### Histogram Detection Parameters

These control how DOA peaks are detected from the model logits:

- `--K INT`: Number of azimuth bins (default: 72, should match model)
- `--tau FLOAT`: Softmax temperature for histogram (default: 0.8)
- `--smooth-k INT`: Histogram smoothing kernel size (default: 1)
- `--min-peak-height FLOAT`: Minimum peak height threshold (default: 0.10)
- `--min-window-mass FLOAT`: Minimum window mass for peak validation (default: 0.24)
- `--min-sep-deg FLOAT`: Minimum angular separation between peaks in degrees (default: 20.0)
- `--min-active-ratio FLOAT`: Minimum active frame ratio (default: 0.20)
- `--max-sources INT`: Maximum number of sources to detect (default: 3)

### Event Gate Parameters

These control when detections are considered valid (filtering noise):

- `--level-delta-on-db FLOAT`: Level increase threshold to open gate (default: 2.5)
- `--level-delta-off-db FLOAT`: Level decrease threshold to close gate (default: 1.0)
- `--level-min-dbfs FLOAT`: Minimum audio level in dBFS (default: -60.0)
- `--level-ema-alpha FLOAT`: Exponential moving average alpha for level tracking (default: 0.05)
- `--event-hold-ms INT`: Minimum time to keep gate open after detection (default: 300)
- `--min-R-clip FLOAT`: Minimum R_clip (coherence measure) to open gate (default: 0.18)
- `--event-refractory-ms INT`: Minimum time between gate state changes (default: 120)

### Onset Detection Parameters

- `--onset-alpha FLOAT`: EMA alpha for spectral flux tracking (default: 0.05)

## Example with Custom Parameters

```bash
python onnx_stream_microphone.py \
    --onnx doa_model.onnx \
    --device-index 9 \
    --window-ms 400 \
    --hop-ms 100 \
    --K 72 \
    --max-sources 2 \
    --tau 0.8 \
    --smooth-k 1 \
    --min-peak-height 0.08 \
    --min-window-mass 0.16 \
    --min-sep-deg 22.5 \
    --min-active-ratio 0.15 \
    --level-delta-on-db 4.0 \
    --level-delta-off-db 1.5 \
    --level-min-dbfs -55.0 \
    --level-ema-alpha 0.05 \
    --event-hold-ms 320 \
    --event-refractory-ms 200 \
    --min-R-clip 0.30 \
    --onset-alpha 0.05
```

## Understanding the Output

### Console Output

Each line shows:

```
[ 12.34s] LVL= -45.2 dBFS diff=+3.5 | FLUXz=2.10 COH=0.75 | GATE=OPEN | MODEL= 12.3ms HIST= 2.1ms | DOA(R=0.45, n=2) [45°, 180°]
```

- `[time]`: Elapsed time in seconds
- `LVL`: Audio level in dBFS
- `diff`: Level difference from background (dB)
- `FLUXz`: Spectral flux z-score (onset detection)
- `COH`: Inter-microphone coherence
- `GATE`: Gate state (OPEN/CLOSED)
- `MODEL`: Model inference time (ms)
- `HIST`: Histogram processing time (ms)
- `DOA(R=..., n=...)`: R_clip value and number of detected peaks
- `[angles]`: Detected azimuth angles in degrees

### Visual Output

A polar plot window shows:
- **Green lines**: Detected sound source directions
- **Line thickness**: Proportional to confidence score
- **Angle labels**: Azimuth in degrees (0° = North/front)

### Azimuth Convention

- **0°** = North (front of microphone)
- **90°** = East (right)
- **180°** = South (back)
- **270°** = West (left)

## How It Works

### 1. Audio Processing Pipeline

```
Microphone (6 ch) → Channel Selection [1,4,3,2] → 4-channel audio
```

### 2. Feature Extraction

For each analysis window:
- Compute STFT for all 4 channels
- Extract magnitude, phase, cosine, and sine components
- Result: `(T_frames, 12_features, F_freq_bins)`

### 3. Model Inference

- Batch process features through ONNX model
- Output: `(T_frames, K_bins)` logits per frame
- Each frame has K probability scores for different azimuth angles

### 4. Histogram Aggregation

- Apply softmax with temperature `tau` to logits
- Weight by circular coherence (R_clip)
- Aggregate across all frames into a single histogram
- Smooth the histogram

### 5. Peak Detection

- Find local maxima in the histogram
- Filter by minimum height, separation, and window mass
- Refine peak positions using parabolic interpolation
- Return up to `max_sources` peaks

### 6. Event Gating

- Track audio level with exponential moving average
- Open gate when:
  - Level increases by `level_delta_on_db` OR
  - Valid peaks detected AND R_clip > `min_R_clip`
- Close gate when level drops and no valid peaks
- Apply hold and refractory periods to prevent flickering

## Troubleshooting

### "Invalid number of channels" Error

**Problem**: Device reports 0 channels or PyAudio can't open it.

**Solution**: 
1. Stop PulseAudio: `pulseaudio --kill`
2. Run the script
3. Restart PulseAudio: `pulseaudio --start`

Or use the helper script `run_onnx_stream.sh`.

### No Audio Detected

- Check microphone connections
- Verify device index with `--list-devices`
- Check audio levels (should be above `level_min_dbfs`)
- Adjust `level_delta_on_db` to be more sensitive

### GPU Not Used

- Verify CUDA is available: `python -c "import torch; print(torch.cuda.is_available())"`
- Install `onnxruntime-gpu` instead of `onnxruntime`
- Check that CUDA providers are listed in the model loading message

### Model Mismatch Errors

- Ensure `--K` matches the model's K value (usually 72)
- Check that the ONNX model was exported with the correct input shape
- Verify config.yaml matches training configuration

### Poor DOA Accuracy

- Increase `--window-ms` for longer analysis windows (more stable)
- Adjust `--min-peak-height` and `--min-window-mass` thresholds
- Tune `--tau` (lower = sharper peaks, higher = smoother)
- Check microphone array calibration and positioning

## Performance Tips

- **GPU Inference**: Use `onnxruntime-gpu` for 5-10x speedup
- **Window Size**: Larger windows (400ms) = more stable but higher latency
- **Hop Size**: Smaller hops (50ms) = more responsive but more computation
- **Batch Size**: The script uses batch_size=25 internally for efficient GPU usage

## Stopping the Script

Press **Ctrl+C** to stop the stream. The script will:
- Close the audio stream
- Close the visualization window
- Clean up resources

## Integration

To use this in your own code, see `onnx_doa_inference.py` which provides a standalone inference class that can be integrated into other projects.