Upload folder using huggingface_hub

c2f1451 verified about 2 months ago

9.8 kB

	# ONNX Real-Time DOA Streaming

	Real-time Direction of Arrival (DOA) detection using an ONNX model with microphone streaming. This script processes audio from a multi-channel microphone array (ReSpeaker) in real-time and displays detected sound source directions.

	## Overview

	The script performs the following process:

	1. Audio Capture: Streams audio from a 6-channel microphone array (ReSpeaker)
	2. Channel Selection: Selects and reorders channels `[1, 4, 3, 2]` to get 4 channels
	3. Feature Extraction: Computes STFT features (magnitude, phase, cosine, sine) from the audio
	4. ONNX Inference: Runs the DOA model on GPU (CUDA) or CPU to get per-frame logits
	5. Histogram Aggregation: Aggregates logits into a circular histogram of azimuth angles
	6. Peak Detection: Finds peaks in the histogram to identify sound source directions
	7. Event Gating: Filters detections based on audio level changes and coherence
	8. Visualization: Displays detected directions on a polar plot in real-time

	## Prerequisites

	### Hardware
	- ReSpeaker 6-Mic Array (or compatible multi-channel microphone)
	- microphone:
	positions:
	- [0.0277, 0.0] # Mic 0: 0°
	- [0.0, 0.0277] # Mic 1: 90°
	- [-0.0277, 0.0] # Mic 2: 180°
	- [0.0, -0.0277] # Mic 3: 270°
	- NVIDIA GPU (optional, for faster inference)

	### Software Dependencies

	Install the required packages:

	```bash
	conda activate doaEnv
	pip install onnxruntime-gpu # For GPU inference
	# OR
	pip install onnxruntime # For CPU-only inference

	pip install pyaudio numpy matplotlib torch pyyaml
	```

	### ONNX Model

	You need a converted ONNX model file. If you haven't converted your PyTorch model yet:

	```bash
	python convert_to_onnx.py --checkpoint models/basic/2025-11-06_22-37-00-6a5fbc92/last.pt --output models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx
	```

	## Quick Start

	### 1. List Available Audio Devices

	First, find your ReSpeaker device index:

	```bash
	python onnx_stream_microphone.py --list-devices
	```

	Look for a device named "ReSpeaker" or "Seeed" or containing "2886". Note the device index.

	### 2. Stop PulseAudio (Required)

	On Linux, PulseAudio often locks the ALSA devices. You need to temporarily stop it:

	```bash
	pulseaudio --kill
	```

	Note: You can use the helper script `run_onnx_stream.sh` which automates this (see below).

	### 3. Run the Streaming Script

	Basic usage:

	```bash
	python onnx_stream_microphone.py \
	--onnx models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx \
	--device-index 9
	```

	### 4. Restart PulseAudio (After Stopping)

	After you're done, restart PulseAudio:

	```bash
	pulseaudio --start
	```

	## Using the Helper Script

	A helper script automates PulseAudio management:

	```bash
	chmod +x run_onnx_stream.sh
	./run_onnx_stream.sh --onnx models/basic/2025-11-06_22-37-00-6a5fbc92/last.onnx --device-index 9
	```

	This script will:
	1. Stop PulseAudio
	2. Run the streaming script
	3. Restart PulseAudio when you exit (Ctrl+C)

	## Command-Line Arguments

	### Required Arguments

	- `--onnx PATH`: Path to the ONNX model file

	### Audio Configuration

	- `--device-index INT`: Audio device index (use `--list-devices` to find it)
	- `--sample-rate INT`: Sample rate in Hz (default: 16000)
	- `--window-ms INT`: Analysis window length in milliseconds (default: 200)
	- `--hop-ms INT`: Hop size (overlap) in milliseconds (default: 100)
	- `--chunk-size INT`: Audio buffer chunk size (default: 1600)
	- `--cpu-only`: Use CPU only (disable GPU inference)
	- `--list-devices`: List all available audio input devices and exit

	### Model Configuration

	- `--config PATH`: Path to config.yaml (default: `configs/train.yaml`)

	### Histogram Detection Parameters

	These control how DOA peaks are detected from the model logits:

	- `--K INT`: Number of azimuth bins (default: 72, should match model)
	- `--tau FLOAT`: Softmax temperature for histogram (default: 0.8)
	- `--smooth-k INT`: Histogram smoothing kernel size (default: 1)
	- `--min-peak-height FLOAT`: Minimum peak height threshold (default: 0.10)
	- `--min-window-mass FLOAT`: Minimum window mass for peak validation (default: 0.24)
	- `--min-sep-deg FLOAT`: Minimum angular separation between peaks in degrees (default: 20.0)
	- `--min-active-ratio FLOAT`: Minimum active frame ratio (default: 0.20)
	- `--max-sources INT`: Maximum number of sources to detect (default: 3)

	### Event Gate Parameters

	These control when detections are considered valid (filtering noise):

	- `--level-delta-on-db FLOAT`: Level increase threshold to open gate (default: 2.5)
	- `--level-delta-off-db FLOAT`: Level decrease threshold to close gate (default: 1.0)
	- `--level-min-dbfs FLOAT`: Minimum audio level in dBFS (default: -60.0)
	- `--level-ema-alpha FLOAT`: Exponential moving average alpha for level tracking (default: 0.05)
	- `--event-hold-ms INT`: Minimum time to keep gate open after detection (default: 300)
	- `--min-R-clip FLOAT`: Minimum R_clip (coherence measure) to open gate (default: 0.18)
	- `--event-refractory-ms INT`: Minimum time between gate state changes (default: 120)

	### Onset Detection Parameters

	- `--onset-alpha FLOAT`: EMA alpha for spectral flux tracking (default: 0.05)

	## Example with Custom Parameters

	```bash
	python onnx_stream_microphone.py \
	--onnx doa_model.onnx \
	--device-index 9 \
	--window-ms 400 \
	--hop-ms 100 \
	--K 72 \
	--max-sources 2 \
	--tau 0.8 \
	--smooth-k 1 \
	--min-peak-height 0.08 \
	--min-window-mass 0.16 \
	--min-sep-deg 22.5 \
	--min-active-ratio 0.15 \
	--level-delta-on-db 4.0 \
	--level-delta-off-db 1.5 \
	--level-min-dbfs -55.0 \
	--level-ema-alpha 0.05 \
	--event-hold-ms 320 \
	--event-refractory-ms 200 \
	--min-R-clip 0.30 \
	--onset-alpha 0.05
	```

	## Understanding the Output

	### Console Output

	Each line shows:

	```
	[ 12.34s] LVL= -45.2 dBFS diff=+3.5 \| FLUXz=2.10 COH=0.75 \| GATE=OPEN \| MODEL= 12.3ms HIST= 2.1ms \| DOA(R=0.45, n=2) [45°, 180°]
	```

	- `[time]`: Elapsed time in seconds
	- `LVL`: Audio level in dBFS
	- `diff`: Level difference from background (dB)
	- `FLUXz`: Spectral flux z-score (onset detection)
	- `COH`: Inter-microphone coherence
	- `GATE`: Gate state (OPEN/CLOSED)
	- `MODEL`: Model inference time (ms)
	- `HIST`: Histogram processing time (ms)
	- `DOA(R=..., n=...)`: R_clip value and number of detected peaks
	- `[angles]`: Detected azimuth angles in degrees

	### Visual Output

	A polar plot window shows:
	- Green lines: Detected sound source directions
	- Line thickness: Proportional to confidence score
	- Angle labels: Azimuth in degrees (0° = North/front)

	### Azimuth Convention

	- 0° = North (front of microphone)
	- 90° = East (right)
	- 180° = South (back)
	- 270° = West (left)

	## How It Works

	### 1. Audio Processing Pipeline

	```
	Microphone (6 ch) → Channel Selection [1,4,3,2] → 4-channel audio
	```

	### 2. Feature Extraction

	For each analysis window:
	- Compute STFT for all 4 channels
	- Extract magnitude, phase, cosine, and sine components
	- Result: `(T_frames, 12_features, F_freq_bins)`

	### 3. Model Inference

	- Batch process features through ONNX model
	- Output: `(T_frames, K_bins)` logits per frame
	- Each frame has K probability scores for different azimuth angles

	### 4. Histogram Aggregation

	- Apply softmax with temperature `tau` to logits
	- Weight by circular coherence (R_clip)
	- Aggregate across all frames into a single histogram
	- Smooth the histogram

	### 5. Peak Detection

	- Find local maxima in the histogram
	- Filter by minimum height, separation, and window mass
	- Refine peak positions using parabolic interpolation
	- Return up to `max_sources` peaks

	### 6. Event Gating

	- Track audio level with exponential moving average
	- Open gate when:
	- Level increases by `level_delta_on_db` OR
	- Valid peaks detected AND R_clip > `min_R_clip`
	- Close gate when level drops and no valid peaks
	- Apply hold and refractory periods to prevent flickering

	## Troubleshooting

	### "Invalid number of channels" Error

	Problem: Device reports 0 channels or PyAudio can't open it.

	Solution:
	1. Stop PulseAudio: `pulseaudio --kill`
	2. Run the script
	3. Restart PulseAudio: `pulseaudio --start`

	Or use the helper script `run_onnx_stream.sh`.

	### No Audio Detected

	- Check microphone connections
	- Verify device index with `--list-devices`
	- Check audio levels (should be above `level_min_dbfs`)
	- Adjust `level_delta_on_db` to be more sensitive

	### GPU Not Used

	- Verify CUDA is available: `python -c "import torch; print(torch.cuda.is_available())"`
	- Install `onnxruntime-gpu` instead of `onnxruntime`
	- Check that CUDA providers are listed in the model loading message

	### Model Mismatch Errors

	- Ensure `--K` matches the model's K value (usually 72)
	- Check that the ONNX model was exported with the correct input shape
	- Verify config.yaml matches training configuration

	### Poor DOA Accuracy

	- Increase `--window-ms` for longer analysis windows (more stable)
	- Adjust `--min-peak-height` and `--min-window-mass` thresholds
	- Tune `--tau` (lower = sharper peaks, higher = smoother)
	- Check microphone array calibration and positioning

	## Performance Tips

	- GPU Inference: Use `onnxruntime-gpu` for 5-10x speedup
	- Window Size: Larger windows (400ms) = more stable but higher latency
	- Hop Size: Smaller hops (50ms) = more responsive but more computation
	- Batch Size: The script uses batch_size=25 internally for efficient GPU usage

	## Stopping the Script

	Press Ctrl+C to stop the stream. The script will:
	- Close the audio stream
	- Close the visualization window
	- Clean up resources

	## Integration

	To use this in your own code, see `onnx_doa_inference.py` which provides a standalone inference class that can be integrated into other projects.