--- title: Hearing Visualized emoji: 🎧 colorFrom: purple colorTo: pink sdk: docker pinned: false license: mit --- # Multi-Talker Audio Source Separation Pipeline for analyzing a 4-channel hearing-aid recording and extracting per-speaker information: - speaker count - direction of arrival (DoA) - gender estimate from F0 - per-speaker transcription - talker-of-interest (ToI) selection ## Quick Start ```bash # install dependencies uv sync # run default approach (ica) uv run python main.py data/mixture.wav --approach ica --output output/ica ``` ASR is optional. To enable Whisper transcription: ```bash uv sync --extra asr ``` ## CLI ```bash uv run python main.py [options] ``` Core options: - `-a, --approach {ica,frankenstein,ica_deeplearning}` - `-o, --output ` - `-w, --whisper-model {tiny,base,small,medium,large}` - `--hf-token ` (only relevant for `ica_deeplearning`) - `-v, --verbose` Example runs: ```bash uv run python main.py data/mixture.wav --approach ica --output output/ica uv run python main.py data/mixture.wav --approach frankenstein --output output/frankenstein uv run python main.py data/mixture.wav --approach ica_deeplearning --output output/ica_dl ``` ## Approaches (Current Status) - `ica` - FastICA separation with fixed 4 sources - DoA from ICA mixing matrix - ToI uses weighted scoring (front/language/energy/gender) - `frankenstein` - FastICA separation with fixed 4 sources - language-priority ToI policy (strong English bonus) - Does not use DoA for final ToI decision - `ica_deeplearning` - pass 1: PCA + ICA (source count from variance threshold) - pass 2 deep stage is currently simplified/placeholder in code - useful as experimental variant, not full deep overlap resolution yet ## Output Each run writes to the chosen output directory: ```text / source_1.wav source_2.wav source_3.wav source_4.wav (or fewer/more for ica_deeplearning) output.wav results.json ``` `results.json` contains per-source metadata such as direction, energy, gender, language, transcript, selection score, and ToI marker. ## Microphone Geometry Channel mapping expected by the pipeline: - channel 0: Left Front (LF) - channel 1: Left Rear (LR) - channel 2: Right Front (RF) - channel 3: Right Rear (RR) Orientation convention: - `0°` front - `90°` right - `180°` rear - `270°` left ## Project Layout ```text . main.py approaches/ pipeline_modules/ scripts/ tests/ archive_solution/ data/ output/ docs/ ``` - `main.py`: canonical entrypoint - `approaches/`: pipeline variants - `pipeline_modules/`: shared logic (audio loading, DoA, gender, ASR, ToI) - `scripts/`: utility/analysis scripts not required for the main run - `tests/`: lightweight validation scripts ## Utility Commands Run benchmark across all approaches: ```bash uv run python scripts/benchmark.py --data-dir data --output-dir benchmark_results ``` Run lightweight checks: ```bash uv run python tests/test_lazy_loading.py uv run python tests/validate_outputs.py uv run python tests/validate_audio.py ``` ## Notes and Limitations - Input must be a 4-channel WAV file. - If Whisper is not installed, transcription is skipped (pipeline still runs). - F0-based gender can return `unknown` when voiced frames are insufficient. - `ica_deeplearning` pass 2 is not fully implemented yet. ## Docs - `docs/pipeline-details.md` - `docs/lazy-imports-fix.md`