metadata
title: Hearing Visualized
emoji: 🎧
colorFrom: purple
colorTo: pink
sdk: docker
pinned: false
license: mit
Multi-Talker Audio Source Separation
Pipeline for analyzing a 4-channel hearing-aid recording and extracting per-speaker information:
- speaker count
- direction of arrival (DoA)
- gender estimate from F0
- per-speaker transcription
- talker-of-interest (ToI) selection
Quick Start
# install dependencies
uv sync
# run default approach (ica)
uv run python main.py data/mixture.wav --approach ica --output output/ica
ASR is optional. To enable Whisper transcription:
uv sync --extra asr
CLI
uv run python main.py <input_wav> [options]
Core options:
-a, --approach {ica,frankenstein,ica_deeplearning}-o, --output <dir>-w, --whisper-model {tiny,base,small,medium,large}--hf-token <token>(only relevant forica_deeplearning)-v, --verbose
Example runs:
uv run python main.py data/mixture.wav --approach ica --output output/ica
uv run python main.py data/mixture.wav --approach frankenstein --output output/frankenstein
uv run python main.py data/mixture.wav --approach ica_deeplearning --output output/ica_dl
Approaches (Current Status)
ica- FastICA separation with fixed 4 sources
- DoA from ICA mixing matrix
- ToI uses weighted scoring (front/language/energy/gender)
frankenstein- FastICA separation with fixed 4 sources
- language-priority ToI policy (strong English bonus)
- Does not use DoA for final ToI decision
ica_deeplearning- pass 1: PCA + ICA (source count from variance threshold)
- pass 2 deep stage is currently simplified/placeholder in code
- useful as experimental variant, not full deep overlap resolution yet
Output
Each run writes to the chosen output directory:
<output_dir>/
source_1.wav
source_2.wav
source_3.wav
source_4.wav (or fewer/more for ica_deeplearning)
output.wav
results.json
results.json contains per-source metadata such as direction, energy, gender, language, transcript, selection score, and ToI marker.
Microphone Geometry
Channel mapping expected by the pipeline:
- channel 0: Left Front (LF)
- channel 1: Left Rear (LR)
- channel 2: Right Front (RF)
- channel 3: Right Rear (RR)
Orientation convention:
0°front90°right180°rear270°left
Project Layout
.
main.py
approaches/
pipeline_modules/
scripts/
tests/
archive_solution/
data/
output/
docs/
main.py: canonical entrypointapproaches/: pipeline variantspipeline_modules/: shared logic (audio loading, DoA, gender, ASR, ToI)scripts/: utility/analysis scripts not required for the main runtests/: lightweight validation scripts
Utility Commands
Run benchmark across all approaches:
uv run python scripts/benchmark.py --data-dir data --output-dir benchmark_results
Run lightweight checks:
uv run python tests/test_lazy_loading.py
uv run python tests/validate_outputs.py
uv run python tests/validate_audio.py
Notes and Limitations
- Input must be a 4-channel WAV file.
- If Whisper is not installed, transcription is skipped (pipeline still runs).
- F0-based gender can return
unknownwhen voiced frames are insufficient. ica_deeplearningpass 2 is not fully implemented yet.
Docs
docs/pipeline-details.mddocs/lazy-imports-fix.md