hedrekao
HF deploy: clean snapshot without local artifacts
a361db3
metadata
title: Hearing Visualized
emoji: 🎧
colorFrom: purple
colorTo: pink
sdk: docker
pinned: false
license: mit

Multi-Talker Audio Source Separation

Pipeline for analyzing a 4-channel hearing-aid recording and extracting per-speaker information:

  • speaker count
  • direction of arrival (DoA)
  • gender estimate from F0
  • per-speaker transcription
  • talker-of-interest (ToI) selection

Quick Start

# install dependencies
uv sync

# run default approach (ica)
uv run python main.py data/mixture.wav --approach ica --output output/ica

ASR is optional. To enable Whisper transcription:

uv sync --extra asr

CLI

uv run python main.py <input_wav> [options]

Core options:

  • -a, --approach {ica,frankenstein,ica_deeplearning}
  • -o, --output <dir>
  • -w, --whisper-model {tiny,base,small,medium,large}
  • --hf-token <token> (only relevant for ica_deeplearning)
  • -v, --verbose

Example runs:

uv run python main.py data/mixture.wav --approach ica --output output/ica
uv run python main.py data/mixture.wav --approach frankenstein --output output/frankenstein
uv run python main.py data/mixture.wav --approach ica_deeplearning --output output/ica_dl

Approaches (Current Status)

  • ica

    • FastICA separation with fixed 4 sources
    • DoA from ICA mixing matrix
    • ToI uses weighted scoring (front/language/energy/gender)
  • frankenstein

    • FastICA separation with fixed 4 sources
    • language-priority ToI policy (strong English bonus)
    • Does not use DoA for final ToI decision
  • ica_deeplearning

    • pass 1: PCA + ICA (source count from variance threshold)
    • pass 2 deep stage is currently simplified/placeholder in code
    • useful as experimental variant, not full deep overlap resolution yet

Output

Each run writes to the chosen output directory:

<output_dir>/
  source_1.wav
  source_2.wav
  source_3.wav
  source_4.wav (or fewer/more for ica_deeplearning)
  output.wav
  results.json

results.json contains per-source metadata such as direction, energy, gender, language, transcript, selection score, and ToI marker.

Microphone Geometry

Channel mapping expected by the pipeline:

  • channel 0: Left Front (LF)
  • channel 1: Left Rear (LR)
  • channel 2: Right Front (RF)
  • channel 3: Right Rear (RR)

Orientation convention:

  • front
  • 90° right
  • 180° rear
  • 270° left

Project Layout

.
  main.py
  approaches/
  pipeline_modules/
  scripts/
  tests/
  archive_solution/
  data/
  output/
  docs/
  • main.py: canonical entrypoint
  • approaches/: pipeline variants
  • pipeline_modules/: shared logic (audio loading, DoA, gender, ASR, ToI)
  • scripts/: utility/analysis scripts not required for the main run
  • tests/: lightweight validation scripts

Utility Commands

Run benchmark across all approaches:

uv run python scripts/benchmark.py --data-dir data --output-dir benchmark_results

Run lightweight checks:

uv run python tests/test_lazy_loading.py
uv run python tests/validate_outputs.py
uv run python tests/validate_audio.py

Notes and Limitations

  • Input must be a 4-channel WAV file.
  • If Whisper is not installed, transcription is skipped (pipeline still runs).
  • F0-based gender can return unknown when voiced frames are insufficient.
  • ica_deeplearning pass 2 is not fully implemented yet.

Docs

  • docs/pipeline-details.md
  • docs/lazy-imports-fix.md