Spaces:

Hedrekao
/

audio-explorers-visualization

Sleeping

App Files Files Community

audio-explorers-visualization / README.md

hedrekao

HF deploy: clean snapshot without local artifacts

a361db3 about 2 months ago

preview code

raw

history blame contribute delete

3.45 kB

metadata

title: Hearing Visualized
emoji: 🎧
colorFrom: purple
colorTo: pink
sdk: docker
pinned: false
license: mit

Multi-Talker Audio Source Separation

Pipeline for analyzing a 4-channel hearing-aid recording and extracting per-speaker information:

speaker count
direction of arrival (DoA)
gender estimate from F0
per-speaker transcription
talker-of-interest (ToI) selection

Quick Start

# install dependencies
uv sync

# run default approach (ica)
uv run python main.py data/mixture.wav --approach ica --output output/ica

ASR is optional. To enable Whisper transcription:

uv sync --extra asr

CLI

uv run python main.py <input_wav> [options]

Core options:

-a, --approach {ica,frankenstein,ica_deeplearning}
-o, --output <dir>
-w, --whisper-model {tiny,base,small,medium,large}
--hf-token <token> (only relevant for ica_deeplearning)
-v, --verbose

Example runs:

uv run python main.py data/mixture.wav --approach ica --output output/ica
uv run python main.py data/mixture.wav --approach frankenstein --output output/frankenstein
uv run python main.py data/mixture.wav --approach ica_deeplearning --output output/ica_dl

Approaches (Current Status)

ica
- FastICA separation with fixed 4 sources
- DoA from ICA mixing matrix
- ToI uses weighted scoring (front/language/energy/gender)
frankenstein
- FastICA separation with fixed 4 sources
- language-priority ToI policy (strong English bonus)
- Does not use DoA for final ToI decision
ica_deeplearning
- pass 1: PCA + ICA (source count from variance threshold)
- pass 2 deep stage is currently simplified/placeholder in code
- useful as experimental variant, not full deep overlap resolution yet

Output

Each run writes to the chosen output directory:

<output_dir>/
  source_1.wav
  source_2.wav
  source_3.wav
  source_4.wav (or fewer/more for ica_deeplearning)
  output.wav
  results.json

results.json contains per-source metadata such as direction, energy, gender, language, transcript, selection score, and ToI marker.

Microphone Geometry

Channel mapping expected by the pipeline:

channel 0: Left Front (LF)
channel 1: Left Rear (LR)
channel 2: Right Front (RF)
channel 3: Right Rear (RR)

Orientation convention:

0° front
90° right
180° rear
270° left

Project Layout

.
  main.py
  approaches/
  pipeline_modules/
  scripts/
  tests/
  archive_solution/
  data/
  output/
  docs/

main.py: canonical entrypoint
approaches/: pipeline variants
pipeline_modules/: shared logic (audio loading, DoA, gender, ASR, ToI)
scripts/: utility/analysis scripts not required for the main run
tests/: lightweight validation scripts

Utility Commands

Run benchmark across all approaches:

uv run python scripts/benchmark.py --data-dir data --output-dir benchmark_results

Run lightweight checks:

uv run python tests/test_lazy_loading.py
uv run python tests/validate_outputs.py
uv run python tests/validate_audio.py

Notes and Limitations

Input must be a 4-channel WAV file.
If Whisper is not installed, transcription is skipped (pipeline still runs).
F0-based gender can return unknown when voiced frames are insufficient.
ica_deeplearning pass 2 is not fully implemented yet.

Docs

docs/pipeline-details.md
docs/lazy-imports-fix.md