---
title: Hearing Visualized
emoji: 🎧
colorFrom: purple
colorTo: pink
sdk: docker
pinned: false
license: mit
---

# Multi-Talker Audio Source Separation

Pipeline for analyzing a 4-channel hearing-aid recording and extracting per-speaker information:

- speaker count
- direction of arrival (DoA)
- gender estimate from F0
- per-speaker transcription
- talker-of-interest (ToI) selection

## Quick Start

```bash
# install dependencies
uv sync

# run default approach (ica)
uv run python main.py data/mixture.wav --approach ica --output output/ica
```

ASR is optional. To enable Whisper transcription:

```bash
uv sync --extra asr
```

## CLI

```bash
uv run python main.py <input_wav> [options]
```

Core options:

- `-a, --approach {ica,frankenstein,ica_deeplearning}`
- `-o, --output <dir>`
- `-w, --whisper-model {tiny,base,small,medium,large}`
- `--hf-token <token>` (only relevant for `ica_deeplearning`)
- `-v, --verbose`

Example runs:

```bash
uv run python main.py data/mixture.wav --approach ica --output output/ica
uv run python main.py data/mixture.wav --approach frankenstein --output output/frankenstein
uv run python main.py data/mixture.wav --approach ica_deeplearning --output output/ica_dl
```

## Approaches (Current Status)

- `ica`
  - FastICA separation with fixed 4 sources
  - DoA from ICA mixing matrix
  - ToI uses weighted scoring (front/language/energy/gender)

- `frankenstein`
  - FastICA separation with fixed 4 sources
  - language-priority ToI policy (strong English bonus)
  - Does not use DoA for final ToI decision

- `ica_deeplearning`
  - pass 1: PCA + ICA (source count from variance threshold)
  - pass 2 deep stage is currently simplified/placeholder in code
  - useful as experimental variant, not full deep overlap resolution yet

## Output

Each run writes to the chosen output directory:

```text
<output_dir>/
  source_1.wav
  source_2.wav
  source_3.wav
  source_4.wav (or fewer/more for ica_deeplearning)
  output.wav
  results.json
```

`results.json` contains per-source metadata such as direction, energy, gender, language, transcript, selection score, and ToI marker.

## Microphone Geometry

Channel mapping expected by the pipeline:

- channel 0: Left Front (LF)
- channel 1: Left Rear (LR)
- channel 2: Right Front (RF)
- channel 3: Right Rear (RR)

Orientation convention:

- `0°` front
- `90°` right
- `180°` rear
- `270°` left

## Project Layout

```text
.
  main.py
  approaches/
  pipeline_modules/
  scripts/
  tests/
  archive_solution/
  data/
  output/
  docs/
```

- `main.py`: canonical entrypoint
- `approaches/`: pipeline variants
- `pipeline_modules/`: shared logic (audio loading, DoA, gender, ASR, ToI)
- `scripts/`: utility/analysis scripts not required for the main run
- `tests/`: lightweight validation scripts

## Utility Commands

Run benchmark across all approaches:

```bash
uv run python scripts/benchmark.py --data-dir data --output-dir benchmark_results
```

Run lightweight checks:

```bash
uv run python tests/test_lazy_loading.py
uv run python tests/validate_outputs.py
uv run python tests/validate_audio.py
```

## Notes and Limitations

- Input must be a 4-channel WAV file.
- If Whisper is not installed, transcription is skipped (pipeline still runs).
- F0-based gender can return `unknown` when voiced frames are insufficient.
- `ica_deeplearning` pass 2 is not fully implemented yet.

## Docs

- `docs/pipeline-details.md`
- `docs/lazy-imports-fix.md`