hedrekao
HF deploy: clean snapshot without local artifacts
a361db3
---
title: Hearing Visualized
emoji: 🎧
colorFrom: purple
colorTo: pink
sdk: docker
pinned: false
license: mit
---
# Multi-Talker Audio Source Separation
Pipeline for analyzing a 4-channel hearing-aid recording and extracting per-speaker information:
- speaker count
- direction of arrival (DoA)
- gender estimate from F0
- per-speaker transcription
- talker-of-interest (ToI) selection
## Quick Start
```bash
# install dependencies
uv sync
# run default approach (ica)
uv run python main.py data/mixture.wav --approach ica --output output/ica
```
ASR is optional. To enable Whisper transcription:
```bash
uv sync --extra asr
```
## CLI
```bash
uv run python main.py <input_wav> [options]
```
Core options:
- `-a, --approach {ica,frankenstein,ica_deeplearning}`
- `-o, --output <dir>`
- `-w, --whisper-model {tiny,base,small,medium,large}`
- `--hf-token <token>` (only relevant for `ica_deeplearning`)
- `-v, --verbose`
Example runs:
```bash
uv run python main.py data/mixture.wav --approach ica --output output/ica
uv run python main.py data/mixture.wav --approach frankenstein --output output/frankenstein
uv run python main.py data/mixture.wav --approach ica_deeplearning --output output/ica_dl
```
## Approaches (Current Status)
- `ica`
- FastICA separation with fixed 4 sources
- DoA from ICA mixing matrix
- ToI uses weighted scoring (front/language/energy/gender)
- `frankenstein`
- FastICA separation with fixed 4 sources
- language-priority ToI policy (strong English bonus)
- Does not use DoA for final ToI decision
- `ica_deeplearning`
- pass 1: PCA + ICA (source count from variance threshold)
- pass 2 deep stage is currently simplified/placeholder in code
- useful as experimental variant, not full deep overlap resolution yet
## Output
Each run writes to the chosen output directory:
```text
<output_dir>/
source_1.wav
source_2.wav
source_3.wav
source_4.wav (or fewer/more for ica_deeplearning)
output.wav
results.json
```
`results.json` contains per-source metadata such as direction, energy, gender, language, transcript, selection score, and ToI marker.
## Microphone Geometry
Channel mapping expected by the pipeline:
- channel 0: Left Front (LF)
- channel 1: Left Rear (LR)
- channel 2: Right Front (RF)
- channel 3: Right Rear (RR)
Orientation convention:
- `0°` front
- `90°` right
- `180°` rear
- `270°` left
## Project Layout
```text
.
main.py
approaches/
pipeline_modules/
scripts/
tests/
archive_solution/
data/
output/
docs/
```
- `main.py`: canonical entrypoint
- `approaches/`: pipeline variants
- `pipeline_modules/`: shared logic (audio loading, DoA, gender, ASR, ToI)
- `scripts/`: utility/analysis scripts not required for the main run
- `tests/`: lightweight validation scripts
## Utility Commands
Run benchmark across all approaches:
```bash
uv run python scripts/benchmark.py --data-dir data --output-dir benchmark_results
```
Run lightweight checks:
```bash
uv run python tests/test_lazy_loading.py
uv run python tests/validate_outputs.py
uv run python tests/validate_audio.py
```
## Notes and Limitations
- Input must be a 4-channel WAV file.
- If Whisper is not installed, transcription is skipped (pipeline still runs).
- F0-based gender can return `unknown` when voiced frames are insufficient.
- `ica_deeplearning` pass 2 is not fully implemented yet.
## Docs
- `docs/pipeline-details.md`
- `docs/lazy-imports-fix.md`