Spaces:

Hedrekao
/

audio-explorers-visualization

Sleeping

App Files Files Community

audio-explorers-visualization / README.md

hedrekao

HF deploy: clean snapshot without local artifacts

a361db3 about 2 months ago

preview code

raw

history blame contribute delete

3.45 kB

	---
	title: Hearing Visualized
	emoji: 🎧
	colorFrom: purple
	colorTo: pink
	sdk: docker
	pinned: false
	license: mit
	---

	# Multi-Talker Audio Source Separation

	Pipeline for analyzing a 4-channel hearing-aid recording and extracting per-speaker information:

	- speaker count
	- direction of arrival (DoA)
	- gender estimate from F0
	- per-speaker transcription
	- talker-of-interest (ToI) selection

	## Quick Start

	```bash
	# install dependencies
	uv sync

	# run default approach (ica)
	uv run python main.py data/mixture.wav --approach ica --output output/ica
	```

	ASR is optional. To enable Whisper transcription:

	```bash
	uv sync --extra asr
	```

	## CLI

	```bash
	uv run python main.py <input_wav> [options]
	```

	Core options:

	- `-a, --approach {ica,frankenstein,ica_deeplearning}`
	- `-o, --output <dir>`
	- `-w, --whisper-model {tiny,base,small,medium,large}`
	- `--hf-token <token>` (only relevant for `ica_deeplearning`)
	- `-v, --verbose`

	Example runs:

	```bash
	uv run python main.py data/mixture.wav --approach ica --output output/ica
	uv run python main.py data/mixture.wav --approach frankenstein --output output/frankenstein
	uv run python main.py data/mixture.wav --approach ica_deeplearning --output output/ica_dl
	```

	## Approaches (Current Status)

	- `ica`
	- FastICA separation with fixed 4 sources
	- DoA from ICA mixing matrix
	- ToI uses weighted scoring (front/language/energy/gender)

	- `frankenstein`
	- FastICA separation with fixed 4 sources
	- language-priority ToI policy (strong English bonus)
	- Does not use DoA for final ToI decision

	- `ica_deeplearning`
	- pass 1: PCA + ICA (source count from variance threshold)
	- pass 2 deep stage is currently simplified/placeholder in code
	- useful as experimental variant, not full deep overlap resolution yet

	## Output

	Each run writes to the chosen output directory:

	```text
	<output_dir>/
	source_1.wav
	source_2.wav
	source_3.wav
	source_4.wav (or fewer/more for ica_deeplearning)
	output.wav
	results.json
	```

	`results.json` contains per-source metadata such as direction, energy, gender, language, transcript, selection score, and ToI marker.

	## Microphone Geometry

	Channel mapping expected by the pipeline:

	- channel 0: Left Front (LF)
	- channel 1: Left Rear (LR)
	- channel 2: Right Front (RF)
	- channel 3: Right Rear (RR)

	Orientation convention:

	- `0°` front
	- `90°` right
	- `180°` rear
	- `270°` left

	## Project Layout

	```text
	.
	main.py
	approaches/
	pipeline_modules/
	scripts/
	tests/
	archive_solution/
	data/
	output/
	docs/
	```

	- `main.py`: canonical entrypoint
	- `approaches/`: pipeline variants
	- `pipeline_modules/`: shared logic (audio loading, DoA, gender, ASR, ToI)
	- `scripts/`: utility/analysis scripts not required for the main run
	- `tests/`: lightweight validation scripts

	## Utility Commands

	Run benchmark across all approaches:

	```bash
	uv run python scripts/benchmark.py --data-dir data --output-dir benchmark_results
	```

	Run lightweight checks:

	```bash
	uv run python tests/test_lazy_loading.py
	uv run python tests/validate_outputs.py
	uv run python tests/validate_audio.py
	```

	## Notes and Limitations

	- Input must be a 4-channel WAV file.
	- If Whisper is not installed, transcription is skipped (pipeline still runs).
	- F0-based gender can return `unknown` when voiced frames are insufficient.
	- `ica_deeplearning` pass 2 is not fully implemented yet.

	## Docs

	- `docs/pipeline-details.md`
	- `docs/lazy-imports-fix.md`