massabaali
/

HowFar-Caarma

speaker augmentation

distance-estimation

self-supervised

speaker-analysis

Model card Files Files and versions

HowFar-Caarma / README.md

massabaali's picture

Upload README.md with huggingface_hub

93a92d1 verified 30 days ago

|

history blame contribute delete

1.89 kB

	---
	license: mit
	tags:
	- audio
	- speech
	- caarma
	- speaker augmentation
	- distance-estimation
	- self-supervised
	- speaker-analysis
	language:
	- en
	---

	# HowFar-Caarma

	HuBERT-based model for distance estimation from speech — predicting the
	physical distance between a speaker and a microphone from the audio signal.

	Backbone: `facebook/hubert-large-ls960-ft` with a classification head trained
	on GAN-augmented data using the CAARMA framework.

	## Files

	- `epoch18_val_acc7997.ckpt` — PyTorch Lightning checkpoint (epoch 18, val acc 79.97%)
	- `inference.py` — minimal loader + embedding extraction script

	## Usage

	```bash
	pip install torch torchaudio transformers pytorch-lightning huggingface_hub

	# Download the checkpoint
	huggingface-cli download MassaBaali/HowFar-Caarma epoch18_val_acc7997.ckpt --local-dir .

	# Run inference
	python inference.py --ckpt epoch18_val_acc7997.ckpt --audio sample.wav
	```

	Or load it directly in Python:

	```python
	from inference import load_model, extract_embedding

	model = load_model("epoch18_val_acc7997.ckpt", device="cuda")
	embedding = extract_embedding(model, "sample.wav", device="cuda")
	print(embedding.shape)
	```

	## Notes

	- Expects 16 kHz mono audio.
	- The checkpoint was trained with PyTorch Lightning; `strict=False` is used on
	load to tolerate minor state-dict key differences.
	- This is the raw Lightning checkpoint rather than a `transformers`-native
	format, so standard `AutoModel.from_pretrained` will not work.

	## Citation

	If you use this model, please cite:

	```bibtex
	@article{baali2025caarma,
	title={CAARMA: Class augmentation with adversarial mixup regularization},
	author={Baali, Massa and Li, Xiang and Chen, Hao and Hannan, Syed Abdul and Singh, Rita and Raj, Bhiksha},
	journal={Findings of the Association for Computational Linguistics: EMNLP},
	volume={2025},
	pages={9732--9742},
	year={2025}
	}
	```