| --- |
| license: mit |
| tags: |
| - audio |
| - speech |
| - caarma |
| - speaker augmentation |
| - distance-estimation |
| - self-supervised |
| - speaker-analysis |
| language: |
| - en |
| --- |
| |
| # HowFar-Caarma |
|
|
| HuBERT-based model for **distance estimation from speech** — predicting the |
| physical distance between a speaker and a microphone from the audio signal. |
|
|
| Backbone: `facebook/hubert-large-ls960-ft` with a classification head trained |
| on GAN-augmented data using the CAARMA framework. |
|
|
| ## Files |
|
|
| - `epoch18_val_acc7997.ckpt` — PyTorch Lightning checkpoint (epoch 18, val acc 79.97%) |
| - `inference.py` — minimal loader + embedding extraction script |
|
|
| ## Usage |
|
|
| ```bash |
| pip install torch torchaudio transformers pytorch-lightning huggingface_hub |
| |
| # Download the checkpoint |
| huggingface-cli download MassaBaali/HowFar-Caarma epoch18_val_acc7997.ckpt --local-dir . |
| |
| # Run inference |
| python inference.py --ckpt epoch18_val_acc7997.ckpt --audio sample.wav |
| ``` |
|
|
| Or load it directly in Python: |
|
|
| ```python |
| from inference import load_model, extract_embedding |
| |
| model = load_model("epoch18_val_acc7997.ckpt", device="cuda") |
| embedding = extract_embedding(model, "sample.wav", device="cuda") |
| print(embedding.shape) |
| ``` |
|
|
| ## Notes |
|
|
| - Expects 16 kHz mono audio. |
| - The checkpoint was trained with PyTorch Lightning; `strict=False` is used on |
| load to tolerate minor state-dict key differences. |
| - This is the raw Lightning checkpoint rather than a `transformers`-native |
| format, so standard `AutoModel.from_pretrained` will not work. |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @article{baali2025caarma, |
| title={CAARMA: Class augmentation with adversarial mixup regularization}, |
| author={Baali, Massa and Li, Xiang and Chen, Hao and Hannan, Syed Abdul and Singh, Rita and Raj, Bhiksha}, |
| journal={Findings of the Association for Computational Linguistics: EMNLP}, |
| volume={2025}, |
| pages={9732--9742}, |
| year={2025} |
| } |
| ``` |