File size: 6,029 Bytes

---
license: apache-2.0
tags:
- audio
- audio-classification
- audio-captioning
- onnx
- executorch
- mobile
- arm
language:
- en
pipeline_tag: audio-classification
base_model:
- wsntxxn/effb2-trm-audiocaps-captioning
- sentence-transformers/all-MiniLM-L6-v2
---

# Audio Caption and Categorizer Models

## Model Description

This repository provides **optimized exports** of audio captioning and categorization models for **ARM-based mobile deployment**. The pipeline consists of:

1. **Audio Captioning**: Uses [`wsntxxn/effb2-trm-audiocaps-captioning`](https://huggingface.co/wsntxxn/effb2-trm-audiocaps-captioning) (EfficientNet-B2 encoder + Transformer decoder) to generate natural language descriptions of audio events.

2. **Audio Categorization**: Uses [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to match generated captions to predefined sound categories via semantic similarity.

### Export Formats
- **Encoder**: ONNX format with integrated preprocessing (STFT, MelSpectrogram, AmplitudeToDB)
- **Decoder**: ExecuTorch (`.pte`) format with dynamic quantization for reduced model size
- **Categorizer**: ExecuTorch (`.pte`) format with quantization

### Key Features
- 5-second audio input at 16kHz
- Preprocessing baked into ONNX encoder (no external audio processing needed)
- Optimized for mobile inference with quantization
- Complete end-to-end pipeline from raw audio to categorized captions

## Usage

### Quick Start

Generate a caption for an audio file:

```bash
# Activate environment
source .venv/bin/activate

# Generate caption
python audio-caption/generate_caption_hybrid.py --audio sample_audio.wav
```

### Python Example

```python
import onnxruntime as ort
from executorch.extension.pybindings.portable_lib import _load_for_executorch
from transformers import AutoTokenizer
import numpy as np

# Load models
encoder_session = ort.InferenceSession("audio-caption/effb2_encoder_preprocess.onnx")
decoder = _load_for_executorch("audio-caption/effb2_decoder_5sec.pte")
tokenizer = AutoTokenizer.from_pretrained("wsntxxn/audiocaps-simple-tokenizer", trust_remote_code=True)

# Process audio (16kHz, 5 seconds = 80000 samples)
audio = np.random.randn(1, 80000).astype(np.float32)

# Encode
attn_emb = encoder_session.run(["attn_emb"], {"audio": audio})[0]

# Decode (greedy search)
generated = [tokenizer.bos_token_id]
for _ in range(30):
    logits = decoder.forward((
        torch.tensor([generated]),
        torch.tensor(attn_emb),
        torch.tensor([attn_emb.shape[1] - 1])
    ))[0]
    next_token = int(torch.argmax(logits[0, -1, :]))
    generated.append(next_token)
    if next_token == tokenizer.eos_token_id:
        break

caption = tokenizer.decode(generated, skip_special_tokens=True)
print(caption)
```



## Training Details

### Base Models

This repository does **not train models** but exports pre-trained models to optimized formats:

| Component | Base Model | Training Dataset | Parameters |
|-----------|------------|------------------|------------|
| Audio Encoder | EfficientNet-B2 | AudioCaps | ~7.7M |
| Caption Decoder | Transformer (2 layers) | AudioCaps | ~4.3M |
| Categorizer | all-MiniLM-L6-v2 | 1B+ sentence pairs | ~22.7M |

### Export Configuration

**Audio Captioning**:
- **Preprocessing**: `n_mels=64`, `n_fft=512`, `hop_length=160`, `win_length=512`
- **Input**: Raw audio waveform (16kHz, 5 seconds)
- **Encoder**: ONNX opset 17 with dynamic axes
- **Decoder**: ExecuTorch with dynamic quantization (int8)

**Categorizer**:
- **Tokenizer**: RoBERTa-based (max length: 128)
- **Export**: ExecuTorch with dynamic quantization
- **Categories**: 50+ predefined audio event categories

## Project Structure

```
.
├── audio-caption/
│   ├── export_encoder_preprocess_onnx.py  # Export ONNX encoder
│   ├── export_decoder_executorch.py       # Export ExecuTorch decoder
│   ├── generate_caption_hybrid.py         # Inference pipeline
│   ├── effb2_encoder_preprocess.onnx      # Exported encoder
│   └── effb2_decoder_5sec.pte             # Exported decoder
│
├── sentence-transformers-embbedings/
│   ├── export_sentence_transformers_executorch.py
│   ├── generate_category_embeddings.py
│   └── category_embeddings.json
│
└── categories.json                         # Category definitions
```

## Setup

### Prerequisites

```bash
# Install uv package manager
pip install uv

# Create environment
uv venv
source .venv/bin/activate

# Install dependencies
uv pip install -r pyproject.toml
```

### Configuration

Create a `.env` file:

```ini
# Hugging Face Token (for gated models)
HF_TOKEN=your_token_here

# Optional: Custom cache directory
# HF_HOME=./.cache/huggingface
```

### Export Models

```bash
# Export audio captioning models
python audio-caption/export_encoder_preprocess_onnx.py
python audio-caption/export_decoder_executorch.py

# Export categorization model
python sentence-transformers-embbedings/export_sentence_transformers_executorch.py

# Generate category embeddings
python sentence-transformers-embbedings/generate_category_embeddings.py
```

## License

Apache License 2.0

## Citations

### Audio Captioning Model

```bibtex
@inproceedings{xu2024efficient,
  title={Efficient Audio Captioning with Encoder-Level Knowledge Distillation},
  author={Xu, Xuenan and Liu, Haohe and Wu, Mengyue and Wang, Wenwu and Plumbley, Mark D.},
  booktitle={Interspeech 2024},
  year={2024},
  doi={10.48550/arXiv.2407.14329},
  url={https://arxiv.org/abs/2407.14329}
}
```

### Sentence Transformer

```bibtex
@inproceedings{reimers-2019-sentence-bert,
  title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
  author = "Reimers, Nils and Gurevych, Iryna",
  booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
  year = "2019",
  publisher = "Association for Computational Linguistics",
  url = "https://arxiv.org/abs/1908.10084",
}
```