File size: 6,029 Bytes
5c8d855
5f26cfa
5c8d855
 
 
 
 
 
 
 
 
 
 
a8b4729
 
 
5c8d855
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a8b4729
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
---
license: apache-2.0
tags:
- audio
- audio-classification
- audio-captioning
- onnx
- executorch
- mobile
- arm
language:
- en
pipeline_tag: audio-classification
base_model:
- wsntxxn/effb2-trm-audiocaps-captioning
- sentence-transformers/all-MiniLM-L6-v2
---

# Audio Caption and Categorizer Models

## Model Description

This repository provides **optimized exports** of audio captioning and categorization models for **ARM-based mobile deployment**. The pipeline consists of:

1. **Audio Captioning**: Uses [`wsntxxn/effb2-trm-audiocaps-captioning`](https://huggingface.co/wsntxxn/effb2-trm-audiocaps-captioning) (EfficientNet-B2 encoder + Transformer decoder) to generate natural language descriptions of audio events.

2. **Audio Categorization**: Uses [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to match generated captions to predefined sound categories via semantic similarity.

### Export Formats
- **Encoder**: ONNX format with integrated preprocessing (STFT, MelSpectrogram, AmplitudeToDB)
- **Decoder**: ExecuTorch (`.pte`) format with dynamic quantization for reduced model size
- **Categorizer**: ExecuTorch (`.pte`) format with quantization

### Key Features
- 5-second audio input at 16kHz
- Preprocessing baked into ONNX encoder (no external audio processing needed)
- Optimized for mobile inference with quantization
- Complete end-to-end pipeline from raw audio to categorized captions

## Usage

### Quick Start

Generate a caption for an audio file:

```bash
# Activate environment
source .venv/bin/activate

# Generate caption
python audio-caption/generate_caption_hybrid.py --audio sample_audio.wav
```

### Python Example

```python
import onnxruntime as ort
from executorch.extension.pybindings.portable_lib import _load_for_executorch
from transformers import AutoTokenizer
import numpy as np

# Load models
encoder_session = ort.InferenceSession("audio-caption/effb2_encoder_preprocess.onnx")
decoder = _load_for_executorch("audio-caption/effb2_decoder_5sec.pte")
tokenizer = AutoTokenizer.from_pretrained("wsntxxn/audiocaps-simple-tokenizer", trust_remote_code=True)

# Process audio (16kHz, 5 seconds = 80000 samples)
audio = np.random.randn(1, 80000).astype(np.float32)

# Encode
attn_emb = encoder_session.run(["attn_emb"], {"audio": audio})[0]

# Decode (greedy search)
generated = [tokenizer.bos_token_id]
for _ in range(30):
    logits = decoder.forward((
        torch.tensor([generated]),
        torch.tensor(attn_emb),
        torch.tensor([attn_emb.shape[1] - 1])
    ))[0]
    next_token = int(torch.argmax(logits[0, -1, :]))
    generated.append(next_token)
    if next_token == tokenizer.eos_token_id:
        break

caption = tokenizer.decode(generated, skip_special_tokens=True)
print(caption)
```



## Training Details

### Base Models

This repository does **not train models** but exports pre-trained models to optimized formats:

| Component | Base Model | Training Dataset | Parameters |
|-----------|------------|------------------|------------|
| Audio Encoder | EfficientNet-B2 | AudioCaps | ~7.7M |
| Caption Decoder | Transformer (2 layers) | AudioCaps | ~4.3M |
| Categorizer | all-MiniLM-L6-v2 | 1B+ sentence pairs | ~22.7M |

### Export Configuration

**Audio Captioning**:
- **Preprocessing**: `n_mels=64`, `n_fft=512`, `hop_length=160`, `win_length=512`
- **Input**: Raw audio waveform (16kHz, 5 seconds)
- **Encoder**: ONNX opset 17 with dynamic axes
- **Decoder**: ExecuTorch with dynamic quantization (int8)

**Categorizer**:
- **Tokenizer**: RoBERTa-based (max length: 128)
- **Export**: ExecuTorch with dynamic quantization
- **Categories**: 50+ predefined audio event categories

## Project Structure

```
.
β”œβ”€β”€ audio-caption/
β”‚   β”œβ”€β”€ export_encoder_preprocess_onnx.py  # Export ONNX encoder
β”‚   β”œβ”€β”€ export_decoder_executorch.py       # Export ExecuTorch decoder
β”‚   β”œβ”€β”€ generate_caption_hybrid.py         # Inference pipeline
β”‚   β”œβ”€β”€ effb2_encoder_preprocess.onnx      # Exported encoder
β”‚   └── effb2_decoder_5sec.pte             # Exported decoder
β”‚
β”œβ”€β”€ sentence-transformers-embbedings/
β”‚   β”œβ”€β”€ export_sentence_transformers_executorch.py
β”‚   β”œβ”€β”€ generate_category_embeddings.py
β”‚   └── category_embeddings.json
β”‚
└── categories.json                         # Category definitions
```

## Setup

### Prerequisites

```bash
# Install uv package manager
pip install uv

# Create environment
uv venv
source .venv/bin/activate

# Install dependencies
uv pip install -r pyproject.toml
```

### Configuration

Create a `.env` file:

```ini
# Hugging Face Token (for gated models)
HF_TOKEN=your_token_here

# Optional: Custom cache directory
# HF_HOME=./.cache/huggingface
```

### Export Models

```bash
# Export audio captioning models
python audio-caption/export_encoder_preprocess_onnx.py
python audio-caption/export_decoder_executorch.py

# Export categorization model
python sentence-transformers-embbedings/export_sentence_transformers_executorch.py

# Generate category embeddings
python sentence-transformers-embbedings/generate_category_embeddings.py
```

## License

Apache License 2.0

## Citations

### Audio Captioning Model

```bibtex
@inproceedings{xu2024efficient,
  title={Efficient Audio Captioning with Encoder-Level Knowledge Distillation},
  author={Xu, Xuenan and Liu, Haohe and Wu, Mengyue and Wang, Wenwu and Plumbley, Mark D.},
  booktitle={Interspeech 2024},
  year={2024},
  doi={10.48550/arXiv.2407.14329},
  url={https://arxiv.org/abs/2407.14329}
}
```

### Sentence Transformer

```bibtex
@inproceedings{reimers-2019-sentence-bert,
  title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
  author = "Reimers, Nils and Gurevych, Iryna",
  booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
  year = "2019",
  publisher = "Association for Computational Linguistics",
  url = "https://arxiv.org/abs/1908.10084",
}
```