File size: 6,029 Bytes
5c8d855 5f26cfa 5c8d855 a8b4729 5c8d855 a8b4729 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 |
---
license: apache-2.0
tags:
- audio
- audio-classification
- audio-captioning
- onnx
- executorch
- mobile
- arm
language:
- en
pipeline_tag: audio-classification
base_model:
- wsntxxn/effb2-trm-audiocaps-captioning
- sentence-transformers/all-MiniLM-L6-v2
---
# Audio Caption and Categorizer Models
## Model Description
This repository provides **optimized exports** of audio captioning and categorization models for **ARM-based mobile deployment**. The pipeline consists of:
1. **Audio Captioning**: Uses [`wsntxxn/effb2-trm-audiocaps-captioning`](https://huggingface.co/wsntxxn/effb2-trm-audiocaps-captioning) (EfficientNet-B2 encoder + Transformer decoder) to generate natural language descriptions of audio events.
2. **Audio Categorization**: Uses [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to match generated captions to predefined sound categories via semantic similarity.
### Export Formats
- **Encoder**: ONNX format with integrated preprocessing (STFT, MelSpectrogram, AmplitudeToDB)
- **Decoder**: ExecuTorch (`.pte`) format with dynamic quantization for reduced model size
- **Categorizer**: ExecuTorch (`.pte`) format with quantization
### Key Features
- 5-second audio input at 16kHz
- Preprocessing baked into ONNX encoder (no external audio processing needed)
- Optimized for mobile inference with quantization
- Complete end-to-end pipeline from raw audio to categorized captions
## Usage
### Quick Start
Generate a caption for an audio file:
```bash
# Activate environment
source .venv/bin/activate
# Generate caption
python audio-caption/generate_caption_hybrid.py --audio sample_audio.wav
```
### Python Example
```python
import onnxruntime as ort
from executorch.extension.pybindings.portable_lib import _load_for_executorch
from transformers import AutoTokenizer
import numpy as np
# Load models
encoder_session = ort.InferenceSession("audio-caption/effb2_encoder_preprocess.onnx")
decoder = _load_for_executorch("audio-caption/effb2_decoder_5sec.pte")
tokenizer = AutoTokenizer.from_pretrained("wsntxxn/audiocaps-simple-tokenizer", trust_remote_code=True)
# Process audio (16kHz, 5 seconds = 80000 samples)
audio = np.random.randn(1, 80000).astype(np.float32)
# Encode
attn_emb = encoder_session.run(["attn_emb"], {"audio": audio})[0]
# Decode (greedy search)
generated = [tokenizer.bos_token_id]
for _ in range(30):
logits = decoder.forward((
torch.tensor([generated]),
torch.tensor(attn_emb),
torch.tensor([attn_emb.shape[1] - 1])
))[0]
next_token = int(torch.argmax(logits[0, -1, :]))
generated.append(next_token)
if next_token == tokenizer.eos_token_id:
break
caption = tokenizer.decode(generated, skip_special_tokens=True)
print(caption)
```
## Training Details
### Base Models
This repository does **not train models** but exports pre-trained models to optimized formats:
| Component | Base Model | Training Dataset | Parameters |
|-----------|------------|------------------|------------|
| Audio Encoder | EfficientNet-B2 | AudioCaps | ~7.7M |
| Caption Decoder | Transformer (2 layers) | AudioCaps | ~4.3M |
| Categorizer | all-MiniLM-L6-v2 | 1B+ sentence pairs | ~22.7M |
### Export Configuration
**Audio Captioning**:
- **Preprocessing**: `n_mels=64`, `n_fft=512`, `hop_length=160`, `win_length=512`
- **Input**: Raw audio waveform (16kHz, 5 seconds)
- **Encoder**: ONNX opset 17 with dynamic axes
- **Decoder**: ExecuTorch with dynamic quantization (int8)
**Categorizer**:
- **Tokenizer**: RoBERTa-based (max length: 128)
- **Export**: ExecuTorch with dynamic quantization
- **Categories**: 50+ predefined audio event categories
## Project Structure
```
.
βββ audio-caption/
β βββ export_encoder_preprocess_onnx.py # Export ONNX encoder
β βββ export_decoder_executorch.py # Export ExecuTorch decoder
β βββ generate_caption_hybrid.py # Inference pipeline
β βββ effb2_encoder_preprocess.onnx # Exported encoder
β βββ effb2_decoder_5sec.pte # Exported decoder
β
βββ sentence-transformers-embbedings/
β βββ export_sentence_transformers_executorch.py
β βββ generate_category_embeddings.py
β βββ category_embeddings.json
β
βββ categories.json # Category definitions
```
## Setup
### Prerequisites
```bash
# Install uv package manager
pip install uv
# Create environment
uv venv
source .venv/bin/activate
# Install dependencies
uv pip install -r pyproject.toml
```
### Configuration
Create a `.env` file:
```ini
# Hugging Face Token (for gated models)
HF_TOKEN=your_token_here
# Optional: Custom cache directory
# HF_HOME=./.cache/huggingface
```
### Export Models
```bash
# Export audio captioning models
python audio-caption/export_encoder_preprocess_onnx.py
python audio-caption/export_decoder_executorch.py
# Export categorization model
python sentence-transformers-embbedings/export_sentence_transformers_executorch.py
# Generate category embeddings
python sentence-transformers-embbedings/generate_category_embeddings.py
```
## License
Apache License 2.0
## Citations
### Audio Captioning Model
```bibtex
@inproceedings{xu2024efficient,
title={Efficient Audio Captioning with Encoder-Level Knowledge Distillation},
author={Xu, Xuenan and Liu, Haohe and Wu, Mengyue and Wang, Wenwu and Plumbley, Mark D.},
booktitle={Interspeech 2024},
year={2024},
doi={10.48550/arXiv.2407.14329},
url={https://arxiv.org/abs/2407.14329}
}
```
### Sentence Transformer
```bibtex
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
``` |