stiv14's picture
Update README.md
2e84f0e verified
---
license: apache-2.0
tags:
- audio
- audio-classification
- audio-captioning
- onnx
- executorch
- mobile
- arm
language:
- en
pipeline_tag: audio-classification
base_model:
- wsntxxn/effb2-trm-audiocaps-captioning
- sentence-transformers/all-MiniLM-L6-v2
---
# Audio Caption and Categorizer Models
## Model Description
This repository provides **optimized exports** of audio captioning and categorization models for **ARM-based mobile deployment**. The pipeline consists of:
1. **Audio Captioning**: Uses [`wsntxxn/effb2-trm-audiocaps-captioning`](https://huggingface.co/wsntxxn/effb2-trm-audiocaps-captioning) (EfficientNet-B2 encoder + Transformer decoder) to generate natural language descriptions of audio events.
2. **Audio Categorization**: Uses [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to match generated captions to predefined sound categories via semantic similarity.
### Export Formats
- **Encoder**: ONNX format with integrated preprocessing (STFT, MelSpectrogram, AmplitudeToDB)
- **Decoder**: ExecuTorch (`.pte`) format with dynamic quantization for reduced model size
- **Categorizer**: ExecuTorch (`.pte`) format with quantization
### Key Features
- 5-second audio input at 16kHz
- Preprocessing baked into ONNX encoder (no external audio processing needed)
- Optimized for mobile inference with quantization
- Complete end-to-end pipeline from raw audio to categorized captions
## Usage
### Quick Start
Generate a caption for an audio file:
```bash
# Activate environment
source .venv/bin/activate
# Generate caption
python audio-caption/generate_caption_hybrid.py --audio sample_audio.wav
```
### Python Example
```python
import onnxruntime as ort
from executorch.extension.pybindings.portable_lib import _load_for_executorch
from transformers import AutoTokenizer
import numpy as np
# Load models
encoder_session = ort.InferenceSession("audio-caption/effb2_encoder_preprocess.onnx")
decoder = _load_for_executorch("audio-caption/effb2_decoder_5sec.pte")
tokenizer = AutoTokenizer.from_pretrained("wsntxxn/audiocaps-simple-tokenizer", trust_remote_code=True)
# Process audio (16kHz, 5 seconds = 80000 samples)
audio = np.random.randn(1, 80000).astype(np.float32)
# Encode
attn_emb = encoder_session.run(["attn_emb"], {"audio": audio})[0]
# Decode (greedy search)
generated = [tokenizer.bos_token_id]
for _ in range(30):
logits = decoder.forward((
torch.tensor([generated]),
torch.tensor(attn_emb),
torch.tensor([attn_emb.shape[1] - 1])
))[0]
next_token = int(torch.argmax(logits[0, -1, :]))
generated.append(next_token)
if next_token == tokenizer.eos_token_id:
break
caption = tokenizer.decode(generated, skip_special_tokens=True)
print(caption)
```
## Training Details
### Base Models
This repository does **not train models** but exports pre-trained models to optimized formats:
| Component | Base Model | Training Dataset | Parameters |
|-----------|------------|------------------|------------|
| Audio Encoder | EfficientNet-B2 | AudioCaps | ~7.7M |
| Caption Decoder | Transformer (2 layers) | AudioCaps | ~4.3M |
| Categorizer | all-MiniLM-L6-v2 | 1B+ sentence pairs | ~22.7M |
### Export Configuration
**Audio Captioning**:
- **Preprocessing**: `n_mels=64`, `n_fft=512`, `hop_length=160`, `win_length=512`
- **Input**: Raw audio waveform (16kHz, 5 seconds)
- **Encoder**: ONNX opset 17 with dynamic axes
- **Decoder**: ExecuTorch with dynamic quantization (int8)
**Categorizer**:
- **Tokenizer**: RoBERTa-based (max length: 128)
- **Export**: ExecuTorch with dynamic quantization
- **Categories**: 50+ predefined audio event categories
## Project Structure
```
.
β”œβ”€β”€ audio-caption/
β”‚ β”œβ”€β”€ export_encoder_preprocess_onnx.py # Export ONNX encoder
β”‚ β”œβ”€β”€ export_decoder_executorch.py # Export ExecuTorch decoder
β”‚ β”œβ”€β”€ generate_caption_hybrid.py # Inference pipeline
β”‚ β”œβ”€β”€ effb2_encoder_preprocess.onnx # Exported encoder
β”‚ └── effb2_decoder_5sec.pte # Exported decoder
β”‚
β”œβ”€β”€ sentence-transformers-embbedings/
β”‚ β”œβ”€β”€ export_sentence_transformers_executorch.py
β”‚ β”œβ”€β”€ generate_category_embeddings.py
β”‚ └── category_embeddings.json
β”‚
└── categories.json # Category definitions
```
## Setup
### Prerequisites
```bash
# Install uv package manager
pip install uv
# Create environment
uv venv
source .venv/bin/activate
# Install dependencies
uv pip install -r pyproject.toml
```
### Configuration
Create a `.env` file:
```ini
# Hugging Face Token (for gated models)
HF_TOKEN=your_token_here
# Optional: Custom cache directory
# HF_HOME=./.cache/huggingface
```
### Export Models
```bash
# Export audio captioning models
python audio-caption/export_encoder_preprocess_onnx.py
python audio-caption/export_decoder_executorch.py
# Export categorization model
python sentence-transformers-embbedings/export_sentence_transformers_executorch.py
# Generate category embeddings
python sentence-transformers-embbedings/generate_category_embeddings.py
```
## License
Apache License 2.0
## Citations
### Audio Captioning Model
```bibtex
@inproceedings{xu2024efficient,
title={Efficient Audio Captioning with Encoder-Level Knowledge Distillation},
author={Xu, Xuenan and Liu, Haohe and Wu, Mengyue and Wang, Wenwu and Plumbley, Mark D.},
booktitle={Interspeech 2024},
year={2024},
doi={10.48550/arXiv.2407.14329},
url={https://arxiv.org/abs/2407.14329}
}
```
### Sentence Transformer
```bibtex
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
```