|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- audio |
|
|
- audio-classification |
|
|
- audio-captioning |
|
|
- onnx |
|
|
- executorch |
|
|
- mobile |
|
|
- arm |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: audio-classification |
|
|
base_model: |
|
|
- wsntxxn/effb2-trm-audiocaps-captioning |
|
|
- sentence-transformers/all-MiniLM-L6-v2 |
|
|
--- |
|
|
|
|
|
# Audio Caption and Categorizer Models |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This repository provides **optimized exports** of audio captioning and categorization models for **ARM-based mobile deployment**. The pipeline consists of: |
|
|
|
|
|
1. **Audio Captioning**: Uses [`wsntxxn/effb2-trm-audiocaps-captioning`](https://huggingface.co/wsntxxn/effb2-trm-audiocaps-captioning) (EfficientNet-B2 encoder + Transformer decoder) to generate natural language descriptions of audio events. |
|
|
|
|
|
2. **Audio Categorization**: Uses [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to match generated captions to predefined sound categories via semantic similarity. |
|
|
|
|
|
### Export Formats |
|
|
- **Encoder**: ONNX format with integrated preprocessing (STFT, MelSpectrogram, AmplitudeToDB) |
|
|
- **Decoder**: ExecuTorch (`.pte`) format with dynamic quantization for reduced model size |
|
|
- **Categorizer**: ExecuTorch (`.pte`) format with quantization |
|
|
|
|
|
### Key Features |
|
|
- 5-second audio input at 16kHz |
|
|
- Preprocessing baked into ONNX encoder (no external audio processing needed) |
|
|
- Optimized for mobile inference with quantization |
|
|
- Complete end-to-end pipeline from raw audio to categorized captions |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Quick Start |
|
|
|
|
|
Generate a caption for an audio file: |
|
|
|
|
|
```bash |
|
|
# Activate environment |
|
|
source .venv/bin/activate |
|
|
|
|
|
# Generate caption |
|
|
python audio-caption/generate_caption_hybrid.py --audio sample_audio.wav |
|
|
``` |
|
|
|
|
|
### Python Example |
|
|
|
|
|
```python |
|
|
import onnxruntime as ort |
|
|
from executorch.extension.pybindings.portable_lib import _load_for_executorch |
|
|
from transformers import AutoTokenizer |
|
|
import numpy as np |
|
|
|
|
|
# Load models |
|
|
encoder_session = ort.InferenceSession("audio-caption/effb2_encoder_preprocess.onnx") |
|
|
decoder = _load_for_executorch("audio-caption/effb2_decoder_5sec.pte") |
|
|
tokenizer = AutoTokenizer.from_pretrained("wsntxxn/audiocaps-simple-tokenizer", trust_remote_code=True) |
|
|
|
|
|
# Process audio (16kHz, 5 seconds = 80000 samples) |
|
|
audio = np.random.randn(1, 80000).astype(np.float32) |
|
|
|
|
|
# Encode |
|
|
attn_emb = encoder_session.run(["attn_emb"], {"audio": audio})[0] |
|
|
|
|
|
# Decode (greedy search) |
|
|
generated = [tokenizer.bos_token_id] |
|
|
for _ in range(30): |
|
|
logits = decoder.forward(( |
|
|
torch.tensor([generated]), |
|
|
torch.tensor(attn_emb), |
|
|
torch.tensor([attn_emb.shape[1] - 1]) |
|
|
))[0] |
|
|
next_token = int(torch.argmax(logits[0, -1, :])) |
|
|
generated.append(next_token) |
|
|
if next_token == tokenizer.eos_token_id: |
|
|
break |
|
|
|
|
|
caption = tokenizer.decode(generated, skip_special_tokens=True) |
|
|
print(caption) |
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Base Models |
|
|
|
|
|
This repository does **not train models** but exports pre-trained models to optimized formats: |
|
|
|
|
|
| Component | Base Model | Training Dataset | Parameters | |
|
|
|-----------|------------|------------------|------------| |
|
|
| Audio Encoder | EfficientNet-B2 | AudioCaps | ~7.7M | |
|
|
| Caption Decoder | Transformer (2 layers) | AudioCaps | ~4.3M | |
|
|
| Categorizer | all-MiniLM-L6-v2 | 1B+ sentence pairs | ~22.7M | |
|
|
|
|
|
### Export Configuration |
|
|
|
|
|
**Audio Captioning**: |
|
|
- **Preprocessing**: `n_mels=64`, `n_fft=512`, `hop_length=160`, `win_length=512` |
|
|
- **Input**: Raw audio waveform (16kHz, 5 seconds) |
|
|
- **Encoder**: ONNX opset 17 with dynamic axes |
|
|
- **Decoder**: ExecuTorch with dynamic quantization (int8) |
|
|
|
|
|
**Categorizer**: |
|
|
- **Tokenizer**: RoBERTa-based (max length: 128) |
|
|
- **Export**: ExecuTorch with dynamic quantization |
|
|
- **Categories**: 50+ predefined audio event categories |
|
|
|
|
|
## Project Structure |
|
|
|
|
|
``` |
|
|
. |
|
|
βββ audio-caption/ |
|
|
β βββ export_encoder_preprocess_onnx.py # Export ONNX encoder |
|
|
β βββ export_decoder_executorch.py # Export ExecuTorch decoder |
|
|
β βββ generate_caption_hybrid.py # Inference pipeline |
|
|
β βββ effb2_encoder_preprocess.onnx # Exported encoder |
|
|
β βββ effb2_decoder_5sec.pte # Exported decoder |
|
|
β |
|
|
βββ sentence-transformers-embbedings/ |
|
|
β βββ export_sentence_transformers_executorch.py |
|
|
β βββ generate_category_embeddings.py |
|
|
β βββ category_embeddings.json |
|
|
β |
|
|
βββ categories.json # Category definitions |
|
|
``` |
|
|
|
|
|
## Setup |
|
|
|
|
|
### Prerequisites |
|
|
|
|
|
```bash |
|
|
# Install uv package manager |
|
|
pip install uv |
|
|
|
|
|
# Create environment |
|
|
uv venv |
|
|
source .venv/bin/activate |
|
|
|
|
|
# Install dependencies |
|
|
uv pip install -r pyproject.toml |
|
|
``` |
|
|
|
|
|
### Configuration |
|
|
|
|
|
Create a `.env` file: |
|
|
|
|
|
```ini |
|
|
# Hugging Face Token (for gated models) |
|
|
HF_TOKEN=your_token_here |
|
|
|
|
|
# Optional: Custom cache directory |
|
|
# HF_HOME=./.cache/huggingface |
|
|
``` |
|
|
|
|
|
### Export Models |
|
|
|
|
|
```bash |
|
|
# Export audio captioning models |
|
|
python audio-caption/export_encoder_preprocess_onnx.py |
|
|
python audio-caption/export_decoder_executorch.py |
|
|
|
|
|
# Export categorization model |
|
|
python sentence-transformers-embbedings/export_sentence_transformers_executorch.py |
|
|
|
|
|
# Generate category embeddings |
|
|
python sentence-transformers-embbedings/generate_category_embeddings.py |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache License 2.0 |
|
|
|
|
|
## Citations |
|
|
|
|
|
### Audio Captioning Model |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{xu2024efficient, |
|
|
title={Efficient Audio Captioning with Encoder-Level Knowledge Distillation}, |
|
|
author={Xu, Xuenan and Liu, Haohe and Wu, Mengyue and Wang, Wenwu and Plumbley, Mark D.}, |
|
|
booktitle={Interspeech 2024}, |
|
|
year={2024}, |
|
|
doi={10.48550/arXiv.2407.14329}, |
|
|
url={https://arxiv.org/abs/2407.14329} |
|
|
} |
|
|
``` |
|
|
|
|
|
### Sentence Transformer |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{reimers-2019-sentence-bert, |
|
|
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
|
|
author = "Reimers, Nils and Gurevych, Iryna", |
|
|
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
|
|
year = "2019", |
|
|
publisher = "Association for Computational Linguistics", |
|
|
url = "https://arxiv.org/abs/1908.10084", |
|
|
} |
|
|
``` |