--- license: apache-2.0 tags: - audio - audio-classification - audio-captioning - onnx - executorch - mobile - arm language: - en pipeline_tag: audio-classification base_model: - wsntxxn/effb2-trm-audiocaps-captioning - sentence-transformers/all-MiniLM-L6-v2 --- # Audio Caption and Categorizer Models ## Model Description This repository provides **optimized exports** of audio captioning and categorization models for **ARM-based mobile deployment**. The pipeline consists of: 1. **Audio Captioning**: Uses [`wsntxxn/effb2-trm-audiocaps-captioning`](https://huggingface.co/wsntxxn/effb2-trm-audiocaps-captioning) (EfficientNet-B2 encoder + Transformer decoder) to generate natural language descriptions of audio events. 2. **Audio Categorization**: Uses [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to match generated captions to predefined sound categories via semantic similarity. ### Export Formats - **Encoder**: ONNX format with integrated preprocessing (STFT, MelSpectrogram, AmplitudeToDB) - **Decoder**: ExecuTorch (`.pte`) format with dynamic quantization for reduced model size - **Categorizer**: ExecuTorch (`.pte`) format with quantization ### Key Features - 5-second audio input at 16kHz - Preprocessing baked into ONNX encoder (no external audio processing needed) - Optimized for mobile inference with quantization - Complete end-to-end pipeline from raw audio to categorized captions ## Usage ### Quick Start Generate a caption for an audio file: ```bash # Activate environment source .venv/bin/activate # Generate caption python audio-caption/generate_caption_hybrid.py --audio sample_audio.wav ``` ### Python Example ```python import onnxruntime as ort from executorch.extension.pybindings.portable_lib import _load_for_executorch from transformers import AutoTokenizer import numpy as np # Load models encoder_session = ort.InferenceSession("audio-caption/effb2_encoder_preprocess.onnx") decoder = _load_for_executorch("audio-caption/effb2_decoder_5sec.pte") tokenizer = AutoTokenizer.from_pretrained("wsntxxn/audiocaps-simple-tokenizer", trust_remote_code=True) # Process audio (16kHz, 5 seconds = 80000 samples) audio = np.random.randn(1, 80000).astype(np.float32) # Encode attn_emb = encoder_session.run(["attn_emb"], {"audio": audio})[0] # Decode (greedy search) generated = [tokenizer.bos_token_id] for _ in range(30): logits = decoder.forward(( torch.tensor([generated]), torch.tensor(attn_emb), torch.tensor([attn_emb.shape[1] - 1]) ))[0] next_token = int(torch.argmax(logits[0, -1, :])) generated.append(next_token) if next_token == tokenizer.eos_token_id: break caption = tokenizer.decode(generated, skip_special_tokens=True) print(caption) ``` ## Training Details ### Base Models This repository does **not train models** but exports pre-trained models to optimized formats: | Component | Base Model | Training Dataset | Parameters | |-----------|------------|------------------|------------| | Audio Encoder | EfficientNet-B2 | AudioCaps | ~7.7M | | Caption Decoder | Transformer (2 layers) | AudioCaps | ~4.3M | | Categorizer | all-MiniLM-L6-v2 | 1B+ sentence pairs | ~22.7M | ### Export Configuration **Audio Captioning**: - **Preprocessing**: `n_mels=64`, `n_fft=512`, `hop_length=160`, `win_length=512` - **Input**: Raw audio waveform (16kHz, 5 seconds) - **Encoder**: ONNX opset 17 with dynamic axes - **Decoder**: ExecuTorch with dynamic quantization (int8) **Categorizer**: - **Tokenizer**: RoBERTa-based (max length: 128) - **Export**: ExecuTorch with dynamic quantization - **Categories**: 50+ predefined audio event categories ## Project Structure ``` . ├── audio-caption/ │ ├── export_encoder_preprocess_onnx.py # Export ONNX encoder │ ├── export_decoder_executorch.py # Export ExecuTorch decoder │ ├── generate_caption_hybrid.py # Inference pipeline │ ├── effb2_encoder_preprocess.onnx # Exported encoder │ └── effb2_decoder_5sec.pte # Exported decoder │ ├── sentence-transformers-embbedings/ │ ├── export_sentence_transformers_executorch.py │ ├── generate_category_embeddings.py │ └── category_embeddings.json │ └── categories.json # Category definitions ``` ## Setup ### Prerequisites ```bash # Install uv package manager pip install uv # Create environment uv venv source .venv/bin/activate # Install dependencies uv pip install -r pyproject.toml ``` ### Configuration Create a `.env` file: ```ini # Hugging Face Token (for gated models) HF_TOKEN=your_token_here # Optional: Custom cache directory # HF_HOME=./.cache/huggingface ``` ### Export Models ```bash # Export audio captioning models python audio-caption/export_encoder_preprocess_onnx.py python audio-caption/export_decoder_executorch.py # Export categorization model python sentence-transformers-embbedings/export_sentence_transformers_executorch.py # Generate category embeddings python sentence-transformers-embbedings/generate_category_embeddings.py ``` ## License Apache License 2.0 ## Citations ### Audio Captioning Model ```bibtex @inproceedings{xu2024efficient, title={Efficient Audio Captioning with Encoder-Level Knowledge Distillation}, author={Xu, Xuenan and Liu, Haohe and Wu, Mengyue and Wang, Wenwu and Plumbley, Mark D.}, booktitle={Interspeech 2024}, year={2024}, doi={10.48550/arXiv.2407.14329}, url={https://arxiv.org/abs/2407.14329} } ``` ### Sentence Transformer ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } ```