audio-caption-categorizer-model / README.md

Update README.md

2e84f0e verified 2 months ago

6.03 kB

	---
	license: apache-2.0
	tags:
	- audio
	- audio-classification
	- audio-captioning
	- onnx
	- executorch
	- mobile
	- arm
	language:
	- en
	pipeline_tag: audio-classification
	base_model:
	- wsntxxn/effb2-trm-audiocaps-captioning
	- sentence-transformers/all-MiniLM-L6-v2
	---

	# Audio Caption and Categorizer Models

	## Model Description

	This repository provides optimized exports of audio captioning and categorization models for ARM-based mobile deployment. The pipeline consists of:

	1. Audio Captioning: Uses [`wsntxxn/effb2-trm-audiocaps-captioning`](https://huggingface.co/wsntxxn/effb2-trm-audiocaps-captioning) (EfficientNet-B2 encoder + Transformer decoder) to generate natural language descriptions of audio events.

	2. Audio Categorization: Uses [`sentence-transformers/all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to match generated captions to predefined sound categories via semantic similarity.

	### Export Formats
	- Encoder: ONNX format with integrated preprocessing (STFT, MelSpectrogram, AmplitudeToDB)
	- Decoder: ExecuTorch (`.pte`) format with dynamic quantization for reduced model size
	- Categorizer: ExecuTorch (`.pte`) format with quantization

	### Key Features
	- 5-second audio input at 16kHz
	- Preprocessing baked into ONNX encoder (no external audio processing needed)
	- Optimized for mobile inference with quantization
	- Complete end-to-end pipeline from raw audio to categorized captions

	## Usage

	### Quick Start

	Generate a caption for an audio file:

	```bash
	# Activate environment
	source .venv/bin/activate

	# Generate caption
	python audio-caption/generate_caption_hybrid.py --audio sample_audio.wav
	```

	### Python Example

	```python
	import onnxruntime as ort
	from executorch.extension.pybindings.portable_lib import _load_for_executorch
	from transformers import AutoTokenizer
	import numpy as np

	# Load models
	encoder_session = ort.InferenceSession("audio-caption/effb2_encoder_preprocess.onnx")
	decoder = _load_for_executorch("audio-caption/effb2_decoder_5sec.pte")
	tokenizer = AutoTokenizer.from_pretrained("wsntxxn/audiocaps-simple-tokenizer", trust_remote_code=True)

	# Process audio (16kHz, 5 seconds = 80000 samples)
	audio = np.random.randn(1, 80000).astype(np.float32)

	# Encode
	attn_emb = encoder_session.run(["attn_emb"], {"audio": audio})[0]

	# Decode (greedy search)
	generated = [tokenizer.bos_token_id]
	for _ in range(30):
	logits = decoder.forward((
	torch.tensor([generated]),
	torch.tensor(attn_emb),
	torch.tensor([attn_emb.shape[1] - 1])
	))[0]
	next_token = int(torch.argmax(logits[0, -1, :]))
	generated.append(next_token)
	if next_token == tokenizer.eos_token_id:
	break

	caption = tokenizer.decode(generated, skip_special_tokens=True)
	print(caption)
	```



	## Training Details

	### Base Models

	This repository does not train models but exports pre-trained models to optimized formats:

	\| Component \| Base Model \| Training Dataset \| Parameters \|
	\|-----------\|------------\|------------------\|------------\|
	\| Audio Encoder \| EfficientNet-B2 \| AudioCaps \| ~7.7M \|
	\| Caption Decoder \| Transformer (2 layers) \| AudioCaps \| ~4.3M \|
	\| Categorizer \| all-MiniLM-L6-v2 \| 1B+ sentence pairs \| ~22.7M \|

	### Export Configuration

	Audio Captioning:
	- Preprocessing: `n_mels=64`, `n_fft=512`, `hop_length=160`, `win_length=512`
	- Input: Raw audio waveform (16kHz, 5 seconds)
	- Encoder: ONNX opset 17 with dynamic axes
	- Decoder: ExecuTorch with dynamic quantization (int8)

	Categorizer:
	- Tokenizer: RoBERTa-based (max length: 128)
	- Export: ExecuTorch with dynamic quantization
	- Categories: 50+ predefined audio event categories

	## Project Structure

	```
	.
	├── audio-caption/
	│ ├── export_encoder_preprocess_onnx.py # Export ONNX encoder
	│ ├── export_decoder_executorch.py # Export ExecuTorch decoder
	│ ├── generate_caption_hybrid.py # Inference pipeline
	│ ├── effb2_encoder_preprocess.onnx # Exported encoder
	│ └── effb2_decoder_5sec.pte # Exported decoder
	│
	├── sentence-transformers-embbedings/
	│ ├── export_sentence_transformers_executorch.py
	│ ├── generate_category_embeddings.py
	│ └── category_embeddings.json
	│
	└── categories.json # Category definitions
	```

	## Setup

	### Prerequisites

	```bash
	# Install uv package manager
	pip install uv

	# Create environment
	uv venv
	source .venv/bin/activate

	# Install dependencies
	uv pip install -r pyproject.toml
	```

	### Configuration

	Create a `.env` file:

	```ini
	# Hugging Face Token (for gated models)
	HF_TOKEN=your_token_here

	# Optional: Custom cache directory
	# HF_HOME=./.cache/huggingface
	```

	### Export Models

	```bash
	# Export audio captioning models
	python audio-caption/export_encoder_preprocess_onnx.py
	python audio-caption/export_decoder_executorch.py

	# Export categorization model
	python sentence-transformers-embbedings/export_sentence_transformers_executorch.py

	# Generate category embeddings
	python sentence-transformers-embbedings/generate_category_embeddings.py
	```

	## License

	Apache License 2.0

	## Citations

	### Audio Captioning Model

	```bibtex
	@inproceedings{xu2024efficient,
	title={Efficient Audio Captioning with Encoder-Level Knowledge Distillation},
	author={Xu, Xuenan and Liu, Haohe and Wu, Mengyue and Wang, Wenwu and Plumbley, Mark D.},
	booktitle={Interspeech 2024},
	year={2024},
	doi={10.48550/arXiv.2407.14329},
	url={https://arxiv.org/abs/2407.14329}
	}
	```

	### Sentence Transformer

	```bibtex
	@inproceedings{reimers-2019-sentence-bert,
	title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
	author = "Reimers, Nils and Gurevych, Iryna",
	booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
	year = "2019",
	publisher = "Association for Computational Linguistics",
	url = "https://arxiv.org/abs/1908.10084",
	}
	```