| language: | |
| - en | |
| - multilingual | |
| license: apache-2.0 | |
| tags: | |
| - onnx | |
| - audio | |
| - automatic-speech-recognition | |
| - phoneme-recognition | |
| - wav2vec2 | |
| base_model: facebook/wav2vec2-lv-60-espeak-cv-ft | |
| # Wav2Vec2-LV-60-Espeak-CV-FT (ONNX) | |
| This is an **ONNX export** of the [facebook/wav2vec2-lv-60-espeak-cv-ft](https://huggingface.co/facebook/wav2vec2-lv-60-espeak-cv-ft) model. | |
| It is designed for client-side inference in the **UltrClick ContentPro** application to perform forced alignment of lyrics to audio. | |
| ## Model Details | |
| - **Original Model**: `facebook/wav2vec2-lv-60-espeak-cv-ft` | |
| - **Format**: ONNX (Open Neural Network Exchange) | |
| - **Precision**: FP16 (Float16) | |
| - **Output**: IPA Phoneme logits (392 vocab size) | |
| - **Sample Rate**: 16kHz | |
| ## Usage | |
| This model is intended to be used with the ONNX Runtime (e.g., via `ort` in Rust or `onnxruntime` in Python). | |
| ### Input | |
| - **Name**: `audio` | |
| - **Shape**: `[batch_size, samples]` | |
| - **Type**: Float32 tensor | |
| ### Output | |
| - **Name**: `logits` | |
| - **Shape**: `[batch_size, frames, 392]` (392 is the vocab size) | |
| ## License | |
| This model is a derivative of the original `facebook/wav2vec2-lv-60-espeak-cv-ft` model and retains the **Apache 2.0** license. | |