Wav2Vec2 Base 960h ONNX

ONNX export of facebook/wav2vec2-base-960h for asr.js.

Model details

  • Architecture: Wav2Vec2 (convolutional feature extractor + transformer encoder + CTC head)
  • Parameters: ~95M
  • Input: Raw 16 kHz mono PCM waveform [1, samples] float32
  • Output: CTC logits [1, frames, 32] float32
  • Output stride: 320 samples (50 fps at 16 kHz)
  • Vocabulary: 32 character-level tokens + CTC blank
  • Languages: English
  • ONNX opset: 18
  • External data: Yes (360 MB model weights in model.onnx.data)

Usage with asr.js

import { loadSpeechModel } from '@asrjs/speech-recognition';

const model = await loadSpeechModel('facebook/wav2vec2-base-960h', {
  source: {
    kind: 'huggingface',
    repoId: 'ysdede/wav2vec2-base-960h-onnx',
    modelFilename: 'model.onnx',
    modelDataFilename: 'model.onnx.data',
    tokenizerFilename: 'vocab.json',
  },
});

// ASR
const transcript = await model.transcribe(audio);

// Forced alignment (WhisperX-style)
const logits = await model.session.executor.extractLogits(audio);
const aligner = createWav2Vec2AlignerFromLogits(logits);
const alignment = aligner.align({ transcript: 'your transcript' });

Files

  • model.onnx — ONNX model graph (1.8 MB)
  • model.onnx.data — External weight data (360 MB)
  • config.json — Original HuggingFace model config
  • vocab.json — Character vocabulary with CTC blank token
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train ysdede/wav2vec2-base-960h-onnx