openslr/librispeech_asr
Viewer β’ Updated β’ 585k β’ 103k β’ 228
How to use suryaumapathy2812/voxlm-2b with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="suryaumapathy2812/voxlm-2b", trust_remote_code=True) # Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("suryaumapathy2812/voxlm-2b", trust_remote_code=True, dtype="auto")VoxLM is a modular Speech-to-Text system that combines audio encoders with Large Language Models for intelligent transcription with word-level timestamps and confidence scores.
VoxLM-2B combines:
Total parameters: ~1.86B (230M trainable via LoRA)
| Dataset | WER | EOS Generation Rate |
|---|---|---|
| LibriSpeech dev-clean | 6.69% | 99.9% |
# 1. Clone the repository
git clone https://github.com/suryaumapathy2812/voxlm.git
cd voxlm
# 2. Install dependencies
pip install -e .
# Or: uv sync
# 3. Download model
pip install huggingface_hub
python -c "from huggingface_hub import hf_hub_download; hf_hub_download('suryaumapathy2812/voxlm-2b', 'model.pt', local_dir='./models/voxlm-2b/')"
# 4. Run inference
import torch
import soundfile as sf
from src.model import VoxLM
# Load model
checkpoint = torch.load("models/voxlm-2b/model.pt", map_location="cuda", weights_only=False)
model = VoxLM(checkpoint["config"]).to("cuda")
model.load_state_dict(checkpoint["model_state_dict"])
model.eval()
# Load audio (16kHz)
audio, sr = sf.read("audio.wav")
audio_tensor = torch.from_numpy(audio).float().to("cuda")
# Transcribe
with torch.no_grad():
result = model.transcribe(audio_tensor)
print(result["text"])
# Output: "hello world"
print(result["words"])
# Output: [
# {"word": "hello", "start": 0.0, "end": 0.5, "confidence": 0.98},
# {"word": "world", "start": 0.5, "end": 1.0, "confidence": 0.95}
# ]
# After HuggingFace integration is complete:
from transformers import AutoModel
import torch
import soundfile as sf
# Load model directly from HuggingFace
model = AutoModel.from_pretrained(
"suryaumapathy2812/voxlm-2b",
trust_remote_code=True
).to("cuda")
# Load audio and transcribe
audio, sr = sf.read("audio.wav")
result = model.generate(torch.from_numpy(audio).float().to("cuda"))
print(result["text"])
print(result["words"])
Audio Input (16kHz)
β
βΌ
βββββββββββββββββββ
β Whisper Encoder β (frozen)
β (244M params) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Audio Projectionβ (trainable)
β + Downsamplingβ
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Qwen2-1.5B LLM β (LoRA adapters)
β β
ββββββββββ¬βββββββββ
β
ββββββββββββββββββββ
βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ
β Text Output β β Alignment Moduleβ
β (transcription)β β (timestamps) β
βββββββββββββββββββ βββββββββββββββββββ
@misc{voxlm2026,
title={VoxLM: Modular Speech-to-Text with LLM Intelligence},
author={Surya Umapathy},
year={2026},
url={https://github.com/suryaumapathy2812/voxlm}
}
Apache 2.0