You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

diffusion-gemma-asr-small

Audio-native, multilingual speech recognition that transcribes through DiffusionGemma's own discrete-diffusion decoder β€” not autoregressive, not an external ASR decoder. Audio is projected directly into the Gemma embedding space, and the transcript is produced by parallel diffusion denoising (~8–16 steps), giving real-time-plus throughput where cost is set by the number of denoising steps, not the length of the transcript.

This repo ships the trained adapter only (projector + LoRA, ~42M params β€” 0.16% of the model). The frozen 26B DiffusionGemma backbone and the frozen whisper-small encoder load from their own repos.

How it works

raw audio ─► whisper-small encoder (frozen) ─► projector (trained, ~19M)
          ─► scatter into <audio> token slots of DiffusionGemma's encoder
          ─► DiffusionGemma decoder denoises a 192-token canvas (bidirectional, cross-attends audio)
          ─► transcript
  • Backbone: google/diffusiongemma-26B-A4B-it β€” frozen, small LoRA adapters on encoder/decoder attention.
  • Audio frontend: openai/whisper-small encoder β€” frozen feature extractor (NOT a decoder).
  • Grounding: trained with three losses β€” uniform-diffusion (the generator), an AR auxiliary, and a CTC loss on the projector via the frozen lm_head (the key unlock that makes the audio embeddings transcript-predictive).

Usage

Install

pip install torch peft soundfile librosa huggingface_hub \
  "transformers @ git+https://github.com/huggingface/transformers.git"   # DiffusionGemma support

Transcribe in Python

import sys, soundfile as sf
from huggingface_hub import snapshot_download

repo = snapshot_download("interfaze-ai/diffusion-gemma-asr-small")   # this adapter (~170 MB)
sys.path.insert(0, repo)
from inference import load, transcribe                       # bundled in this repo

# Loads frozen DiffusionGemma-26B + whisper-small + this adapter (downloads bases on first run).
model, tok, fe = load(f"{repo}/diffusion_asr_small.pt", device="cuda")

wav, sr = sf.read("audio.wav")        # 16 kHz mono float32 (inference.py resamples if needed)
print(transcribe(wav, model, tok, fe, max_steps=16))

Or from the command line

python inference.py audio.wav        # run inside the downloaded repo dir

Long audio is split at silence (the encoder has a 30 s window, like Whisper). max_steps trades speed for accuracy β€” 8 is near-best and fastest, 16 is the default.

Languages & accuracy

Trained on FLEURS (6 languages) + LibriSpeech (en) + VoxPopuli (en/de/fr/es). WER/CER are Whisper-normalized (Open-ASR / Artificial-Analysis convention), 16 diffusion steps:

benchmark metric score
LibriSpeech test-clean (en) WER 6.6%
FLEURS English WER 15.7%
VoxPopuli English WER 18.5%
FLEURS Hindi CER 15.8%
FLEURS Mandarin CER 29.6%

Among diffusion / non-autoregressive ASR it leads (6.6% on LibriSpeech vs Whisfusion's 8.3%, with a smaller encoder). It trails autoregressive Whisper β€” a training-data gap (~219 h seen), not architecture.

Files

  • diffusion_asr_small.pt β€” trained adapter ({"projector": ..., "lora": ...})
  • model.py, audio.py β€” model definition (self-contained)
  • inference.py β€” runnable example (load + segment + transcribe)
  • requirements.txt

Requirements / licensing

  • Needs transformers from main (DiffusionGemma support) + torch, peft.
  • Base models load from their own repos under their licenses: google/diffusiongemma-26B-A4B-it (Gemma terms) and openai/whisper-small (MIT).
  • This adapter: Apache-2.0.

Limitations

  • Per-segment window is ≀30 s (encoder limit) β€” long audio is chunked at silence, same as Whisper.
  • Mandarin is the weakest language; more data is the lever.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for interfaze-ai/diffusion-gemma-asr-small

Finetuned
(13)
this model

Space using interfaze-ai/diffusion-gemma-asr-small 1