NOT A FINAL MODEL, MORE TRAINING STILL IN PLACE

Nemotron Speech Streaming Spanish 0.6B

A streaming Spanish ASR model (627M parameters) based on NVIDIA's Nemotron architecture, fine-tuned from the English streaming checkpoint.

Model Details


Architecture	Fast Conformer RNNT (streaming)
Parameters	627M
Base Model	nvidia/nemotron-speech-streaming-en-0.6b
Teacher Model	nvidia/parakeet-tdt-0.6b-v3
Language	Spanish (es)
Streaming Chunk	1.12s (14 frames × 80ms)
Framework	NeMo

Performance

Benchmark	WER (%)
FLEURS Spanish (test)	8.44

Training Data

Multilingual LibriSpeech (MLS) — Spanish subset (~918h, read audiobook speech)
Common Voice — Spanish validated split (crowd-sourced read speech)
VoxPopuli — Spanish subset (European Parliament speech)
TEDx Spanish (ciempiess/tedx_spanish) — used for refinement stage

Usage

Batch Inference

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.restore_from("nemotron-speech-streaming-es-0.6b.nemo")
transcriptions = model.transcribe(["audio.wav"])
print(transcriptions[0])

Streaming Inference (1.12s chunks)

import torch
import numpy as np
import soundfile as sf
import nemo.collections.asr as nemo_asr
from nemo.collections.asr.parts.utils.streaming_utils import CacheAwareStreamingAudioBuffer
from nemo.core.classes.common import typecheck
from omegaconf import open_dict

typecheck.set_typecheck_enabled(False)

# Load model
model = nemo_asr.models.ASRModel.restore_from("nemotron-speech-streaming-es-0.6b.nemo")
model.eval()

with open_dict(model.cfg):
    model.cfg.decoding.greedy.use_cuda_graph_decoder = False
model.change_decoding_strategy(model.cfg.decoding)

# Setup streaming with 1.12s chunks (right_context=13)
RIGHT_CONTEXT = 13
chunk_frames = 1 + RIGHT_CONTEXT  # 14 frames
model.encoder.setup_streaming_params(
    chunk_size=chunk_frames,
    shift_size=chunk_frames,
    left_chunks=70 // chunk_frames,
)

# Load audio
audio, sr = sf.read("audio.wav", dtype="float32")
assert sr == 16000, "Audio must be 16kHz"

# Initialize streaming state
dev = next(model.parameters()).device
cache_ch, cache_t, cache_ch_len = model.encoder.get_initial_cache_state(
    batch_size=1, dtype=torch.float32, device=dev
)
prev_hyps = None

# Process audio in streaming chunks
CHUNK_SAMPLES = int(chunk_frames * 0.08 * sr)  # 1.12s = 17920 samples
for start in range(0, len(audio), CHUNK_SAMPLES):
    chunk = audio[start : start + CHUNK_SAMPLES]
    if len(chunk) < CHUNK_SAMPLES:
        chunk = np.pad(chunk, (0, CHUNK_SAMPLES - len(chunk)))

    buffer = CacheAwareStreamingAudioBuffer(model=model)
    buffer.append_audio(chunk)

    for chunk_audio, chunk_len in buffer:
        with torch.no_grad():
            result = model.conformer_stream_step(
                processed_signal=chunk_audio,
                processed_signal_length=chunk_len,
                cache_last_channel=cache_ch,
                cache_last_time=cache_t,
                cache_last_channel_len=cache_ch_len,
                previous_hypotheses=prev_hyps,
                return_transcription=True,
            )
        cache_ch, cache_t, cache_ch_len = result[2], result[3], result[4]
        prev_hyps = result[5]

        if prev_hyps and prev_hyps[0].text:
            print(prev_hyps[0].text)

Citation

If you use this model, please cite:

@misc{nemotron-streaming-es,
  title={Nemotron Speech Streaming Spanish 0.6B},
  year={2026},
  url={https://huggingface.co/nenad1002/Nemotron-Streaming-ES-ES}
}

Downloads last month: 6

Datasets used to train nenad1002/Nemotron-Streaming-ES-ES

Evaluation results

Test WER on FLEURS Spanish
test set self-reported

8.440