NOT A FINAL MODEL, MORE TRAINING STILL IN PLACE
Nemotron Speech Streaming Spanish 0.6B
A streaming Spanish ASR model (627M parameters) based on NVIDIA's Nemotron architecture, fine-tuned from the English streaming checkpoint.
Model Details
| Architecture | Fast Conformer RNNT (streaming) |
| Parameters | 627M |
| Base Model | nvidia/nemotron-speech-streaming-en-0.6b |
| Teacher Model | nvidia/parakeet-tdt-0.6b-v3 |
| Language | Spanish (es) |
| Streaming Chunk | 1.12s (14 frames × 80ms) |
| Framework | NeMo |
Performance
| Benchmark | WER (%) |
|---|---|
| FLEURS Spanish (test) | 8.44 |
Training Data
- Multilingual LibriSpeech (MLS) — Spanish subset (~918h, read audiobook speech)
- Common Voice — Spanish validated split (crowd-sourced read speech)
- VoxPopuli — Spanish subset (European Parliament speech)
- TEDx Spanish (ciempiess/tedx_spanish) — used for refinement stage
Usage
Batch Inference
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.restore_from("nemotron-speech-streaming-es-0.6b.nemo")
transcriptions = model.transcribe(["audio.wav"])
print(transcriptions[0])
Streaming Inference (1.12s chunks)
import torch
import numpy as np
import soundfile as sf
import nemo.collections.asr as nemo_asr
from nemo.collections.asr.parts.utils.streaming_utils import CacheAwareStreamingAudioBuffer
from nemo.core.classes.common import typecheck
from omegaconf import open_dict
typecheck.set_typecheck_enabled(False)
# Load model
model = nemo_asr.models.ASRModel.restore_from("nemotron-speech-streaming-es-0.6b.nemo")
model.eval()
with open_dict(model.cfg):
model.cfg.decoding.greedy.use_cuda_graph_decoder = False
model.change_decoding_strategy(model.cfg.decoding)
# Setup streaming with 1.12s chunks (right_context=13)
RIGHT_CONTEXT = 13
chunk_frames = 1 + RIGHT_CONTEXT # 14 frames
model.encoder.setup_streaming_params(
chunk_size=chunk_frames,
shift_size=chunk_frames,
left_chunks=70 // chunk_frames,
)
# Load audio
audio, sr = sf.read("audio.wav", dtype="float32")
assert sr == 16000, "Audio must be 16kHz"
# Initialize streaming state
dev = next(model.parameters()).device
cache_ch, cache_t, cache_ch_len = model.encoder.get_initial_cache_state(
batch_size=1, dtype=torch.float32, device=dev
)
prev_hyps = None
# Process audio in streaming chunks
CHUNK_SAMPLES = int(chunk_frames * 0.08 * sr) # 1.12s = 17920 samples
for start in range(0, len(audio), CHUNK_SAMPLES):
chunk = audio[start : start + CHUNK_SAMPLES]
if len(chunk) < CHUNK_SAMPLES:
chunk = np.pad(chunk, (0, CHUNK_SAMPLES - len(chunk)))
buffer = CacheAwareStreamingAudioBuffer(model=model)
buffer.append_audio(chunk)
for chunk_audio, chunk_len in buffer:
with torch.no_grad():
result = model.conformer_stream_step(
processed_signal=chunk_audio,
processed_signal_length=chunk_len,
cache_last_channel=cache_ch,
cache_last_time=cache_t,
cache_last_channel_len=cache_ch_len,
previous_hypotheses=prev_hyps,
return_transcription=True,
)
cache_ch, cache_t, cache_ch_len = result[2], result[3], result[4]
prev_hyps = result[5]
if prev_hyps and prev_hyps[0].text:
print(prev_hyps[0].text)
Citation
If you use this model, please cite:
@misc{nemotron-streaming-es,
title={Nemotron Speech Streaming Spanish 0.6B},
year={2026},
url={https://huggingface.co/nenad1002/Nemotron-Streaming-ES-ES}
}
- Downloads last month
- 6
Datasets used to train nenad1002/Nemotron-Streaming-ES-ES
Evaluation results
- Test WER on FLEURS Spanishtest set self-reported8.440