ConMamba-small-es
Table of Contents
Click to expand
Summary
The ConMamba-small-es is an acoustic model for Automatic Speech Recognition (ASR) in Spanish. It is based on the ConMamba architecture, which uses a Mamba (State Space Model) encoder augmented with convolutions for efficient sequence processing.
Model Description
The ConMamba-small-es model implements the Convolution-augmented Mamba (ConMamba) architecture, an adaptation of State Space Models (SSMs) designed to improve performance and efficiency in speech recognition tasks by integrating convolutional layers.
This model has been specifically trained for the Spanish language. The corpus used for training has 2788 hours, which consists of audio transcripts from television and radio broadcasts.
Intended Uses and Limitations
This model can be used for Automatic Speech Recognition (ASR) in Spanish. The model is intended to transcribe audio files in Spanish to plain text without punctuation.
How to Get Started with the Model
Installation
The implementation of the ConMamba architecture often depends on specific libraries such as mamba-ssm and causal-conv1d. It is recommended to follow the installation steps from the original Mamba ASR repository:
- Create a virtual environment (mamba_asr, for example):
conda create --name mamba_asr python=3.9 conda activate mamba_asr - Install dependencies:
clone github https://github.com/langtech-bsc/ConMamba_ASR cd ConMamba_ASR pip install -r requirements.txt # Make sure that the versions of torch, torchaudio, causal-conv1d, and mamba-ssm are compatible with your hardware.
For Inference
Inference is performed using the dedicated run_inference.py script provided within the repository.
- Define Paths: Set the paths for the repository, the input audio, and the specific configuration file for inference.
- Execute Inference: Run the script using the defined paths.
# Define your paths
REPO_PATH="/path/to/ConMamba-ASR"
AUDIO="/path/to/your/audio.wav"
HPARAMS="conmambamamba_debug_spanish_small_1k_unigram_inference.yaml" # Use your specific inference YAML
# Execute inference script
python $REPO_PATH/run_inference.py \
--hparams $HPARAMS \
--audio $AUDIO
Dev Result - WER: 12.52
Training Details
Training data
The model was trained for a total of 2788 hours. Including:
- Fisher Spanish
- Voxpopuli
- Heorico
- librivox spanish
- Wikipedia spanish
- voxforge spanish
- Tele con ciencia
- Common voice 17 es
Training Hyperparameters
- Training hours: 2788
- language: Spanish
- number_of_epochs: 110
- batch_size: 30
- ctc_weight: 0.6
- grad_accumulation_factor: 1
- max_grad_norm: 5.0
- loss_reduction: 'batchmean'
- sorting: random
- num_workers: 8
- precision: bf16
- avg_checkpoints: 10
- lr_adam: 0.001
Citation
If this model contributes to your research, please cite the work:
@inproceedings{zevallos2026conmambasmalles,
title={Evaluating High-Performance and Lightweight ASR Systems for Spanish},
author={Zevallos, Rodolfo}
organization={Barcelona Supercomputing Center},
year={2026}
}
Additional Information
Author
The model was trained during September (2025) in the Language Technologies Laboratory of the Barcelona Supercomputing Center by Rodolfo Zevallos.
Contact
For further information, please send an email to bsc-lt@bsc.es.
Copyright
Copyright(c) 2026 by Language Technologies Laboratory, Barcelona Supercomputing Center.
License
Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA.
The conversion of the model was possible thanks to the computing time provided by Barcelona Supercomputing Center through MareNostrum 5.
Evaluation results
- Test WER on librispeech_estest set self-reported7.470
- Test WER on fleurstest set self-reported8.450
- Test WER on common voice v17test set self-reported11.930