ConMamba-small-es

Click to expand

Model Description
Intended Uses and Limitations
How to Get Started with the Model
Conversion Details
Citation
Additional information

Summary

The ConMamba-small-es is an acoustic model for Automatic Speech Recognition (ASR) in Spanish. It is based on the ConMamba architecture, which uses a Mamba (State Space Model) encoder augmented with convolutions for efficient sequence processing.

Model Description

The ConMamba-small-es model implements the Convolution-augmented Mamba (ConMamba) architecture, an adaptation of State Space Models (SSMs) designed to improve performance and efficiency in speech recognition tasks by integrating convolutional layers.

This model has been specifically trained for the Spanish language. The corpus used for training has 2788 hours, which consists of audio transcripts from television and radio broadcasts.

Intended Uses and Limitations

This model can be used for Automatic Speech Recognition (ASR) in Spanish. The model is intended to transcribe audio files in Spanish to plain text without punctuation.

How to Get Started with the Model

Installation

The implementation of the ConMamba architecture often depends on specific libraries such as mamba-ssm and causal-conv1d. It is recommended to follow the installation steps from the original Mamba ASR repository:

Create a virtual environment (mamba_asr, for example):

conda create --name mamba_asr python=3.9
conda activate mamba_asr

Install dependencies:

clone github https://github.com/langtech-bsc/ConMamba_ASR
cd ConMamba_ASR
pip install -r requirements.txt
# Make sure that the versions of torch, torchaudio, causal-conv1d, and mamba-ssm are compatible with your hardware.

For Inference

Inference is performed using the dedicated run_inference.py script provided within the repository.

Define Paths: Set the paths for the repository, the input audio, and the specific configuration file for inference.
Execute Inference: Run the script using the defined paths.

# Define your paths
REPO_PATH="/path/to/ConMamba-ASR" 
AUDIO="/path/to/your/audio.wav"
HPARAMS="conmambamamba_debug_spanish_small_1k_unigram_inference.yaml" # Use your specific inference YAML

# Execute inference script
python  $REPO_PATH/run_inference.py \
  --hparams $HPARAMS \
  --audio $AUDIO

Dev Result - WER: 12.52

Training Details

Training data

The model was trained for a total of 2788 hours. Including:

Training Hyperparameters

Training hours: 2788
language: Spanish
number_of_epochs: 110
batch_size: 30
ctc_weight: 0.6
grad_accumulation_factor: 1
max_grad_norm: 5.0
loss_reduction: 'batchmean'
sorting: random
num_workers: 8
precision: bf16
avg_checkpoints: 10
lr_adam: 0.001

Citation

If this model contributes to your research, please cite the work:

@inproceedings{zevallos2026conmambasmalles,
  title={Evaluating High-Performance and Lightweight ASR Systems for Spanish},
  author={Zevallos, Rodolfo}
  organization={Barcelona Supercomputing Center},
  year={2026}
}

Additional Information

Author

The model was trained during September (2025) in the Language Technologies Laboratory of the Barcelona Supercomputing Center by Rodolfo Zevallos.

Contact

For further information, please send an email to bsc-lt@bsc.es.

Copyright

License

GPL-3.0

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA.

The conversion of the model was possible thanks to the computing time provided by Barcelona Supercomputing Center through MareNostrum 5.

Downloads last month: -; Downloads are not tracked for this model. How to track

Evaluation results

Test WER on librispeech_es
test set self-reported

7.470
Test WER on fleurs
test set self-reported

8.450
Test WER on common voice v17
test set self-reported

11.930

BSC-LT
/

ConMamba-small-es