ORAGEN Models

This repository contains exported ORAGEN-based model weights for chimera-ml.

These checkpoints are used for age estimation and gender recognition from speech, face images, and combined audio-visual inputs. In the chimera-ml ORAGEN pipeline, the multimodal model operates on intermediate audio and visual features extracted from the unimodal branches.

Files

audio_model.pt — audio-only checkpoint used for speech-based age estimation and gender recognition.
image_model.pt — image-only checkpoint used for face-based feature extraction and prediction in the ORAGEN pipeline.
multimodal_model.pt — audio-visual checkpoint that combines audio and image features for multimodal prediction.

What They Predict

These models predict:

age (0-100)
gender (female, male)

The ORAGEN codebase also contains support for mask-related prediction in some model variants, but the exported multimodal configuration used here has include_mask: false.

Training Setup

According to the training configs in examples/oragen/configs:

Audio training uses facebook/wav2vec2-large-robust as the backbone.
The multimodal setup uses agender_multimodal_model_v3.
The visual branch is used as an image feature extractor in the fusion pipeline and is referenced together with nateraw/vit-age-classifier-based ORAGEN visual weights.
Training and inference use 16 kHz audio and 4s windows with 2s shift.

Datasets referenced by the configs:

Audio: AGENDER, CommonVoice, TIMIT
Image: LAGENDA, IMDB-Clean, AFEW
Multimodal: VoxCeleb2, BRAVE-MASKS

Per-Corpus Results

The training logs do not report raw accuracy directly. For gender prediction, the reported classification metrics are gen_precision, gen_uar, and gen_macro_f1. For age prediction, the reported regression metrics are age_mae and age_pcc.

Results from the original paper

Audio Model

Corpus	Age MAE	Age PCC	Gender UAR, %	Gender Macro F1, %
AGENDER	10.60	0.83	87.17	86.25
CommonVoice	10.47	0.81	92.59	92.64
TIMIT	6.90	0.91	98.60	98.58
VoxCeleb2	9.91	0.60	90.00	88.71
BRAVE-MASKS (test)	11.89	0.64	86.22	85.18

Image Model

Corpus	Age MAE	Age PCC	Gender UAR, %	Gender Macro F1, %
LAGENDA	5.18	0.95	92.89	92.90
AFEW	5.62	0.82	95.16	94.98
IMDB-Clean (test)	5.47	0.84	98.37	98.26
VoxCeleb2	5.97	0.64	98.37	98.16
BRAVE-MASKS (test)	8.71	0.74	94.44	94.43

Multimodal Model (intermediate fusion)

Corpus	Age MAE	Age PCC	Gender UAR, %	Gender Macro F1, %
VoxCeleb2	5.68	0.66	99.11	99.02
BRAVE-MASKS (test)	8.73	0.74	94.95	94.89

6) Related publications

Markitantov M., Ryumina E., Karpov A. Audio-visual occlusion-robust gender recognition and age estimation approach based on multi-task cross-modal attention. // Expert Systems with Applications. 2026. vol. 296. ID 127473. https://doi.org/10.1016/j.eswa.2025.127473

BibTeX:

@article{markitantov2026oragen,
  author = {Markitantov, Maxim and Ryumina, Elena and Karpov, Alexey},
  title = {Audio-visual occlusion-robust gender recognition and age estimation approach based on multi-task cross-modal attention},
  journal = {Expert Systems with Applications},
  volume = {296},
  pages = {127473},
  year = {2026},
  month = jan,
  doi = {10.1016/j.eswa.2025.127473},
  url = {https://doi.org/10.1016/j.eswa.2025.127473}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for markitantov/ORAGEN

Base model

facebook/wav2vec2-large-robust

Finetuned

(3)

this model