ORAGEN Models

This repository contains exported ORAGEN-based model weights for chimera-ml.

These checkpoints are used for age estimation and gender recognition from speech, face images, and combined audio-visual inputs. In the chimera-ml ORAGEN pipeline, the multimodal model operates on intermediate audio and visual features extracted from the unimodal branches.

Files

  • audio_model.pt โ€” audio-only checkpoint used for speech-based age estimation and gender recognition.
  • image_model.pt โ€” image-only checkpoint used for face-based feature extraction and prediction in the ORAGEN pipeline.
  • multimodal_model.pt โ€” audio-visual checkpoint that combines audio and image features for multimodal prediction.

What They Predict

These models predict:

  • age (0-100)
  • gender (female, male)

The ORAGEN codebase also contains support for mask-related prediction in some model variants, but the exported multimodal configuration used here has include_mask: false.

Training Setup

According to the training configs in examples/oragen/configs:

  • Audio training uses facebook/wav2vec2-large-robust as the backbone.
  • The multimodal setup uses agender_multimodal_model_v3.
  • The visual branch is used as an image feature extractor in the fusion pipeline and is referenced together with nateraw/vit-age-classifier-based ORAGEN visual weights.
  • Training and inference use 16 kHz audio and 4s windows with 2s shift.

Datasets referenced by the configs:

  • Audio: AGENDER, CommonVoice, TIMIT
  • Image: LAGENDA, IMDB-Clean, AFEW
  • Multimodal: VoxCeleb2, BRAVE-MASKS

Per-Corpus Results

The training logs do not report raw accuracy directly. For gender prediction, the reported classification metrics are gen_precision, gen_uar, and gen_macro_f1. For age prediction, the reported regression metrics are age_mae and age_pcc.

Results from the original paper

Audio Model

Corpus Age MAE Age PCC Gender UAR, % Gender Macro F1, %
AGENDER 10.60 0.83 87.17 86.25
CommonVoice 10.47 0.81 92.59 92.64
TIMIT 6.90 0.91 98.60 98.58
VoxCeleb2 9.91 0.60 90.00 88.71
BRAVE-MASKS (test) 11.89 0.64 86.22 85.18

Image Model

Corpus Age MAE Age PCC Gender UAR, % Gender Macro F1, %
LAGENDA 5.18 0.95 92.89 92.90
AFEW 5.62 0.82 95.16 94.98
IMDB-Clean (test) 5.47 0.84 98.37 98.26
VoxCeleb2 5.97 0.64 98.37 98.16
BRAVE-MASKS (test) 8.71 0.74 94.44 94.43

Multimodal Model (intermediate fusion)

Corpus Age MAE Age PCC Gender UAR, % Gender Macro F1, %
VoxCeleb2 5.68 0.66 99.11 99.02
BRAVE-MASKS (test) 8.73 0.74 94.95 94.89

6) Related publications

Markitantov M., Ryumina E., Karpov A. Audio-visual occlusion-robust gender recognition and age estimation approach based on multi-task cross-modal attention. // Expert Systems with Applications. 2026. vol. 296. ID 127473. https://doi.org/10.1016/j.eswa.2025.127473

BibTeX:

@article{markitantov2026oragen,
  author = {Markitantov, Maxim and Ryumina, Elena and Karpov, Alexey},
  title = {Audio-visual occlusion-robust gender recognition and age estimation approach based on multi-task cross-modal attention},
  journal = {Expert Systems with Applications},
  volume = {296},
  pages = {127473},
  year = {2026},
  month = jan,
  doi = {10.1016/j.eswa.2025.127473},
  url = {https://doi.org/10.1016/j.eswa.2025.127473}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for markitantov/ORAGEN

Finetuned
(3)
this model