ORAGEN / README.md

Updated readme

22b73cd 12 days ago

4.2 kB

	---
	library_name: pytorch
	tags:
	- chimera-ml
	- oragen
	- pytorch
	- audio
	- image
	- multimodal
	- age-estimation
	- gender-recognition
	- wav2vec2
	- vit
	datasets:
	- AGENDER
	- CommonVoice
	- TIMIT
	- LAGENDA
	- IMDB-clean
	- AFEW
	- VoxCeleb2
	- BRAVE-MASKS
	base_model:
	- facebook/wav2vec2-large-robust
	- nateraw/vit-age-classifier
	---

	# ORAGEN Models

	This repository contains exported ORAGEN-based model weights for [`chimera-ml`](https://github.com/markitantov/chimera_ml/).

	These checkpoints are used for age estimation and gender recognition from speech, face images, and combined audio-visual inputs. In the `chimera-ml` ORAGEN pipeline, the multimodal model operates on intermediate audio and visual features extracted from the unimodal branches.

	## Files

	- `audio_model.pt` — audio-only checkpoint used for speech-based age estimation and gender recognition.
	- `image_model.pt` — image-only checkpoint used for face-based feature extraction and prediction in the ORAGEN pipeline.
	- `multimodal_model.pt` — audio-visual checkpoint that combines audio and image features for multimodal prediction.

	## What They Predict

	These models predict:

	- age (0-100)
	- gender (`female`, `male`)

	The ORAGEN codebase also contains support for mask-related prediction in some model variants, but the exported multimodal configuration used here has `include_mask: false`.

	## Training Setup

	According to the training configs in `examples/oragen/configs`:

	- Audio training uses `facebook/wav2vec2-large-robust` as the backbone.
	- The multimodal setup uses `agender_multimodal_model_v3`.
	- The visual branch is used as an image feature extractor in the fusion pipeline and is referenced together with `nateraw/vit-age-classifier`-based ORAGEN visual weights.
	- Training and inference use `16 kHz` audio and `4s` windows with `2s` shift.

	Datasets referenced by the configs:

	- Audio: `AGENDER`, `CommonVoice`, `TIMIT`
	- Image: `LAGENDA`, `IMDB-Clean`, `AFEW`
	- Multimodal: `VoxCeleb2`, `BRAVE-MASKS`

	## Per-Corpus Results

	The training logs do not report raw accuracy directly. For gender prediction, the reported classification metrics are `gen_precision`, `gen_uar`, and `gen_macro_f1`. For age prediction, the reported regression metrics are `age_mae` and `age_pcc`.

	## Results from the original paper

	### Audio Model

	\| Corpus \| Age MAE \| Age PCC \| Gender UAR, % \| Gender Macro F1, % \|
	\|--------\|---------\|---------\|------------\|-----------------\|
	\| AGENDER \| 10.60 \| 0.83 \| 87.17 \| 86.25 \|
	\| CommonVoice \| 10.47 \| 0.81 \| 92.59 \| 92.64 \|
	\| TIMIT \| 6.90 \| 0.91 \| 98.60 \| 98.58 \|
	\| VoxCeleb2 \| 9.91 \| 0.60 \| 90.00 \| 88.71 \|
	\| BRAVE-MASKS (test) \| 11.89 \| 0.64 \| 86.22 \| 85.18 \|

	### Image Model

	\| Corpus \| Age MAE \| Age PCC \| Gender UAR, % \| Gender Macro F1, % \|
	\|--------\|---------\|---------\|------------\|-----------------\|
	\| LAGENDA \| 5.18 \| 0.95 \| 92.89 \| 92.90 \|
	\| AFEW \| 5.62 \| 0.82 \| 95.16 \| 94.98 \|
	\| IMDB-Clean (test) \| 5.47 \| 0.84 \| 98.37 \| 98.26 \|
	\| VoxCeleb2 \| 5.97 \| 0.64 \| 98.37 \| 98.16 \|
	\| BRAVE-MASKS (test) \| 8.71 \| 0.74 \| 94.44 \| 94.43 \|

	### Multimodal Model (intermediate fusion)

	\| Corpus \| Age MAE \| Age PCC \| Gender UAR, % \| Gender Macro F1, % \|
	\|--------\|---------\|---------\|------------\|-----------------\|
	\| VoxCeleb2 \| 5.68 \| 0.66 \| 99.11 \| 99.02 \|
	\| BRAVE-MASKS (test) \| 8.73 \| 0.74 \| 94.95 \| 94.89 \|


	## 6) Related publications

	Markitantov M., Ryumina E., Karpov A. Audio-visual occlusion-robust gender recognition and age estimation approach based on multi-task cross-modal attention. // Expert Systems with Applications. 2026. vol. 296. ID 127473. https://doi.org/10.1016/j.eswa.2025.127473

	BibTeX:

	```bibtex
	@article{markitantov2026oragen,
	author = {Markitantov, Maxim and Ryumina, Elena and Karpov, Alexey},
	title = {Audio-visual occlusion-robust gender recognition and age estimation approach based on multi-task cross-modal attention},
	journal = {Expert Systems with Applications},
	volume = {296},
	pages = {127473},
	year = {2026},
	month = jan,
	doi = {10.1016/j.eswa.2025.127473},
	url = {https://doi.org/10.1016/j.eswa.2025.127473}
	}
	```