markitantov commited on
Commit
22b73cd
·
1 Parent(s): a84f7a9

Updated readme

Browse files
Files changed (1) hide show
  1. README.md +104 -12
README.md CHANGED
@@ -1,24 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # ORAGEN Models
2
 
3
- This directory contains exported ORAGEN-based model weights for [`chimera-ml`](https://github.com/markitantov/chimera_ml).
4
 
5
- ORAGEN is an audio-visual modeling approach designed for age estimation and gender recognition from human speech and face data. These models are intended for research and inference pipelines built on top of `chimera-ml`.
6
 
7
  ## Files
8
 
9
- - `audio_model.pt` — audio-only ORAGEN model for prediction from speech.
10
- - `image_model.pt` — image-only ORAGEN model for prediction from face images.
11
- - `multimodal_model.pt` — audio-visual ORAGEN model that combines both modalities.
12
 
13
  ## What They Predict
14
 
15
- These models are used to predict:
16
- - age
17
- - gender [female, male]
18
 
19
- In the broader ORAGEN setup, multimodal models may also be used for protective mask-related prediction, depending on the training configuration.
 
20
 
 
21
 
22
- ---
23
- license: apache-2.0
24
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: pytorch
3
+ tags:
4
+ - chimera-ml
5
+ - oragen
6
+ - pytorch
7
+ - audio
8
+ - image
9
+ - multimodal
10
+ - age-estimation
11
+ - gender-recognition
12
+ - wav2vec2
13
+ - vit
14
+ datasets:
15
+ - AGENDER
16
+ - CommonVoice
17
+ - TIMIT
18
+ - LAGENDA
19
+ - IMDB-clean
20
+ - AFEW
21
+ - VoxCeleb2
22
+ - BRAVE-MASKS
23
+ base_model:
24
+ - facebook/wav2vec2-large-robust
25
+ - nateraw/vit-age-classifier
26
+ ---
27
+
28
  # ORAGEN Models
29
 
30
+ This repository contains exported ORAGEN-based model weights for [`chimera-ml`](https://github.com/markitantov/chimera_ml/).
31
 
32
+ These checkpoints are used for age estimation and gender recognition from speech, face images, and combined audio-visual inputs. In the `chimera-ml` ORAGEN pipeline, the multimodal model operates on intermediate audio and visual features extracted from the unimodal branches.
33
 
34
  ## Files
35
 
36
+ - `audio_model.pt` — audio-only checkpoint used for speech-based age estimation and gender recognition.
37
+ - `image_model.pt` — image-only checkpoint used for face-based feature extraction and prediction in the ORAGEN pipeline.
38
+ - `multimodal_model.pt` — audio-visual checkpoint that combines audio and image features for multimodal prediction.
39
 
40
  ## What They Predict
41
 
42
+ These models predict:
 
 
43
 
44
+ - age (0-100)
45
+ - gender (`female`, `male`)
46
 
47
+ The ORAGEN codebase also contains support for mask-related prediction in some model variants, but the exported multimodal configuration used here has `include_mask: false`.
48
 
49
+ ## Training Setup
50
+
51
+ According to the training configs in `examples/oragen/configs`:
52
+
53
+ - Audio training uses `facebook/wav2vec2-large-robust` as the backbone.
54
+ - The multimodal setup uses `agender_multimodal_model_v3`.
55
+ - The visual branch is used as an image feature extractor in the fusion pipeline and is referenced together with `nateraw/vit-age-classifier`-based ORAGEN visual weights.
56
+ - Training and inference use `16 kHz` audio and `4s` windows with `2s` shift.
57
+
58
+ Datasets referenced by the configs:
59
+
60
+ - Audio: `AGENDER`, `CommonVoice`, `TIMIT`
61
+ - Image: `LAGENDA`, `IMDB-Clean`, `AFEW`
62
+ - Multimodal: `VoxCeleb2`, `BRAVE-MASKS`
63
+
64
+ ## Per-Corpus Results
65
+
66
+ The training logs do not report raw accuracy directly. For gender prediction, the reported classification metrics are `gen_precision`, `gen_uar`, and `gen_macro_f1`. For age prediction, the reported regression metrics are `age_mae` and `age_pcc`.
67
+
68
+ ## Results from the original paper
69
+
70
+ ### Audio Model
71
+
72
+ | Corpus | Age MAE | Age PCC | Gender UAR, % | Gender Macro F1, % |
73
+ |--------|---------|---------|------------|-----------------|
74
+ | AGENDER | 10.60 | 0.83 | 87.17 | 86.25 |
75
+ | CommonVoice | 10.47 | 0.81 | 92.59 | 92.64 |
76
+ | TIMIT | 6.90 | 0.91 | 98.60 | 98.58 |
77
+ | VoxCeleb2 | 9.91 | 0.60 | 90.00 | 88.71 |
78
+ | BRAVE-MASKS (test) | 11.89 | 0.64 | 86.22 | 85.18 |
79
+
80
+ ### Image Model
81
+
82
+ | Corpus | Age MAE | Age PCC | Gender UAR, % | Gender Macro F1, % |
83
+ |--------|---------|---------|------------|-----------------|
84
+ | LAGENDA | 5.18 | 0.95 | 92.89 | 92.90 |
85
+ | AFEW | 5.62 | 0.82 | 95.16 | 94.98 |
86
+ | IMDB-Clean (test) | 5.47 | 0.84 | 98.37 | 98.26 |
87
+ | VoxCeleb2 | 5.97 | 0.64 | 98.37 | 98.16 |
88
+ | BRAVE-MASKS (test) | 8.71 | 0.74 | 94.44 | 94.43 |
89
+
90
+ ### Multimodal Model (intermediate fusion)
91
+
92
+ | Corpus | Age MAE | Age PCC | Gender UAR, % | Gender Macro F1, % |
93
+ |--------|---------|---------|------------|-----------------|
94
+ | VoxCeleb2 | 5.68 | 0.66 | 99.11 | 99.02 |
95
+ | BRAVE-MASKS (test) | 8.73 | 0.74 | 94.95 | 94.89 |
96
+
97
+
98
+ ## 6) Related publications
99
+
100
+ Markitantov M., Ryumina E., Karpov A. Audio-visual occlusion-robust gender recognition and age estimation approach based on multi-task cross-modal attention. // Expert Systems with Applications. 2026. vol. 296. ID 127473. https://doi.org/10.1016/j.eswa.2025.127473
101
+
102
+ BibTeX:
103
+
104
+ ```bibtex
105
+ @article{markitantov2026oragen,
106
+ author = {Markitantov, Maxim and Ryumina, Elena and Karpov, Alexey},
107
+ title = {Audio-visual occlusion-robust gender recognition and age estimation approach based on multi-task cross-modal attention},
108
+ journal = {Expert Systems with Applications},
109
+ volume = {296},
110
+ pages = {127473},
111
+ year = {2026},
112
+ month = jan,
113
+ doi = {10.1016/j.eswa.2025.127473},
114
+ url = {https://doi.org/10.1016/j.eswa.2025.127473}
115
+ }
116
+ ```