# MeFEm: Medical Face Embedding Models Vision Transformers pre-trained on face data for potential medical applications. Available in Small (MeFEm-S) and Base (MeFEm-B) sizes. ## Quick Start ```python import torch import timm # Load model (MeFEm-S example) model = timm.create_model( 'vit_small_patch16_224', pretrained=False, num_classes=0, # No classification head global_pool='token' # Use CLS token (default) ) model.load_state_dict(torch.load('mefem-s.pt')) model.eval() # Forward pass x = torch.randn(1, 3, 224, 224) # Your face image embeddings = model(x) # [1, 384] CLS token embeddings ``` ## Model Details - **Architecture**: ViT-Small/16 (384-dim) or ViT-Base/16 (768-dim) with CLS token - **Training**: Modified I-JEPA on ~6.5M face images - **Input**: Face crops with 2× expanded bounding boxes, 224×224 resolution - **Output**: CLS token embeddings (`global_pool='token'`) or all tokens (`global_pool=''`) ## Usage Tips ```python # For all tokens (CLS + patches): model = timm.create_model('vit_small_patch16_224', num_classes=0, global_pool='') tokens = model(x) # [1, 197, 384] # For patch embeddings only: tokens = model.forward_features(x) patch_embeddings = tokens[:, 1:] # [1, 196, 384] ``` ## Training Data Face images from FaceCaption-15M, AVSpeech, and SHFQ datasets (~6.5M total). Images were cropped with expanded (2×) face bounding boxes. ## Notes - Optimized for face images with loose cropping - Intended for representation learning and transfer to medical tasks - Results may vary for non-face or tightly-cropped images - More info on training and metrics [here](https://arxiv.org/pdf/2602.14672) ## License CC BY 4.0. Reference paper if used: ``` @misc{borets2026mefemmedicalfaceembedding, title={MeFEm: Medical Face Embedding model}, author={Yury Borets and Stepan Botman}, year={2026}, eprint={2602.14672}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2602.14672}, } ```