File size: 3,948 Bytes
73dda27 437e1a0 9c42a1f 8deebb4 437e1a0 d6a3894 d298730 d6a3894 7d6e37a a9e2f0f 90ec428 a9e2f0f 99287bc a9e2f0f 77f390e a9e2f0f 1fd8182 d4cd572 a9e2f0f 90ec428 a9e2f0f 9bf47fe 865e986 a9e2f0f 9bf47fe a9e2f0f 9bf47fe a9e2f0f 55ed408 865e986 77f390e 0f50160 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
---
license: apache-2.0
datasets:
- OpenFace-CQUPT/FaceCaption-15M
language:
- zh
- en
metrics:
- accuracy
pipeline_tag: image-to-text
---
# About the Dataset
You need to first download the FaceCaption-15M from our huggingface and then apply for access to the original Laion-face dataset by completing the required agreement (github). Once approved, refer to the information available on HuggingFace to obtain the corresponding image-text pairs.
**[25/06/09] 🤗The Original Images, are Released [Completing the Agreement](https://github.com/ddw2AIGROUP2CQUPT/Large-Scale-Multimodal-Face-Datasets)**
# Demonstration of Cross-modal Retrieval (FLIP-based model)
<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/TGxEwHBbWZIbW67kG9jMH.mp4"></video>
# FLIP (Facial Language Image Pretraining)
This repository is the official implementation of [FaceCaption-15M]().
# Updates:
**[24/07/20] The usage of FLIP has been released! [OpenFace-CQUPT/FLIP-demo](https://huggingface.co/OpenFace-CQUPT/FLIP/tree/main/FLIP-demo)**
**[24/07/17] The model named FLIP has been released! [OpenFace-CQUPT/FLIP](https://huggingface.co/OpenFace-CQUPT/FLIP)**
**Overview of FLIP architecture.**

**Fig.1:(a). Same color represents shared parameters. “12x” stands for 12-layer transformer modules. (b), (c) and (d) FLIP-based model are applied to the tasks of text-image retrieval, facial attributes prediction and sketch less facial image retrieval, respectively.**
## Training
Coming soon......(Only for the datasets been published, the code of training is meaningful.)
```shell
python pretrain.py > log.log
```
## Pre-trained Models
We provide pretrained model weights :
FLIP Base —— click [here](https://huggingface.co/OpenFace-CQUPT/Facial-language-image-pretraining-model/tree/main/ckpt)
FLIP Large —— coming soon......
## Datasets
Download the FaceCaption-15M dataset from [here](https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-15M).
## Results
### Task1: Text-Image Retrieval
**Table 1:** Comparison with other classical pretrained models. All pretrained model backbones are frozen, with only the linear layer being fine-tuned. † represents the model pretrained on the LAION-Face [86] dataset; * represents the model pretrained on the FaceCaption dataset constructed without using LLM text generation.

### Task2: Facial Attributes Prediction
**Table 2:** Comparison with other classical models. † represents the model pre-trained on the original LAION-Face dataset.

### Task3: Sketch Less Facial Image Retrieval
**Table 3:** Comparative results with different baseline methods. † represents the model pre-trained on the LAION-Face dataset.


**Fig.2:Demonstration of our FLIP-based model on the SLFIR task. Both methods can retrieve the target face photo from the top-5 list using a partial sketch. Our proposed FLIP-based model can achieve this using fewer strokes than the baseline. The number at the bottom denotes the rank of the paired (true match) photos at every stage.**
## Contacts
mailto: 2018211556@stu.cqupt.edu.cn or dw_dai@163.com
## Citation
```tex
@misc{dai202415mmultimodalfacialimagetext,
title={15M Multimodal Facial Image-Text Dataset},
author={Dawei Dai and YuTang Li and YingGe Liu and Mingming Jia and Zhang YuanHui and Guoyin Wang},
year={2024},
eprint={2407.08515},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2407.08515},
}
``` |