Image-to-Text
Chinese
English
File size: 3,948 Bytes
73dda27
 
 
 
 
 
 
 
 
 
 
437e1a0
9c42a1f
 
8deebb4
437e1a0
 
 
d6a3894
 
d298730
d6a3894
7d6e37a
a9e2f0f
90ec428
a9e2f0f
99287bc
 
 
 
 
 
a9e2f0f
 
 
 
77f390e
a9e2f0f
 
 
 
 
 
 
 
 
 
 
1fd8182
d4cd572
 
a9e2f0f
 
 
90ec428
a9e2f0f
 
 
 
 
 
9bf47fe
865e986
a9e2f0f
 
 
 
9bf47fe
a9e2f0f
 
 
 
 
9bf47fe
a9e2f0f
 
 
55ed408
865e986
77f390e
 
0f50160
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
license: apache-2.0
datasets:
- OpenFace-CQUPT/FaceCaption-15M
language:
- zh
- en
metrics:
- accuracy
pipeline_tag: image-to-text
---

# About the Dataset

You need to first download the FaceCaption-15M from our huggingface and then apply for access to the original Laion-face dataset by completing the required agreement (github). Once approved, refer to the information available on HuggingFace to obtain the corresponding image-text pairs.

**[25/06/09] 🤗The Original Images, are Released [Completing the Agreement](https://github.com/ddw2AIGROUP2CQUPT/Large-Scale-Multimodal-Face-Datasets)**

# Demonstration of Cross-modal Retrieval (FLIP-based model)

<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/TGxEwHBbWZIbW67kG9jMH.mp4"></video>

# FLIP (Facial Language Image Pretraining)

This repository is the official implementation of [FaceCaption-15M]().

# Updates:

**[24/07/20] The usage of FLIP has been released! [OpenFace-CQUPT/FLIP-demo](https://huggingface.co/OpenFace-CQUPT/FLIP/tree/main/FLIP-demo)**  

**[24/07/17] The model named FLIP has been released! [OpenFace-CQUPT/FLIP](https://huggingface.co/OpenFace-CQUPT/FLIP)**  

**Overview of FLIP architecture.**

![image-20240318101027127](https://img.yutangli.net/img/202403181010116.png)

 **Fig.1:(a). Same color represents shared parameters. “12x” stands for 12-layer transformer modules. (b), (c) and (d) FLIP-based model are applied to the tasks of text-image retrieval, facial attributes prediction and sketch less facial image retrieval, respectively.**

## Training

Coming soon......(Only for the datasets been published, the code of training is meaningful.)

```shell
python pretrain.py > log.log
```

## Pre-trained Models

We provide pretrained model weights :  
FLIP Base —— click [here](https://huggingface.co/OpenFace-CQUPT/Facial-language-image-pretraining-model/tree/main/ckpt)  
FLIP Large —— coming soon......

## Datasets

Download the FaceCaption-15M dataset from [here](https://huggingface.co/datasets/OpenFace-CQUPT/FaceCaption-15M).


## Results

### Task1: Text-Image Retrieval

**Table 1:** Comparison with other classical pretrained models. All pretrained model backbones are frozen, with only the linear layer being fine-tuned. † represents the model pretrained on the LAION-Face [86] dataset; * represents the model pretrained on the FaceCaption dataset constructed without using LLM text generation.

![](https://img.yutangli.net/img/202403181015142.png)

### Task2: Facial Attributes Prediction

**Table 2:** Comparison with other classical models. † represents the model pre-trained on the original LAION-Face dataset.

![image-20240318101126897](https://img.yutangli.net/img/202403181011115.png)

### Task3: Sketch Less Facial Image Retrieval

**Table 3:** Comparative results with different baseline methods. † represents the model pre-trained on the LAION-Face dataset.

![image-20240318101633671](https://img.yutangli.net/img/202403181016876.png)

![image/png](https://cdn-uploads.huggingface.co/production/uploads/663f06e01cd68975883a353e/snd-9JBKJnRuZpm0Wp38f.png)

**Fig.2:Demonstration of our FLIP-based model on the SLFIR task. Both methods can retrieve the target face photo from the top-5 list using a partial sketch. Our proposed FLIP-based model can achieve this using fewer strokes than the baseline. The number at the bottom denotes the rank of the paired (true match) photos at every stage.**

## Contacts
mailto: 2018211556@stu.cqupt.edu.cn or dw_dai@163.com

## Citation
```tex
@misc{dai202415mmultimodalfacialimagetext,
      title={15M Multimodal Facial Image-Text Dataset}, 
      author={Dawei Dai and YuTang Li and YingGe Liu and Mingming Jia and Zhang YuanHui and Guoyin Wang},
      year={2024},
      eprint={2407.08515},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.08515}, 
}
```