weizhiwang commited on
Commit
ab4e465
·
verified ·
1 Parent(s): 2e4ff54

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -0
README.md ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - weizhiwang/unifilter_train_data
5
+ base_model:
6
+ - Qwen/Qwen2.5-1.5B-Instruct
7
+ - google/siglip-so400m-patch14-384
8
+ ---
9
+ # UniFilter
10
+
11
+ Official implementation of [Train a Unified Multimodal Data Quality Classifier with Synthetic Data]() accepted by EMNLP 2025 Findings.
12
+
13
+
14
+ ## Release
15
+ <!-- - [3/31/2025] 🔥 We released all pre-training data in webdataset format at [Open-Qwen2VL-Data](https://huggingface.co/datasets/weizhiwang/Open-Qwen2VL-Data).
16
+ - [3/31/2025] 🔥 We released the technical report for [**Open-Qwen2VL**](https://arxiv.org/abs/2504.00595). -->
17
+ - [8/25/2025] 🔥 We released UniFilter model at [UniFilter-Qwen2.5-1.5B](https://huggingface.co/weizhiwang/UniFilter-Qwen2.5-1.5B). Empowered by a strong 1.5B LLM backbone, the UniFilter model achieves best inference speed on quality score generation and the classification accuracy.
18
+
19
+
20
+ ## Introduction
21
+ UniFilter is a Unified Multimodal Data Quality Classifier for High-Quality Multimodal Data Filtering, which can generate quality scores for both image-text caption and interleaved document data. Such quality scores can be further used for high-quality data filtering to significantly strengthen the capability of pre-trained MLLMs.
22
+
23
+ This repo supports
24
+ - synthetic data generation
25
+ - UniFilter training
26
+ - quality score generation with [UniFilter-Qwen2.5-1.5B](https://huggingface.co/weizhiwang/UniFilter-Qwen2.5-1.5B).
27
+
28
+ ## Installation
29
+ If you just require the quality score generation, please install the customized LLaVA package only.
30
+
31
+ ```Shell
32
+ conda create -n unifilter python=3.10
33
+ conda activate unifilter
34
+ pip install -e LLaVA
35
+ pip install flash-attn==2.5.2 --no-build-isolation
36
+ ```
37
+
38
+ ## Synthetic Data Generation for UniFilter Training
39
+ We instruct Claude-3 or Claude-3.5 to generate the desired (multimodal data example, quality score) pairs across 4 designated quality levels.
40
+ The synthetic data generation scrips are:
41
+ - [claude_sonnet_caption_data_generation.py](data_prepare/caption_data_scripts/claude_sonnet_caption_data_generation.py)
42
+ - [claude_sonnet_interleaved_data_generation.py](data_prepare/interleaved_data_scripts/claude_sonnet_interleaved_data_generation.py)
43
+
44
+ ## Data Preparation for UniFilter Training
45
+ UniFilter is trained a large-scale set of (multimodal data example, quality score) pairs, which contains both caption data and interleaved document data. The synthetic multimodal example-score paired data are available at [UniFilter-Post-Train-Data]().
46
+
47
+ ## UniFilter Training
48
+ We develop the UniFilter training and scoring codebase based on [LLaVA-Unified]() repo, which is adapted from LLaVA with the support for recent LLMs and Vision Encoders.
49
+ <!-- An additional [LlavaPhi3Classifier](LLaVA/llava/model/language_model/llava_phi3.py#235) class is customized as the model class for UniFilter. -->
50
+
51
+ The architectural design of UniFilter contains three modules, the vision encoder, the visual projector, and the LLM Backbone. Different from a MLLM, the LLM Backbone does not have a language modeling head and we replace it with a score generation head. All these module parameters are specified with:
52
+ - `--mm_projector_type`: visual projector, i.e. aapool_mlp representing average pooling vision projector with 144 tokens for one image
53
+ - `--vision_tower`: vision encoder, i.e. SigLIP-SO-400M with 384px resolution
54
+ - `--model_name_or_path`: LLM Backbone, i.e. Qwen2.5-0.5B-Instruct
55
+
56
+
57
+ ### Visual Projector Pre-Training (Stage 1)
58
+
59
+ Please download the 558K subset of the LLAVA-Pretrain caption dataset [LLaVA-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain).
60
+
61
+ Training script with DeepSpeed ZeRO-2: [`pretrain.sh`](scripts/v1_5/pretrain.sh).
62
+
63
+
64
+ ### UniFilter Classifier Training (Stage 2)
65
+
66
+
67
+ Training script with DeepSpeed ZeRO-3: [`train_classifier.sh`](scripts/v1_5/train_classifier.sh).
68
+
69
+ Our training script will upload the metrics to wandb. The best UniFilter model is saved based on the best quality classification accuracy on the validation sets.
70
+
71
+
72
+ ## Quality Score Generation
73
+
74
+ ## Caption Data Quality Scoring
75
+ ```Shell
76
+ python data_scoring/data_quality_classifier_caption_scoring.py \
77
+ --model-path weizhiwang/UniFilter-Qwen2.5-1.5B \
78
+ --tar-file-path data/datacomp/medium_vanilla_filter\
79
+ --gpu-id 0 \
80
+ --batch-size 4 \
81
+ --tars-per-gpu 256 \
82
+ ```
83
+
84
+ ## Interleaved Data Quality Scoring
85
+ ```Shell
86
+ python data_scoring/data_quality_classifier_interleaved_scoring.py \
87
+ --model-path weizhiwang/UniFilter-Qwen2.5-1.5B \
88
+ --tar-file-path data/OBELICS/obelics_webdataset\
89
+ --gpu-id 0 \
90
+ --batch-size 1 \
91
+ --tars-per-gpu 128 \
92
+ ```
93
+
94
+ Parameters to note:
95
+ - `--gpu-id`: for large-scale score generation using multi-machines, specify the index of machines
96
+ - `--model-path`: path to the UniFilter model checkpoint
97
+ - `--tar-file-path`: path to the webdataset image-text caption data or interleaved document data tars
98
+ - `--tars-per-gpu`: the number of webdataset tars for a single-gpu to inference on
99
+
100
+
101
+ ## Acknowledgement
102
+
103
+ - [LLaVA](https://github.com/haotian-liu/LLaVA): the codebase we built upon for UniFilter training.