Update README.md
Browse files
README.md
CHANGED
|
@@ -23,7 +23,7 @@ The training data is sourced from two types of datasets:
|
|
| 23 |
- **Coarse semantic labels**: 8.5 million images paired with captions of varying quality, ranging from well-defined descriptions to noisy and less relevant text.
|
| 24 |
|
| 25 |
### 2. Data Filtering
|
| 26 |
-
To refine the coarse dataset, we propose a data filtering strategy using the CLIP-based model,
|
| 27 |
|
| 28 |

|
| 29 |
*Figure 1: Data Refinement Process of the CLIP-RS Dataset. Left: Workflow for filtering and refining low-quality captions. Right: Examples of low-quality captions and their refined versions.*
|
|
|
|
| 23 |
- **Coarse semantic labels**: 8.5 million images paired with captions of varying quality, ranging from well-defined descriptions to noisy and less relevant text.
|
| 24 |
|
| 25 |
### 2. Data Filtering
|
| 26 |
+
To refine the coarse dataset, we propose a data filtering strategy using the CLIP-based model, CLIP_Sem. This model is pre-trained on high-quality captions to ensure that only semantically accurate image-text pairs are retained. The similarity scores (SS) between each image-text pair are calculated, and captions with low similarity are discarded.
|
| 27 |
|
| 28 |

|
| 29 |
*Figure 1: Data Refinement Process of the CLIP-RS Dataset. Left: Workflow for filtering and refining low-quality captions. Right: Examples of low-quality captions and their refined versions.*
|