lusxvr commited on
Commit
07f5cb0
·
1 Parent(s): 717a796
Files changed (1) hide show
  1. app/src/content/article.mdx +4 -3
app/src/content/article.mdx CHANGED
@@ -272,10 +272,11 @@ Compared against existing VLM training datasets, FineVision produces significant
272
  <HtmlEmbed src="against-baselines.html" desc="Average Rank of Models trained on different open source datasets." />
273
 
274
  ### How contaminated are the datasets?
275
- To investigate data leakage from benchmarks into this dataset, we construct a deduplication pipeline based on the sample images. We embed the images of 66 image-test datasets from the lmms-eval framework using the SSCD descriptor, and compute the cosine similarity between our samples and the test-set embeddings. Whenever a sample has a similarity higher than a threshold of 0.95 it is assumed to be a duplicate. While our tests with various thresholds show that this is flagging some samples that are not actual duplicates (especially if the image depicts similar but different images in detail, like graphs or tables), we preferred to err on the side of caution. We open-source the deduplication pipeline here as well as the precomputed test-set embedding’s here.
276
-
277
- <HtmlEmbed src="comparison.html" desc="desc" title="title"/>
278
 
 
 
 
279
 
280
  | Name | Samples | Contamination Rate | Performance Drop |
281
  |---------------|---------|--------------------|------------------|
 
272
  <HtmlEmbed src="against-baselines.html" desc="Average Rank of Models trained on different open source datasets." />
273
 
274
  ### How contaminated are the datasets?
275
+ To investigate data leakage from benchmarks into this dataset, we construct a deduplication pipeline based on the sample images. We embed the images of 66 image-test datasets from the lmms-eval framework using the SSCD descriptor, and compute the cosine similarity between our samples and the test-set embeddings. Whenever a sample has a similarity higher than a threshold of 0.95 it is assumed to be a duplicate. While our tests with various thresholds show that this is still flagging more false-positives than false-negatives, we preferred to err on the side of caution. Below is an example of a correctly identified Duplicate ("Photo"), a false-positive with a similarity score above 0.95 ("Chart") and a false-negative with a similarity score below 0.95 ("Drawing"). We open-source the deduplication pipeline here as well as the precomputed test-set embedding’s here.
 
 
276
 
277
+ <Wide>
278
+ <HtmlEmbed src="comparison.html" desc="Examples of the Deduplication Pipeline."/>
279
+ </Wide>
280
 
281
  | Name | Samples | Contamination Rate | Performance Drop |
282
  |---------------|---------|--------------------|------------------|