Bug report: seems Siglip2 cannot handle the single-apostrophe character
I am using Siglip model and Siglip2 model to compare similarity scores between texts.
I have query input: query="The people are racing on horses",
and 3 text contexts:
contexts_text=[ "Horse racing is a popular sport", "The man is swimming", "Don't sleep in the car please" ],
I vary the 3rd prompt as "Don't sleep in the car please", and "Dont sleep in the car please" (the second without the single quote apostrophe).
When computing similarity scores between the query and the text contexts in batches, I should ALWAYS get the same scores for the same prompt regardless of the other queries in the prompt. For siglip 1, this works correctly.
I get similarity scores:Text similarities: [0.88092041015625, 0.7309052348136902, 0.726897656917572] for the first version (with apostrophe)
andText similarities: [0.88092041015625, 0.7309052348136902, 0.726897656917572] for the second version (without apostrophe).
Since the scores are the same I believe SigLIP 1 tokenizer probably does some preprocessing with this character
But for SigLIP 2, I get similarity scores:Text similarities: [0.9056751132011414, 0.8166967630386353, 0.9067648649215698] for the first version
andText similarities: [0.698212206363678, 0.6713727712631226, 0.8529390096664429] for the second version.
So for SigLIP 2, the single apostrophe completely breaks the entire batch. I get varying similarity scores for all 3 prompts, instead of just the last one.
I'm sure there is no bug in how I compute similarity scores, as I do the same for CLIP, SigLIP, and SigLIP 2 models. And varying prompts work for all other models. I believe there must be some sort of bug in the text tokenizer for SigLIP 2 (across all model patch sizes).
For reference my CLIP scores (using 'ViT-B-32', 'laion2b_s34b_b79k') on these same prompts are:Text similarities: [0.8413525819778442, 0.5710945129394531, 0.5424197912216187]
andText similarities: [0.8413525819778442, 0.5710945129394531, 0.5516183972358704]
so I do believe SigLIP 1 probably does some preprocessing specifically for some characters such as the single quote apostrophe (').
I would love to see if other people have similar issues! Hopefully this helps some of you.