evometrika (evometrika)

posted an update about 2 years ago

Post

3875

Yesterday, xAI announced Grok-1.5 Vision - https://x.ai/blog/grok-1.5v. But more importantly, they also released a new VLM benchmark dataset - RealWorldQA. The only problem was that they released it as a ZIP archive. I fixed that! Now you can use it in your evaluations as a regular HF dataset: visheratin/realworldqa

1 reply

·

visheratin

posted an update about 2 years ago

Post

2070

Look at the beauty in the video — four different embeddings on the same map! In another community blog post, I explore how you can use Nomic Atlas to view and clean your dataset. You can check it out here - https://huggingface.co/blog/visheratin/nomic-data-cleaning

1 reply

·

visheratin

posted an update about 2 years ago

Post

Keep stacking cool stuff and getting better results! After I changed the standard vision encoder to SigLIP, NLLB-CLIP got a 10% average performance improvement. And now, I added matryoshka layers (https://arxiv.org/abs/2205.13147) to enable smaller embeddings and got another 6% performance boost! Plus, thanks to MRL, 4.5x smaller embeddings retain 90%+ quality.

The large model is finally SoTA for both image and text multilingual retrieval!

The models are available on the hub:
- visheratin/nllb-siglip-mrl-base
- visheratin/nllb-siglip-mrl-large

2 replies

·

visheratin

posted an update about 2 years ago

Post

VLMs have a resolution problem, which prevents them from finding small details in large images. In my community blog post, I discuss the ways to solve it and describe the details of MC-LLaVA architecture - https://huggingface.co/blog/visheratin/vlm-resolution-curse

Check it out, and let me know what you think!

11 replies

·

visheratin

posted an update about 2 years ago

Post

Isn't it sad that VLMs don't have any inference parameters for the vision part? Well, MC-LLaVA now has two whole knobs you can use to make it find even the smallest details! I finally (almost) properly implemented multi-crop, and now you can control the number of crops and how many image tokens will be generated. The video shows how, by increasing the number of crops and tokens, my 3B model correctly identifies the 30x90 pixel logo in the 3200x3000 pixel image.
Other notable updates:
- I use SigLIP from Transformers, so you don't need to install additional libraries.
- the model now supports auto classes, so you can create the model and processor with only two lines.
- performance increased by 10%+ across all benchmarks.

The work is far from over, but it feels like good progress.

The model on the hub: visheratin/MC-LLaVA-3b
You can try the model here: visheratin/mc-llava-3b

visheratin

authored a paper over 2 years ago

NLLB-CLIP -- train performant multilingual image retrieval model on a budget

Paper • 2309.01859 • Published Sep 4, 2023 • 3

AI & ML interests

Team members 3

evometrika's activity