AI & ML interests

None defined yet.

visheratinΒ 
posted an update almost 2 years ago
view post
Post
3794
Yesterday, xAI announced Grok-1.5 Vision - https://x.ai/blog/grok-1.5v. But more importantly, they also released a new VLM benchmark dataset - RealWorldQA. The only problem was that they released it as a ZIP archive. I fixed that! Now you can use it in your evaluations as a regular HF dataset: visheratin/realworldqa
  • 1 reply
Β·
visheratinΒ 
posted an update almost 2 years ago
view post
Post
2065
Look at the beauty in the video β€” four different embeddings on the same map! In another community blog post, I explore how you can use Nomic Atlas to view and clean your dataset. You can check it out here - https://huggingface.co/blog/visheratin/nomic-data-cleaning
  • 1 reply
Β·
visheratinΒ 
posted an update almost 2 years ago
view post
Post
Keep stacking cool stuff and getting better results! After I changed the standard vision encoder to SigLIP, NLLB-CLIP got a 10% average performance improvement. And now, I added matryoshka layers (https://arxiv.org/abs/2205.13147) to enable smaller embeddings and got another 6% performance boost! Plus, thanks to MRL, 4.5x smaller embeddings retain 90%+ quality.

The large model is finally SoTA for both image and text multilingual retrieval!

The models are available on the hub:
- visheratin/nllb-siglip-mrl-base
- visheratin/nllb-siglip-mrl-large
  • 2 replies
Β·
visheratinΒ 
posted an update almost 2 years ago
view post
Post
VLMs have a resolution problem, which prevents them from finding small details in large images. In my community blog post, I discuss the ways to solve it and describe the details of MC-LLaVA architecture - https://huggingface.co/blog/visheratin/vlm-resolution-curse

Check it out, and let me know what you think!
Β·
visheratinΒ 
posted an update almost 2 years ago
view post
Post
Isn't it sad that VLMs don't have any inference parameters for the vision part? Well, MC-LLaVA now has two whole knobs you can use to make it find even the smallest details! I finally (almost) properly implemented multi-crop, and now you can control the number of crops and how many image tokens will be generated. The video shows how, by increasing the number of crops and tokens, my 3B model correctly identifies the 30x90 pixel logo in the 3200x3000 pixel image.
Other notable updates:
- I use SigLIP from Transformers, so you don't need to install additional libraries.
- the model now supports auto classes, so you can create the model and processor with only two lines.
- performance increased by 10%+ across all benchmarks.

The work is far from over, but it feels like good progress.

The model on the hub: visheratin/MC-LLaVA-3b
You can try the model here: visheratin/mc-llava-3b