VLMs

non-profit
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

andito  new activity about 2 hours ago
vlmbook/notebooks:Missing code
andito  updated a dataset about 3 hours ago
vlmbook/notebooks
merve  updated a dataset about 1 month ago
vlmbook/images
View all activity

andito 
in vlmbook/notebooks about 2 hours ago

Missing code

🔥 1
2
#1 opened 17 days ago by
mgat1
andito 
posted an update 8 months ago
view post
Post
2836
Finally, our new paper is out! "𝗙𝗶𝗻𝗲𝗩𝗶𝘀𝗶𝗼𝗻: 𝗢𝗽𝗲𝗻 𝗗𝗮𝘁𝗮 𝗜𝘀 𝗔𝗹𝗹 𝗬𝗼𝘂 𝗡𝗲𝗲𝗱"! 🥳
FineVision: Open Data Is All You Need (2510.17269)

If you've ever trained a VLM, you know this problem: nobody shares their data mixtures. It's a black box, making replicating SOTA work impossible.
We wanted to change that.

FineVision unifies 200 sources into 24 million samples. With 17.3 million images and 9.5 billion answer tokens, it's the largest open resource of its kind.

In the paper, we share how we built it:
🔍 finding and cleaning data at scale
🧹 removing excessive duplicates across sources
🤗 decontaminating against 66 public benchmarks

My favorite part is Figure 6 (in the video!). It's our visual diversity analysis. It shows that FineVision isn't just bigger; it's more balanced and conceptually richer than other open datasets.
NVIDIA's Eagle 2 paper highlighted just how critical this visual diversity is, and our results confirm it: models trained on FineVision consistently outperform those trained on any other open dataset on 11 benchmarks!

🎉 To celebrate the paper, I’m also releasing a concatenated and shuffled version of the full dataset! 👉HuggingFaceM4/FineVision_full_shuffled

It’s ready to stream, so you can start training your own models right away:

from datasets import load_dataset
d = load_dataset("HuggingFaceM4/FineVision_full_shuffled", split="train", streaming=True)
print(next(iter(d)))

A big shoutout to the first authors: Luis Wiedmann and Orr Zohar. They are rockstars!
merve 
posted an update 9 months ago
view post
Post
13418
deepseek-ai/DeepSeek-OCR is out! 🔥 my take ⤵️
> pretty insane it can parse and re-render charts in HTML
> it uses CLIP and SAM features concatenated, so better grounding
> very efficient per vision tokens/performance ratio
> covers 100 languages
  • 4 replies
·
merve 
posted an update 9 months ago
view post
Post
7063
large AI labs open-sourced a ton of models last week 🔥
here's few picks, find even more here merve/sep-16-releases-68d13ea4c547f02f95842f05 🤝
> IBM released a new Docling model with 258M params based on Granite (A2.0) 📝 ibm-granite/granite-docling-258M
> Xiaomi released 7B audio LM with base and instruct variants (MIT) XiaomiMiMo/mimo-audio-68cc7202692c27dae881cce0
> DecartAI released Lucy Edit, open Nano Banana 🍌 (NC) decart-ai/Lucy-Edit-Dev
> OpenGVLab released a family of agentic computer use models (3B/7B/32B) with the dataset 💻 OpenGVLab/scalecua-68c912cf56f7ff4c8e034003
> Meituan Longcat released thinking version of LongCat-Flash 💭 meituan-longcat/LongCat-Flash-Thinking
  • 2 replies
·
merve 
posted an update 10 months ago
view post
Post
3570
IBM just released small swiss army knife for the document models: granite-docling-258M on Hugging Face 🔥

> not only a document converter but also can do document question answering, understand multiple languages 🤯
> best part: released with Apache 2.0 license 👏 use it with your commercial projects!
> it supports transformers, vLLM and MLX from the get-go! 🤗
> built on SigLIP2 & granite-165M

model: ibm-granite/granite-docling-258M
demo: ibm-granite/granite-docling-258m-demo 💗
merve 
posted an update 10 months ago
merve 
in vlmbook/images 10 months ago

Upload invoice (1).png

1
#3 opened 10 months ago by
mervenoyan
merve 
posted an update 10 months ago
view post
Post
1094
fan-favorite vision LM Florence-2 is now officially supported in transformers 🤗

find all the models in
florence-community
org 🫡
merve 
posted an update 10 months ago