Today's content moderation systems give you a label: safe or unsafe. They don't tell you what triggered the decision, who is involved, or where in the image it happens. That opacity hurts auditing, breaks adaptation across platforms, and frustrates the human review that responsible deployment demands.
We built SenBen to fix this: the first large-scale scene graph benchmark designed specifically for sensitive content moderation:
- 13,999 annotated frames from 157 movies - Visual Genome style scene graphs with bounding boxes, attributes, and predicates - Affective state attributes (pain, fear, aggression, distress) so the model captures not just what is in the frame, but what it means - 16 safety tags across 5 categories, the broadest taxonomy of any dataset of this kind
A small model that beats much bigger ones:
We distilled a frontier VLM into a compact 241M parameter student built on Florence-2.
On grounded scene graph metrics, the 241M student beats every evaluated VLM except Gemini, and every commercial safety API. It also wins on object detection and captioning across the entire model zoo. It runs at 733 ms per frame on 1.2 GB VRAM, which is 7.6 times faster than the next-best local VLM at zero per-frame cost. The whole benchmark, from dataset creation through all baseline evaluations, is reproducible for under $350.
deepseek-ai/DeepSeek-OCR is out! 🔥 my take ⤵️ > pretty insane it can parse and re-render charts in HTML > it uses CLIP and SAM features concatenated, so better grounding > very efficient per vision tokens/performance ratio > covers 100 languages
IBM just released small swiss army knife for the document models: granite-docling-258M on Hugging Face 🔥
> not only a document converter but also can do document question answering, understand multiple languages 🤯 > best part: released with Apache 2.0 license 👏 use it with your commercial projects! > it supports transformers, vLLM and MLX from the get-go! 🤗 > built on SigLIP2 & granite-165M
- You can train a model in a language it has never been trained in using the PT model. There’s no need for large datasets. - With the PT model, you can easily replicate the voice of any character you want. Just 1k samples are enough. - You can add emotion support with a small dataset.