nielsr HF Staff

Add pipeline tag, library name, and improve model card

b774c06 verified 4 months ago

1.88 kB

base_model:
  - OpenGVLab/InternVL3_5-1B-Instruct
language:
  - en
license: mit
metrics:
  - accuracy
tags:
  - visual-reasoning
  - fine-grained-vqa
  - fine-grained-recognition
pipeline_tag: image-text-to-text
library_name: transformers

Model Card for TWIN-InternVL3_5-1B

This repository contains the InternVL3.5-1B model post-trained on the TWIN dataset, as introduced in the paper Same or Not? Enhancing Visual Perception in Vision-Language Models.

TWIN is a large-scale dataset of 561,000 image-pair queries designed to enhance the perceptual abilities of Vision-Language Models (VLMs). It tasks models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. Fine-tuning on TWIN yields significant gains in fine-grained recognition across various domains like art, animals, plants, and landmarks.

Resources

Project Page: https://glab-caltech.github.io/twin/
Paper: Same or Not? Enhancing Visual Perception in Vision-Language Models
Code Repository: https://github.com/damianomarsili/TWIN
Dataset: glab-caltech/TWIN
Benchmark Suite: glab-caltech/FGVQA

Citation

If you use TWIN in your research, please consider citing the work:

@misc{marsili2025notenhancingvisualperception,
      title={Same or Not? Enhancing Visual Perception in Vision-Language Models}, 
      author={Damiano Marsili and Aditya Mehta and Ryan Y. Lin and Georgia Gkioxari},
      year={2025},
      eprint={2512.23592},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.23592}, 
}