xinyu1205
/

recognize_anything_model

image tagging, image captioning

Model card Files Files and versions

recognize_anything_model / README.md

xinyu1205's picture

Update README.md

a4f945f over 2 years ago

|

2.41 kB

	---
	license: mit
	language:
	- en
	pipeline_tag: image-to-text
	tags:
	- image tagging, image captioning
	---

	# Recognize Anything & Tag2Text

	Model card for <a href="https://recognize-anything.github.io/">Recognize Anything: A Strong Image Tagging Model </a> and <a href="https://tag2text.github.io/">Tag2Text: Guiding Vision-Language Model via Image Tagging</a>.

	Recognition and localization are two foundation computer vision tasks.
	- The Segment Anything Model (SAM) excels in localization capabilities, while it falls short when it comes to recognition tasks.
	- The Recognize Anything Model (RAM) and Tag2Text exhibits exceptional recognition abilities, in terms of both accuracy and scope.
	-
	\| ![RAM.jpg](https://github.com/xinyu1205/Tag2Text/raw/main/images/localization_and_recognition.jpg) \|
	\|:--:\|
	\| <b> Pull figure from recognize-anything official repo \| Image source: https://recognize-anything.github.io/ </b>\|

	## TL;DR

	Authors from the [paper](https://arxiv.org/abs/2306.03514) write in the abstract:

	We present the Recognize Anything Model~(RAM): a strong foundation model for image tagging. RAM makes a substantial step for large models in computer vision, demonstrating the zero-shot ability to recognize any common category with high accuracy. By leveraging large-scale image-text pairs for training instead of manual annotations, RAM introduces a new paradigm for image tagging. We evaluate the tagging capability of RAM on numerous benchmarks and observe an impressive zero-shot performance, which significantly outperforms CLIP and BLIP. Remarkably, RAM even surpasses fully supervised models and exhibits a competitive performance compared with the Google tagging API.


	## BibTex and citation info

	```
	@article{zhang2023recognize,
	title={Recognize Anything: A Strong Image Tagging Model},
	author={Zhang, Youcai and Huang, Xinyu and Ma, Jinyu and Li, Zhaoyang and Luo, Zhaochuan and Xie, Yanchun and Qin, Yuzhuo and Luo, Tong and Li, Yaqian and Liu, Shilong and others},
	journal={arXiv preprint arXiv:2306.03514},
	year={2023}
	}

	@article{huang2023tag2text,

	title={Tag2Text: Guiding Vision-Language Model via Image Tagging},
	author={Huang, Xinyu and Zhang, Youcai and Ma, Jinyu and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Li, Yaqian and Guo, Yandong and Zhang, Lei},
	journal={arXiv preprint arXiv:2303.05657},
	year={2023}
	}
	```