llm-jp-clip-vit-base-patch16 / README.md

Update README.md

e3b65fb verified 9 months ago

7.62 kB

	---
	tags:
	- clip
	- llm-jp-clip
	- japanese-clip
	library_name: open_clip
	pipeline_tag: zero-shot-image-classification
	license:
	- apache-2.0
	datasets:
	- llm-jp/relaion2B-en-research-safe-japanese-translation
	language:
	- ja
	---
	# Model Card for llm-jp-clip-vit-base-patch16

	# Model Details

	Japanese CLIP model trained with [OpenCLIP](https://github.com/mlfoundations/open_clip) on [relaion2B-en-research-safe-japanese-translation](https://huggingface.co/datasets/llm-jp/relaion2B-en-research-safe-japanese-translation), a Japanese translation of the English subset of ReLAION-5B (https://huggingface.co/datasets/laion/relaion2B-en-research-safe), translated by [gemma-2-9b-it](https://huggingface.co/datasets/laion/relaion2B-en-research-safe).

	The total number of parameters of this model is 248M.

	# How to Use

	## Installation

	```bash
	$ pip install open_clip_torch
	```

	## Zero-shot Image Classification
	```python
	import open_clip

	model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')
	tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')

	import torch
	from PIL import Image
	import requests

	url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
	image = Image.open(requests.get(url, stream=True).raw)
	image = preprocess(image).unsqueeze(0)
	text = tokenizer(["猫", "犬", "鳥"])

	with torch.no_grad(), torch.cuda.amp.autocast():
	image_features = model.encode_image(image)
	text_features = model.encode_text(text)
	image_features /= image_features.norm(dim=-1, keepdim=True)
	text_features /= text_features.norm(dim=-1, keepdim=True)

	text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

	print("Label probs:", text_probs)
	# Label probs: tensor([[9.9425e-01, 5.2273e-03, 5.2600e-04]])
	```

	Reference:
	- [Using OpenCLIP at Hugging Face](https://huggingface.co/docs/hub/en/open_clip), HuggingFace Docs
	- OpenCLIP [repository](https://github.com/mlfoundations/open_clip)


	# Training Details

	## Model Architecture

	- Text Encoder: RoBERTa base with llm-jp-tokenizer
	- Image Encoder: ViT-B/16

	## Training Data

	This model is trained on [relaion2B-en-research-safe-japanese-translation](https://huggingface.co/datasets/llm-jp/relaion2B-en-research-safe-japanese-translation).
	Due to a 70% success rate in image downloads, the dataset size was 1.45 billion samples, and we processed it over 9 epochs (13 billion samples in total).

	# Evaluation

	Evaluation Code: https://github.com/llm-jp/clip-eval

	Table: Performance of each model in zero-shot image classification and image-text retrieval tasks. Bold indicates first place, and _underline_ indicates second place.


	\| Model \| Params (M) \| ImageNet \| Recruit \| CIFAR10 \| CIFAR100 \| Food101 \| Caltech101 \| XM3600 I → T \| XM3600 T → I \| Avg. \|
	\|-----------------------------\|-------------\|----------\|---------\|---------\|----------\|---------\|------------\|-------------\|-------------\|------\|
	\| Japanese CLIP \| \| \| \| \| \| \| \| \| \| \|
	\| [Rinna ViT-B/16](https://huggingface.co/rinna/japanese-clip-vit-b-16) \| 196 \| 50.6 \| 39.9 \| 90.7 \| 64.0 \| 53.2 \| 84.6 \| 53.8 \| 54.0 \| 61.4 \|
	\| [Rinna ViT-B/16 cloob](https://huggingface.co/rinna/japanese-cloob-vit-b-16) \| 196 \| 54.6 \| 41.6 \| 88.2 \| 60.3 \| 57.2 \| 80.2 \| 53.4 \| 53.4 \| 61.1 \|
	\| [LY ViT-B/16](https://huggingface.co/line-corporation/clip-japanese-base) \| 196 \| 52.0 \| 83.8 \| 96.3 \| 76.7 \| 73.9 \| 88.4 \| 76.9 \| 78.0 \| 78.3 \|
	\| [llm-jp-ViT-B/16](https://huggingface.co/llm-jp/llm-jp-clip-vit-base-patch16) \| 248 \| 54.2 \| 59.4 \| 91.8 \| 69.2 \| _82.2_ \| 85.6 \| 73.6 \| 72.7 \| 73.6 \|
	\| [StabilityAI ViT-L/16](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16) \| 414 \| 62.4 \| 70.5 \| _97.6_ \| 84.1 \| 74.0 \| 86.7 \| 67.3 \| 66.0 \| 76.1 \|
	\| [llm-jp-ViT-L/14](https://huggingface.co/llm-jp/llm-jp-clip-vit-large-patch14) \| 467 \| _59.5_ \| 62.9 \| 96.4 \| 77.0 \| 88.2 \| _87.8_ \| 74.1 \| _74.1_ \| _77.5_ \|
	\| Multilingual CLIP \| \| \| \| \| \| \| \| \| \| \|
	\| [SigLIP B/16-256 multi](https://huggingface.co/google/siglip-base-patch16-256-multilingual) \| 370 \| 51.9 \| 71.2 \| 92.4 \| 65.8 \| 78.6 \| 85.6 \| 45.9 \| 43.0 \| 66.8 \|
	\| [jina-clip-v2](https://huggingface.co/jinaai/jina-clip-v2) \| 865 \| 35.8 \| 48.1 \| 95.1 \| 58.3 \| 52.0 \| 69.4 \| 67.3 \| 66.4 \| 61.6 \|
	\| [LAION ViT-H/14 multi](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k) \| 1193 \| 53.0 \| _74.5_ \| 97.9 \| _78.4_ \| 74.3 \| 85.1 \| _75.0_ \| 72.0 \| 76.3 \|


	# LICENSE
	[The Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)


	Please refer to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms), as the training data was translated using gemma-2-9b-it. We utilizes Gemma solely for translation purposes. According to the definition of "Model Derivatives" in Section 1.1(e), our model does not fall under the category of a "model in order to cause that model to perform similarly to Gemma." Therefore, we have concluded that it is not necessary to inherit the Gemma license.

	# Citation

	Bibtex:
	```
	@inproceedings{sugiura-etal-2025-developing,
	title = "Developing {J}apanese {CLIP} Models Leveraging an Open-weight {LLM} for Large-scale Dataset Translation",
	author = "Sugiura, Issa and
	Kurita, Shuhei and
	Oda, Yusuke and
	Kawahara, Daisuke and
	Okazaki, Naoaki",
	editor = "Ebrahimi, Abteen and
	Haider, Samar and
	Liu, Emmy and
	Haider, Sammar and
	Leonor Pacheco, Maria and
	Wein, Shira",
	booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)",
	month = apr,
	year = "2025",
	address = "Albuquerque, USA",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2025.naacl-srw.15/",
	pages = "162--170",
	ISBN = "979-8-89176-192-6",
	abstract = "CLIP is a foundational model that bridges images and text, widely adopted as a key component in numerous vision-language models.However, the lack of large-scale open Japanese image-text pairs poses a significant barrier to the development of Japanese vision-language models.In this study, we constructed a Japanese image-text pair dataset with 1.5 billion examples using machine translation with open-weight LLMs and pre-trained Japanese CLIP models on the dataset.The performance of the pre-trained models was evaluated across seven benchmark datasets, achieving competitive average scores compared to models of similar size without the need for extensive data curation. However, the results also revealed relatively low performance on tasks specific to Japanese culture, highlighting the limitations of translation-based approaches in capturing cultural nuances. Our dataset, models, and code are publicly available."
	}

	```