Asagi-8B / README.md

Upload folder using huggingface_hub

6d87e6a verified 10 months ago

12.6 kB

	---
	license: apache-2.0
	language:
	- ja
	base_model:
	- llm-jp/llm-jp-3-7.2b-instruct3
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- llava
	---

	## Model Details

	### Model Description

	This repository provides Asagi-8B, a large-scale Japanese Vision & Language Model (VLM).
	Asagi-8B has been trained on an extensive Japanese dataset, incorporating a diverse range of data sources.

	A significant portion of the training data is synthesized using models such as the Japanese large language model ([CALM3-22B-Chat](https://huggingface.co/cyberagent/calm3-22b-chat)) and the English Vision & Language Model ([Phi3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct)).

	Importantly, we do not use LLMs that restrict the usage of their outputs in the license terms (e.g., GPT-4) to synthesize the training data.

	Note: Unlike the other models in the Asagi series, we skipped stage-1 training for Asagi-8B.


	\|Model components\|Model / Architecture\|Parameters\|
	\|:---\|:---\|---:\|
	\|Vision encoder\|[siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)\|428M\|
	\|Projector\|2-layer MLP\|43M\|
	\|LLM\|[llm-jp-3-7.2b-instruct](https://huggingface.co/llm-jp/llm-jp-3-7.2b-instruct)\|7.2B\|


	## Usage

	### Requirements

	```txt
	transformers==4.45.1
	accelerate==0.34.2
	torch==2.4.0
	torchvision==0.19.0
	```

	### How to use
	```python
	import requests
	import torch
	import transformers
	from PIL import Image
	from transformers import AutoModel, AutoProcessor, GenerationConfig

	transformers.set_seed(42)
	model_path = "MIL-UT/Asagi-8B"
	processor = AutoProcessor.from_pretrained(model_path)
	model = AutoModel.from_pretrained(
	model_path, trust_remote_code=True,
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)

	generation_config = GenerationConfig(
	do_sample=True,
	num_beams=5,
	max_new_tokens=256,
	temperature=0.7,
	repetition_penalty=1.5
	)

	prompt = ("以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。\n\n"
	"### 指示:\n<image>\nこの画像を見て、次の質問に詳細かつ具体的に答えてください。この写真はどこで撮影されたものか教えてください。また、画像の内容についても詳しく説明してください。\n\n### 応答:\n")

	# sample image
	sample_image_url = "https://raw.githubusercontent.com/uehara-mech/uehara-mech.github.io/refs/heads/master/images/shibuya.jpg"
	image = Image.open(requests.get(sample_image_url, stream=True).raw)

	inputs = processor(
	text=prompt, images=image, return_tensors="pt"
	)
	inputs_text = processor.tokenizer(prompt, return_tensors="pt")
	inputs['input_ids'] = inputs_text['input_ids']
	inputs['attention_mask'] = inputs_text['attention_mask']
	for k, v in inputs.items():
	if v.dtype == torch.float32:
	inputs[k] = v.to(model.dtype)
	inputs = {k: inputs[k].to(model.device) for k in inputs if k != "token_type_ids"}

	generate_ids = model.generate(
	**inputs,
	generation_config=generation_config
	)
	generated_text = processor.batch_decode(
	generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)[0]

	# do not print the prompt
	if "<image>" in prompt:
	prompt = prompt.replace("<image>", " ")
	generated_text = generated_text.replace(prompt, "")

	print(f"Generated text: {generated_text}")

	# >>> Generated text: この写真は東京・銀座の夜景を撮影したものです。銀座は東京を代表する繁華街の一つで、多くの高層ビルが立ち並び、ネオンサインや街灯の光で輝いています。特に目を引くのは、中央にそびえ立つ「銀座クリスタルビル」で、その特徴的な時計塔が印象的です。ビル群の間を行き交う車や人々の姿は、都会の活気を感じさせます。

	```

	### Example
	<img src="https://raw.githubusercontent.com/uehara-mech/uehara-mech.github.io/refs/heads/master/images/shibuya.jpg" width="350">

	\| Prompt \| Output \|
	\|:---\|:---\|
	\| この画像を見て、次の質問に詳細かつ具体的に答えてください。この写真はどこで撮影されたものか教えてください。また、画像の内容についても詳しく説明してください。 \|この写真は東京・銀座の夜景を撮影したものです。銀座は東京を代表する繁華街の一つで、多くの高層ビルが立ち並び、ネオンサインや街灯の光で輝いています。特に目を引くのは、中央にそびえ立つ「銀座クリスタルビル」で、その特徴的な時計塔が印象的です。ビル群の間を行き交う車や人々の姿は、都会の活気を感じさせます。\|

	Note: This image was actually taken in Shibuya, not Ginza.


	### More Examples
	<details>
	<summary>Click to expand</summary>


	<img src="https://raw.githubusercontent.com/uehara-mech/uehara-mech.github.io/refs/heads/master/images/bridge.jpg" width="300">

	\| Prompt \| Output \|
	\|:---\|:---\|
	\| この画像を見て、次の指示に詳細かつ具体的に答えてください。この写真の内容について詳しく教えてください。 \|この写真は、東京のお台場地区にあるレインボーブリッジを撮影したものです。レインボーブリッジは、東京都港区と江東区を結ぶ重要な交通路であり、その特徴的な白い塔とケーブルが印象的です。背景には高層ビル群が立ち並び、東京の都市景観が広がっています。空は部分的に雲がかかっており、朝か夕方の時間帯を示唆しています。\|


	<img src="https://raw.githubusercontent.com/uehara-mech/uehara-mech.github.io/refs/heads/master/images/tower.jpg" width="300">

	\| Prompt \| Output \|
	\|:---\|:---\|
	\| この画像を見て、次の質問に詳細かつ具体的に答えてください。この写真について評価してください。 \|この写真は、東京スカイツリーの壮大な景色を捉えており、赤と白の構造が空に向かってそびえ立っています。低い角度から撮影されているため、タワーがさらに威圧的に見えます。背景には白い雲が点在する澄んだ青い空が広がり、前景には緑豊かな木々と街灯が配置され、自然の要素が都市の風景に溶け込んでいます。全体として、人工物と自然が見事に調和した美しい写真です。\|


	<img src="https://raw.githubusercontent.com/uehara-mech/uehara-mech.github.io/refs/heads/master/images/shrine.jpg" width="300">

	\| Prompt \| Output \|
	\|:---\|:---\|
	\| この画像を見て、次の質問に詳細かつ具体的に答えてください。この写真の神社について、細かいところまで詳しく説明してください。 \|この写真に写っている神社は、伝統的な日本の建築様式を反映しています。朱色の鳥居と白い壁のコントラストが美しく、神聖な雰囲気を醸し出しています。鳥居は神社の入り口に位置し、神域と俗界を分ける役割を果たしています。神社の屋根は瓦葺きで、緑色の瓦が使われており、これは日本の伝統的な建築様式の一つです。また、神社の周りには近代的な建物が建ち並んでおり、伝統と現代が共存する日本の風景を象徴しています。\|

	<img src="https://raw.githubusercontent.com/uehara-mech/uehara-mech.github.io/refs/heads/master/images/garden.jpg" width="300">

	\| Prompt \| Output \|
	\|:---\|:---\|
	\| この画像を見て、次の指示に詳細かつ具体的に答えてください。これは日本庭園の中でも、どのような形式に分類される庭園ですか？また、その理由は何ですか？ \|この庭園は「池泉回遊式庭園」と呼ばれる形式の庭園です。池泉回遊式庭園は、池泉や水景物のある庭園を指し、水生庭園や水のある庭園とも呼ばれます。この形式の庭園は、自然の美しさを最大限に引き出すために設計されており、池や水景物が庭園の中心的な要素となっています。\|

	<img src="https://raw.githubusercontent.com/uehara-mech/uehara-mech.github.io/refs/heads/master/images/slope.jpg" width="200">

	\| Prompt \| Output \|
	\|:---\|:---\|
	\| この画像を見て、次の質問に詳細に答えてください。この画像の場所を舞台とした小説のあらすじを書いてください。 \|この画像の場所を舞台とした小説の題名は『石段の秘密』。物語は、都会の喧騒を逃れた主人公が、古い石造りの階段を登るシーンから始まります。階段の周りには緑豊かな植物が生い茂り、歴史を感じさせる建物が立ち並んでいます。主人公は階段を登りながら、過去と未来を行き来する不思議な体験をし、最終的に階段の下に隠された秘密を明らかにします。\|

	</details>


	## Training Details

	### Training Data
	Note: we skipped stage-1 training for Asagi-8B.

	\| Dataset \| Preprocess \| Stage1 \| Stage2 \| Size \|
	\|:--------------------------\|:-------------:\|:-------:\|:------:\|:---------:\|
	\| ROIS (Ours) \| Synthesized \| ✓ \| ✓ \| 8.4M \|
	\| Japanese image text pairs \| Synthesized \| ✓ \| ✓ \| 4.4M \|
	\| Wikipedia \| Synthesized \| ✓ \| ✓ \| 2.5M \|
	\| Open Images \| Translated \| ✓ \| ✓ \| 680K \|
	\| DCI \| Translated \| ✓ \| ✓ \| 7K \|
	\| CommonCatalog CC-BY \| Translated \| ✓ \| ✓ \| 3.5M \|
	\| LLaVA-Pretrain-JA \| \| ✓ \| ✓ \| 550K \|
	\| STAIR Captions \| \| ✓ \| ✓ \| 410K \|
	\| Flickr-JP \| \| ✓ \| ✓ \| 160K \|
	\| YJ Captions \| \| ✓ \| ✓ \| 130K \|
	\| Japanese Pascal \| \| ✓ \| ✓ \| 5K \|
	\| ArtBench \| Synthesized \| \| ✓ \| 100K \|
	\| GQA \| Translated \| \| ✓ \| 1.9M \|
	\| VQA v2 \| Translated \| \| ✓ \| 880K \|
	\| A-OKVQA \| Translated \| \| ✓ \| 34K \|
	\| OK-VQA \| Translated \| \| ✓ \| 18K \|
	\| Japanese Visual Genome \| Translated \| \| ✓ \| 1.6M \|
	\| PangeaInstruct \| \| \| ✓ \| 93K \|

	Note: ROIS (Ours) is a newly collected dataset crawled from the web specifically for this project.
	The dataset consists of image and raw text pairs, which are used to synthesize the training data.

	## Evaluation

	We evaluated our model using Heron-Bench, JA-VLM-Bench-in-the-Wild, and JA-VG-VQA-500.
	We used eval-mm library for this evaluation.

	Here, models with "†" are not trained with GPT-generated data.
	Bold numbers indicate the best performance among all models, and underlined numbers indicate the best performance among models not trained with GPT-generated data.

	\| Model \| LM Size \| Heron-Bench (LLM (%)) \| JA-VLM-Bench-In-the-Wild (ROUGE-L) \| JA-VLM-Bench-In-the-Wild (LLM (/5.0)) \| JA-VG-VQA-500 (ROUGE-L) \| JA-VG-VQA-500 (LLM (/5.0)) \|
	\|:-------------------------------\|:--------:\|:--------------------:\|:---------------------------------:\|:-----------------------------------:\|:---------------------:\|:-----------------------:\|
	\| Japanese InstructBLIP Alpha† \| 7B \| 14.0 \| 20.8 \| 2.42 \| - \| - \|
	\| Japanese Stable VLM† \| 7B \| 24.2 \| 23.3 \| 2.47 \| - \| - \|
	\| LLaVA-CALM2-SigLIP† \| 7B \| 43.3 \| 47.2 \| 3.15 \| 17.4 \| 3.21 \|
	\| Llama-3-EvoVLM-JP-v2 \| 8B \| 39.3 \| 41.4 \| 2.92 \| 23.5 \| 2.96 \|
	\| VILA-jp \| 13B \|57.2\|52.3\| 3.69\| 16.2 \| 3.62 \|
	\| Asagi-2B† \| 1.8B \| 44.7 \| 48.8 \| 3.26 \| 53.7 \| 3.69 \|
	\| Asagi-4B† \| 3.7B \| 49.3 \| 49.6 \| 3.38 \| 55.6 \| 3.78 \|
	\| Asagi-8B† \| 7.2B \| 54.7 \| 49.4 \| <u>3.45</u> \| 56.43 \| <u>3.84</u> \|
	\| Asagi-14B† \| 13B \| <u>55.8</u> \| <u>50.8</u> \| 3.44 \| <u>56.8</u> \| <u>3.84</u> \|
	\| GPT-4o \| - \| 87.6 \| 37.6 \| 3.85 \| 12.1 \| 3.58 \|



	## Risks and Limitations

	The models released here are in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.


	## Model Card Authors

	Kohei Uehara