Update README.md

44ec19b verified about 1 year ago

5.04 kB

	---
	license: apache-2.0
	datasets:
	- Fancy-MLLM/R1-Onevision
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	pipeline_tag: image-text-to-text
	---

	## R1-Onevision

	[\[📂 GitHub\]](https://github.com/Fancy-MLLM/R1-Onevision)[\[📝 Report\]](https://yangyi-vai.notion.site/r1-onevision?pvs=4)
	[\[🤗 HF Dataset\]](https://huggingface.co/datasets/Fancy-MLLM/R1-onevision) [\[🤗 Reasoning Benchmark\]](https://huggingface.co/datasets/Fancy-MLLM/R1-OneVision-Bench) [\[🤗 HF Demo\]](https://huggingface.co/spaces/Fancy-MLLM/R1-OneVision)

	## Model Overview

	This is a multimodal large language model fine-tuned from Qwen2.5-VL on the R1-Onevision dataset. The model enhances vision-language understanding and reasoning capabilities, making it suitable for various tasks such as visual reasoning, image understanding. With its robust ability to perform multimodal reasoning, R1-Onevision emerges as a powerful AI assistant capable of addressing a wide range of problem-solving challenges across different domains.

	## Training Configuration and Curve
	- Framework: The training process uses the open-source LLama-Factory library, with Qwen2.5-VL-Instruct as the base model. This model comes in three variants: 3B, 7B, and 32B.
	- Parameters: For efficiency, we use a resolution of 518 for image inputs to save GPU memory. The training follows a full model SFT (Supervised Fine-Tuning) approach with a learning rate of 1e-5, trained for one epoch.

	The training configuration is as follows:
	```python
	image_resolution: 518
	cutoff_len: 8192
	per_device_train_batch_size: 1
	gradient_accumulation_steps: 16
	learning_rate: 1.0e-5

	num_train_epochs: 1.0
	lr_scheduler_type: cosine
	warmup_ratio: 0.05
	bf16: true
	flash_attn: fa2
	```

	Training loss curve:
	<img src="https://cdn-uploads.huggingface.co/production/uploads/65af78bb3e82498d4c65ed2a/8BNyo-v68aFvab2kXxtt1.png"/>

	## Usage

	You can load the model using the Hugging Face `transformers` library:

	```python
	from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
	import torch
	from qwen_vl_utils import process_vision_info

	MODEL_ID = "Fancy-MLLM/R1-Onevision-7B"
	processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	MODEL_ID,
	trust_remote_code=True,
	torch_dtype=torch.bfloat16
	).to("cuda").eval()

	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image", "image": "<your image path>"},
	{"type": "text", "text": "Hint: Please answer the question and provide the final answer at the end. Question: Which number do you have to write in the last daisy?"},
	],
	}
	]

	# Preparation for inference
	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to(model.device)

	generated_ids = model.generate(**inputs, max_new_tokens=4096)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text)
	```

	## Ongoing Work
	1. Rule-Based Reinforcement Learning (RL)

	We are actively exploring the integration of rule-based systems into reinforcement learning to enhance the agent's decision-making process. This approach combines domain-specific rules with the learning process, aiming to improve the efficiency and safety of learning in complex environments.

	2. Training with General Data and Multimodal Reasoning CoT

	Our ongoing work includes expanding the training datasets by incorporating more general data alongside multimodal reasoning Chain-of-Thought (CoT) data. This will enable the model to benefit from a broader range of information, enhancing its ability to handle diverse reasoning tasks across various domains.

	3. Incorporating Chinese Multimodal Reasoning CoT Data

	We are also focused on integrating Chinese multimodal reasoning CoT data into the training process. By adding this language-specific dataset, we aim to improve the model’s capability to perform reasoning tasks in Chinese, expanding its multilingual and multimodal reasoning proficiency.

	4. Release of the 3B Model


	We are working on the release of a smaller, more efficient 3B model, which is designed to provide a balance between performance and resource efficiency. This model aims to deliver strong multimodal reasoning capabilities while being more accessible and optimized for environments with limited computational resources, offering a more compact alternative to the current 7B model.

	# Institution
	- Zhejiang University

	## Model Contact
	- xiaoxuanhe@zju.edu.cn
	- panhongkun@zju.edu.cn
	- yang-yi@zju.edu.cn