Update pipeline tag, add library_name, and links to paper/code (#1)

759e07c verified 4 months ago

2.33 kB

	---
	base_model: qwen2.5-vl
	license: mit
	pipeline_tag: image-text-to-text
	tags:
	- vision-language-model
	- multimodal
	- reasoning
	- fine-tuned
	- qwen
	library_name: transformers
	---

	# DRIFT

	This is a fine-tuned version of Qwen2.5-VL for enhanced reasoning capabilities, specifically optimized for multimodal reasoning tasks.
	The model is presented in the paper [Directional Reasoning Injection for Fine-Tuning MLLMs](https://huggingface.co/papers/2510.15050).
	The code and further details can be found on the GitHub repository: https://github.com/WikiChao/DRIFT

	## Usage

	```python
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
	import torch

	model_id = "ChaoHuangCS/DRIFT-VL-7B"

	# Load model and processor
	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	model_id,
	torch_dtype=torch.float16,
	device_map="auto",
	trust_remote_code=True
	)
	processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

	# Example usage with an image
	from PIL import Image

	image = Image.open("your_image.jpg")
	prompt = "Analyze this image and explain your reasoning step by step."

	# Format the input
	messages = [
	{"role": "user", "content": [{"type": "image", "image": image}, {"type": "text", "text": prompt}]}
	]

	# Apply chat template
	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	image_inputs, video_inputs = processor.process_vision_info(messages)

	inputs = processor(text=[text], images=image_inputs, videos=video_inputs, return_tensors="pt")

	with torch.no_grad():
	outputs = model.generate(**inputs, max_new_tokens=512)

	response = processor.decode(outputs[0], skip_special_tokens=True)
	print(response)
	```

	## Fine-tuning Details

	This model was fine-tuned using:
	- Base Model: Qwen2.5-VL
	- Merged Model: DeepSeek-R1
	- Training Method: Custom reasoning-focused fine-tuning
	- Dataset: Multimodal reasoning datasets
	- Architecture: Preserves original Qwen2.5-VL architecture

	## Performance

	The model has been optimized for:
	- Enhanced reasoning capabilities
	- Better multimodal understanding
	- Improved step-by-step thinking processes
	- More accurate visual question answering

	## Citation

	If you use this model, please cite our paper.

	## License

	This model is released under the MIT license.