DRIFT-VL-7B / README.md
ChaoHuangCS's picture
Update pipeline tag, add library_name, and links to paper/code (#1)
759e07c verified
---
base_model: qwen2.5-vl
license: mit
pipeline_tag: image-text-to-text
tags:
- vision-language-model
- multimodal
- reasoning
- fine-tuned
- qwen
library_name: transformers
---
# DRIFT
This is a fine-tuned version of Qwen2.5-VL for enhanced reasoning capabilities, specifically optimized for multimodal reasoning tasks.
The model is presented in the paper [Directional Reasoning Injection for Fine-Tuning MLLMs](https://huggingface.co/papers/2510.15050).
The code and further details can be found on the GitHub repository: https://github.com/WikiChao/DRIFT
## Usage
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import torch
model_id = "ChaoHuangCS/DRIFT-VL-7B"
# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Example usage with an image
from PIL import Image
image = Image.open("your_image.jpg")
prompt = "Analyze this image and explain your reasoning step by step."
# Format the input
messages = [
{"role": "user", "content": [{"type": "image", "image": image}, {"type": "text", "text": prompt}]}
]
# Apply chat template
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
```
## Fine-tuning Details
This model was fine-tuned using:
- **Base Model**: Qwen2.5-VL
- **Merged Model**: DeepSeek-R1
- **Training Method**: Custom reasoning-focused fine-tuning
- **Dataset**: Multimodal reasoning datasets
- **Architecture**: Preserves original Qwen2.5-VL architecture
## Performance
The model has been optimized for:
- Enhanced reasoning capabilities
- Better multimodal understanding
- Improved step-by-step thinking processes
- More accurate visual question answering
## Citation
If you use this model, please cite our paper.
## License
This model is released under the MIT license.