| | --- |
| | base_model: qwen2.5-vl |
| | license: mit |
| | pipeline_tag: image-text-to-text |
| | tags: |
| | - vision-language-model |
| | - multimodal |
| | - reasoning |
| | - fine-tuned |
| | - qwen |
| | library_name: transformers |
| | --- |
| | |
| | # DRIFT |
| |
|
| | This is a fine-tuned version of Qwen2.5-VL for enhanced reasoning capabilities, specifically optimized for multimodal reasoning tasks. |
| | The model is presented in the paper [Directional Reasoning Injection for Fine-Tuning MLLMs](https://huggingface.co/papers/2510.15050). |
| | The code and further details can be found on the GitHub repository: https://github.com/WikiChao/DRIFT |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor |
| | import torch |
| | |
| | model_id = "ChaoHuangCS/DRIFT-VL-7B" |
| | |
| | # Load model and processor |
| | model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
| | model_id, |
| | torch_dtype=torch.float16, |
| | device_map="auto", |
| | trust_remote_code=True |
| | ) |
| | processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) |
| | |
| | # Example usage with an image |
| | from PIL import Image |
| | |
| | image = Image.open("your_image.jpg") |
| | prompt = "Analyze this image and explain your reasoning step by step." |
| | |
| | # Format the input |
| | messages = [ |
| | {"role": "user", "content": [{"type": "image", "image": image}, {"type": "text", "text": prompt}]} |
| | ] |
| | |
| | # Apply chat template |
| | text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| | image_inputs, video_inputs = processor.process_vision_info(messages) |
| | |
| | inputs = processor(text=[text], images=image_inputs, videos=video_inputs, return_tensors="pt") |
| | |
| | with torch.no_grad(): |
| | outputs = model.generate(**inputs, max_new_tokens=512) |
| | |
| | response = processor.decode(outputs[0], skip_special_tokens=True) |
| | print(response) |
| | ``` |
| |
|
| | ## Fine-tuning Details |
| |
|
| | This model was fine-tuned using: |
| | - **Base Model**: Qwen2.5-VL |
| | - **Merged Model**: DeepSeek-R1 |
| | - **Training Method**: Custom reasoning-focused fine-tuning |
| | - **Dataset**: Multimodal reasoning datasets |
| | - **Architecture**: Preserves original Qwen2.5-VL architecture |
| |
|
| | ## Performance |
| |
|
| | The model has been optimized for: |
| | - Enhanced reasoning capabilities |
| | - Better multimodal understanding |
| | - Improved step-by-step thinking processes |
| | - More accurate visual question answering |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite our paper. |
| |
|
| | ## License |
| |
|
| | This model is released under the MIT license. |