File size: 6,144 Bytes
d97336c 9240a6c 70a9383 d1b17f7 d97336c eca4c3d 20eeba3 d97336c 1d2c588 af0d01e cc21d28 20eeba3 cc21d28 af0d01e cc21d28 d97336c 6976e28 d97336c af0d01e d97336c 9041c1b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 | ---
license: apache-2.0
language:
- en
- zh
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- text-generation-inference
- trl
- ocr
- vision-language
- reasoning
- grounded-visual-reasoning
- sft
- grpo
- thinking
- code
- thinking=1
---

# **Lumian-VLR-7B-Thinking**
> The **Lumian-VLR-7B-Thinking** model is a high-fidelity vision-language reasoning (experimental model) system designed for fine-grained multimodal understanding. Built on **Qwen2.5-VL-7B-Instruct**, this model enhances image captioning, sampled video reasoning, and document comprehension through explicit grounded reasoning. It produces structured reasoning traces aligned with visual coordinates, enabling explainable multimodal reasoning. Trained via supervised fine-tuning (SFT) on visually-grounded reasoning traces and further refined using GRPO reinforcement learning, Lumian delivers superior step-by-step chain-of-thought reasoning with strong visual grounding.
> [!NOTE]
*Model Subfolder:* [Lumian-VLR-7B-Thinking(think-preview)](https://huggingface.co/prithivMLmods/Lumian-VLR-7B-Thinking/tree/main/think-preview)
>
> *Model Folder:* [Lumian-VLR-7B-Thinking(no-think-single-shot)](https://huggingface.co/prithivMLmods/Lumian-VLR-7B-Thinking/tree/main/)
## Quick Start with Transformers(think-preview)🤗
```py
pip install git+https://github.com/huggingface/transformers.git
```
```py
# Load Lumian-VLR-7B-Thinking
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
MODEL_ID = "prithivMLmods/Lumian-VLR-7B-Thinking"
SUBFOLDER = "think-preview"
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True, subfolder=SUBFOLDER)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
MODEL_ID,
trust_remote_code=True,
subfolder=SUBFOLDER,
torch_dtype=torch.float16
).to(device).eval()
```
## Key Enhancements
* **Visually-Grounded Reasoning and Thinking Traces**: Generates explicit reasoning traces tied to image regions and document structures for transparent and explainable outputs.
* **Advanced Image Captioning**: Produces detailed, grounded captions with reasoning steps for improved scene understanding.
* **Sampled Video Reasoning**: Handles long-duration videos with temporal reasoning for question answering and summarization.
* **Context-Aware Document Analysis**: Excels at structured and unstructured content extraction with visual grounding.
* **Fine-Grained Visual Grounding**: Accurately links reasoning steps to tables, charts, and graphical elements.
* **Reinforcement-Learned Thinking**: GRPO training incentivizes accurate, grounded reasoning with minimal hallucinations.
> [!TIP]
Colab Demo : https://huggingface.co/prithivMLmods/Lumian-VLR-7B-Thinking/blob/main/think-preview/Lumian-VLR-7B-Thinking-Demo-Notebook/Lumian-VLR-7B-Thinking.ipynb
## Thinking Traces
The model outputs reasoning and answers in a structured format:
```
<think>
Step 1: Identify the main elements in the image and their positions.
Step 2: Analyze the relationships between objects and surrounding context.
Step 3: Derive the final answer based on spatial reasoning and visual cues.
</think>
<answer>
The image depicts a person holding an open book with highlighted sections on the left page.
</answer>
```
## Quick Start with Transformers(single-shot)
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"prithivMLmods/Lumian-VLR-7B-Thinking", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("prithivMLmods/Lumian-VLR-7B-Thinking")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image with thinking traces."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
## Intended Use
* Visual reasoning with grounded, step-by-step thinking traces.
* Explainable image captioning and sampled video reasoning.
* Multimodal document retrieval, extraction, and analytical interpretation.
* Transparent chain-of-thought reasoning for educational, research, and enterprise use.
* Multilingual reasoning and structured content extraction.
* Robotic and mobile vision-based automation with grounded decision-making.
## Limitations
* High memory requirements for long videos and large document batches.
* Degraded accuracy on extremely low-resolution or obscured visuals.
* Suboptimal for real-time inference on edge devices.
* Visual token configuration strongly influences reasoning fidelity.
* Occasional reasoning drift or partial grounding errors.
## References
* **YaRN: Efficient Context Window Extension of Large Language Models**
* **Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution**
* **Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond**
* **A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy**
* **Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning** |