π Paper | π Blog | π Recipe Leaderboard | π Benchmark (MMLBD-C)
OriOn-Qwen
OriOn-Qwen is a long-context visual document finetune of Qwen/Qwen3-VL-32B-Instruct trained with LongPO for improved long-document QA and reasoning over PDFs. We use task arithmetic to minimize degradation from the original model.
Highlights
- Pareto-optimal performance on long-document VQA benchmarks including MMLongBenchDoc and our corrected version MMLBD-C (+1.7 and +2.6 F1, respectively vs the base 32B instruct model), (-1.2, +0.2 vs SOTA Qwen3-VL-235B-A22B-Instruct).
- Short-to-long preference-optimized for long-document QA on challenging synthetic questions.
- Frontier visual and text LC performance for LC reasoning on PDFs and text.
- Drop-in Transformers and vLLM usage with
Qwen3VLForConditionalGeneration+AutoProcessor(same API as the base model) andvllm serve lightonai/OriOn-Qwen. - Same 256K context window as Qwen3-VL-32B-Instruct.
Related
- Checkpoint Leaderboard: lightonai/OriOn-Leaderboard includes extensive information for exploration and reproducibility of our training recipes.
- Best Mistral Checkpoint: lightonai/OriOn-Mistral improves Mistral's visual LC performance by 16.8% on MMLongBenchDoc and text LC performance by 43.5% on HELMET while extending the context length to 344K tokens and training on up to 336 page documents.
- Manually Corrected MMLongBenchDoc: lightonai/MMLBD-C improves upon MMLongBenchDoc by flagging inconsistencies between the question, answer and source document. We correct errors related to typos, poor grammar, incorrect question-document pairing and ambiguous phrasing.
Benchmarks
Scores (accuracy / task metric, higher is better).
The table below compares OriOn-Qwen to the base models and main checkpoints from our paper.
| Model / checkpoint | VA | LCA | MMLBD-C | MMLB 128K | SlideVQA | Helmet | LongBench v2 | DUDE |
|---|---|---|---|---|---|---|---|---|
| OriOn-Qwen (LongPO short-stage) | 94.6 | 93.1 | 56.4 | 75.6 | 75.5 | 62.9 | 42.0 | 56.0 |
| Qwen3-VL 32B (baseline) | 94.2 | 92.8 | 53.8 | 70.4 | 77.2 | 63.0 | 42.0 | 61.8 |
| Qwen3-VL 32B Plain Distill (short stage) | 92.5 | 92.5 | 57.3 | 73.8 | 66.8 | 65.7 | 44.0 | 54.8 |
| OriOn-Mistral (Plain Distill) | 84.9 | 83.0 | 47.4 | 65.7 | 71.2 | 53.1 | 38.0 | 54.0 |
| Mistral 3.1 Small (24B) | 80.2 | 76.7 | 41.4 | 66.4 | 67.8 | 37.0 | 39.0 | 52.8 |
Intended use
OriOn-Qwen is intended for:
- Long PDF / slide-deck QA and understanding strong one-shot QA capabilities with the full document given to the model
- Long-context text/visual reasoning we show that visual LC training improves not only visual LC performance, but text performance as well.
Training details (high level)
- Method: Preference optimization (LongPO)
- Teacher policy: Single page and multi-page questions with Qwen3-VL-235B-A22B-Instruct used for answer generation.
- Data: Long PDF documents up to 104 pages.
Serving
We recommend serving with vLLM (adjust for your setup)
vllm serve lightonai/OriOn-Qwen -tp 2 --quantization fp8
Usage with Transformers
This is adapted directly from the official Qwen3-VL-32B-Instruct model card, with the model id swapped to lightonai/OriOn-Qwen.
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
# Load the model on the available device(s)
model = Qwen3VLForConditionalGeneration.from_pretrained(
"lightonai/OriOn-Qwen", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("lightonai/OriOn-Qwen")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
)
inputs = inputs.to(model.device)
# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)
print(output_text)
Tip: if youβre running multi-image/video, Qwen recommends
flash_attention_2for speed/memory.
Citation
If you use OriOn-Qwen or MMLBD-C in your work, please cite:
@misc{orion_longdoc_vlm_2026,
title={How to Train Your Long-Context Visual Document Model},
author={Austin Veselka},
year={2026},
eprint={2602.15257},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.15257},
}
@misc{qwen3technicalreport,
title = {Qwen3 Technical Report},
author = {Qwen Team},
year = {2025},
eprint = {2505.09388},
archivePrefix= {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2505.09388}
}
@misc{mmlbd,
title={MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations},
author={Yubo Ma and Yuhang Zang and Liangyu Chen and Meiqi Chen and Yizhu Jiao and Xinze Li and Xinyuan Lu and Ziyu Liu and Yan Ma and Xiaoyi Dong and Pan Zhang and Liangming Pan and Yu-Gang Jiang and Jiaqi Wang and Yixin Cao and Aixin Sun},
year={2024},
eprint={2407.01523},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2407.01523},
}
- Downloads last month
- -
Model tree for lightonai/OriOn-Qwen
Base model
Qwen/Qwen3-VL-32B-Instruct