Safetensors
qwen3_vl
πŸ‡ͺπŸ‡Ί Region: EU

Website LinkedIn X

πŸ“„ Paper | πŸ“ Blog | πŸš€ Recipe Leaderboard | πŸ“Š Benchmark (MMLBD-C)

OriOn-Qwen

OriOn-Qwen is a long-context visual document finetune of Qwen/Qwen3-VL-32B-Instruct trained with LongPO for improved long-document QA and reasoning over PDFs. We use task arithmetic to minimize degradation from the original model.


Highlights

  • Pareto-optimal performance on long-document VQA benchmarks including MMLongBenchDoc and our corrected version MMLBD-C (+1.7 and +2.6 F1, respectively vs the base 32B instruct model), (-1.2, +0.2 vs SOTA Qwen3-VL-235B-A22B-Instruct).
  • Short-to-long preference-optimized for long-document QA on challenging synthetic questions.
  • Frontier visual and text LC performance for LC reasoning on PDFs and text.
  • Drop-in Transformers and vLLM usage with Qwen3VLForConditionalGeneration + AutoProcessor (same API as the base model) and vllm serve lightonai/OriOn-Qwen.
  • Same 256K context window as Qwen3-VL-32B-Instruct.

Related

  • Checkpoint Leaderboard: lightonai/OriOn-Leaderboard includes extensive information for exploration and reproducibility of our training recipes.
  • Best Mistral Checkpoint: lightonai/OriOn-Mistral improves Mistral's visual LC performance by 16.8% on MMLongBenchDoc and text LC performance by 43.5% on HELMET while extending the context length to 344K tokens and training on up to 336 page documents.
  • Manually Corrected MMLongBenchDoc: lightonai/MMLBD-C improves upon MMLongBenchDoc by flagging inconsistencies between the question, answer and source document. We correct errors related to typos, poor grammar, incorrect question-document pairing and ambiguous phrasing.

Benchmarks

Scores (accuracy / task metric, higher is better).

The table below compares OriOn-Qwen to the base models and main checkpoints from our paper.

Model / checkpoint VA LCA MMLBD-C MMLB 128K SlideVQA Helmet LongBench v2 DUDE
OriOn-Qwen (LongPO short-stage) 94.6 93.1 56.4 75.6 75.5 62.9 42.0 56.0
Qwen3-VL 32B (baseline) 94.2 92.8 53.8 70.4 77.2 63.0 42.0 61.8
Qwen3-VL 32B Plain Distill (short stage) 92.5 92.5 57.3 73.8 66.8 65.7 44.0 54.8
OriOn-Mistral (Plain Distill) 84.9 83.0 47.4 65.7 71.2 53.1 38.0 54.0
Mistral 3.1 Small (24B) 80.2 76.7 41.4 66.4 67.8 37.0 39.0 52.8

Intended use

OriOn-Qwen is intended for:

  • Long PDF / slide-deck QA and understanding strong one-shot QA capabilities with the full document given to the model
  • Long-context text/visual reasoning we show that visual LC training improves not only visual LC performance, but text performance as well.

Training details (high level)

  • Method: Preference optimization (LongPO)
  • Teacher policy: Single page and multi-page questions with Qwen3-VL-235B-A22B-Instruct used for answer generation.
  • Data: Long PDF documents up to 104 pages.

Serving

We recommend serving with vLLM (adjust for your setup)

vllm serve lightonai/OriOn-Qwen -tp 2 --quantization fp8 

Usage with Transformers

This is adapted directly from the official Qwen3-VL-32B-Instruct model card, with the model id swapped to lightonai/OriOn-Qwen.

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

# Load the model on the available device(s)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "lightonai/OriOn-Qwen", dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("lightonai/OriOn-Qwen")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)
print(output_text)

Tip: if you’re running multi-image/video, Qwen recommends flash_attention_2 for speed/memory.


Citation

If you use OriOn-Qwen or MMLBD-C in your work, please cite:

@misc{orion_longdoc_vlm_2026,
      title={How to Train Your Long-Context Visual Document Model}, 
      author={Austin Veselka},
      year={2026},
      eprint={2602.15257},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.15257}, 
}
@misc{qwen3technicalreport,
  title        = {Qwen3 Technical Report},
  author       = {Qwen Team},
  year         = {2025},
  eprint       = {2505.09388},
  archivePrefix= {arXiv},
  primaryClass = {cs.CL},
  url          = {https://arxiv.org/abs/2505.09388}
}
@misc{mmlbd,
  title={MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations},
  author={Yubo Ma and Yuhang Zang and Liangyu Chen and Meiqi Chen and Yizhu Jiao and Xinze Li and Xinyuan Lu and Ziyu Liu and Yan Ma and Xiaoyi Dong and Pan Zhang and Liangming Pan and Yu-Gang Jiang and Jiaqi Wang and Yixin Cao and Aixin Sun},
  year={2024},
  eprint={2407.01523},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2407.01523},
}

Downloads last month
-
Safetensors
Model size
33B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for lightonai/OriOn-Qwen

Finetuned
(17)
this model

Collection including lightonai/OriOn-Qwen

Papers for lightonai/OriOn-Qwen