OriOn-Qwen

OriOn-Qwen is a long-context visual document finetune of Qwen/Qwen3-VL-32B-Instruct trained with LongPO for improved long-document QA and reasoning over PDFs. We use task arithmetic to minimize degradation from the original model.

Highlights

Pareto-optimal performance on long-document VQA benchmarks including MMLongBenchDoc and our corrected version MMLBD-C (+1.7 and +2.6 F1, respectively vs the base 32B instruct model), (-1.2, +0.2 vs SOTA Qwen3-VL-235B-A22B-Instruct).
Short-to-long preference-optimized for long-document QA on challenging synthetic questions.
Frontier visual and text LC performance for LC reasoning on PDFs and text.
Drop-in Transformers and vLLM usage with Qwen3VLForConditionalGeneration + AutoProcessor (same API as the base model) and vllm serve lightonai/OriOn-Qwen.
Same 256K context window as Qwen3-VL-32B-Instruct.

Checkpoint Leaderboard: lightonai/OriOn-Leaderboard includes extensive information for exploration and reproducibility of our training recipes.
Best Mistral Checkpoint: lightonai/OriOn-Mistral improves Mistral's visual LC performance by 16.8% on MMLongBenchDoc and text LC performance by 43.5% on HELMET while extending the context length to 344K tokens and training on up to 336 page documents.
Manually Corrected MMLongBenchDoc: lightonai/MMLBD-C improves upon MMLongBenchDoc by flagging inconsistencies between the question, answer and source document. We correct errors related to typos, poor grammar, incorrect question-document pairing and ambiguous phrasing.

Benchmarks

Scores (accuracy / task metric, higher is better).

The table below compares OriOn-Qwen to the base models and main checkpoints from our paper.

Model / checkpoint	VA	LCA	MMLBD-C	MMLB 128K	SlideVQA	Helmet	LongBench v2	DUDE
OriOn-Qwen (LongPO short-stage)	94.6	93.1	56.4	75.6	75.5	62.9	42.0	56.0
Qwen3-VL 32B (baseline)	94.2	92.8	53.8	70.4	77.2	63.0	42.0	61.8
Qwen3-VL 32B Plain Distill (short stage)	92.5	92.5	57.3	73.8	66.8	65.7	44.0	54.8
OriOn-Mistral (Plain Distill)	84.9	83.0	47.4	65.7	71.2	53.1	38.0	54.0
Mistral 3.1 Small (24B)	80.2	76.7	41.4	66.4	67.8	37.0	39.0	52.8

Intended use

OriOn-Qwen is intended for:

Long PDF / slide-deck QA and understanding strong one-shot QA capabilities with the full document given to the model
Long-context text/visual reasoning we show that visual LC training improves not only visual LC performance, but text performance as well.

Training details (high level)

Method: Preference optimization (LongPO)
Teacher policy: Single page and multi-page questions with Qwen3-VL-235B-A22B-Instruct used for answer generation.
Data: Long PDF documents up to 104 pages.

Serving

We recommend serving with vLLM (adjust for your setup)

vllm serve lightonai/OriOn-Qwen -tp 2 --quantization fp8

Usage with Transformers

This is adapted directly from the official Qwen3-VL-32B-Instruct model card, with the model id swapped to lightonai/OriOn-Qwen.

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

# Load the model on the available device(s)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "lightonai/OriOn-Qwen", dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("lightonai/OriOn-Qwen")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)
print(output_text)

Tip: if you’re running multi-image/video, Qwen recommends flash_attention_2 for speed/memory.

Citation

If you use OriOn-Qwen or MMLBD-C in your work, please cite:

@misc{orion_longdoc_vlm_2026,
      title={How to Train Your Long-Context Visual Document Model}, 
      author={Austin Veselka},
      year={2026},
      eprint={2602.15257},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.15257}, 
}
@misc{qwen3technicalreport,
  title        = {Qwen3 Technical Report},
  author       = {Qwen Team},
  year         = {2025},
  eprint       = {2505.09388},
  archivePrefix= {arXiv},
  primaryClass = {cs.CL},
  url          = {https://arxiv.org/abs/2505.09388}
}
@misc{mmlbd,
  title={MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations},
  author={Yubo Ma and Yuhang Zang and Liangyu Chen and Meiqi Chen and Yizhu Jiao and Xinze Li and Xinyuan Lu and Ziyu Liu and Yan Ma and Xiaoyi Dong and Pan Zhang and Liangming Pan and Yu-Gang Jiang and Jiaqi Wang and Yixin Cao and Aixin Sun},
  year={2024},
  eprint={2407.01523},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2407.01523},
}

Downloads last month: 8

Safetensors

Model size

33B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lightonai/OriOn-Qwen

Base model

Qwen/Qwen3-VL-32B-Instruct

Finetuned

(32)

this model

Collection including lightonai/OriOn-Qwen

OriOn 💫

Collection

Visual long document VLMs based on Mistral-Small-3.1-24B-Instruct-2503 and Qwen3-VL-32B-Instruct • 5 items • Updated Apr 9 • 5

Papers for lightonai/OriOn-Qwen

How to Train Your Long-Context Visual Document Model

Paper • 2602.15257 • Published Feb 16 • 1

Qwen3 Technical Report

Paper • 2505.09388 • Published May 14, 2025 • 340

MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

Paper • 2407.01523 • Published Jul 1, 2024

lightonai
/

OriOn-Qwen