YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding
Yongxin Shi, Jiapeng Wang, Zeyu Shan, Dezhi Peng, Zening Lin, Lianwen Jin

URaG

URaG (Unified Retrieval and Generation) is a simple-yet-effective unified framework that unifies retrieval and generation in a model for efficient long document understanding. Equipped with a lightweight cross-modal retrieval module, URaG explicitly leverages the inherent evidence localization capabilities of MLLMs to perform efficient and integrated retrieval.

URaG-3B is based on the Qwen2.5-VL-3B.

Environment

conda create -n urag python=3.10
conda activate urag

Install torch & flash-attn:

# Recommend version (not mandatory) 
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install -U flash-attn==2.7.3 --no-build-isolation

Install other dependencies:

git clone https://github.com/shi-yx/URaG.git
cd ./URaG
pip install -r requirements.txt

Inference

We provide an example of inference code in the Github repo.

Train

Prepare Training Datasets

The dataset format is as follows (Please refer to the Github repo for more details):

[
  {
    "id": "unique_id",
    "image": ["image_path1", "image_path2", ...],
    "conversations": [
      {"from": "human", "value": "query"},
      {"from": "gpt", "value": "answer"},
    ]
    "retrieval_labels": [0, 1, 0, ...],  # 1: evidence, 0: non-evidence
  },
  ...
]

Pretrain

cd ./URaG/code
sh scripts/pretrain.sh
# extract the parameters of the proj_layer
sh scripts/extract_projlayer.sh

Finetune

cd ./URaG/code
sh scripts/finetune.sh
# merge lora
sh scripts/merge_lora.sh

Citation

If you find our work helpful, feel free to give us a cite.

@inproceedings{shi2026urag,
  title={URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding},
  author={Shi, Yongxin and Wang, Jiapeng and Shan, Zeyu and Peng, Dezhi and Lin, Zening and Jin, Lianwen},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  year={2026}
}

Downloads last month: 12

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for shiyx1/URaG-3B

URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding

Paper • 2511.10552 • Published Nov 13, 2025