RxnCaption-VL

Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning

English

Overview

RxnCaption-VL is a fine-tuned Qwen2.5-VL-7B-Instruct model for chemical reaction diagram parsing. Given a reaction diagram image annotated with bounding-box indices (BIVP — Bounding-box Index Visual Prompt), the model outputs structured JSON describing all reactions in the image.

CVPR 2026 | Paper | Code | Dataset

Key Results

Benchmark	Hard F1	Soft F1
RxnScribe-test	75.5	88.2
U-RxnDiagram-15k-test	55.5	67.6

Quick Start

With ms-swift (recommended)

pip install ms-swift

swift infer \
    --model           songjhPKU/RxnCaption-VL \
    --model_type      qwen2_5_vl \
    --infer_backend   pt \
    --val_dataset     your_eval.jsonl \
    --result_path     output.jsonl \
    --max_batch_size  1 \
    --max_new_tokens  16384

Programmatic usage

from swift.llm import InferEngine, InferRequest, PtEngine, RequestConfig

SYSTEM_PROMPT = (
    "You are a chemistry expert. Analyze the provided image which contains "
    "chemical reactions. Your task is to identify all chemical structures and "
    "relevant text (like reagents, conditions, identifiers). Then, organize "
    "them into a complete chemical reaction equation. Output the result as a "
    "JSON list, where each item represents a single reaction. Each reaction "
    "must contain 'reactants', 'conditions', and 'products'. Each of these is "
    "a list of objects. An object can be a structure represented as "
    '{"structure": <index>}, text as {"text": "<content>"}, or an '
    'identifier as {"identifier": "<content>"}. The <index> corresponds '
    "to the numeric label of a structure in the image. Output only the JSON."
)

engine = PtEngine("songjhPKU/RxnCaption-VL", model_type="qwen2_5_vl", max_batch_size=1)

req = InferRequest(
    messages=[
        {"role": "system",  "content": SYSTEM_PROMPT},
        {"role": "user",    "content": "<image> Now output your JSON format result:"},
    ],
    images=["path/to/annotated_image.png"],
)

cfg = RequestConfig(max_new_tokens=16384)
response = engine.infer([req], cfg)[0].choices[0].message.content
print(response)

Input Format

The model expects BIVP-annotated images: original reaction diagram images with blue bounding boxes and reading-order numeric labels drawn on top. These are generated by the MolYOLO + BIVP annotation pipeline. See the GitHub repo for the full pipeline.

Output Format

[
  {
    "reactants":  [{"structure": 1}, {"text": "H₂O"}],
    "conditions": [{"text": "Δ, 2h"}],
    "products":   [{"structure": 2}]
  }
]

Training Details

Base model: Qwen2.5-VL-7B-Instruct
Training type: Full fine-tuning (all parameters)
Training data: U-RxnDiagram-15k (~15,000 images, 4× augmented → ~59,000 samples)
Max sequence length: 16,384 tokens
Framework: ms-swift + DeepSpeed Zero-2

License

This model is released under the CC BY-NC 4.0 license.

Citation

@misc{song2026rxncaptionreformulatingreactiondiagram,
      title={RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning}, 
      author={Jiahe Song and Chuang Wang and Bowen Jiang and Yinfan Wang and Hao Zheng and Xingjian Wei and Chengjin Liu and Rui Nie and Junyuan Gao and Jiaxing Sun and Yubin Wang and Lijun Wu and Zhenhua Huang and Jiang Wu and Qian Yu and Conghui He},
      year={2026},
      eprint={2511.02384},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.02384}, 
}

中文

概述

RxnCaption-VL 是基于 Qwen2.5-VL-7B-Instruct 微调的化学反应示意图解析模型。输入经过 BIVP（边界框索引视觉提示）标注的反应示意图后，模型输出结构化的 JSON 描述图中的所有反应。

CVPR 2026 | 论文 | 代码 | 数据集

主要结果

基准测试	Hard F1	Soft F1
RxnScribe-test	75.5	88.2
U-RxnDiagram-15k-test	55.5	67.6

快速使用

使用 ms-swift（推荐）

pip install ms-swift

swift infer \
    --model           songjhPKU/RxnCaption-VL \
    --model_type      qwen2_5_vl \
    --infer_backend   pt \
    --val_dataset     your_eval.jsonl \
    --result_path     output.jsonl \
    --max_batch_size  1 \
    --max_new_tokens  16384

编程调用

from swift.llm import InferEngine, InferRequest, PtEngine, RequestConfig

SYSTEM_PROMPT = (
    "You are a chemistry expert. Analyze the provided image which contains "
    "chemical reactions. Your task is to identify all chemical structures and "
    "relevant text (like reagents, conditions, identifiers). Then, organize "
    "them into a complete chemical reaction equation. Output the result as a "
    "JSON list, where each item represents a single reaction. Each reaction "
    "must contain 'reactants', 'conditions', and 'products'. Each of these is "
    "a list of objects. An object can be a structure represented as "
    '{"structure": <index>}, text as {"text": "<content>"}, or an '
    'identifier as {"identifier": "<content>"}. The <index> corresponds '
    "to the numeric label of a structure in the image. Output only the JSON."
)

engine = PtEngine("songjhPKU/RxnCaption-VL", model_type="qwen2_5_vl", max_batch_size=1)

req = InferRequest(
    messages=[
        {"role": "system",  "content": SYSTEM_PROMPT},
        {"role": "user",    "content": "<image> Now output your JSON format result:"},
    ],
    images=["path/to/annotated_image.png"],
)

cfg = RequestConfig(max_new_tokens=16384)
response = engine.infer([req], cfg)[0].choices[0].message.content
print(response)

输入格式

模型需要 BIVP 标注后的图像：在原始反应示意图上绘制了蓝色边界框和阅读顺序数字编号。这些标注由 MolYOLO + BIVP 标注流水线生成。完整流水线请参考 GitHub 仓库。

输出格式

[
  {
    "reactants":  [{"structure": 1}, {"text": "H₂O"}],
    "conditions": [{"text": "Δ, 2h"}],
    "products":   [{"structure": 2}]
  }
]

训练细节

基座模型：Qwen2.5-VL-7B-Instruct
训练方式：全参数微调
训练数据：U-RxnDiagram-15k（约 15,000 张图像，4 倍数据增强后约 59,000 样本）
最大序列长度：16,384 tokens
训练框架：ms-swift + DeepSpeed Zero-2

许可证

本模型采用 CC BY-NC 4.0 许可协议。

引用

@misc{song2026rxncaptionreformulatingreactiondiagram,
      title={RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning}, 
      author={Jiahe Song and Chuang Wang and Bowen Jiang and Yinfan Wang and Hao Zheng and Xingjian Wei and Chengjin Liu and Rui Nie and Junyuan Gao and Jiaxing Sun and Yubin Wang and Lijun Wu and Zhenhua Huang and Jiang Wu and Qian Yu and Conghui He},
      year={2026},
      eprint={2511.02384},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.02384}, 
}

Downloads last month: 511

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for songjhPKU/RxnCaption-VL

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Finetuned

(1130)

this model

Dataset used to train songjhPKU/RxnCaption-VL

Paper for songjhPKU/RxnCaption-VL

RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning

Paper • 2511.02384 • Published Nov 4, 2025 • 3