RxnCaption-VL
Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning
English
Overview
RxnCaption-VL is a fine-tuned Qwen2.5-VL-7B-Instruct model for chemical reaction diagram parsing. Given a reaction diagram image annotated with bounding-box indices (BIVP — Bounding-box Index Visual Prompt), the model outputs structured JSON describing all reactions in the image.
CVPR 2026 | Paper | Code | Dataset
Key Results
| Benchmark | Hard F1 | Soft F1 |
|---|---|---|
| RxnScribe-test | 75.5 | 88.2 |
| U-RxnDiagram-15k-test | 55.5 | 67.6 |
Quick Start
With ms-swift (recommended)
pip install ms-swift
swift infer \
--model songjhPKU/RxnCaption-VL \
--model_type qwen2_5_vl \
--infer_backend pt \
--val_dataset your_eval.jsonl \
--result_path output.jsonl \
--max_batch_size 1 \
--max_new_tokens 16384
Programmatic usage
from swift.llm import InferEngine, InferRequest, PtEngine, RequestConfig
SYSTEM_PROMPT = (
"You are a chemistry expert. Analyze the provided image which contains "
"chemical reactions. Your task is to identify all chemical structures and "
"relevant text (like reagents, conditions, identifiers). Then, organize "
"them into a complete chemical reaction equation. Output the result as a "
"JSON list, where each item represents a single reaction. Each reaction "
"must contain 'reactants', 'conditions', and 'products'. Each of these is "
"a list of objects. An object can be a structure represented as "
'{"structure": <index>}, text as {"text": "<content>"}, or an '
'identifier as {"identifier": "<content>"}. The <index> corresponds '
"to the numeric label of a structure in the image. Output only the JSON."
)
engine = PtEngine("songjhPKU/RxnCaption-VL", model_type="qwen2_5_vl", max_batch_size=1)
req = InferRequest(
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "<image> Now output your JSON format result:"},
],
images=["path/to/annotated_image.png"],
)
cfg = RequestConfig(max_new_tokens=16384)
response = engine.infer([req], cfg)[0].choices[0].message.content
print(response)
Input Format
The model expects BIVP-annotated images: original reaction diagram images with blue bounding boxes and reading-order numeric labels drawn on top. These are generated by the MolYOLO + BIVP annotation pipeline. See the GitHub repo for the full pipeline.
Output Format
[
{
"reactants": [{"structure": 1}, {"text": "H₂O"}],
"conditions": [{"text": "Δ, 2h"}],
"products": [{"structure": 2}]
}
]
Training Details
- Base model: Qwen2.5-VL-7B-Instruct
- Training type: Full fine-tuning (all parameters)
- Training data: U-RxnDiagram-15k (~15,000 images, 4× augmented → ~59,000 samples)
- Max sequence length: 16,384 tokens
- Framework: ms-swift + DeepSpeed Zero-2
License
This model is released under the CC BY-NC 4.0 license.
Citation
@misc{song2026rxncaptionreformulatingreactiondiagram,
title={RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning},
author={Jiahe Song and Chuang Wang and Bowen Jiang and Yinfan Wang and Hao Zheng and Xingjian Wei and Chengjin Liu and Rui Nie and Junyuan Gao and Jiaxing Sun and Yubin Wang and Lijun Wu and Zhenhua Huang and Jiang Wu and Qian Yu and Conghui He},
year={2026},
eprint={2511.02384},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.02384},
}
中文
概述
RxnCaption-VL 是基于 Qwen2.5-VL-7B-Instruct 微调的化学反应示意图解析模型。输入经过 BIVP(边界框索引视觉提示)标注的反应示意图后,模型输出结构化的 JSON 描述图中的所有反应。
主要结果
| 基准测试 | Hard F1 | Soft F1 |
|---|---|---|
| RxnScribe-test | 75.5 | 88.2 |
| U-RxnDiagram-15k-test | 55.5 | 67.6 |
快速使用
使用 ms-swift(推荐)
pip install ms-swift
swift infer \
--model songjhPKU/RxnCaption-VL \
--model_type qwen2_5_vl \
--infer_backend pt \
--val_dataset your_eval.jsonl \
--result_path output.jsonl \
--max_batch_size 1 \
--max_new_tokens 16384
编程调用
from swift.llm import InferEngine, InferRequest, PtEngine, RequestConfig
SYSTEM_PROMPT = (
"You are a chemistry expert. Analyze the provided image which contains "
"chemical reactions. Your task is to identify all chemical structures and "
"relevant text (like reagents, conditions, identifiers). Then, organize "
"them into a complete chemical reaction equation. Output the result as a "
"JSON list, where each item represents a single reaction. Each reaction "
"must contain 'reactants', 'conditions', and 'products'. Each of these is "
"a list of objects. An object can be a structure represented as "
'{"structure": <index>}, text as {"text": "<content>"}, or an '
'identifier as {"identifier": "<content>"}. The <index> corresponds '
"to the numeric label of a structure in the image. Output only the JSON."
)
engine = PtEngine("songjhPKU/RxnCaption-VL", model_type="qwen2_5_vl", max_batch_size=1)
req = InferRequest(
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "<image> Now output your JSON format result:"},
],
images=["path/to/annotated_image.png"],
)
cfg = RequestConfig(max_new_tokens=16384)
response = engine.infer([req], cfg)[0].choices[0].message.content
print(response)
输入格式
模型需要 BIVP 标注后的图像:在原始反应示意图上绘制了蓝色边界框和阅读顺序数字编号。这些标注由 MolYOLO + BIVP 标注流水线生成。完整流水线请参考 GitHub 仓库。
输出格式
[
{
"reactants": [{"structure": 1}, {"text": "H₂O"}],
"conditions": [{"text": "Δ, 2h"}],
"products": [{"structure": 2}]
}
]
训练细节
- 基座模型:Qwen2.5-VL-7B-Instruct
- 训练方式:全参数微调
- 训练数据:U-RxnDiagram-15k(约 15,000 张图像,4 倍数据增强后约 59,000 样本)
- 最大序列长度:16,384 tokens
- 训练框架:ms-swift + DeepSpeed Zero-2
许可证
本模型采用 CC BY-NC 4.0 许可协议。
引用
@misc{song2026rxncaptionreformulatingreactiondiagram,
title={RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning},
author={Jiahe Song and Chuang Wang and Bowen Jiang and Yinfan Wang and Hao Zheng and Xingjian Wei and Chengjin Liu and Rui Nie and Junyuan Gao and Jiaxing Sun and Yubin Wang and Lijun Wu and Zhenhua Huang and Jiang Wu and Qian Yu and Conghui He},
year={2026},
eprint={2511.02384},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.02384},
}
- Downloads last month
- 28
Model tree for songjhPKU/RxnCaption-VL
Base model
Qwen/Qwen2.5-VL-7B-Instruct