Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,129 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
+
<p align="left">
|
| 5 |
+
中文  |  <a href="https://huggingface.co/SCUT-DLVCLab/TongGu-VL-2B-Instruct/blob/main/README_en.md">English</a>
|
| 6 |
+
</p>
|
| 7 |
+
<div align="center">
|
| 8 |
+
<img src="./images/通古logo.png" width="400"/>
|
| 9 |
+
</div>
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
# 通古大模型
|
| 13 |
+
|
| 14 |
+
## 介绍
|
| 15 |
+
|
| 16 |
+
TongGu-VL-2B-Instruct是华南理工大学深度学习与视觉计算实验室(SCUT-DLVCLab)开发的古籍多模态模型,具备较强的多模态古籍理解能力,能够进行古籍文字识别、书法绘画赏析等。
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
## 评测结果
|
| 20 |
+
|
| 21 |
+
TongGu-VL-2B-Instruct在众多古籍多模态任务上超越了现有的模型,在未来TongGu-VL会持续更新模型并受益于更强大的基座模型。
|
| 22 |
+
|
| 23 |
+
<div align="center">
|
| 24 |
+
<img src="./images/evaluation_table.png">
|
| 25 |
+
</div>
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
# 开源清单
|
| 29 |
+
|
| 30 |
+
## 模型
|
| 31 |
+
|
| 32 |
+
[**TongGu-VL-2B-Instruct**](https://huggingface.co/SCUT-DLVCLab/TongGu-VL-2B-Instruct): 2B古籍多模态模型,在35.8万古籍多模态语料上做指令微调得到,具备文字识别、书法赏析等功能。
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
## 数据
|
| 36 |
+
|
| 37 |
+
**CCS358k**: 35.8万多模态古籍指令微调数据,涵盖古籍文本、插图、绘画等7大场景。
|
| 38 |
+
|
| 39 |
+
CCS358k数据集只能用于非商业研究目的。对于想要使用CCS358k数据集的学者或组织,请先填写此[申请表](https://github.com/SCUT-DLVCLab/TongGu-VL/blob/main/application-form/Application-Form-for-Using-CCS358K.docx)并通过电子邮件发送给我们。向我们提交申请表时,请列出或附上您近6年发表的论文1-2篇,以表明您(或您的团队)在古籍领域进行研究。
|
| 40 |
+
我们收到并批准您的申请后,将为您提供下载链接和解压密码。
|
| 41 |
+
所有用户必须遵守所有使用条件;否则,将撤销授权。
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
# 新闻
|
| 45 |
+
|
| 46 |
+
- 2025/07/06 通古论文被ACM MM 2025接收。
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
# 推理
|
| 51 |
+
|
| 52 |
+
```python
|
| 53 |
+
import torch
|
| 54 |
+
from transformers import AutoProcessor
|
| 55 |
+
from transformers import AutoModelForCausalLM
|
| 56 |
+
from qwen_vl_utils import process_vision_info
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
model_id = "/data3/cjh/models/tonggu_vl_models/Tonggu-VL-2B-Instruct"
|
| 60 |
+
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
|
| 61 |
+
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)
|
| 62 |
+
|
| 63 |
+
def use_model(input_image, input_prompt):
|
| 64 |
+
messages = [
|
| 65 |
+
{
|
| 66 |
+
"role": "user",
|
| 67 |
+
"content": [
|
| 68 |
+
{
|
| 69 |
+
"type": "image",
|
| 70 |
+
"image": input_image,
|
| 71 |
+
},
|
| 72 |
+
{"type": "text", "text": input_prompt},
|
| 73 |
+
],
|
| 74 |
+
}
|
| 75 |
+
]
|
| 76 |
+
|
| 77 |
+
# Preparation for inference
|
| 78 |
+
text = processor.apply_chat_template(
|
| 79 |
+
messages, tokenize=False, add_generation_prompt=True
|
| 80 |
+
)
|
| 81 |
+
image_inputs, video_inputs = process_vision_info(messages)
|
| 82 |
+
inputs = processor(
|
| 83 |
+
text=[text],
|
| 84 |
+
images=image_inputs,
|
| 85 |
+
videos=video_inputs,
|
| 86 |
+
padding=True,
|
| 87 |
+
return_tensors="pt",
|
| 88 |
+
)
|
| 89 |
+
|
| 90 |
+
guided_text = messages[0]["content"][1]["text"] + '<|vision_start|><|image_pad|><|vision_end|>'
|
| 91 |
+
print(guided_text)
|
| 92 |
+
inputs_ocr = processor(text=[guided_text], images=image_inputs, videos=video_inputs, padding=False, return_tensors="pt")
|
| 93 |
+
inputs["input_ids_ocr"] = inputs_ocr["input_ids"]
|
| 94 |
+
inputs["attention_mask_ocr"] = inputs_ocr["attention_mask"]
|
| 95 |
+
inputs = inputs.to("cuda")
|
| 96 |
+
|
| 97 |
+
# Inference: Generation of the output
|
| 98 |
+
generated_ids = model.generate(**inputs, max_new_tokens=2048, temperature=0.8, top_p=0.95, top_k=50)
|
| 99 |
+
generated_ids_trimmed = [
|
| 100 |
+
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
| 101 |
+
]
|
| 102 |
+
output_text = processor.batch_decode(
|
| 103 |
+
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
| 104 |
+
)
|
| 105 |
+
|
| 106 |
+
return output_text[0]
|
| 107 |
+
|
| 108 |
+
image = "you image here"
|
| 109 |
+
prompt = "识别图中文字"
|
| 110 |
+
|
| 111 |
+
print(use_model(image, prompt))
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
|
| 115 |
+
# 引用
|
| 116 |
+
|
| 117 |
+
```
|
| 118 |
+
@inproceedings{cao2025tonggu,
|
| 119 |
+
title={TongGu-VL: Advancing Visual-Language Understanding in Chinese Classical Studies through Parameter Sensitivity-Guided Instruction Tuning},
|
| 120 |
+
author={Cao, Jiahuan and Liu, Yang and Zhang, Peirong and Shi, Yongxin and Ding, Kai and Jin, Lianwen},
|
| 121 |
+
booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
|
| 122 |
+
pages={11111--11120},
|
| 123 |
+
year={2025}
|
| 124 |
+
}
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
# 声明:
|
| 128 |
+
|
| 129 |
+
经过大规模数据的指令微调,TongGu-VL具备较强的古籍多模态理解能力,如文字识别、书法鉴赏等,然而受限于模型规模、自回归生成范式等,TongGu-VL仍然可能生成包含事实性错误的误导性回复或包含偏见/歧视的有害内容,请谨慎使用和注意甄别,请勿将TongGu-VL生成的有害内容传播至互联网。若产生不良后果,由传播者自负。
|