| <div align="center"> | |
| <img src="./images/通古logo.png" width="400"/> | |
| </div> | |
| # TongGu-VL Model | |
| ## Introduction | |
| TongGu-VL-2B-Instruct is a multimodal model for classical Chinese literature developed by the Deep Learning and Visual Computing Laboratory at South China University of Technology (SCUT-DLVCLab). It possesses strong capabilities in understanding classical Chinese multimodal content, including tasks such as text recognition in ancient documents and appreciation of calligraphy and paintings. | |
| ## Evaluation Results | |
| TongGu-VL-2B-Instruct surpasses existing models in many multimodal tasks involving classical Chinese literature. The TongGu-VL series will continue to be updated in the future, benefiting from more powerful base models. | |
| <div align="center"> | |
| <img src="./images/evaluation_table.png"> | |
| </div> | |
| # Released Resources | |
| ## Models | |
| [**TongGu-VL-2B-Instruct**](https://huggingface.co/SCUT-DLVCLab/TongGu-VL-2B-Instruct): A 2B-parameter multimodal model for classical Chinese literature. It was instruction-tuned on 358K classical multimodal documents and supports functions such as text recognition and calligraphy appreciation. | |
| ## Data | |
| **CCS358k**: A dataset containing 358K multimodal classical Chinese instruction-tuning samples, covering 7 major scenarios including classical texts, illustrations, and paintings. | |
| The CCS358k dataset is only available for non-commercial research purposes. Scholars or organizations wishing to use the CCS358k dataset must first complete this [application form](https://github.com/SCUT-DLVCLab/TongGu-VL/blob/main/application-form/Application-Form-for-Using-CCS358K.docx) and email it to us. When submitting the application, please include or list 1-2 papers published in the last 6 years to demonstrate your (or your team’s) research experience in the field of classical literature. | |
| After we receive and approve your application, we will provide you with the download link and extraction password. | |
| All users must comply with all usage terms; otherwise, authorization will be revoked. | |
| # News | |
| - 2025/07/06 The TongGu paper was accepted at ACM MM 2025. | |
| # Inference | |
| ```python | |
| # transformers == 4.48.2 | |
| import torch | |
| from transformers import AutoProcessor | |
| from transformers import AutoModelForCausalLM | |
| from qwen_vl_utils import process_vision_info | |
| model_id = "SCUT-DLVCLab/TongGu-VL-2B-Instruct" | |
| processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) | |
| model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True) | |
| def use_model(input_image, input_prompt): | |
| messages = [ | |
| { | |
| "role": "user", | |
| "content": [ | |
| { | |
| "type": "image", | |
| "image": input_image, | |
| }, | |
| {"type": "text", "text": input_prompt}, | |
| ], | |
| } | |
| ] | |
| # Preparation for inference | |
| text = processor.apply_chat_template( | |
| messages, tokenize=False, add_generation_prompt=True | |
| ) | |
| image_inputs, video_inputs = process_vision_info(messages) | |
| inputs = processor( | |
| text=[text], | |
| images=image_inputs, | |
| videos=video_inputs, | |
| padding=True, | |
| return_tensors="pt", | |
| ) | |
| guided_text = messages[0]["content"][1]["text"] + '<|vision_start|><|image_pad|><|vision_end|>' | |
| print(guided_text) | |
| inputs_ocr = processor(text=[guided_text], images=image_inputs, videos=video_inputs, padding=False, return_tensors="pt") | |
| inputs["input_ids_ocr"] = inputs_ocr["input_ids"] | |
| inputs["attention_mask_ocr"] = inputs_ocr["attention_mask"] | |
| inputs = inputs.to("cuda") | |
| # Inference: Generation of the output | |
| generated_ids = model.generate(**inputs, max_new_tokens=2048, temperature=0.8, top_p=0.95, top_k=50) | |
| generated_ids_trimmed = [ | |
| out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) | |
| ] | |
| output_text = processor.batch_decode( | |
| generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False | |
| ) | |
| return output_text[0] | |
| image = "you image here" | |
| prompt = "Identify the text in the image." | |
| print(use_model(image, prompt)) | |
| ``` | |
| # Citation | |
| ``` | |
| @inproceedings{cao2025tonggu, | |
| title={TongGu-VL: Advancing Visual-Language Understanding in Chinese Classical Studies through Parameter Sensitivity-Guided Instruction Tuning}, | |
| author={Cao, Jiahuan and Liu, Yang and Zhang, Peirong and Shi, Yongxin and Ding, Kai and Jin, Lianwen}, | |
| booktitle={Proceedings of the 33rd ACM International Conference on Multimedia}, | |
| pages={11111--11120}, | |
| year={2025} | |
| } | |
| ``` | |
| # Disclaimer | |
| After extensive instruction tuning on large-scale data, TongGu-VL has developed strong multimodal understanding capabilities for classical Chinese literature, such as text recognition and calligraphy appreciation. However, due to limitations in model scale, the autoregressive generation paradigm, and other factors, TongGu-VL may still generate misleading responses containing factual errors or harmful content with biases/discrimination. Please use it with caution and exercise critical judgment. Do not disseminate harmful content generated by TongGu-VL on the internet. Any adverse consequences arising from such dissemination are the sole responsibility of the disseminator. |