TongGu-VL Model

Introduction

TongGu-VL-2B-Instruct is a multimodal model for classical Chinese literature developed by the Deep Learning and Visual Computing Laboratory at South China University of Technology (SCUT-DLVCLab). It possesses strong capabilities in understanding classical Chinese multimodal content, including tasks such as text recognition in ancient documents and appreciation of calligraphy and paintings.

Evaluation Results

TongGu-VL-2B-Instruct surpasses existing models in many multimodal tasks involving classical Chinese literature. The TongGu-VL series will continue to be updated in the future, benefiting from more powerful base models.

Released Resources

Models

TongGu-VL-2B-Instruct: A 2B-parameter multimodal model for classical Chinese literature. It was instruction-tuned on 358K classical multimodal documents and supports functions such as text recognition and calligraphy appreciation.

Data

CCS358k: A dataset containing 358K multimodal classical Chinese instruction-tuning samples, covering 7 major scenarios including classical texts, illustrations, and paintings.

The CCS358k dataset is only available for non-commercial research purposes. Scholars or organizations wishing to use the CCS358k dataset must first complete this application form and email it to us. When submitting the application, please include or list 1-2 papers published in the last 6 years to demonstrate your (or your team’s) research experience in the field of classical literature. After we receive and approve your application, we will provide you with the download link and extraction password. All users must comply with all usage terms; otherwise, authorization will be revoked.

News

2025/07/06 The TongGu paper was accepted at ACM MM 2025.

Inference

# transformers == 4.48.2

import torch
from transformers import AutoProcessor
from transformers import AutoModelForCausalLM
from qwen_vl_utils import process_vision_info


model_id = "SCUT-DLVCLab/TongGu-VL-2B-Instruct"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)

def use_model(input_image, input_prompt):
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": input_image,
                },
                {"type": "text", "text": input_prompt},
            ],
        }
    ]

    # Preparation for inference
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    
    guided_text = messages[0]["content"][1]["text"] + '<|vision_start|><|image_pad|><|vision_end|>'
    print(guided_text)
    inputs_ocr = processor(text=[guided_text], images=image_inputs, videos=video_inputs, padding=False, return_tensors="pt")
    inputs["input_ids_ocr"] = inputs_ocr["input_ids"]
    inputs["attention_mask_ocr"] = inputs_ocr["attention_mask"]
    inputs = inputs.to("cuda")

    # Inference: Generation of the output
    generated_ids = model.generate(**inputs, max_new_tokens=2048, temperature=0.8, top_p=0.95, top_k=50)
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )
    
    return output_text[0]

image = "you image here"
prompt = "Identify the text in the image."

print(use_model(image, prompt))

Citation

@inproceedings{cao2025tonggu,
  title={TongGu-VL: Advancing Visual-Language Understanding in Chinese Classical Studies through Parameter Sensitivity-Guided Instruction Tuning},
  author={Cao, Jiahuan and Liu, Yang and Zhang, Peirong and Shi, Yongxin and Ding, Kai and Jin, Lianwen},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  pages={11111--11120},
  year={2025}
}

Disclaimer

After extensive instruction tuning on large-scale data, TongGu-VL has developed strong multimodal understanding capabilities for classical Chinese literature, such as text recognition and calligraphy appreciation. However, due to limitations in model scale, the autoregressive generation paradigm, and other factors, TongGu-VL may still generate misleading responses containing factual errors or harmful content with biases/discrimination. Please use it with caution and exercise critical judgment. Do not disseminate harmful content generated by TongGu-VL on the internet. Any adverse consequences arising from such dissemination are the sole responsibility of the disseminator.