Upload README.txt
Browse files- README.txt +108 -0
README.txt
ADDED
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- multimodal
|
| 5 |
+
- llama-factory
|
| 6 |
+
- qwen
|
| 7 |
+
- traditional-chinese-medicine
|
| 8 |
+
- tcm
|
| 9 |
+
- herb-recognition
|
| 10 |
+
- medical
|
| 11 |
+
- visual-question-answering
|
| 12 |
+
- vqa
|
| 13 |
+
base_model: Qwen/Qwen2.5-VL
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# TCM-VisResolve: Multimodal Recognition of Dried Herbs and Clinical MCQ Answering in TCM
|
| 17 |
+
|
| 18 |
+
[cite_start]**TCM-VisResolve (TCM-VR)** is a domain-specific multimodal large language model (MLLM) designed for Traditional Chinese Medicine (TCM)[cite: 12, 13, 83].
|
| 19 |
+
|
| 20 |
+
[cite_start]This model, fine-tuned from **Qwen2.5-VL** [cite: 201][cite_start], is specifically engineered to bridge the gap between visual recognition and symbolic reasoning in TCM[cite: 14]. It excels at two primary tasks:
|
| 21 |
+
1. [cite_start]**Recognizing images of dried medicinal herbs**[cite: 13, 84].
|
| 22 |
+
2. [cite_start]**Answering clinical-style multiple-choice questions (MCQs)**[cite: 13, 84].
|
| 23 |
+
|
| 24 |
+
[cite_start]This model was trained using the **LLaMA Factory** framework[cite: 258].
|
| 25 |
+
|
| 26 |
+
## 👨💻 Authors and Affiliations
|
| 27 |
+
|
| 28 |
+
* **Wudao Yang**
|
| 29 |
+
* [cite_start]Affiliation: CS Dept., School of Mathematics & CS, Yunnan Minzu University, Kunming, China [cite: 2, 6, 7]
|
| 30 |
+
* [cite_start]Email: wudaoyang@ymu.edu.cn [cite: 7]
|
| 31 |
+
* **Zhiqiang Yu*** (Corresponding author)
|
| 32 |
+
* [cite_start]Affiliation: CS Dept., School of Mathematics & CS, Yunnan Minzu University, Kunming, China [cite: 4, 6, 8]
|
| 33 |
+
* [cite_start]Email: yzqyt@ymu.edu.cn [cite: 8, 37]
|
| 34 |
+
* **Chee Seng Chan*** (Corresponding author)
|
| 35 |
+
* [cite_start]Affiliation: AI Dept., Faculty of CS & IT (FSKTM), Universiti Malaya, Kuala Lumpur, Malaysia [cite: 5, 6, 9]
|
| 36 |
+
* [cite_start]Email: cs.chan@um.edu.my [cite: 9, 37]
|
| 37 |
+
|
| 38 |
+
## 🚀 Key Features
|
| 39 |
+
* [cite_start]**Domain-Specific Expertise:** Fine-tuned on a massive dataset of 220,000 herb images (163 classes) and 220,000 clinical MCQs[cite: 14, 84, 115, 203].
|
| 40 |
+
* [cite_start]**High Accuracy:** Achieves **96.7% accuracy** on a held-out test suite of TCM-related MCQs [cite: 22, 92, 117, 481][cite_start], significantly outperforming general-purpose models like GPT-40 and Gemini[cite: 22, 92, 117].
|
| 41 |
+
* [cite_start]**Robust & Reliable:** Incorporates a **Cross-Transformation Memory Mechanism (CTMM)** to prevent overfitting and "answer position bias" [cite: 19, 91][cite_start], forcing the model to reason about content rather than memorizing patterns[cite: 23, 330, 443].
|
| 42 |
+
|
| 43 |
+
## 🛠️ Training Procedure
|
| 44 |
+
|
| 45 |
+
### Base Model
|
| 46 |
+
[cite_start]TCM-VR uses **Qwen2.5-VL** as its vision-language backbone[cite: 84, 201].
|
| 47 |
+
|
| 48 |
+
### Dataset
|
| 49 |
+
[cite_start]The model was fine-tuned using a comprehensive, specially-curated dataset[cite: 162, 203]:
|
| 50 |
+
* [cite_start]**Images:** 220,000 real-world images of dried and processed herbs across 163 categories[cite: 14, 162, 203].
|
| 51 |
+
* [cite_start]**Text:** 220,000 multiple-choice questions (totaling 880,000 answer options) [cite: 14, 84, 115] [cite_start]structured in a vision-language JSON format[cite: 211].
|
| 52 |
+
|
| 53 |
+
### Cross-Transformation Memory Mechanism (CTMM)
|
| 54 |
+
[cite_start]To ensure the model learns to reason rather than memorize [cite: 316, 434][cite_start], the CTMM was applied during training[cite: 91, 472]. [cite_start]This mechanism enforces semantic consistency by[cite: 337]:
|
| 55 |
+
1. [cite_start]**Paraphrasing Prompts:** Using varied linguistic structures for semantically identical questions[cite: 19, 323, 343].
|
| 56 |
+
2. [cite_start]**Shuffling Answer Orders:** Randomizing the `A, B, C, D` options for each question to disrupt positional biases[cite: 19, 91, 325, 330].
|
| 57 |
+
|
| 58 |
+
## 📊 Performance
|
| 59 |
+
[cite_start]On a held-out test split, TCM-VR achieves **96.7% accuracy** on multimodal clinical MCQs[cite: 22, 92, 117, 481]. [cite_start]Case studies confirm that the model can correctly identify the right answer even when its position is shuffled, demonstrating true reasoning capabilities[cite: 23, 442, 443].
|
| 60 |
+
|
| 61 |
+
## 💡 How to Use
|
| 62 |
+
You can use this model similarly to other Qwen-VL models for multimodal chat. [cite_start]The model expects an image and a structured query, as shown in the paper's training data (Figure 2)[cite: 185, 187, 190].
|
| 63 |
+
|
| 64 |
+
```python
|
| 65 |
+
import torch
|
| 66 |
+
from PIL import Image
|
| 67 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 68 |
+
|
| 69 |
+
# Set device
|
| 70 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 71 |
+
|
| 72 |
+
# Load the model and tokenizer
|
| 73 |
+
# !! Replace "your-username/TCM-VisResolve" with your model's HF path !!
|
| 74 |
+
model_id = "your-username/TCM-VisResolve"
|
| 75 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
|
| 76 |
+
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", trust_remote_code=True, bf16=True).eval()
|
| 77 |
+
|
| 78 |
+
# 1. Load your herb image
|
| 79 |
+
# [cite_start]Example using the 'Gan Jiang' image from the paper [cite: 169]
|
| 80 |
+
[cite_start]image_path = "path/to/your/herb_image.jpg" # e.g., "data/mcq_output/gangjiang_gangjiang_1577.jpg" [cite: 189]
|
| 81 |
+
image = Image.open(image_path)
|
| 82 |
+
|
| 83 |
+
# 2. Format your query
|
| 84 |
+
# The query should include the image placeholder and the MCQ
|
| 85 |
+
# [cite_start]This example is based on Figure 2 in the paper [cite: 185, 190]
|
| 86 |
+
question = "这是什么中药? 以下哪项不是该药材的适应症? A. 主治:赤眼涩痛;咳嗽上气... B. 主治:补肝明目... C. 主治:鼻血不止... D. 主治:背痈..."
|
| 87 |
+
|
| 88 |
+
# Format the prompt for the model
|
| 89 |
+
messages = [
|
| 90 |
+
{"role": "user", "content": [{"type": "image", "image": image}, {"type": "text", "text": question}]}
|
| 91 |
+
]
|
| 92 |
+
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 93 |
+
model_inputs = tokenizer([text], return_tensors="pt").to(device)
|
| 94 |
+
|
| 95 |
+
# 3. Generate the response
|
| 96 |
+
generated_ids = model.generate(
|
| 97 |
+
model_inputs.input_ids,
|
| 98 |
+
max_new_tokens=1024
|
| 99 |
+
)
|
| 100 |
+
generated_ids = [
|
| 101 |
+
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
|
| 102 |
+
]
|
| 103 |
+
|
| 104 |
+
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
| 105 |
+
print(response)
|
| 106 |
+
|
| 107 |
+
# [cite_start]Example output (based on Figure 2 [cite: 187]):
|
| 108 |
+
# "名称:干姜 拼音:gangjiang ... 功效:主治:赤眼涩痛... 正确答案是 A 选项内容:..."
|