File size: 4,966 Bytes
abdb364
 
 
167441a
 
6676e1b
d70255b
 
 
 
167441a
08b09de
167441a
08b09de
167441a
 
 
 
 
6676e1b
 
30d103e
6676e1b
167441a
6676e1b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
167441a
6676e1b
 
 
 
167441a
a813af0
 
6676e1b
 
 
 
 
cd4ed1d
fc6052f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ad3212f
6676e1b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
167441a
 
6676e1b
 
167441a
6676e1b
 
 
 
167441a
6676e1b
 
 
 
167441a
6676e1b
 
 
 
 
 
167441a
 
6676e1b
167441a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
license: apache-2.0
---
# Diagram Formalizer
Model Structure: 

<p align="center">
  <img src="sample/diagram_formalizer.png" alt="Alt text" width="50%" height="auto">
</p>


- **Diagram Encoder**: [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)

- **Lightweight LLM**: [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct)



## Quick Start
Before running the script, install the following necessary dependencies.

```shell
pip install torch==2.4.0 transformers==4.40.0 accelerate pillow sentencepiece
```
You can use the following script to predict the ConsCDL and ImgCDL for geometric diagram.

```python
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import warnings
import numpy as np

# set device
device = 'cuda'  # or cpu
torch.set_default_device(device)

# create model
model = AutoModelForCausalLM.from_pretrained(
    'NaughtyDog97/DiagramFormalizer',
    torch_dtype=torch.float16, # float32 for cpu
    device_map='auto',
    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    'NaughtyDog97/DiagramFormalizer',
    use_fast=True,
    padding_side="right",
    trust_remote_code=True)

# text prompt
img_path = 'sample/4927.png'
prompt = 'Based on the image, first describe what you see in the figure, then predict the construction_cdl and image_cdl and calibrate it.'
text = f'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<image>\n{prompt}<|im_end|>\n<|im_start|>assistant\n'

def tokenizer_image_token(prompt, tokenizer, image_token_index, return_tensors=None):
    prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split('<image>')]

    def insert_separator(X, sep):
        return [ele for sublist in zip(X, [sep] * len(X)) for ele in sublist][:-1]

    input_ids = []
    offset = 0
    if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id:
        offset = 1
        input_ids.append(prompt_chunks[0][0])

    for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
        input_ids.extend(x[offset:])

    if return_tensors is not None:
        if return_tensors == 'pt':
            return torch.tensor(input_ids, dtype=torch.long)
        raise ValueError(f'Unsupported tensor type: {return_tensors}')
    return input_ids
    
input_ids = tokenizer_image_token(text, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda()

# image, sample images can be found in images folder
image = Image.open(img_path).convert('RGB')

image_tensor = model.process_images([image], model.config).to(dtype=model.dtype, device=device)

# generate
with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=image_tensor,
        do_sample=False,
        temperature=None,
        top_p=None,
        top_k=None,
        num_beams=1,
        max_new_tokens=3500,
        eos_token_id=tokenizer.eos_token_id,
        repetition_penalty=None,
        use_cache=True
    )[0]


respones = tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip()
print(respones)

```

Our model supports the following recognition instrutions:
- Natural Language Description: 
    - Describe what you see in the figure.
    - Tell me what you observe in the image.
- Predicting ConsCDL only
    - Based on the image, predict the construction_cdl.
    - Based on the image, predict the construction_cdl and calibrate it.
    - Based on the image, first describe what you see in the figure, then predict the construction_cdl.
    - Based on the image, first describe what you see in the figure, then predict the construction_cdl and calibrate it.
- Predicting ImgCDL only:
    - Based on the image, predict the image_cdl.
    - Based on the image, predict the image_cdl and calibrate it.
    - Based on the image, first describe what you see in the figure, then predict the image_cdl.
    - Based on the image, first describe what you see in the figure, then predict the image_cdl and calibrate it.
- Predicting construction_cdl and image_cdl simultaneously:
    - Based on the image, predict the construction_cdl and image_cdl.
    - Based on the image, first predict the construction_cdl and image_cdl and calibrate it.
    - Based on the image, first describe what you see in the figure, then predict the construction_cdl and image_cdl.
    - Based on the image, first describe what you see in the figure, then predict the construction_cdl and image_cdl and calibrate it.


## Performance of Diagram Formalizer on formalgeo7k test set
| Model   |   ConsCdlAcc   |   ConsCdlPerfect    |   ImgCdlAcc   |   ImgCdlPerfect   |    BothPerfect   |
|-----|----------------|---------------------|---------------|-------------------|------------------|
|  Diagram Formalizer  |    90.25         |      72.29           |     92.88        |  84.38 | 65.05  |