| | --- |
| | license: apache-2.0 |
| | --- |
| | # Diagram Formalizer |
| | Model Structure: |
| |
|
| | <p align="center"> |
| | <img src="sample/diagram_formalizer.png" alt="Alt text" width="50%" height="auto"> |
| | </p> |
| |
|
| |
|
| | - **Diagram Encoder**: [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) |
| |
|
| | - **Lightweight LLM**: [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) |
| |
|
| |
|
| |
|
| | ## Quick Start |
| | Before running the script, install the following necessary dependencies. |
| |
|
| | ```shell |
| | pip install torch==2.4.0 transformers==4.40.0 accelerate pillow sentencepiece |
| | ``` |
| | You can use the following script to predict the ConsCDL and ImgCDL for geometric diagram. |
| |
|
| | ```python |
| | import torch |
| | import transformers |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | from PIL import Image |
| | import warnings |
| | import numpy as np |
| | |
| | # set device |
| | device = 'cuda' # or cpu |
| | torch.set_default_device(device) |
| | |
| | # create model |
| | model = AutoModelForCausalLM.from_pretrained( |
| | 'NaughtyDog97/DiagramFormalizer', |
| | torch_dtype=torch.float16, # float32 for cpu |
| | device_map='auto', |
| | trust_remote_code=True) |
| | tokenizer = AutoTokenizer.from_pretrained( |
| | 'NaughtyDog97/DiagramFormalizer', |
| | use_fast=True, |
| | padding_side="right", |
| | trust_remote_code=True) |
| | |
| | # text prompt |
| | img_path = 'sample/4927.png' |
| | prompt = 'Based on the image, first describe what you see in the figure, then predict the construction_cdl and image_cdl and calibrate it.' |
| | text = f'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<image>\n{prompt}<|im_end|>\n<|im_start|>assistant\n' |
| | |
| | def tokenizer_image_token(prompt, tokenizer, image_token_index, return_tensors=None): |
| | prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split('<image>')] |
| | |
| | def insert_separator(X, sep): |
| | return [ele for sublist in zip(X, [sep] * len(X)) for ele in sublist][:-1] |
| | |
| | input_ids = [] |
| | offset = 0 |
| | if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id: |
| | offset = 1 |
| | input_ids.append(prompt_chunks[0][0]) |
| | |
| | for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)): |
| | input_ids.extend(x[offset:]) |
| | |
| | if return_tensors is not None: |
| | if return_tensors == 'pt': |
| | return torch.tensor(input_ids, dtype=torch.long) |
| | raise ValueError(f'Unsupported tensor type: {return_tensors}') |
| | return input_ids |
| | |
| | input_ids = tokenizer_image_token(text, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda() |
| | |
| | # image, sample images can be found in images folder |
| | image = Image.open(img_path).convert('RGB') |
| | |
| | image_tensor = model.process_images([image], model.config).to(dtype=model.dtype, device=device) |
| | |
| | # generate |
| | with torch.inference_mode(): |
| | output_ids = model.generate( |
| | input_ids, |
| | images=image_tensor, |
| | do_sample=False, |
| | temperature=None, |
| | top_p=None, |
| | top_k=None, |
| | num_beams=1, |
| | max_new_tokens=3500, |
| | eos_token_id=tokenizer.eos_token_id, |
| | repetition_penalty=None, |
| | use_cache=True |
| | )[0] |
| | |
| | |
| | respones = tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip() |
| | print(respones) |
| | |
| | ``` |
| |
|
| | Our model supports the following recognition instrutions: |
| | - Natural Language Description: |
| | - Describe what you see in the figure. |
| | - Tell me what you observe in the image. |
| | - Predicting ConsCDL only |
| | - Based on the image, predict the construction_cdl. |
| | - Based on the image, predict the construction_cdl and calibrate it. |
| | - Based on the image, first describe what you see in the figure, then predict the construction_cdl. |
| | - Based on the image, first describe what you see in the figure, then predict the construction_cdl and calibrate it. |
| | - Predicting ImgCDL only: |
| | - Based on the image, predict the image_cdl. |
| | - Based on the image, predict the image_cdl and calibrate it. |
| | - Based on the image, first describe what you see in the figure, then predict the image_cdl. |
| | - Based on the image, first describe what you see in the figure, then predict the image_cdl and calibrate it. |
| | - Predicting construction_cdl and image_cdl simultaneously: |
| | - Based on the image, predict the construction_cdl and image_cdl. |
| | - Based on the image, first predict the construction_cdl and image_cdl and calibrate it. |
| | - Based on the image, first describe what you see in the figure, then predict the construction_cdl and image_cdl. |
| | - Based on the image, first describe what you see in the figure, then predict the construction_cdl and image_cdl and calibrate it. |
| |
|
| |
|
| | ## Performance of Diagram Formalizer on formalgeo7k test set |
| | | Model | ConsCdlAcc | ConsCdlPerfect | ImgCdlAcc | ImgCdlPerfect | BothPerfect | |
| | |-----|----------------|---------------------|---------------|-------------------|------------------| |
| | | Diagram Formalizer | 90.25 | 72.29 | 92.88 | 84.38 | 65.05 | |