zezeze97

update

30d103e over 1 year ago

4.97 kB

	---
	license: apache-2.0
	---
	# Diagram Formalizer
	Model Structure:

	<p align="center">
	<img src="sample/diagram_formalizer.png" alt="Alt text" width="50%" height="auto">
	</p>


	- Diagram Encoder: [siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)

	- Lightweight LLM: [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct)



	## Quick Start
	Before running the script, install the following necessary dependencies.

	```shell
	pip install torch==2.4.0 transformers==4.40.0 accelerate pillow sentencepiece
	```
	You can use the following script to predict the ConsCDL and ImgCDL for geometric diagram.

	```python
	import torch
	import transformers
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from PIL import Image
	import warnings
	import numpy as np

	# set device
	device = 'cuda' # or cpu
	torch.set_default_device(device)

	# create model
	model = AutoModelForCausalLM.from_pretrained(
	'NaughtyDog97/DiagramFormalizer',
	torch_dtype=torch.float16, # float32 for cpu
	device_map='auto',
	trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained(
	'NaughtyDog97/DiagramFormalizer',
	use_fast=True,
	padding_side="right",
	trust_remote_code=True)

	# text prompt
	img_path = 'sample/4927.png'
	prompt = 'Based on the image, first describe what you see in the figure, then predict the construction_cdl and image_cdl and calibrate it.'
	text = f'<\|im_start\|>system\nYou are a helpful assistant.<\|im_end\|>\n<\|im_start\|>user\n<image>\n{prompt}<\|im_end\|>\n<\|im_start\|>assistant\n'

	def tokenizer_image_token(prompt, tokenizer, image_token_index, return_tensors=None):
	prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split('<image>')]

	def insert_separator(X, sep):
	return [ele for sublist in zip(X, [sep] * len(X)) for ele in sublist][:-1]

	input_ids = []
	offset = 0
	if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id:
	offset = 1
	input_ids.append(prompt_chunks[0][0])

	for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
	input_ids.extend(x[offset:])

	if return_tensors is not None:
	if return_tensors == 'pt':
	return torch.tensor(input_ids, dtype=torch.long)
	raise ValueError(f'Unsupported tensor type: {return_tensors}')
	return input_ids

	input_ids = tokenizer_image_token(text, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda()

	# image, sample images can be found in images folder
	image = Image.open(img_path).convert('RGB')

	image_tensor = model.process_images([image], model.config).to(dtype=model.dtype, device=device)

	# generate
	with torch.inference_mode():
	output_ids = model.generate(
	input_ids,
	images=image_tensor,
	do_sample=False,
	temperature=None,
	top_p=None,
	top_k=None,
	num_beams=1,
	max_new_tokens=3500,
	eos_token_id=tokenizer.eos_token_id,
	repetition_penalty=None,
	use_cache=True
	)[0]


	respones = tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip()
	print(respones)

	```

	Our model supports the following recognition instrutions:
	- Natural Language Description:
	- Describe what you see in the figure.
	- Tell me what you observe in the image.
	- Predicting ConsCDL only
	- Based on the image, predict the construction_cdl.
	- Based on the image, predict the construction_cdl and calibrate it.
	- Based on the image, first describe what you see in the figure, then predict the construction_cdl.
	- Based on the image, first describe what you see in the figure, then predict the construction_cdl and calibrate it.
	- Predicting ImgCDL only:
	- Based on the image, predict the image_cdl.
	- Based on the image, predict the image_cdl and calibrate it.
	- Based on the image, first describe what you see in the figure, then predict the image_cdl.
	- Based on the image, first describe what you see in the figure, then predict the image_cdl and calibrate it.
	- Predicting construction_cdl and image_cdl simultaneously:
	- Based on the image, predict the construction_cdl and image_cdl.
	- Based on the image, first predict the construction_cdl and image_cdl and calibrate it.
	- Based on the image, first describe what you see in the figure, then predict the construction_cdl and image_cdl.
	- Based on the image, first describe what you see in the figure, then predict the construction_cdl and image_cdl and calibrate it.


	## Performance of Diagram Formalizer on formalgeo7k test set
	\| Model \| ConsCdlAcc \| ConsCdlPerfect \| ImgCdlAcc \| ImgCdlPerfect \| BothPerfect \|
	\|-----\|----------------\|---------------------\|---------------\|-------------------\|------------------\|
	\| Diagram Formalizer \| 90.25 \| 72.29 \| 92.88 \| 84.38 \| 65.05 \|