Add files using upload-large-folder tool

356bdc7 verified 11 months ago

9.79 kB

	## Model Summary

	BigDocs-Phi-3.5-instruct is a multi-modal model that is trained with BigDocs for document intelligence tasks.

	microsoft/Phi-3.5-vision-instruct is used as the base and we perform 2 stages of training -
	1. Continual Pre-Training (CPT) with BigDocs-CPT keeping the encoder and adapter trainable.
	2. Fine Tuning (FT) with DocDownstream-1.0 keeping the decoder and adapter trainable.


	## General Document Benchmarks

	Models trained on [BigDocs-7.5M+DocDownstream] perform competitively across multimodal document benchmarks. We compare them to base checkpoints, instruction-tuned models, and those trained on [DocStruct4M+DocDownstream]. BigDocs models show consistent performance.

	\| Model \| DocVQA<br>VAL \| InfoVQA<br>VAL \| DeepForm<br>TEST \| KLC<br>TEST \| WTQ<br>TEST \| TabFact<br>TEST \| ChartQA<br>TEST \| TextVQA<br>VAL \| MMMU<br>VAL \| DudeMini<br>TEST \| SlideVQA-M<br>TEST \| TableVQA<br>TEST \| Avg. Score \|
	\|-----------------------------------\|---------------------\|-----------------------\|-------------------------\|-------------------\|-------------------\|-----------------------\|-----------------------\|----------------------\|------------------\|------------------------\|--------------------------\|-------------------------\|----------------\|
	\| DocOwl1.5-8B (instruct) \| 80.73 \| 49.94 \| 68.84 \| 37.99 \| 38.87 \| 79.67 \| 68.56 \| 68.91 \| 33.67 \| 34.64 \| 31.62 \| 52.60 \| 53.84 \|
	\| DocOwl1.5-8B (base) \| 2.07 \| 1.84 \| 0.00 \| 0.00 \| 0.00 \| 0.00 \| 0.00 \| 0.00 \| 24.44 \| 19.07 \| 3.30 \| 13.63 \| 5.36 \|
	\| DocOwl1.5-8B (base) + DocStruct4M \| 75.99 \| 46.88 \| 62.77 \| 35.21 \| 32.86 \| 71.56 \| 68.36 \| 65.08 \| 33.67 \| 29.00 \| 27.03 \| 46.27 \| 49.56 \|
	\| DocOwl1.5-8B (base) + BigDocs (Ours) \| 78.70 \| 47.62 \| 64.39 \| 36.93 \| 35.69 \| 72.65 \| 65.80 \| 67.30 \| 32.33 \| 32.55 \| 29.60 \| 49.03 \| 51.05 \|
	\| Qwen2-VL-2B (instruct) \| 89.16 \| 64.11 \| 32.38 \| 25.18 \| 38.20 \| 57.21 \| 73.40 \| 79.90 \| 42.00 \| 45.23 \| 46.50 \| 43.07 \| 53.03 \|
	\| Qwen2-VL-2B (base) \| 7.26 \| 0.78 \| 0.00 \| 0.00 \| 0.00 \| 0.00 \| 0.00 \| 1.14 \| 34.89 \| 28.43 \| 14.55 \| 0.00 \| 7.25 \|
	\| Qwen2-VL-2B (base) + DocStruct4M \| 59.53 \| 32.00 \| 53.98 \| 36.38 \| 28.48 \| 64.24 \| 54.44 \| 55.89 \| 34.89 \| 28.78 \| 22.68 \| 46.53 \| 43.15 \|
	\| *Qwen2-VL-2B (base) + BigDocs (Ours) \| 57.23 \| 31.88 \| 49.31 \| 34.39 \| 31.61 \| 64.75 \| 68.60 \| 61.01 \| 35.67 \| 27.19 \| 17.46 \| 47.53 \| 43.89 \|
	\| Phi3.5-Vision-4B (instruct) \| 86.00 \| 56.20 \| 10.47 \| 7.49 \| 17.18 \| 30.43 \| 82.16 \| 73.12 \| 46.00 \| 37.20 \| 30.93 \| 70.70 \| 45.66 \|
	\| Phi3.5-Vision-4B + DocStruct4M \| 86.76 \| 68.90 \| 70.12 \| 37.83 \| 51.30 \| 82.12 \| 79.76 \| 68.60 \| 44.11 \| 35.52 \| 31.90 \| 69.17 \| 60.51 \|
	\| Phi3.5-Vision-4B + BigDocs (Ours) \| 87.05 \| 70.05 \| 70.97 \| 37.45 \| 51.21 \| 81.24 \| 81.56 \| 68.72 \| 45.00 \| 36.15 \| 32.47 \| 67.77 \| 60.80 \|
	\| LLaVA-NeXT-7B (instruct) \| 63.51 \| 30.90 \| 1.30 \| 5.35 \| 20.06 \| 52.83 \| 52.12 \| 65.10 \| 38.89 \| 17.94 \| 7.46 \| 32.87 \| 32.36 \|
	\| LLaVA-NeXT-7B + DocStruct4M \| 60.95 \| 26.14 \| 39.78 \| 28.34 \| 25.90 \| 67.72 \| 61.20 \| 52.25 \| 25.78 \| 21.70 \| 15.33 \| 27.03 \| 37.68 \|
	\| LLaVA-NeXT-7B + BigDocs (Ours) \| 57.13 \| 24.47 \| 46.38 \| 31.09 \| 27.06 \| 72.58 \| 54.72 \| 49.06 \| 17.78 \| 22.88 \| 16.07 \| 33.13 \| 37.70 \|
	\| Llama-3.2-90B \| 74.15* \| 48.71 \| 4.18 \| 1.81 \| 24.20 \| 63.01 \| 11.36* \| 71.69 \| 57.78 \| 41.24 \| 26.09 \| 41.57 \| 38.82 \|
	\| GPT-4o 20240806 \| 92.80 \| 66.37 \| 38.39 \| 29.92 \| 46.63 \| 81.10 \| 85.70 \| 70.46 \| 69.10 \| 54.55 \| 67.58 \| 72.87 \| 64.62 \|
	\| Claude-3.5 Sonnet \| 88.48 \| 59.05 \| 31.41 \| 24.82 \| 47.13 \| 53.48 \| 51.84 \| 71.42 \| 64.78 \| 35.11 \| 0.00 \| 81.27 \| 50.73 \|
	\| GeminiPro-1.5 \| 91.23 \| 73.94 \| 32.16 \| 24.07 \| 50.29 \| 71.22 \| 34.68 \| 68.16 \| 58.22 \| 48.15 \| 52.05 \| 80.43 \| 57.05 \|
	\| Qwen2-VL-72B \| 96.50 \| 84.50 \| 30.45 \| 24.78 \| 55.63 \| 0.00 \| 88.30 \| 85.50 \| 64.50 \| 35.87 \| 2.15 \| 74.23 \| 58.40 \|


	### Input Formats

	BigDocs-Phi-3.5-instruct follows the same chat format as Phi-3.5-vision-instruct:

	Single image:
	```
	<\|user\|>\n<\|image_1\|>\n{prompt}<\|end\|>\n<\|assistant\|>\n
	```

	Multi-turn conversations:
	```
	<\|user\|>\n<\|image_1\|>\n{prompt_1}<\|end\|>\n<\|assistant\|>\n{response_1}<\|end\|>\n<\|user\|>\n{prompt_2}<\|end\|>\n<\|assistant\|>\n
	```

	For multi-image usage, add multiple image placeholders in the front of the prompts. <\|image_{}\|> index should start from 1. One example of prompt is shown as follows:
	```
	<\|user\|>\n<\|image_1\|>\n<\|image_2\|>\n<\|image_3\|>\n<\|image_4\|>\n{prompt}<\|end\|>\n<\|assistant\|>\n
	```
	### Loading the model locally
	After obtaining the Phi-3.5-vision-instruct model checkpoints, users can use this sample code for inference.
	```python
	from PIL import Image
	import requests
	from transformers import AutoModelForCausalLM
	from transformers import AutoProcessor
	model_id = "BigDocs/BigDocs-Phi-3.5-instruct"

	# Note: set _attn_implementation='eager' if you don't have flash_attn installed
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map="cuda",
	trust_remote_code=True,
	torch_dtype="auto",
	_attn_implementation='flash_attention_2'
	)
	# for best performance, use num_crops=4 for multi-frame, num_crops=16 for single-frame.
	processor = AutoProcessor.from_pretrained(model_id,
	trust_remote_code=True,
	num_crops=4
	)

	images = []
	placeholder = ""

	# Note: if OOM, you might consider reduce number of frames in this example.
	for i in range(1,20):
	url = f"https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-{i}-2048.jpg"
	images.append(Image.open(requests.get(url, stream=True).raw))
	placeholder += f"<\|image_{i}\|>\n"
	messages = [
	{"role": "user", "content": placeholder+"Summarize the deck of slides."},
	]
	prompt = processor.tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	inputs = processor(prompt, images, return_tensors="pt").to("cuda:0")
	generation_args = {
	"max_new_tokens": 1000,
	"temperature": 0.0,
	"do_sample": False,
	}
	generate_ids = model.generate(**inputs,
	eos_token_id=processor.tokenizer.eos_token_id,
	**generation_args
	)

	# remove input tokens
	generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
	response = processor.batch_decode(generate_ids,
	skip_special_tokens=True,
	clean_up_tokenization_spaces=False)[0]

	print(response)
	```