| ## Model Summary | |
| BigDocs-Phi-3.5-instruct is a multi-modal model that is trained with BigDocs for document intelligence tasks. | |
| microsoft/Phi-3.5-vision-instruct is used as the base and we perform 2 stages of training - | |
| 1. Continual Pre-Training (CPT) with BigDocs-CPT keeping the encoder and adapter trainable. | |
| 2. Fine Tuning (FT) with DocDownstream-1.0 keeping the decoder and adapter trainable. | |
| ## General Document Benchmarks | |
| Models trained on [BigDocs-7.5M+DocDownstream] perform competitively across multimodal document benchmarks. We compare them to base checkpoints, instruction-tuned models, and those trained on [DocStruct4M+DocDownstream]. BigDocs models show consistent performance. | |
| | **Model** | **DocVQA**<br>*VAL* | **InfoVQA**<br>*VAL* | **DeepForm**<br>*TEST* | **KLC**<br>*TEST* | **WTQ**<br>*TEST* | **TabFact**<br>*TEST* | **ChartQA**<br>*TEST* | **TextVQA**<br>*VAL* | **MMMU**<br>*VAL* | **DudeMini**<br>*TEST* | **SlideVQA-M**<br>*TEST* | **TableVQA**<br>*TEST* | **Avg. Score** | | |
| |-----------------------------------|---------------------|-----------------------|-------------------------|-------------------|-------------------|-----------------------|-----------------------|----------------------|------------------|------------------------|--------------------------|-------------------------|----------------| | |
| | DocOwl1.5-8B (instruct) | 80.73 | 49.94 | 68.84 | 37.99 | 38.87 | 79.67 | 68.56 | 68.91 | 33.67 | 34.64 | 31.62 | 52.60 | 53.84 | | |
| | DocOwl1.5-8B (base) | 2.07 | 1.84 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 24.44 | 19.07 | 3.30 | 13.63 | 5.36 | | |
| | DocOwl1.5-8B (base) + DocStruct4M | 75.99 | 46.88 | 62.77 | 35.21 | 32.86 | 71.56 | 68.36 | 65.08 | 33.67 | 29.00 | 27.03 | 46.27 | 49.56 | | |
| | DocOwl1.5-8B (base) + BigDocs (Ours) | 78.70 | 47.62 | 64.39 | 36.93 | 35.69 | 72.65 | 65.80 | 67.30 | 32.33 | 32.55 | 29.60 | 49.03 | 51.05 | | |
| | Qwen2-VL-2B (instruct) | 89.16 | 64.11 | 32.38 | 25.18 | 38.20 | 57.21 | 73.40 | 79.90 | 42.00 | 45.23 | 46.50 | 43.07 | 53.03 | | |
| | Qwen2-VL-2B (base) | 7.26 | 0.78 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.14 | 34.89 | 28.43 | 14.55 | 0.00 | 7.25 | | |
| | Qwen2-VL-2B (base) + DocStruct4M | 59.53 | 32.00 | 53.98 | 36.38 | 28.48 | 64.24 | 54.44 | 55.89 | 34.89 | 28.78 | 22.68 | 46.53 | 43.15 | | |
| | *Qwen2-VL-2B (base) + BigDocs (Ours) | 57.23 | 31.88 | 49.31 | 34.39 | 31.61 | 64.75 | 68.60 | 61.01 | 35.67 | 27.19 | 17.46 | 47.53 | 43.89 | | |
| | Phi3.5-Vision-4B (instruct) | 86.00 | 56.20 | 10.47 | 7.49 | 17.18 | 30.43 | 82.16 | 73.12 | 46.00 | 37.20 | 30.93 | 70.70 | 45.66 | | |
| | Phi3.5-Vision-4B + DocStruct4M | 86.76 | 68.90 | 70.12 | 37.83 | 51.30 | 82.12 | 79.76 | 68.60 | 44.11 | 35.52 | 31.90 | 69.17 | 60.51 | | |
| | **Phi3.5-Vision-4B + BigDocs (Ours)** | **87.05** | **70.05** | **70.97** | **37.45** | **51.21** | **81.24** | **81.56** | **68.72** | **45.00** | **36.15** | **32.47** | **67.77** | **60.80** | | |
| | LLaVA-NeXT-7B (instruct) | 63.51 | 30.90 | 1.30 | 5.35 | 20.06 | 52.83 | 52.12 | 65.10 | 38.89 | 17.94 | 7.46 | 32.87 | 32.36 | | |
| | LLaVA-NeXT-7B + DocStruct4M | 60.95 | 26.14 | 39.78 | 28.34 | 25.90 | 67.72 | 61.20 | 52.25 | 25.78 | 21.70 | 15.33 | 27.03 | 37.68 | | |
| | LLaVA-NeXT-7B + BigDocs (Ours) | 57.13 | 24.47 | 46.38 | 31.09 | 27.06 | 72.58 | 54.72 | 49.06 | 17.78 | 22.88 | 16.07 | 33.13 | 37.70 | | |
| | Llama-3.2-90B | 74.15* | 48.71 | 4.18 | 1.81 | 24.20 | 63.01 | 11.36* | 71.69 | 57.78 | 41.24 | 26.09 | 41.57 | 38.82 | | |
| | GPT-4o 20240806 | 92.80 | 66.37 | 38.39 | 29.92 | 46.63 | 81.10 | 85.70 | 70.46 | 69.10 | 54.55 | 67.58 | 72.87 | 64.62 | | |
| | Claude-3.5 Sonnet | 88.48 | 59.05 | 31.41 | 24.82 | 47.13 | 53.48 | 51.84 | 71.42 | 64.78 | 35.11 | 0.00 | 81.27 | 50.73 | | |
| | GeminiPro-1.5 | 91.23 | 73.94 | 32.16 | 24.07 | 50.29 | 71.22 | 34.68 | 68.16 | 58.22 | 48.15 | 52.05 | 80.43 | 57.05 | | |
| | Qwen2-VL-72B | 96.50 | 84.50 | 30.45 | 24.78 | 55.63 | 0.00 | 88.30 | 85.50 | 64.50 | 35.87 | 2.15 | 74.23 | 58.40 | | |
| ### Input Formats | |
| BigDocs-Phi-3.5-instruct follows the same chat format as Phi-3.5-vision-instruct: | |
| Single image: | |
| ``` | |
| <|user|>\n<|image_1|>\n{prompt}<|end|>\n<|assistant|>\n | |
| ``` | |
| Multi-turn conversations: | |
| ``` | |
| <|user|>\n<|image_1|>\n{prompt_1}<|end|>\n<|assistant|>\n{response_1}<|end|>\n<|user|>\n{prompt_2}<|end|>\n<|assistant|>\n | |
| ``` | |
| For multi-image usage, add multiple image placeholders in the front of the prompts. <|image_{}|> index should start from 1. One example of prompt is shown as follows: | |
| ``` | |
| <|user|>\n<|image_1|>\n<|image_2|>\n<|image_3|>\n<|image_4|>\n{prompt}<|end|>\n<|assistant|>\n | |
| ``` | |
| ### Loading the model locally | |
| After obtaining the Phi-3.5-vision-instruct model checkpoints, users can use this sample code for inference. | |
| ```python | |
| from PIL import Image | |
| import requests | |
| from transformers import AutoModelForCausalLM | |
| from transformers import AutoProcessor | |
| model_id = "BigDocs/BigDocs-Phi-3.5-instruct" | |
| # Note: set _attn_implementation='eager' if you don't have flash_attn installed | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_id, | |
| device_map="cuda", | |
| trust_remote_code=True, | |
| torch_dtype="auto", | |
| _attn_implementation='flash_attention_2' | |
| ) | |
| # for best performance, use num_crops=4 for multi-frame, num_crops=16 for single-frame. | |
| processor = AutoProcessor.from_pretrained(model_id, | |
| trust_remote_code=True, | |
| num_crops=4 | |
| ) | |
| images = [] | |
| placeholder = "" | |
| # Note: if OOM, you might consider reduce number of frames in this example. | |
| for i in range(1,20): | |
| url = f"https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-{i}-2048.jpg" | |
| images.append(Image.open(requests.get(url, stream=True).raw)) | |
| placeholder += f"<|image_{i}|>\n" | |
| messages = [ | |
| {"role": "user", "content": placeholder+"Summarize the deck of slides."}, | |
| ] | |
| prompt = processor.tokenizer.apply_chat_template( | |
| messages, | |
| tokenize=False, | |
| add_generation_prompt=True | |
| ) | |
| inputs = processor(prompt, images, return_tensors="pt").to("cuda:0") | |
| generation_args = { | |
| "max_new_tokens": 1000, | |
| "temperature": 0.0, | |
| "do_sample": False, | |
| } | |
| generate_ids = model.generate(**inputs, | |
| eos_token_id=processor.tokenizer.eos_token_id, | |
| **generation_args | |
| ) | |
| # remove input tokens | |
| generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:] | |
| response = processor.batch_decode(generate_ids, | |
| skip_special_tokens=True, | |
| clean_up_tokenization_spaces=False)[0] | |
| print(response) | |
| ``` |