Upload folder using huggingface_hub

31e985a verified 4 months ago

5.63 kB

	---
	base_model: sovitrath/Phi-3.5-vision-instruct
	library_name: peft
	---

	# Model Card for Model ID

	This is a fined-tuned Phi 3.5 Vision Instruct model for receipt OCR specifically.

	It has been fine-tuned on the SROIEv2 datasets and the annotations were generated using Qwen2.5-3B VL.

	The dataset is [available on Kaggle](https://www.kaggle.com/datasets/sovitrath/receipt-ocr-input).

	## Model Details

	- The base model is [sovitrath/Phi-3.5-vision-instruct](sovitrath/Phi-3.5-vision-instruct).

	## Technical Specifications

	### Compute Infrastructure

	The model was trained on a system with 10GB RTX 3080 GPU, 10th generation i7 CPU, and 32GB RAM.

	### Framework versions

	```
	torch==2.5.1
	torchvision==0.20.1
	torchaudio==2.5.1
	flash-attn==2.7.2.post1
	triton==3.1.0
	transformers==4.51.3
	accelerate==1.2.0
	datasets==4.1.1
	huggingface-hub==0.31.1
	peft==0.15.2
	trl==0.18.0
	safetensors==0.4.5
	sentencepiece==0.2.0
	tiktoken==0.8.0
	einops==0.8.0
	opencv-python==4.10.0.84
	pillow==10.2.0
	numpy==2.2.0
	scipy==1.14.1
	tqdm==4.66.4
	pandas==2.2.2
	pyarrow==21.0.0
	regex==2024.11.6
	requests==2.32.3
	python-dotenv==1.1.1
	wandb==0.22.1
	rich==13.9.4
	jiwer==4.0.0
	bitsandbytes==0.45.0
	```

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	import torch
	import matplotlib.pyplot as plt
	import transformers

	from PIL import Image
	from transformers import AutoModelForCausalLM, AutoProcessor
	from transformers import BitsAndBytesConfig

	model_id = 'sovitrath/Phi-3.5-Vision-Instruct-OCR'

	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map='auto',
	torch_dtype=torch.bfloat16,
	trust_remote_code=True,
	# _attn_implementation='flash_attention_2', # Use `flash_attention_2` on Ampere GPUs and above and `eager` on older GPUs.
	_attn_implementation='eager', # Use `flash_attention_2` on Ampere GPUs and above and `eager` on older GPUs.
	)

	# processor = AutoProcessor.from_pretrained('sovitrath/Phi-3.5-vision-instruct', trust_remote_code=True)
	processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

	test_image = Image.open('../inference_data/image_1.jpeg').convert('RGB')

	plt.figure(figsize=(9, 7))
	plt.imshow(test_image)
	plt.show()

	def test(model, processor, image, max_new_tokens=1024, device='cuda'):
	placeholder = f"<\|image_1\|>\n"
	messages = [
	{
	'role': 'user',
	'content': placeholder + 'OCR this image accurately'
	},
	]

	# Prepare the text input by applying the chat template
	text_input = processor.tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=False
	)

	if image.mode != 'RGB':
	image = image.convert('RGB')

	# Prepare the inputs for the model
	model_inputs = processor(
	text=text_input,
	images=[image],
	return_tensors='pt',
	).to(device) # Move inputs to the specified device

	# Generate text with the model
	generated_ids = model.generate(**model_inputs, max_new_tokens=max_new_tokens)

	# Trim the generated ids to remove the input ids
	trimmed_generated_ids = [
	out_ids[len(in_ids):] for in_ids, out_ids in zip(model_inputs.input_ids, generated_ids)
	]

	# Decode the output text
	output_text = processor.batch_decode(
	trimmed_generated_ids,
	skip_special_tokens=True,
	clean_up_tokenization_spaces=False
	)

	return output_text[0] # Return the first decoded output text

	output = test(model, processor, test_image)
	print(output)
	```



	## Training Details

	### Training Data

	It has been fine-tuned on the SROIEv2 datasets and the annotations were generated using Qwen2.5-3B VL.

	### Training Procedure

	* It has been fine-tuned for 1200 steps. However, the checkpoints correspond to the model saved at 400 steps which gave the best loss.
	* The text file annotations were generated using Qwen2.5-3B VL.


	#### Training Hyperparameters

	* It is a LoRA model.

	LoRA configuration:

	```python
	# Configure LoRA
	peft_config = LoraConfig(
	r=8,
	lora_alpha=16,
	lora_dropout=0.0,
	target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
	use_dora=True,
	init_lora_weights='gaussian'
	)

	# Apply PEFT model adaptation
	peft_model = get_peft_model(model, peft_config)

	# Print trainable parameters
	peft_model.print_trainable_parameters()
	```

	Trainer configuration:

	```python
	# Configure training arguments using SFTConfig
	training_args = transformers.TrainingArguments(
	output_dir=output_dir,
	logging_dir=output_dir,
	# num_train_epochs=1,
	max_steps=1200, # 625,
	per_device_train_batch_size=1, # Batch size MUST be 1 for Phi 3.5 Vision Instruct fine-tuning
	per_device_eval_batch_size=1, # Batch size MUST be 1 for Phi 3.5 Vision Instruct fine-tuning
	gradient_accumulation_steps=4, # 4
	warmup_steps=50,
	learning_rate=1e-4,
	weight_decay=0.01,
	logging_steps=400,
	eval_steps=400,
	save_steps=400,
	logging_strategy='steps',
	eval_strategy='steps',
	save_strategy='steps',
	save_total_limit=2,
	optim='adamw_torch_fused',
	bf16=True,
	report_to='wandb',
	remove_unused_columns=False,
	gradient_checkpointing=True,
	dataloader_num_workers=4,
	# dataset_text_field='',
	# dataset_kwargs={'skip_prepare_dataset': True},
	load_best_model_at_end=True,
	save_safetensors=True,
	)
	```

	## Evaluation

	The current best validation loss is 0.377421.

	The CER on the test set is 0.355. The Qwen2.5-3B VL test annotations were used as ground truth.