sovitrath's picture
Upload folder using huggingface_hub
31e985a verified
---
base_model: sovitrath/Phi-3.5-vision-instruct
library_name: peft
---
# Model Card for Model ID
This is a fined-tuned Phi 3.5 Vision Instruct model for receipt OCR specifically.
It has been fine-tuned on the SROIEv2 datasets and the annotations were generated using Qwen2.5-3B VL.
The dataset is **[available on Kaggle](https://www.kaggle.com/datasets/sovitrath/receipt-ocr-input)**.
## Model Details
- The base model is **[sovitrath/Phi-3.5-vision-instruct](sovitrath/Phi-3.5-vision-instruct)**.
## Technical Specifications
### Compute Infrastructure
The model was trained on a system with 10GB RTX 3080 GPU, 10th generation i7 CPU, and 32GB RAM.
### Framework versions
```
torch==2.5.1
torchvision==0.20.1
torchaudio==2.5.1
flash-attn==2.7.2.post1
triton==3.1.0
transformers==4.51.3
accelerate==1.2.0
datasets==4.1.1
huggingface-hub==0.31.1
peft==0.15.2
trl==0.18.0
safetensors==0.4.5
sentencepiece==0.2.0
tiktoken==0.8.0
einops==0.8.0
opencv-python==4.10.0.84
pillow==10.2.0
numpy==2.2.0
scipy==1.14.1
tqdm==4.66.4
pandas==2.2.2
pyarrow==21.0.0
regex==2024.11.6
requests==2.32.3
python-dotenv==1.1.1
wandb==0.22.1
rich==13.9.4
jiwer==4.0.0
bitsandbytes==0.45.0
```
## How to Get Started with the Model
Use the code below to get started with the model.
```python
import torch
import matplotlib.pyplot as plt
import transformers
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
from transformers import BitsAndBytesConfig
model_id = 'sovitrath/Phi-3.5-Vision-Instruct-OCR'
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map='auto',
torch_dtype=torch.bfloat16,
trust_remote_code=True,
# _attn_implementation='flash_attention_2', # Use `flash_attention_2` on Ampere GPUs and above and `eager` on older GPUs.
_attn_implementation='eager', # Use `flash_attention_2` on Ampere GPUs and above and `eager` on older GPUs.
)
# processor = AutoProcessor.from_pretrained('sovitrath/Phi-3.5-vision-instruct', trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
test_image = Image.open('../inference_data/image_1.jpeg').convert('RGB')
plt.figure(figsize=(9, 7))
plt.imshow(test_image)
plt.show()
def test(model, processor, image, max_new_tokens=1024, device='cuda'):
placeholder = f"<|image_1|>\n"
messages = [
{
'role': 'user',
'content': placeholder + 'OCR this image accurately'
},
]
# Prepare the text input by applying the chat template
text_input = processor.tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=False
)
if image.mode != 'RGB':
image = image.convert('RGB')
# Prepare the inputs for the model
model_inputs = processor(
text=text_input,
images=[image],
return_tensors='pt',
).to(device) # Move inputs to the specified device
# Generate text with the model
generated_ids = model.generate(**model_inputs, max_new_tokens=max_new_tokens)
# Trim the generated ids to remove the input ids
trimmed_generated_ids = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(model_inputs.input_ids, generated_ids)
]
# Decode the output text
output_text = processor.batch_decode(
trimmed_generated_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
return output_text[0] # Return the first decoded output text
output = test(model, processor, test_image)
print(output)
```
## Training Details
### Training Data
It has been fine-tuned on the SROIEv2 datasets and the annotations were generated using Qwen2.5-3B VL.
### Training Procedure
* It has been fine-tuned for 1200 steps. However, the checkpoints correspond to the model saved at 400 steps which gave the best loss.
* The text file annotations were generated using Qwen2.5-3B VL.
#### Training Hyperparameters
* It is a LoRA model.
**LoRA configuration:**
```python
# Configure LoRA
peft_config = LoraConfig(
r=8,
lora_alpha=16,
lora_dropout=0.0,
target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
use_dora=True,
init_lora_weights='gaussian'
)
# Apply PEFT model adaptation
peft_model = get_peft_model(model, peft_config)
# Print trainable parameters
peft_model.print_trainable_parameters()
```
**Trainer configuration:**
```python
# Configure training arguments using SFTConfig
training_args = transformers.TrainingArguments(
output_dir=output_dir,
logging_dir=output_dir,
# num_train_epochs=1,
max_steps=1200, # 625,
per_device_train_batch_size=1, # Batch size MUST be 1 for Phi 3.5 Vision Instruct fine-tuning
per_device_eval_batch_size=1, # Batch size MUST be 1 for Phi 3.5 Vision Instruct fine-tuning
gradient_accumulation_steps=4, # 4
warmup_steps=50,
learning_rate=1e-4,
weight_decay=0.01,
logging_steps=400,
eval_steps=400,
save_steps=400,
logging_strategy='steps',
eval_strategy='steps',
save_strategy='steps',
save_total_limit=2,
optim='adamw_torch_fused',
bf16=True,
report_to='wandb',
remove_unused_columns=False,
gradient_checkpointing=True,
dataloader_num_workers=4,
# dataset_text_field='',
# dataset_kwargs={'skip_prepare_dataset': True},
load_best_model_at_end=True,
save_safetensors=True,
)
```
## Evaluation
The current best validation loss is **0.377421**.
The CER on the test set is **0.355**. The Qwen2.5-3B VL test annotations were used as ground truth.