|
|
--- |
|
|
base_model: sovitrath/Phi-3.5-vision-instruct |
|
|
library_name: peft |
|
|
--- |
|
|
|
|
|
# Model Card for Model ID |
|
|
|
|
|
This is a fined-tuned Phi 3.5 Vision Instruct model for receipt OCR specifically. |
|
|
|
|
|
It has been fine-tuned on the SROIEv2 datasets and the annotations were generated using Qwen2.5-3B VL. |
|
|
|
|
|
The dataset is **[available on Kaggle](https://www.kaggle.com/datasets/sovitrath/receipt-ocr-input)**. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- The base model is **[sovitrath/Phi-3.5-vision-instruct](sovitrath/Phi-3.5-vision-instruct)**. |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Compute Infrastructure |
|
|
|
|
|
The model was trained on a system with 10GB RTX 3080 GPU, 10th generation i7 CPU, and 32GB RAM. |
|
|
|
|
|
### Framework versions |
|
|
|
|
|
``` |
|
|
torch==2.5.1 |
|
|
torchvision==0.20.1 |
|
|
torchaudio==2.5.1 |
|
|
flash-attn==2.7.2.post1 |
|
|
triton==3.1.0 |
|
|
transformers==4.51.3 |
|
|
accelerate==1.2.0 |
|
|
datasets==4.1.1 |
|
|
huggingface-hub==0.31.1 |
|
|
peft==0.15.2 |
|
|
trl==0.18.0 |
|
|
safetensors==0.4.5 |
|
|
sentencepiece==0.2.0 |
|
|
tiktoken==0.8.0 |
|
|
einops==0.8.0 |
|
|
opencv-python==4.10.0.84 |
|
|
pillow==10.2.0 |
|
|
numpy==2.2.0 |
|
|
scipy==1.14.1 |
|
|
tqdm==4.66.4 |
|
|
pandas==2.2.2 |
|
|
pyarrow==21.0.0 |
|
|
regex==2024.11.6 |
|
|
requests==2.32.3 |
|
|
python-dotenv==1.1.1 |
|
|
wandb==0.22.1 |
|
|
rich==13.9.4 |
|
|
jiwer==4.0.0 |
|
|
bitsandbytes==0.45.0 |
|
|
``` |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
Use the code below to get started with the model. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import matplotlib.pyplot as plt |
|
|
import transformers |
|
|
|
|
|
from PIL import Image |
|
|
from transformers import AutoModelForCausalLM, AutoProcessor |
|
|
from transformers import BitsAndBytesConfig |
|
|
|
|
|
model_id = 'sovitrath/Phi-3.5-Vision-Instruct-OCR' |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
device_map='auto', |
|
|
torch_dtype=torch.bfloat16, |
|
|
trust_remote_code=True, |
|
|
# _attn_implementation='flash_attention_2', # Use `flash_attention_2` on Ampere GPUs and above and `eager` on older GPUs. |
|
|
_attn_implementation='eager', # Use `flash_attention_2` on Ampere GPUs and above and `eager` on older GPUs. |
|
|
) |
|
|
|
|
|
# processor = AutoProcessor.from_pretrained('sovitrath/Phi-3.5-vision-instruct', trust_remote_code=True) |
|
|
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) |
|
|
|
|
|
test_image = Image.open('../inference_data/image_1.jpeg').convert('RGB') |
|
|
|
|
|
plt.figure(figsize=(9, 7)) |
|
|
plt.imshow(test_image) |
|
|
plt.show() |
|
|
|
|
|
def test(model, processor, image, max_new_tokens=1024, device='cuda'): |
|
|
placeholder = f"<|image_1|>\n" |
|
|
messages = [ |
|
|
{ |
|
|
'role': 'user', |
|
|
'content': placeholder + 'OCR this image accurately' |
|
|
}, |
|
|
] |
|
|
|
|
|
# Prepare the text input by applying the chat template |
|
|
text_input = processor.tokenizer.apply_chat_template( |
|
|
messages, |
|
|
add_generation_prompt=True, |
|
|
tokenize=False |
|
|
) |
|
|
|
|
|
if image.mode != 'RGB': |
|
|
image = image.convert('RGB') |
|
|
|
|
|
# Prepare the inputs for the model |
|
|
model_inputs = processor( |
|
|
text=text_input, |
|
|
images=[image], |
|
|
return_tensors='pt', |
|
|
).to(device) # Move inputs to the specified device |
|
|
|
|
|
# Generate text with the model |
|
|
generated_ids = model.generate(**model_inputs, max_new_tokens=max_new_tokens) |
|
|
|
|
|
# Trim the generated ids to remove the input ids |
|
|
trimmed_generated_ids = [ |
|
|
out_ids[len(in_ids):] for in_ids, out_ids in zip(model_inputs.input_ids, generated_ids) |
|
|
] |
|
|
|
|
|
# Decode the output text |
|
|
output_text = processor.batch_decode( |
|
|
trimmed_generated_ids, |
|
|
skip_special_tokens=True, |
|
|
clean_up_tokenization_spaces=False |
|
|
) |
|
|
|
|
|
return output_text[0] # Return the first decoded output text |
|
|
|
|
|
output = test(model, processor, test_image) |
|
|
print(output) |
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
It has been fine-tuned on the SROIEv2 datasets and the annotations were generated using Qwen2.5-3B VL. |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
* It has been fine-tuned for 1200 steps. However, the checkpoints correspond to the model saved at 400 steps which gave the best loss. |
|
|
* The text file annotations were generated using Qwen2.5-3B VL. |
|
|
|
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
* It is a LoRA model. |
|
|
|
|
|
**LoRA configuration:** |
|
|
|
|
|
```python |
|
|
# Configure LoRA |
|
|
peft_config = LoraConfig( |
|
|
r=8, |
|
|
lora_alpha=16, |
|
|
lora_dropout=0.0, |
|
|
target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'], |
|
|
use_dora=True, |
|
|
init_lora_weights='gaussian' |
|
|
) |
|
|
|
|
|
# Apply PEFT model adaptation |
|
|
peft_model = get_peft_model(model, peft_config) |
|
|
|
|
|
# Print trainable parameters |
|
|
peft_model.print_trainable_parameters() |
|
|
``` |
|
|
|
|
|
**Trainer configuration:** |
|
|
|
|
|
```python |
|
|
# Configure training arguments using SFTConfig |
|
|
training_args = transformers.TrainingArguments( |
|
|
output_dir=output_dir, |
|
|
logging_dir=output_dir, |
|
|
# num_train_epochs=1, |
|
|
max_steps=1200, # 625, |
|
|
per_device_train_batch_size=1, # Batch size MUST be 1 for Phi 3.5 Vision Instruct fine-tuning |
|
|
per_device_eval_batch_size=1, # Batch size MUST be 1 for Phi 3.5 Vision Instruct fine-tuning |
|
|
gradient_accumulation_steps=4, # 4 |
|
|
warmup_steps=50, |
|
|
learning_rate=1e-4, |
|
|
weight_decay=0.01, |
|
|
logging_steps=400, |
|
|
eval_steps=400, |
|
|
save_steps=400, |
|
|
logging_strategy='steps', |
|
|
eval_strategy='steps', |
|
|
save_strategy='steps', |
|
|
save_total_limit=2, |
|
|
optim='adamw_torch_fused', |
|
|
bf16=True, |
|
|
report_to='wandb', |
|
|
remove_unused_columns=False, |
|
|
gradient_checkpointing=True, |
|
|
dataloader_num_workers=4, |
|
|
# dataset_text_field='', |
|
|
# dataset_kwargs={'skip_prepare_dataset': True}, |
|
|
load_best_model_at_end=True, |
|
|
save_safetensors=True, |
|
|
) |
|
|
``` |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
The current best validation loss is **0.377421**. |
|
|
|
|
|
The CER on the test set is **0.355**. The Qwen2.5-3B VL test annotations were used as ground truth. |
|
|
|