|
|
--- |
|
|
library_name: peft |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- Qwen/Qwen2-VL-2B-Instruct |
|
|
tags: |
|
|
- llama-factory |
|
|
- lora |
|
|
- generated_from_trainer |
|
|
- Qwen |
|
|
- Vl-model |
|
|
- fine-tuning |
|
|
- vision-model |
|
|
- multi-modal |
|
|
model-index: |
|
|
- name: models |
|
|
results: [] |
|
|
datasets: |
|
|
- naver-clova-ix/cord-v2 |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- accuracy |
|
|
- precision |
|
|
- recall |
|
|
- f1 |
|
|
pipeline_tag: image-text-to-text |
|
|
--- |
|
|
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
|
|
# Qwen Fine Tuning Results |
|
|
<img src="./results.png" alt="Sample Invoice" width="auto"/> |
|
|
|
|
|
# models |
|
|
|
|
|
This model is a fine-tuned version of [Qwen/Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) on the invoice_train dataset. |
|
|
It achieves the following results on the evaluation set: |
|
|
- Loss: 0.0481 |
|
|
|
|
|
## Model description |
|
|
|
|
|
- he Qwen2 2B model has been fine-tuned on OCR-rich invoice data from the CORD-v2 dataset, allowing it to recognize both the content and layout of invoices effectively. The model outputs structured information directly, enabling downstream processing or integration into accounting systems. |
|
|
|
|
|
For each invoice image, the model identifies and extracts the following fields: |
|
|
|
|
|
- Menu Items |
|
|
|
|
|
- Item Prices |
|
|
|
|
|
- Subtotal Price |
|
|
|
|
|
- Total Price |
|
|
|
|
|
- Tax Amount |
|
|
|
|
|
- Cash Given |
|
|
|
|
|
- Change Amount |
|
|
|
|
|
## More Info |
|
|
- Base Model: Qwen2 2B — a large language model fine-tuned for vision-language tasks. |
|
|
|
|
|
- Fine-Tuning: Supervised learning on OCR + structure pairs from the CORD-v2 dataset. |
|
|
|
|
|
- Input: OCR-annotated invoice image data from the CORD-v2 dataset. |
|
|
|
|
|
- Output: Structured extraction of key financial fields in JSON format. |
|
|
|
|
|
|
|
|
|
|
|
## Training and evaluation data |
|
|
|
|
|
- Training Set: 800 samples Used to fine-tune the Qwen2 2B model on learning to extract key invoice components from OCR-text and layout information. |
|
|
|
|
|
- Evaluation Set: 100 samples Used to assess the model’s ability to generalize and accurately extract fields such as menu items, prices, subtotal, tax, cash, and change from unseen invoices. |
|
|
|
|
|
### Training hyperparameters |
|
|
|
|
|
The following hyperparameters were used during training: |
|
|
- learning_rate: 0.0001 |
|
|
- train_batch_size: 1 |
|
|
- eval_batch_size: 1 |
|
|
- seed: 42 |
|
|
- gradient_accumulation_steps: 4 |
|
|
- total_train_batch_size: 4 |
|
|
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments |
|
|
- lr_scheduler_type: cosine |
|
|
- lr_scheduler_warmup_ratio: 0.1 |
|
|
- num_epochs: 3.0 |
|
|
|
|
|
### Training results |
|
|
|
|
|
| Training Loss | Epoch | Step | Validation Loss | |
|
|
|:-------------:|:-----:|:----:|:---------------:| |
|
|
| 0.0779 | 0.5 | 100 | 0.0685 | |
|
|
| 0.0647 | 1.0 | 200 | 0.0511 | |
|
|
| 0.0292 | 1.5 | 300 | 0.0500 | |
|
|
| 0.028 | 2.0 | 400 | 0.0449 | |
|
|
| 0.013 | 2.5 | 500 | 0.0488 | |
|
|
| 0.0116 | 3.0 | 600 | 0.0481 | |
|
|
|
|
|
|
|
|
### Framework versions |
|
|
|
|
|
- PEFT 0.14.0 |
|
|
- Transformers 4.51.3 |
|
|
- Pytorch 2.6.0+cu124 |
|
|
- Datasets 3.5.0 |
|
|
- Tokenizers 0.21.1 |