File size: 3,087 Bytes
b5c6bd2
 
 
d18c0fb
 
b5c6bd2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d18c0fb
b5c6bd2
8edc086
 
 
 
b5c6bd2
 
 
8edc086
 
165a034
8edc086
 
 
 
 
cd6854f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8edc086
 
 
 
 
cd6854f
8edc086
cd6854f
8edc086
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
library_name: peft
license: apache-2.0
base_model:
- Qwen/Qwen2-VL-2B-Instruct
tags:
- llama-factory
- lora
- generated_from_trainer
- Qwen
- Vl-model
- fine-tuning
- vision-model
- multi-modal
model-index:
- name: models
  results: []
datasets:
- naver-clova-ix/cord-v2
language:
- en
metrics:
- accuracy
- precision
- recall
- f1
pipeline_tag: image-text-to-text
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# Qwen Fine Tuning Results 
<img src="./results.png" alt="Sample Invoice" width="auto"/>

# models

This model is a fine-tuned version of [Qwen/Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) on the invoice_train dataset.
It achieves the following results on the evaluation set:
- Loss: 0.0481

## Model description

- he Qwen2 2B model has been fine-tuned on OCR-rich invoice data from the CORD-v2 dataset, allowing it to recognize both the content and layout of invoices effectively. The model outputs structured information directly, enabling downstream processing or integration into accounting systems.

For each invoice image, the model identifies and extracts the following fields:

- Menu Items

- Item Prices

- Subtotal Price

- Total Price

- Tax Amount

- Cash Given

- Change Amount

## More Info
- Base Model: Qwen2 2B — a large language model fine-tuned for vision-language tasks.

- Fine-Tuning: Supervised learning on OCR + structure pairs from the CORD-v2 dataset.

- Input: OCR-annotated invoice image data from the CORD-v2 dataset.

- Output: Structured extraction of key financial fields in JSON format.



## Training and evaluation data

- Training Set: 800 samples Used to fine-tune the Qwen2 2B model on learning to extract key invoice components from OCR-text and layout information.

- Evaluation Set: 100 samples Used to assess the model’s ability to generalize and accurately extract fields such as menu items, prices, subtotal, tax, cash, and change from unseen invoices.

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 4
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 3.0

### Training results

| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:----:|:---------------:|
| 0.0779        | 0.5   | 100  | 0.0685          |
| 0.0647        | 1.0   | 200  | 0.0511          |
| 0.0292        | 1.5   | 300  | 0.0500          |
| 0.028         | 2.0   | 400  | 0.0449          |
| 0.013         | 2.5   | 500  | 0.0488          |
| 0.0116        | 3.0   | 600  | 0.0481          |


### Framework versions

- PEFT 0.14.0
- Transformers 4.51.3
- Pytorch 2.6.0+cu124
- Datasets 3.5.0
- Tokenizers 0.21.1