TomasFAV commited on
Commit
fd4a120
·
verified ·
1 Parent(s): dc3ebf0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -20
README.md CHANGED
@@ -4,40 +4,103 @@ license: apache-2.0
4
  base_model: google-bert/bert-base-multilingual-cased
5
  tags:
6
  - generated_from_trainer
 
 
 
 
7
  metrics:
8
  - precision
9
  - recall
10
  - f1
11
  - accuracy
12
  model-index:
13
- - name: BERTInvoiceCzechR
14
  results: []
15
  ---
16
 
17
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
18
- should probably proofread and complete it, then remove this comment. -->
19
 
20
- # BERTInvoiceCzechR
21
 
22
- This model is a fine-tuned version of [google-bert/bert-base-multilingual-cased](https://huggingface.co/google-bert/bert-base-multilingual-cased) on an unknown dataset.
23
  It achieves the following results on the evaluation set:
24
- - Loss: 0.3291
25
- - Precision: 0.5188
26
- - Recall: 0.6917
27
- - F1: 0.5929
28
- - Accuracy: 0.9335
 
 
29
 
30
  ## Model description
31
 
32
- More information needed
33
 
34
- ## Intended uses & limitations
 
 
 
 
 
 
35
 
36
- More information needed
 
 
37
 
38
- ## Training and evaluation data
39
 
40
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
  ## Training procedure
43
 
@@ -54,6 +117,8 @@ The following hyperparameters were used during training:
54
  - num_epochs: 10
55
  - mixed_precision_training: Native AMP
56
 
 
 
57
  ### Training results
58
 
59
  | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
@@ -69,10 +134,11 @@ The following hyperparameters were used during training:
69
  | 0.3757 | 9.0 | 783 | 0.3604 | 0.4906 | 0.6858 | 0.5720 | 0.9279 |
70
  | 0.3757 | 10.0 | 870 | 0.3515 | 0.5011 | 0.6944 | 0.5821 | 0.9296 |
71
 
 
72
 
73
- ### Framework versions
74
 
75
- - Transformers 5.0.0
76
- - Pytorch 2.10.0+cu128
77
- - Datasets 4.0.0
78
- - Tokenizers 0.22.2
 
4
  base_model: google-bert/bert-base-multilingual-cased
5
  tags:
6
  - generated_from_trainer
7
+ - invoice-processing
8
+ - information-extraction
9
+ - czech-language
10
+ - synthetic-data
11
  metrics:
12
  - precision
13
  - recall
14
  - f1
15
  - accuracy
16
  model-index:
17
+ - name: BERTInvoiceCzechR-V0
18
  results: []
19
  ---
20
 
21
+ # BERTInvoiceCzechR (V0 Synthetic Templates Only)
 
22
 
23
+ This model is a fine-tuned version of [google-bert/bert-base-multilingual-cased](https://huggingface.co/google-bert/bert-base-multilingual-cased) for the task of structured information extraction from Czech invoices.
24
 
 
25
  It achieves the following results on the evaluation set:
26
+ - Loss: 0.3291
27
+ - Precision: 0.5188
28
+ - Recall: 0.6917
29
+ - F1: 0.5929
30
+ - Accuracy: 0.9335
31
+
32
+ ---
33
 
34
  ## Model description
35
 
36
+ BERTInvoiceCzechR (V0) is the baseline model in a multi-stage experimental pipeline focused on invoice understanding.
37
 
38
+ The model performs token-level classification to extract structured fields from invoice text, such as:
39
+ - supplier
40
+ - customer
41
+ - invoice number
42
+ - bank details
43
+ - totals
44
+ - dates
45
 
46
+ This version (V0) is trained **exclusively on synthetically generated invoices created from predefined templates**, without any layout randomization or real-world data.
47
+
48
+ ---
49
 
50
+ ## Training data
51
 
52
+ The dataset consists purely of:
53
+
54
+ - synthetically generated invoices
55
+ - fixed template structures
56
+ - controlled field placement and formatting
57
+
58
+ Characteristics:
59
+ - consistent layout across samples
60
+ - fully controlled annotations
61
+ - no noise or OCR artifacts
62
+ - no real invoice data
63
+ - added synthetic image augmentations
64
+
65
+ This dataset represents the **simplest training scenario** in the pipeline and serves as a baseline for comparison with more complex data variants.
66
+
67
+ ---
68
+
69
+ ## Role in the pipeline
70
+
71
+ This model corresponds to:
72
+
73
+ **V0 – Synthetic template-based dataset only**
74
+
75
+ It is used as:
76
+ - a baseline for evaluating the impact of:
77
+ - layout variability
78
+ - synthetic-real hybrid data
79
+ - real annotated invoices
80
+ - a reference point for measuring generalization gap
81
+
82
+ ---
83
+
84
+ ## Intended uses
85
+
86
+ - Baseline model for document AI experiments
87
+ - Evaluation of synthetic data usefulness
88
+ - Comparison with more advanced dataset variants (V1–V3)
89
+ - Research in Czech invoice information extraction
90
+
91
+ ---
92
+
93
+ ## Limitations
94
+
95
+ - Strong dependency on template structure
96
+ - May have poor generalization to:
97
+ - unseen layouts
98
+ - real-world invoices
99
+ - noisy OCR outputs
100
+ - Does not capture layout variability
101
+ - Trained only on clean synthetic data
102
+
103
+ ---
104
 
105
  ## Training procedure
106
 
 
117
  - num_epochs: 10
118
  - mixed_precision_training: Native AMP
119
 
120
+ ---
121
+
122
  ### Training results
123
 
124
  | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
 
134
  | 0.3757 | 9.0 | 783 | 0.3604 | 0.4906 | 0.6858 | 0.5720 | 0.9279 |
135
  | 0.3757 | 10.0 | 870 | 0.3515 | 0.5011 | 0.6944 | 0.5821 | 0.9296 |
136
 
137
+ ---
138
 
139
+ ## Framework versions
140
 
141
+ - Transformers 5.0.0
142
+ - PyTorch 2.10.0+cu128
143
+ - Datasets 4.0.0
144
+ - Tokenizers 0.22.2