TomasFAV commited on
Commit
0c074b5
·
verified ·
1 Parent(s): 0dc3950

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -20
README.md CHANGED
@@ -1,41 +1,102 @@
1
  ---
2
  library_name: transformers
 
 
3
  tags:
4
  - generated_from_trainer
 
 
 
 
 
5
  metrics:
6
  - precision
7
  - recall
8
  - f1
9
  - accuracy
10
  model-index:
11
- - name: BERTInvoiceCzechRV1
12
  results: []
13
  ---
14
 
15
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
16
- should probably proofread and complete it, then remove this comment. -->
17
 
18
- # BERTInvoiceCzechRV1
19
 
20
- This model was trained from scratch on an unknown dataset.
21
  It achieves the following results on the evaluation set:
22
- - Loss: 0.2295
23
- - Precision: 0.6594
24
- - Recall: 0.7309
25
- - F1: 0.6933
26
- - Accuracy: 0.9534
 
 
27
 
28
  ## Model description
29
 
30
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
- ## Intended uses & limitations
33
 
34
- More information needed
 
 
 
 
 
 
 
35
 
36
- ## Training and evaluation data
 
 
 
 
 
 
 
37
 
38
- More information needed
 
 
 
 
 
 
 
39
 
40
  ## Training procedure
41
 
@@ -52,6 +113,8 @@ The following hyperparameters were used during training:
52
  - num_epochs: 10
53
  - mixed_precision_training: Native AMP
54
 
 
 
55
  ### Training results
56
 
57
  | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
@@ -67,10 +130,11 @@ The following hyperparameters were used during training:
67
  | 0.0306 | 9.0 | 585 | 0.2853 | 0.6059 | 0.7208 | 0.6584 | 0.9455 |
68
  | 0.0306 | 10.0 | 650 | 0.2859 | 0.6054 | 0.7239 | 0.6594 | 0.9452 |
69
 
 
70
 
71
- ### Framework versions
72
 
73
- - Transformers 5.0.0
74
- - Pytorch 2.10.0+cu128
75
- - Datasets 4.0.0
76
- - Tokenizers 0.22.2
 
1
  ---
2
  library_name: transformers
3
+ license: apache-2.0
4
+ base_model: google-bert/bert-base-multilingual-cased
5
  tags:
6
  - generated_from_trainer
7
+ - invoice-processing
8
+ - information-extraction
9
+ - czech-language
10
+ - synthetic-data
11
+ - layout-augmentation
12
  metrics:
13
  - precision
14
  - recall
15
  - f1
16
  - accuracy
17
  model-index:
18
+ - name: BERTInvoiceCzechR-V1
19
  results: []
20
  ---
21
 
22
+ # BERTInvoiceCzechR (V1 Synthetic + Random Layout)
 
23
 
24
+ This model is a fine-tuned version of [google-bert/bert-base-multilingual-cased](https://huggingface.co/google-bert/bert-base-multilingual-cased) for the task of structured information extraction from Czech invoices.
25
 
 
26
  It achieves the following results on the evaluation set:
27
+ - Loss: 0.2295
28
+ - Precision: 0.6594
29
+ - Recall: 0.7309
30
+ - F1: 0.6933
31
+ - Accuracy: 0.9534
32
+
33
+ ---
34
 
35
  ## Model description
36
 
37
+ BERTInvoiceCzechR (V1) extends the baseline model (V0) by introducing layout variability into the training data.
38
+
39
+ The model performs token-level classification to extract structured invoice fields such as:
40
+ - supplier
41
+ - customer
42
+ - invoice number
43
+ - bank details
44
+ - totals
45
+ - dates
46
+
47
+ Compared to V0, this version is trained on synthetically generated invoices with **randomized layouts**, improving robustness to positional and structural variations.
48
+
49
+ ---
50
+
51
+ ## Training data
52
+
53
+ The dataset consists of:
54
+
55
+ - synthetically generated invoices based on templates
56
+ - additional variants with randomized layout structures
57
+
58
+ Key properties:
59
+ - variable positioning of fields
60
+ - layout perturbations (shifts, spacing, ordering)
61
+ - preserved semantic correctness of labels
62
+ - still fully synthetic (no real invoices)
63
+
64
+ This dataset introduces **layout diversity**, which is critical for generalization in document understanding tasks.
65
+
66
+ ---
67
+
68
+ ## Role in the pipeline
69
+
70
+ This model corresponds to:
71
 
72
+ **V1 Synthetic templates + randomized layouts**
73
 
74
+ It is used to:
75
+ - evaluate the impact of layout variability
76
+ - compare against:
77
+ - V0 (fixed templates)
78
+ - later stages with real data (V2, V3)
79
+ - measure improvements in generalization
80
+
81
+ ---
82
 
83
+ ## Intended uses
84
+
85
+ - Research in layout-aware NLP without explicit layout models
86
+ - Benchmarking robustness to structural variation
87
+ - Intermediate baseline for synthetic data pipelines
88
+ - Czech invoice information extraction
89
+
90
+ ---
91
 
92
+ ## Limitations
93
+
94
+ - Still trained only on synthetic data
95
+ - No exposure to real-world noise (OCR errors, distortions)
96
+ - Layout variation is artificial and may not fully reflect real documents
97
+ - Does not leverage explicit spatial features (pure BERT)
98
+
99
+ ---
100
 
101
  ## Training procedure
102
 
 
113
  - num_epochs: 10
114
  - mixed_precision_training: Native AMP
115
 
116
+ ---
117
+
118
  ### Training results
119
 
120
  | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
 
130
  | 0.0306 | 9.0 | 585 | 0.2853 | 0.6059 | 0.7208 | 0.6584 | 0.9455 |
131
  | 0.0306 | 10.0 | 650 | 0.2859 | 0.6054 | 0.7239 | 0.6594 | 0.9452 |
132
 
133
+ ---
134
 
135
+ ## Framework versions
136
 
137
+ - Transformers 5.0.0
138
+ - PyTorch 2.10.0+cu128
139
+ - Datasets 4.0.0
140
+ - Tokenizers 0.22.2