TomasFAV commited on
Commit
bc7777b
·
verified ·
1 Parent(s): 10c850f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -20
README.md CHANGED
@@ -1,41 +1,110 @@
1
  ---
2
  library_name: transformers
 
 
3
  tags:
4
  - generated_from_trainer
 
 
 
 
 
 
5
  metrics:
6
  - precision
7
  - recall
8
  - f1
9
  - accuracy
10
  model-index:
11
- - name: BERTInvoiceCzechRV3
12
  results: []
13
  ---
14
 
15
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
16
- should probably proofread and complete it, then remove this comment. -->
17
 
18
- # BERTInvoiceCzechRV3
19
 
20
- This model was trained from scratch on an unknown dataset.
21
  It achieves the following results on the evaluation set:
22
- - Loss: 0.0630
23
- - Precision: 0.8620
24
- - Recall: 0.9072
25
- - F1: 0.8840
26
- - Accuracy: 0.9830
 
 
27
 
28
  ## Model description
29
 
30
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
- ## Intended uses & limitations
33
 
34
- More information needed
35
 
36
- ## Training and evaluation data
37
 
38
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
  ## Training procedure
41
 
@@ -52,6 +121,8 @@ The following hyperparameters were used during training:
52
  - num_epochs: 10
53
  - mixed_precision_training: Native AMP
54
 
 
 
55
  ### Training results
56
 
57
  | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
@@ -67,10 +138,11 @@ The following hyperparameters were used during training:
67
  | No log | 9.0 | 180 | 0.0644 | 0.8593 | 0.9083 | 0.8831 | 0.9828 |
68
  | No log | 10.0 | 200 | 0.0630 | 0.8620 | 0.9072 | 0.8840 | 0.9830 |
69
 
 
70
 
71
- ### Framework versions
72
 
73
- - Transformers 5.0.0
74
- - Pytorch 2.10.0+cu128
75
- - Datasets 4.0.0
76
- - Tokenizers 0.22.2
 
1
  ---
2
  library_name: transformers
3
+ license: apache-2.0
4
+ base_model: google-bert/bert-base-multilingual-cased
5
  tags:
6
  - generated_from_trainer
7
+ - invoice-processing
8
+ - information-extraction
9
+ - czech-language
10
+ - synthetic-data
11
+ - hybrid-data
12
+ - real-data
13
  metrics:
14
  - precision
15
  - recall
16
  - f1
17
  - accuracy
18
  model-index:
19
+ - name: BERTInvoiceCzechR-V3
20
  results: []
21
  ---
22
 
23
+ # BERTInvoiceCzechR (V3 Full Pipeline with Real Data Fine-Tuning)
 
24
 
25
+ This model is a fine-tuned version of [google-bert/bert-base-multilingual-cased](https://huggingface.co/google-bert/bert-base-multilingual-cased) for structured information extraction from Czech invoices.
26
 
 
27
  It achieves the following results on the evaluation set:
28
+ - Loss: 0.0630
29
+ - Precision: 0.8620
30
+ - Recall: 0.9072
31
+ - F1: 0.8840
32
+ - Accuracy: 0.9830
33
+
34
+ ---
35
 
36
  ## Model description
37
 
38
+ BERTInvoiceCzechR (V3) is the final model in a multi-stage training pipeline designed for invoice understanding.
39
+
40
+ The model performs token-level classification to extract structured invoice fields:
41
+ - supplier
42
+ - customer
43
+ - invoice number
44
+ - bank details
45
+ - totals
46
+ - dates
47
+
48
+ This version combines synthetic data, layout augmentation, hybrid data, and **real annotated invoices**, resulting in the highest performance across all variants.
49
+
50
+ ---
51
+
52
+ ## Training data
53
+
54
+ The dataset used in this stage is a combination of:
55
+
56
+ 1. **Synthetic template-based invoices (V0)**
57
+ 2. **Synthetic invoices with randomized layouts (V1)**
58
+ 3. **Hybrid invoices with real layouts and synthetic content (V2)**
59
+ 4. **Real annotated invoices**
60
+
61
+ ### Real data fine-tuning
62
+
63
+ The final stage introduces:
64
+ - real invoice documents
65
+ - manually or semi-automatically annotated fields
66
+ - natural linguistic variability
67
+ - real formatting inconsistencies
68
+
69
+ This allows the model to:
70
+ - adapt to real-world distributions
71
+ - learn domain-specific patterns
72
+ - improve robustness and generalization
73
+
74
+ ---
75
 
76
+ ## Role in the pipeline
77
 
78
+ This model corresponds to:
79
 
80
+ **V3 Full pipeline (synthetic + hybrid + real data fine-tuning)**
81
 
82
+ It represents:
83
+ - the final production-ready model
84
+ - the culmination of the proposed data generation strategy
85
+ - the best-performing configuration in the experimental setup
86
+
87
+ ---
88
+
89
+ ## Intended uses
90
+
91
+ - Real-world invoice information extraction
92
+ - Document AI systems in production environments
93
+ - OCR post-processing pipelines
94
+ - Research benchmarking against synthetic-only approaches
95
+
96
+ ---
97
+
98
+ ## Limitations
99
+
100
+ - Performance depends on OCR quality (input text assumption)
101
+ - May still struggle with:
102
+ - highly unusual invoice formats
103
+ - extreme noise or low-resolution scans
104
+ - Requires tokenized text input (not end-to-end from images)
105
+ - Domain-specific (Czech invoices)
106
+
107
+ ---
108
 
109
  ## Training procedure
110
 
 
121
  - num_epochs: 10
122
  - mixed_precision_training: Native AMP
123
 
124
+ ---
125
+
126
  ### Training results
127
 
128
  | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
 
138
  | No log | 9.0 | 180 | 0.0644 | 0.8593 | 0.9083 | 0.8831 | 0.9828 |
139
  | No log | 10.0 | 200 | 0.0630 | 0.8620 | 0.9072 | 0.8840 | 0.9830 |
140
 
141
+ ---
142
 
143
+ ## Framework versions
144
 
145
+ - Transformers 5.0.0
146
+ - PyTorch 2.10.0+cu128
147
+ - Datasets 4.0.0
148
+ - Tokenizers 0.22.2