TomasFAV
/

BERTInvoiceCzechV0

@@ -4,40 +4,103 @@ license: apache-2.0
 base_model: google-bert/bert-base-multilingual-cased
 tags:
 - generated_from_trainer
 metrics:
 - precision
 - recall
 - f1
 - accuracy
 model-index:
-- name: BERTInvoiceCzechR
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# BERTInvoiceCzechR
-This model is a fine-tuned version of [google-bert/bert-base-multilingual-cased](https://huggingface.co/google-bert/bert-base-multilingual-cased) on an unknown dataset.
 It achieves the following results on the evaluation set:
-- Loss: 0.3291
-- Precision: 0.5188
-- Recall: 0.6917
-- F1: 0.5929
-- Accuracy: 0.9335
 ## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
 ## Training procedure
@@ -54,6 +117,8 @@ The following hyperparameters were used during training:
 - num_epochs: 10
 - mixed_precision_training: Native AMP
 ### Training results
 | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1     | Accuracy |
@@ -69,10 +134,11 @@ The following hyperparameters were used during training:
 | 0.3757        | 9.0   | 783  | 0.3604          | 0.4906    | 0.6858 | 0.5720 | 0.9279   |
 | 0.3757        | 10.0  | 870  | 0.3515          | 0.5011    | 0.6944 | 0.5821 | 0.9296   |
-### Framework versions
-- Transformers 5.0.0
-- Pytorch 2.10.0+cu128
-- Datasets 4.0.0
-- Tokenizers 0.22.2

 base_model: google-bert/bert-base-multilingual-cased
 tags:
 - generated_from_trainer
+- invoice-processing
+- information-extraction
+- czech-language
+- synthetic-data
 metrics:
 - precision
 - recall
 - f1
 - accuracy
 model-index:
+- name: BERTInvoiceCzechR-V0
   results: []
 ---
+# BERTInvoiceCzechR (V0 – Synthetic Templates Only)
+This model is a fine-tuned version of [google-bert/bert-base-multilingual-cased](https://huggingface.co/google-bert/bert-base-multilingual-cased) for the task of structured information extraction from Czech invoices.
 It achieves the following results on the evaluation set:
+- Loss: 0.3291
+- Precision: 0.5188
+- Recall: 0.6917
+- F1: 0.5929
+- Accuracy: 0.9335
+---
 ## Model description
+BERTInvoiceCzechR (V0) is the baseline model in a multi-stage experimental pipeline focused on invoice understanding.
+The model performs token-level classification to extract structured fields from invoice text, such as:
+- supplier
+- customer
+- invoice number
+- bank details
+- totals
+- dates
+This version (V0) is trained **exclusively on synthetically generated invoices created from predefined templates**, without any layout randomization or real-world data.
+---
+## Training data
+The dataset consists purely of:
+- synthetically generated invoices
+- fixed template structures
+- controlled field placement and formatting
+Characteristics:
+- consistent layout across samples
+- fully controlled annotations
+- no noise or OCR artifacts
+- no real invoice data
+- added synthetic image augmentations
+This dataset represents the **simplest training scenario** in the pipeline and serves as a baseline for comparison with more complex data variants.
+---
+## Role in the pipeline
+This model corresponds to:
+**V0 – Synthetic template-based dataset only**
+It is used as:
+- a baseline for evaluating the impact of:
+  - layout variability
+  - synthetic-real hybrid data
+  - real annotated invoices
+- a reference point for measuring generalization gap
+---
+## Intended uses
+- Baseline model for document AI experiments
+- Evaluation of synthetic data usefulness
+- Comparison with more advanced dataset variants (V1–V3)
+- Research in Czech invoice information extraction
+---
+## Limitations
+- Strong dependency on template structure
+- May have poor generalization to:
+  - unseen layouts
+  - real-world invoices
+  - noisy OCR outputs
+- Does not capture layout variability
+- Trained only on clean synthetic data
+---
 ## Training procedure
 - num_epochs: 10
 - mixed_precision_training: Native AMP
+---
 ### Training results
 | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1     | Accuracy |
 | 0.3757        | 9.0   | 783  | 0.3604          | 0.4906    | 0.6858 | 0.5720 | 0.9279   |
 | 0.3757        | 10.0  | 870  | 0.3515          | 0.5011    | 0.6944 | 0.5821 | 0.9296   |
+---
+## Framework versions
+- Transformers 5.0.0
+- PyTorch 2.10.0+cu128
+- Datasets 4.0.0
+- Tokenizers 0.22.2