TomasFAV
/

BERTInvoiceCzechV0123WORSEF1

@@ -1,41 +1,110 @@
 ---
 library_name: transformers
 tags:
 - generated_from_trainer
 metrics:
 - precision
 - recall
 - f1
 - accuracy
 model-index:
-- name: BERTInvoiceCzechRV3
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# BERTInvoiceCzechRV3
-This model was trained from scratch on an unknown dataset.
 It achieves the following results on the evaluation set:
-- Loss: 0.0630
-- Precision: 0.8620
-- Recall: 0.9072
-- F1: 0.8840
-- Accuracy: 0.9830
 ## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
 ## Training procedure
@@ -52,6 +121,8 @@ The following hyperparameters were used during training:
 - num_epochs: 10
 - mixed_precision_training: Native AMP
 ### Training results
 | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1     | Accuracy |
@@ -67,10 +138,11 @@ The following hyperparameters were used during training:
 | No log        | 9.0   | 180  | 0.0644          | 0.8593    | 0.9083 | 0.8831 | 0.9828   |
 | No log        | 10.0  | 200  | 0.0630          | 0.8620    | 0.9072 | 0.8840 | 0.9830   |
-### Framework versions
-- Transformers 5.0.0
-- Pytorch 2.10.0+cu128
-- Datasets 4.0.0
-- Tokenizers 0.22.2

 ---
 library_name: transformers
+license: apache-2.0
+base_model: google-bert/bert-base-multilingual-cased
 tags:
 - generated_from_trainer
+- invoice-processing
+- information-extraction
+- czech-language
+- synthetic-data
+- hybrid-data
+- real-data
 metrics:
 - precision
 - recall
 - f1
 - accuracy
 model-index:
+- name: BERTInvoiceCzechR-V3
   results: []
 ---
+# BERTInvoiceCzechR (V3 – Full Pipeline with Real Data Fine-Tuning)
+This model is a fine-tuned version of [google-bert/bert-base-multilingual-cased](https://huggingface.co/google-bert/bert-base-multilingual-cased) for structured information extraction from Czech invoices.
 It achieves the following results on the evaluation set:
+- Loss: 0.0630
+- Precision: 0.8620
+- Recall: 0.9072
+- F1: 0.8840
+- Accuracy: 0.9830
+---
 ## Model description
+BERTInvoiceCzechR (V3) is the final model in a multi-stage training pipeline designed for invoice understanding.
+The model performs token-level classification to extract structured invoice fields:
+- supplier
+- customer
+- invoice number
+- bank details
+- totals
+- dates
+This version combines synthetic data, layout augmentation, hybrid data, and **real annotated invoices**, resulting in the highest performance across all variants.
+---
+## Training data
+The dataset used in this stage is a combination of:
+1. **Synthetic template-based invoices (V0)**
+2. **Synthetic invoices with randomized layouts (V1)**
+3. **Hybrid invoices with real layouts and synthetic content (V2)**
+4. **Real annotated invoices**
+### Real data fine-tuning
+The final stage introduces:
+- real invoice documents
+- manually or semi-automatically annotated fields
+- natural linguistic variability
+- real formatting inconsistencies
+This allows the model to:
+- adapt to real-world distributions
+- learn domain-specific patterns
+- improve robustness and generalization
+---
+## Role in the pipeline
+This model corresponds to:
+**V3 – Full pipeline (synthetic + hybrid + real data fine-tuning)**
+It represents:
+- the final production-ready model
+- the culmination of the proposed data generation strategy
+- the best-performing configuration in the experimental setup
+---
+## Intended uses
+- Real-world invoice information extraction
+- Document AI systems in production environments
+- OCR post-processing pipelines
+- Research benchmarking against synthetic-only approaches
+---
+## Limitations
+- Performance depends on OCR quality (input text assumption)
+- May still struggle with:
+  - highly unusual invoice formats
+  - extreme noise or low-resolution scans
+- Requires tokenized text input (not end-to-end from images)
+- Domain-specific (Czech invoices)
+---
 ## Training procedure
 - num_epochs: 10
 - mixed_precision_training: Native AMP
+---
 ### Training results
 | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1     | Accuracy |
 | No log        | 9.0   | 180  | 0.0644          | 0.8593    | 0.9083 | 0.8831 | 0.9828   |
 | No log        | 10.0  | 200  | 0.0630          | 0.8620    | 0.9072 | 0.8840 | 0.9830   |
+---
+## Framework versions
+- Transformers 5.0.0
+- PyTorch 2.10.0+cu128
+- Datasets 4.0.0
+- Tokenizers 0.22.2