--- license: other language: - el library_name: transformers pipeline_tag: text-classification tags: - text-classification - bert - greek - document-classification - page-classification - nlp - contrastive-learning base_model: nlpaueb/bert-base-greek-uncased-v1 metrics: - accuracy - f1 --- # Arch-L3869-PageClassification ## Model Details ### Model Description This is a **Greek text classification model** for categorizing document pages into 18 different classes. The model was trained using a two-phase approach: 1. **Phase 1 (Contrastive Learning):** Further pre-training of the base BERT model using Supervised Contrastive Learning (SCL) to create better document embeddings. 2. **Phase 2 (Classification):** Fine-tuning with Asymmetric Loss for handling class imbalance. - **Developed by:** Archeiothiki S.A. - AI Services Team - **Model type:** BertForSequenceClassification - **Language(s):** Greek (el) - **Finetuned from model:** [nlpaueb/bert-base-greek-uncased-v1](https://huggingface.co/nlpaueb/bert-base-greek-uncased-v1) ### Model Architecture - **Base Model:** nlpaueb/bert-base-greek-uncased-v1 - **Pruned Layers:** [0, 2, 4, 6, 8, 11] (6 layers kept for efficiency) - **Hidden Size:** 768 - **Attention Heads:** 12 - **Max Position Embeddings:** 512 - **Vocab Size:** 35,000 ## Uses ### Direct Use This model classifies document pages (text extracted via OCR) into one of 18 categories: | ID | Class Label | Description | |----|-------------|-------------| | 0 | AA_AADE_OTHER | Other AADE documents | | 1 | AA_Certificate_of_Current_Image_of_Entity | Business/Partnership Certificates | | 2 | AA_ENERGY | Energy bills | | 3 | AA_Employer's_Certificate/Payroll | Employment certificates | | 4 | AA_ID_Card | Identity cards | | 5 | AA_INCOME_TAX_RETURN_-_E1 | Income tax return (E1 form) | | 6 | AA_INCOME_TAX_RETURN_OF_LEGAL_PERSONS | Legal entity tax returns (N form) | | 7 | AA_LEGAL_ENTITY_MINUTES | General Assembly/Board minutes | | 8 | AA_LEGAL_ENT_ARTICLES_OF_ASSOCIATION | Articles of association | | 9 | AA_LEGAL_ENT_CERTIFICATE | Commercial Registry certificates | | 10 | AA_NEW_POLICE_IDENTITY_CARD | New police ID cards | | 11 | AA_Natural_Person_Information_Form | Ownership certificates | | 12 | AA_Pension_Certificate | Pension certificates | | 13 | AA_Personal_Income_Tax_(FEP) | Personal income tax (FEP) | | 14 | AA_SOLEMN_DECLARATION | Solemn declarations | | 15 | AA_TELEPHONY | Phone bills | | 16 | BB_Other_Documents | Other identifiable documents | | 17 | Other | Unclassified pages | ## How to Get Started with the Model ### Prerequisites ```bash pip install transformers torch ``` ### Preprocessing Function (Required!) ⚠️ **IMPORTANT:** This preprocessing MUST be applied to all texts before inference. The model was trained with this preprocessing. ```python import re import unicodedata # Same symbols removed during training SYMBOLS_TO_REMOVE = r"[`~!@#$%^&*()\-+=\[\]{\}/?><,\'\":;|»«§°·¦ʼ¬£€©΄´\\…\n]" def strip_accents_and_lowercase(text: str) -> str: """Remove accents and convert to lowercase.""" return "".join( c for c in unicodedata.normalize("NFD", text) if unicodedata.category(c) != "Mn" ).lower() def clean_text(text: str, symbols_to_remove: str | None = None) -> str: """ Main preprocessing function. Steps: 1. Remove special symbols 2. Collapse multiple dots into single dot 3. Remove accents + lowercase 4. Normalize whitespace """ if symbols_to_remove: text = re.sub(symbols_to_remove, " ", text) text = re.sub(r"\.{2,}", ". ", text) text = strip_accents_and_lowercase(text) text = re.sub(r"\s+", " ", text).strip() return text def preprocess_text(text: str) -> str: return clean_text(text, symbols_to_remove=SYMBOLS_TO_REMOVE) ``` ### Inference Code Snippet (includes preprocessing + dummy strings) ```python import json import re import unicodedata import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer # Preprocessing (REQUIRED!) SYMBOLS_TO_REMOVE = r"[`~!@#$%^&*()\-+=\[\]{\}/?><,\'\":;|»«§°·¦ʼ¬£€©΄´\\…\n]" def strip_accents_and_lowercase(text: str) -> str: return "".join( c for c in unicodedata.normalize("NFD", text) if unicodedata.category(c) != "Mn" ).lower() def clean_text(text: str, symbols_to_remove: str | None = None) -> str: if symbols_to_remove: text = re.sub(symbols_to_remove, " ", text) text = re.sub(r"\.{2,}", ". ", text) text = strip_accents_and_lowercase(text) text = re.sub(r"\s+", " ", text).strip() return text def preprocess_text(text: str) -> str: return clean_text(text, symbols_to_remove=SYMBOLS_TO_REMOVE) # Load model and tokenizer MODEL_PATH = "path/to/model" tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH) model.eval() # Load label mapping with open(f"{MODEL_PATH}/id2label.json", "r", encoding="utf-8") as f: id2label = json.load(f) # Dummy texts (examples) texts = [ "ΔΕΛΤΙΟ ΑΣΤΥΝΟΜΙΚΗΣ ΤΑΥΤΟΤΗΤΑΣ ΠΑΠΑΔΟΠΟΥΛΟΣ ΙΩΑΝΝΗΣ", "ΕΝΤΥΠΟ Ε1 ΔΗΛΩΣΗ ΦΟΡΟΛΟΓΙΑΣ ΕΙΣΟΔΗΜΑΤΟΣ 2024", ] # Preprocess texts preprocessed_texts = [preprocess_text(t) for t in texts] # Tokenize inputs = tokenizer( preprocessed_texts, truncation=True, padding="max_length", max_length=512, return_tensors="pt" ) # Inference with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits probabilities = torch.sigmoid(logits) # Multi-label sigmoid predictions = probabilities.argmax(dim=1) # Get labels for i, pred in enumerate(predictions): label = id2label[str(pred.item())] confidence = probabilities[i][pred].item() print(f"Text: {texts[i][:50]}...") print(f"Prediction: {label} (confidence: {confidence:.4f})") print() ``` ### Expected Output ``` Text: ΔΕΛΤΙΟ ΑΣΤΥΝΟΜΙΚΗΣ ΤΑΥΤΟΤΗΤΑΣ ΠΑΠΑΔΟΠΟΥΛΟΣ ΙΩΑΝΝΗΣ... Prediction: AA_ID_Card (confidence: 0.9842) Text: ΕΝΤΥΠΟ Ε1 ΔΗΛΩΣΗ ΦΟΡΟΛΟΓΙΑΣ ΕΙΣΟΔΗΜΑΤΟΣ 2024... Prediction: AA_INCOME_TAX_RETURN_-_E1 (confidence: 0.9567) ``` ## Training Details ### Training Data - **Dataset:** Internal annotated document dataset - **Total Samples:** ~6,600 (train + validation) - **Test Samples:** 1,336 - **Classes:** 18 (imbalanced distribution) - **Largest Class:** Other (571 test samples, ~43%) - **Smallest Class:** AA_LEGAL_ENTITY_MINUTES (7 test samples, ~0.5%) ### Training Procedure #### Phase 1: Contrastive Learning - **Base Model:** nlpaueb/bert-base-greek-uncased-v1 - **Loss Function:** Supervised Contrastive Loss (SCL) - **Epochs:** 200 - **Learning Rate:** 2e-5 - **Batch Size:** 32 - **Layer Pruning:** Kept layers [0, 2, 4, 6, 8, 11] #### Phase 2: Classification - **Base Model:** Output of Phase 1 (26_01_2026_15_00_12) - **Loss Function:** Asymmetric Loss (gamma=4) - **Epochs:** 50 - **Learning Rate:** 1e-4 - **Batch Size:** 32 - **Gradient Accumulation:** 2 - **Warmup Ratio:** 0.1 - **LR Scheduler:** Cosine - **Oversampling:** BB_Other_Documents (x2) ### Framework Versions - **Python:** 3.9.0 - **PyTorch:** 2.x - **Transformers:** 4.38.2 - **Datasets:** 2.x ## Evaluation Results ### Overall Metrics (Test Set: 1,336 samples) | Metric | Score | |--------|-------| | **Accuracy** | 0.94 | | **Macro F1** | 0.92 | | **Weighted F1** | 0.94 | ### Per-Class Performance | Class | Precision | Recall | F1-Score | Support | |-------|-----------|--------|----------|---------| | AA_AADE_OTHER | 0.89 | 0.89 | 0.89 | 9 | | AA_Certificate_of_Current_Image | 1.00 | 1.00 | 1.00 | 10 | | AA_ENERGY | 0.92 | 0.89 | 0.91 | 27 | | AA_Employer's_Certificate/Payroll | 0.86 | 0.97 | 0.92 | 39 | | AA_ID_Card | 1.00 | 0.99 | 1.00 | 190 | | AA_INCOME_TAX_RETURN_-_E1 | 0.92 | 0.86 | 0.89 | 77 | | AA_INCOME_TAX_RETURN_LEGAL | 1.00 | 1.00 | 1.00 | 8 | | AA_LEGAL_ENTITY_MINUTES | 1.00 | 1.00 | 1.00 | 7 | | AA_LEGAL_ENT_ARTICLES | 0.80 | 1.00 | 0.89 | 8 | | AA_LEGAL_ENT_CERTIFICATE | 0.71 | 0.88 | 0.79 | 17 | | AA_NEW_POLICE_IDENTITY_CARD | 0.96 | 1.00 | 0.98 | 26 | | AA_Natural_Person_Form | 0.90 | 0.93 | 0.92 | 30 | | AA_Pension_Certificate | 0.92 | 0.95 | 0.93 | 74 | | AA_Personal_Income_Tax_(FEP) | 1.00 | 0.94 | 0.97 | 147 | | AA_SOLEMN_DECLARATION | 0.80 | 0.89 | 0.84 | 9 | | AA_TELEPHONY | 0.97 | 0.92 | 0.94 | 65 | | **BB_Other_Documents** | **0.82** | **0.64** | **0.72** | 22 | | **Other** | **0.94** | **0.95** | **0.95** | 571 | ### Key Performance Highlights - ✅ **Other class:** F1=0.95 (excellent handling of the majority class) - ✅ **BB_Other_Documents:** F1=0.72 (best among all trained models for this rare class) - ✅ **High-confidence classes:** AA_ID_Card, AA_Certificate, AA_Legal_Entity_Minutes all achieve 1.00 F1 - ⚠️ **Lower performance:** AA_LEGAL_ENT_CERTIFICATE (F1=0.79) - needs more training data ## Model Files | File | Description | Required | |------|-------------|----------| | `model.safetensors` | Model weights | ✅ Yes | | `config.json` | Model architecture + id2label/label2id | ✅ Yes | | `tokenizer.json` | Tokenizer | ✅ Yes | | `tokenizer_config.json` | Tokenizer config | ✅ Yes | | `vocab.txt` | Vocabulary | ✅ Yes | | `special_tokens_map.json` | Special tokens | ✅ Yes | | `id2label.json` | ID to label mapping | ✅ Yes | | `label2id.json` | Label to ID mapping | ✅ Yes | | `test_report.txt` | Classification report | Optional | ## Model Card Authors AI Services Team - Archeiothiki S.A. ## Model Card Contact Internal use only.