File size: 9,719 Bytes

479068c

---
license: other
language:
- el
library_name: transformers
pipeline_tag: text-classification
tags:
- text-classification
- bert
- greek
- document-classification
- page-classification
- nlp
- contrastive-learning
base_model: nlpaueb/bert-base-greek-uncased-v1
metrics:
- accuracy
- f1
---

# Arch-L3869-PageClassification

## Model Details

### Model Description

This is a **Greek text classification model** for categorizing document pages into 18 different classes. The model was trained using a two-phase approach:

1. **Phase 1 (Contrastive Learning):** Further pre-training of the base BERT model using Supervised Contrastive Learning (SCL) to create better document embeddings.
2. **Phase 2 (Classification):** Fine-tuning with Asymmetric Loss for handling class imbalance.

- **Developed by:** Archeiothiki S.A. - AI Services Team
- **Model type:** BertForSequenceClassification
- **Language(s):** Greek (el)
- **Finetuned from model:** [nlpaueb/bert-base-greek-uncased-v1](https://huggingface.co/nlpaueb/bert-base-greek-uncased-v1)

### Model Architecture

- **Base Model:** nlpaueb/bert-base-greek-uncased-v1
- **Pruned Layers:** [0, 2, 4, 6, 8, 11] (6 layers kept for efficiency)
- **Hidden Size:** 768
- **Attention Heads:** 12
- **Max Position Embeddings:** 512
- **Vocab Size:** 35,000

## Uses

### Direct Use

This model classifies document pages (text extracted via OCR) into one of 18 categories:

| ID | Class Label | Description |
|----|-------------|-------------|
| 0 | AA_AADE_OTHER | Other AADE documents |
| 1 | AA_Certificate_of_Current_Image_of_Entity | Business/Partnership Certificates |
| 2 | AA_ENERGY | Energy bills |
| 3 | AA_Employer's_Certificate/Payroll | Employment certificates |
| 4 | AA_ID_Card | Identity cards |
| 5 | AA_INCOME_TAX_RETURN_-_E1 | Income tax return (E1 form) |
| 6 | AA_INCOME_TAX_RETURN_OF_LEGAL_PERSONS | Legal entity tax returns (N form) |
| 7 | AA_LEGAL_ENTITY_MINUTES | General Assembly/Board minutes |
| 8 | AA_LEGAL_ENT_ARTICLES_OF_ASSOCIATION | Articles of association |
| 9 | AA_LEGAL_ENT_CERTIFICATE | Commercial Registry certificates |
| 10 | AA_NEW_POLICE_IDENTITY_CARD | New police ID cards |
| 11 | AA_Natural_Person_Information_Form | Ownership certificates |
| 12 | AA_Pension_Certificate | Pension certificates |
| 13 | AA_Personal_Income_Tax_(FEP) | Personal income tax (FEP) |
| 14 | AA_SOLEMN_DECLARATION | Solemn declarations |
| 15 | AA_TELEPHONY | Phone bills |
| 16 | BB_Other_Documents | Other identifiable documents |
| 17 | Other | Unclassified pages |

## How to Get Started with the Model

### Prerequisites

```bash
pip install transformers torch
```

### Preprocessing Function (Required!)

⚠️ **IMPORTANT:** This preprocessing MUST be applied to all texts before inference. The model was trained with this preprocessing.

```python
import re
import unicodedata

# Same symbols removed during training
SYMBOLS_TO_REMOVE = r"[`~!@#$%^&*()\-+=\[\]{\}/?><,\'\":;|»«§°·¦ʼ¬£€©΄´\\…\n]"

def strip_accents_and_lowercase(text: str) -> str:
    """Remove accents and convert to lowercase."""
    return "".join(
        c for c in unicodedata.normalize("NFD", text)
        if unicodedata.category(c) != "Mn"
    ).lower()

def clean_text(text: str, symbols_to_remove: str | None = None) -> str:
    """
    Main preprocessing function.

    Steps:
        1. Remove special symbols
        2. Collapse multiple dots into single dot
        3. Remove accents + lowercase
        4. Normalize whitespace
    """
    if symbols_to_remove:
        text = re.sub(symbols_to_remove, " ", text)

    text = re.sub(r"\.{2,}", ". ", text)
    text = strip_accents_and_lowercase(text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def preprocess_text(text: str) -> str:
    return clean_text(text, symbols_to_remove=SYMBOLS_TO_REMOVE)
```

### Inference Code Snippet (includes preprocessing + dummy strings)

```python
import json
import re
import unicodedata
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Preprocessing (REQUIRED!)
SYMBOLS_TO_REMOVE = r"[`~!@#$%^&*()\-+=\[\]{\}/?><,\'\":;|»«§°·¦ʼ¬£€©΄´\\…\n]"

def strip_accents_and_lowercase(text: str) -> str:
    return "".join(
        c for c in unicodedata.normalize("NFD", text)
        if unicodedata.category(c) != "Mn"
    ).lower()

def clean_text(text: str, symbols_to_remove: str | None = None) -> str:
    if symbols_to_remove:
        text = re.sub(symbols_to_remove, " ", text)
    text = re.sub(r"\.{2,}", ". ", text)
    text = strip_accents_and_lowercase(text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def preprocess_text(text: str) -> str:
    return clean_text(text, symbols_to_remove=SYMBOLS_TO_REMOVE)

# Load model and tokenizer
MODEL_PATH = "path/to/model"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)
model.eval()

# Load label mapping
with open(f"{MODEL_PATH}/id2label.json", "r", encoding="utf-8") as f:
    id2label = json.load(f)

# Dummy texts (examples)
texts = [
    "ΔΕΛΤΙΟ ΑΣΤΥΝΟΜΙΚΗΣ ΤΑΥΤΟΤΗΤΑΣ ΠΑΠΑΔΟΠΟΥΛΟΣ ΙΩΑΝΝΗΣ",
    "ΕΝΤΥΠΟ Ε1 ΔΗΛΩΣΗ ΦΟΡΟΛΟΓΙΑΣ ΕΙΣΟΔΗΜΑΤΟΣ 2024",
]

# Preprocess texts
preprocessed_texts = [preprocess_text(t) for t in texts]

# Tokenize
inputs = tokenizer(
    preprocessed_texts,
    truncation=True,
    padding="max_length",
    max_length=512,
    return_tensors="pt"
)

# Inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probabilities = torch.sigmoid(logits)  # Multi-label sigmoid
    predictions = probabilities.argmax(dim=1)

# Get labels
for i, pred in enumerate(predictions):
    label = id2label[str(pred.item())]
    confidence = probabilities[i][pred].item()
    print(f"Text: {texts[i][:50]}...")
    print(f"Prediction: {label} (confidence: {confidence:.4f})")
    print()
```

### Expected Output

```
Text: ΔΕΛΤΙΟ ΑΣΤΥΝΟΜΙΚΗΣ ΤΑΥΤΟΤΗΤΑΣ ΠΑΠΑΔΟΠΟΥΛΟΣ ΙΩΑΝΝΗΣ...
Prediction: AA_ID_Card (confidence: 0.9842)

Text: ΕΝΤΥΠΟ Ε1 ΔΗΛΩΣΗ ΦΟΡΟΛΟΓΙΑΣ ΕΙΣΟΔΗΜΑΤΟΣ 2024...
Prediction: AA_INCOME_TAX_RETURN_-_E1 (confidence: 0.9567)
```

## Training Details

### Training Data

- **Dataset:** Internal annotated document dataset
- **Total Samples:** ~6,600 (train + validation)
- **Test Samples:** 1,336
- **Classes:** 18 (imbalanced distribution)
- **Largest Class:** Other (571 test samples, ~43%)
- **Smallest Class:** AA_LEGAL_ENTITY_MINUTES (7 test samples, ~0.5%)

### Training Procedure

#### Phase 1: Contrastive Learning
- **Base Model:** nlpaueb/bert-base-greek-uncased-v1
- **Loss Function:** Supervised Contrastive Loss (SCL)
- **Epochs:** 200
- **Learning Rate:** 2e-5
- **Batch Size:** 32
- **Layer Pruning:** Kept layers [0, 2, 4, 6, 8, 11]

#### Phase 2: Classification
- **Base Model:** Output of Phase 1 (26_01_2026_15_00_12)
- **Loss Function:** Asymmetric Loss (gamma=4)
- **Epochs:** 50
- **Learning Rate:** 1e-4
- **Batch Size:** 32
- **Gradient Accumulation:** 2
- **Warmup Ratio:** 0.1
- **LR Scheduler:** Cosine
- **Oversampling:** BB_Other_Documents (x2)

### Framework Versions

- **Python:** 3.9.0
- **PyTorch:** 2.x
- **Transformers:** 4.38.2
- **Datasets:** 2.x

## Evaluation Results

### Overall Metrics (Test Set: 1,336 samples)

| Metric | Score |
|--------|-------|
| **Accuracy** | 0.94 |
| **Macro F1** | 0.92 |
| **Weighted F1** | 0.94 |

### Per-Class Performance

| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| AA_AADE_OTHER | 0.89 | 0.89 | 0.89 | 9 |
| AA_Certificate_of_Current_Image | 1.00 | 1.00 | 1.00 | 10 |
| AA_ENERGY | 0.92 | 0.89 | 0.91 | 27 |
| AA_Employer's_Certificate/Payroll | 0.86 | 0.97 | 0.92 | 39 |
| AA_ID_Card | 1.00 | 0.99 | 1.00 | 190 |
| AA_INCOME_TAX_RETURN_-_E1 | 0.92 | 0.86 | 0.89 | 77 |
| AA_INCOME_TAX_RETURN_LEGAL | 1.00 | 1.00 | 1.00 | 8 |
| AA_LEGAL_ENTITY_MINUTES | 1.00 | 1.00 | 1.00 | 7 |
| AA_LEGAL_ENT_ARTICLES | 0.80 | 1.00 | 0.89 | 8 |
| AA_LEGAL_ENT_CERTIFICATE | 0.71 | 0.88 | 0.79 | 17 |
| AA_NEW_POLICE_IDENTITY_CARD | 0.96 | 1.00 | 0.98 | 26 |
| AA_Natural_Person_Form | 0.90 | 0.93 | 0.92 | 30 |
| AA_Pension_Certificate | 0.92 | 0.95 | 0.93 | 74 |
| AA_Personal_Income_Tax_(FEP) | 1.00 | 0.94 | 0.97 | 147 |
| AA_SOLEMN_DECLARATION | 0.80 | 0.89 | 0.84 | 9 |
| AA_TELEPHONY | 0.97 | 0.92 | 0.94 | 65 |
| **BB_Other_Documents** | **0.82** | **0.64** | **0.72** | 22 |
| **Other** | **0.94** | **0.95** | **0.95** | 571 |

### Key Performance Highlights

- ✅ **Other class:** F1=0.95 (excellent handling of the majority class)
- ✅ **BB_Other_Documents:** F1=0.72 (best among all trained models for this rare class)
- ✅ **High-confidence classes:** AA_ID_Card, AA_Certificate, AA_Legal_Entity_Minutes all achieve 1.00 F1
- ⚠️ **Lower performance:** AA_LEGAL_ENT_CERTIFICATE (F1=0.79) - needs more training data

## Model Files

| File | Description | Required |
|------|-------------|----------|
| `model.safetensors` | Model weights | ✅ Yes |
| `config.json` | Model architecture + id2label/label2id | ✅ Yes |
| `tokenizer.json` | Tokenizer | ✅ Yes |
| `tokenizer_config.json` | Tokenizer config | ✅ Yes |
| `vocab.txt` | Vocabulary | ✅ Yes |
| `special_tokens_map.json` | Special tokens | ✅ Yes |
| `id2label.json` | ID to label mapping | ✅ Yes |
| `label2id.json` | Label to ID mapping | ✅ Yes |
| `test_report.txt` | Classification report | Optional |

## Model Card Authors

AI Services Team - Archeiothiki S.A.

## Model Card Contact

Internal use only.