KYC_classification / README.md
gkl-arch's picture
Upload 11 files
479068c verified
---
license: other
language:
- el
library_name: transformers
pipeline_tag: text-classification
tags:
- text-classification
- bert
- greek
- document-classification
- page-classification
- nlp
- contrastive-learning
base_model: nlpaueb/bert-base-greek-uncased-v1
metrics:
- accuracy
- f1
---
# Arch-L3869-PageClassification
## Model Details
### Model Description
This is a **Greek text classification model** for categorizing document pages into 18 different classes. The model was trained using a two-phase approach:
1. **Phase 1 (Contrastive Learning):** Further pre-training of the base BERT model using Supervised Contrastive Learning (SCL) to create better document embeddings.
2. **Phase 2 (Classification):** Fine-tuning with Asymmetric Loss for handling class imbalance.
- **Developed by:** Archeiothiki S.A. - AI Services Team
- **Model type:** BertForSequenceClassification
- **Language(s):** Greek (el)
- **Finetuned from model:** [nlpaueb/bert-base-greek-uncased-v1](https://huggingface.co/nlpaueb/bert-base-greek-uncased-v1)
### Model Architecture
- **Base Model:** nlpaueb/bert-base-greek-uncased-v1
- **Pruned Layers:** [0, 2, 4, 6, 8, 11] (6 layers kept for efficiency)
- **Hidden Size:** 768
- **Attention Heads:** 12
- **Max Position Embeddings:** 512
- **Vocab Size:** 35,000
## Uses
### Direct Use
This model classifies document pages (text extracted via OCR) into one of 18 categories:
| ID | Class Label | Description |
|----|-------------|-------------|
| 0 | AA_AADE_OTHER | Other AADE documents |
| 1 | AA_Certificate_of_Current_Image_of_Entity | Business/Partnership Certificates |
| 2 | AA_ENERGY | Energy bills |
| 3 | AA_Employer's_Certificate/Payroll | Employment certificates |
| 4 | AA_ID_Card | Identity cards |
| 5 | AA_INCOME_TAX_RETURN_-_E1 | Income tax return (E1 form) |
| 6 | AA_INCOME_TAX_RETURN_OF_LEGAL_PERSONS | Legal entity tax returns (N form) |
| 7 | AA_LEGAL_ENTITY_MINUTES | General Assembly/Board minutes |
| 8 | AA_LEGAL_ENT_ARTICLES_OF_ASSOCIATION | Articles of association |
| 9 | AA_LEGAL_ENT_CERTIFICATE | Commercial Registry certificates |
| 10 | AA_NEW_POLICE_IDENTITY_CARD | New police ID cards |
| 11 | AA_Natural_Person_Information_Form | Ownership certificates |
| 12 | AA_Pension_Certificate | Pension certificates |
| 13 | AA_Personal_Income_Tax_(FEP) | Personal income tax (FEP) |
| 14 | AA_SOLEMN_DECLARATION | Solemn declarations |
| 15 | AA_TELEPHONY | Phone bills |
| 16 | BB_Other_Documents | Other identifiable documents |
| 17 | Other | Unclassified pages |
## How to Get Started with the Model
### Prerequisites
```bash
pip install transformers torch
```
### Preprocessing Function (Required!)
⚠️ **IMPORTANT:** This preprocessing MUST be applied to all texts before inference. The model was trained with this preprocessing.
```python
import re
import unicodedata
# Same symbols removed during training
SYMBOLS_TO_REMOVE = r"[`~!@#$%^&*()\-+=\[\]{\}/?><,\'\":;|»«§°·¦ʼ¬£€©΄´\\…\n]"
def strip_accents_and_lowercase(text: str) -> str:
"""Remove accents and convert to lowercase."""
return "".join(
c for c in unicodedata.normalize("NFD", text)
if unicodedata.category(c) != "Mn"
).lower()
def clean_text(text: str, symbols_to_remove: str | None = None) -> str:
"""
Main preprocessing function.
Steps:
1. Remove special symbols
2. Collapse multiple dots into single dot
3. Remove accents + lowercase
4. Normalize whitespace
"""
if symbols_to_remove:
text = re.sub(symbols_to_remove, " ", text)
text = re.sub(r"\.{2,}", ". ", text)
text = strip_accents_and_lowercase(text)
text = re.sub(r"\s+", " ", text).strip()
return text
def preprocess_text(text: str) -> str:
return clean_text(text, symbols_to_remove=SYMBOLS_TO_REMOVE)
```
### Inference Code Snippet (includes preprocessing + dummy strings)
```python
import json
import re
import unicodedata
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Preprocessing (REQUIRED!)
SYMBOLS_TO_REMOVE = r"[`~!@#$%^&*()\-+=\[\]{\}/?><,\'\":;|»«§°·¦ʼ¬£€©΄´\\…\n]"
def strip_accents_and_lowercase(text: str) -> str:
return "".join(
c for c in unicodedata.normalize("NFD", text)
if unicodedata.category(c) != "Mn"
).lower()
def clean_text(text: str, symbols_to_remove: str | None = None) -> str:
if symbols_to_remove:
text = re.sub(symbols_to_remove, " ", text)
text = re.sub(r"\.{2,}", ". ", text)
text = strip_accents_and_lowercase(text)
text = re.sub(r"\s+", " ", text).strip()
return text
def preprocess_text(text: str) -> str:
return clean_text(text, symbols_to_remove=SYMBOLS_TO_REMOVE)
# Load model and tokenizer
MODEL_PATH = "path/to/model"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)
model.eval()
# Load label mapping
with open(f"{MODEL_PATH}/id2label.json", "r", encoding="utf-8") as f:
id2label = json.load(f)
# Dummy texts (examples)
texts = [
"ΔΕΛΤΙΟ ΑΣΤΥΝΟΜΙΚΗΣ ΤΑΥΤΟΤΗΤΑΣ ΠΑΠΑΔΟΠΟΥΛΟΣ ΙΩΑΝΝΗΣ",
"ΕΝΤΥΠΟ Ε1 ΔΗΛΩΣΗ ΦΟΡΟΛΟΓΙΑΣ ΕΙΣΟΔΗΜΑΤΟΣ 2024",
]
# Preprocess texts
preprocessed_texts = [preprocess_text(t) for t in texts]
# Tokenize
inputs = tokenizer(
preprocessed_texts,
truncation=True,
padding="max_length",
max_length=512,
return_tensors="pt"
)
# Inference
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probabilities = torch.sigmoid(logits) # Multi-label sigmoid
predictions = probabilities.argmax(dim=1)
# Get labels
for i, pred in enumerate(predictions):
label = id2label[str(pred.item())]
confidence = probabilities[i][pred].item()
print(f"Text: {texts[i][:50]}...")
print(f"Prediction: {label} (confidence: {confidence:.4f})")
print()
```
### Expected Output
```
Text: ΔΕΛΤΙΟ ΑΣΤΥΝΟΜΙΚΗΣ ΤΑΥΤΟΤΗΤΑΣ ΠΑΠΑΔΟΠΟΥΛΟΣ ΙΩΑΝΝΗΣ...
Prediction: AA_ID_Card (confidence: 0.9842)
Text: ΕΝΤΥΠΟ Ε1 ΔΗΛΩΣΗ ΦΟΡΟΛΟΓΙΑΣ ΕΙΣΟΔΗΜΑΤΟΣ 2024...
Prediction: AA_INCOME_TAX_RETURN_-_E1 (confidence: 0.9567)
```
## Training Details
### Training Data
- **Dataset:** Internal annotated document dataset
- **Total Samples:** ~6,600 (train + validation)
- **Test Samples:** 1,336
- **Classes:** 18 (imbalanced distribution)
- **Largest Class:** Other (571 test samples, ~43%)
- **Smallest Class:** AA_LEGAL_ENTITY_MINUTES (7 test samples, ~0.5%)
### Training Procedure
#### Phase 1: Contrastive Learning
- **Base Model:** nlpaueb/bert-base-greek-uncased-v1
- **Loss Function:** Supervised Contrastive Loss (SCL)
- **Epochs:** 200
- **Learning Rate:** 2e-5
- **Batch Size:** 32
- **Layer Pruning:** Kept layers [0, 2, 4, 6, 8, 11]
#### Phase 2: Classification
- **Base Model:** Output of Phase 1 (26_01_2026_15_00_12)
- **Loss Function:** Asymmetric Loss (gamma=4)
- **Epochs:** 50
- **Learning Rate:** 1e-4
- **Batch Size:** 32
- **Gradient Accumulation:** 2
- **Warmup Ratio:** 0.1
- **LR Scheduler:** Cosine
- **Oversampling:** BB_Other_Documents (x2)
### Framework Versions
- **Python:** 3.9.0
- **PyTorch:** 2.x
- **Transformers:** 4.38.2
- **Datasets:** 2.x
## Evaluation Results
### Overall Metrics (Test Set: 1,336 samples)
| Metric | Score |
|--------|-------|
| **Accuracy** | 0.94 |
| **Macro F1** | 0.92 |
| **Weighted F1** | 0.94 |
### Per-Class Performance
| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| AA_AADE_OTHER | 0.89 | 0.89 | 0.89 | 9 |
| AA_Certificate_of_Current_Image | 1.00 | 1.00 | 1.00 | 10 |
| AA_ENERGY | 0.92 | 0.89 | 0.91 | 27 |
| AA_Employer's_Certificate/Payroll | 0.86 | 0.97 | 0.92 | 39 |
| AA_ID_Card | 1.00 | 0.99 | 1.00 | 190 |
| AA_INCOME_TAX_RETURN_-_E1 | 0.92 | 0.86 | 0.89 | 77 |
| AA_INCOME_TAX_RETURN_LEGAL | 1.00 | 1.00 | 1.00 | 8 |
| AA_LEGAL_ENTITY_MINUTES | 1.00 | 1.00 | 1.00 | 7 |
| AA_LEGAL_ENT_ARTICLES | 0.80 | 1.00 | 0.89 | 8 |
| AA_LEGAL_ENT_CERTIFICATE | 0.71 | 0.88 | 0.79 | 17 |
| AA_NEW_POLICE_IDENTITY_CARD | 0.96 | 1.00 | 0.98 | 26 |
| AA_Natural_Person_Form | 0.90 | 0.93 | 0.92 | 30 |
| AA_Pension_Certificate | 0.92 | 0.95 | 0.93 | 74 |
| AA_Personal_Income_Tax_(FEP) | 1.00 | 0.94 | 0.97 | 147 |
| AA_SOLEMN_DECLARATION | 0.80 | 0.89 | 0.84 | 9 |
| AA_TELEPHONY | 0.97 | 0.92 | 0.94 | 65 |
| **BB_Other_Documents** | **0.82** | **0.64** | **0.72** | 22 |
| **Other** | **0.94** | **0.95** | **0.95** | 571 |
### Key Performance Highlights
- ✅ **Other class:** F1=0.95 (excellent handling of the majority class)
- ✅ **BB_Other_Documents:** F1=0.72 (best among all trained models for this rare class)
- ✅ **High-confidence classes:** AA_ID_Card, AA_Certificate, AA_Legal_Entity_Minutes all achieve 1.00 F1
- ⚠️ **Lower performance:** AA_LEGAL_ENT_CERTIFICATE (F1=0.79) - needs more training data
## Model Files
| File | Description | Required |
|------|-------------|----------|
| `model.safetensors` | Model weights | ✅ Yes |
| `config.json` | Model architecture + id2label/label2id | ✅ Yes |
| `tokenizer.json` | Tokenizer | ✅ Yes |
| `tokenizer_config.json` | Tokenizer config | ✅ Yes |
| `vocab.txt` | Vocabulary | ✅ Yes |
| `special_tokens_map.json` | Special tokens | ✅ Yes |
| `id2label.json` | ID to label mapping | ✅ Yes |
| `label2id.json` | Label to ID mapping | ✅ Yes |
| `test_report.txt` | Classification report | Optional |
## Model Card Authors
AI Services Team - Archeiothiki S.A.
## Model Card Contact
Internal use only.