Upload 11 files

479068c verified about 2 months ago

9.72 kB

	---
	license: other
	language:
	- el
	library_name: transformers
	pipeline_tag: text-classification
	tags:
	- text-classification
	- bert
	- greek
	- document-classification
	- page-classification
	- nlp
	- contrastive-learning
	base_model: nlpaueb/bert-base-greek-uncased-v1
	metrics:
	- accuracy
	- f1
	---

	# Arch-L3869-PageClassification

	## Model Details

	### Model Description

	This is a Greek text classification model for categorizing document pages into 18 different classes. The model was trained using a two-phase approach:

	1. Phase 1 (Contrastive Learning): Further pre-training of the base BERT model using Supervised Contrastive Learning (SCL) to create better document embeddings.
	2. Phase 2 (Classification): Fine-tuning with Asymmetric Loss for handling class imbalance.

	- Developed by: Archeiothiki S.A. - AI Services Team
	- Model type: BertForSequenceClassification
	- Language(s): Greek (el)
	- Finetuned from model: [nlpaueb/bert-base-greek-uncased-v1](https://huggingface.co/nlpaueb/bert-base-greek-uncased-v1)

	### Model Architecture

	- Base Model: nlpaueb/bert-base-greek-uncased-v1
	- Pruned Layers: [0, 2, 4, 6, 8, 11] (6 layers kept for efficiency)
	- Hidden Size: 768
	- Attention Heads: 12
	- Max Position Embeddings: 512
	- Vocab Size: 35,000

	## Uses

	### Direct Use

	This model classifies document pages (text extracted via OCR) into one of 18 categories:

	\| ID \| Class Label \| Description \|
	\|----\|-------------\|-------------\|
	\| 0 \| AA_AADE_OTHER \| Other AADE documents \|
	\| 1 \| AA_Certificate_of_Current_Image_of_Entity \| Business/Partnership Certificates \|
	\| 2 \| AA_ENERGY \| Energy bills \|
	\| 3 \| AA_Employer's_Certificate/Payroll \| Employment certificates \|
	\| 4 \| AA_ID_Card \| Identity cards \|
	\| 5 \| AA_INCOME_TAX_RETURN_-_E1 \| Income tax return (E1 form) \|
	\| 6 \| AA_INCOME_TAX_RETURN_OF_LEGAL_PERSONS \| Legal entity tax returns (N form) \|
	\| 7 \| AA_LEGAL_ENTITY_MINUTES \| General Assembly/Board minutes \|
	\| 8 \| AA_LEGAL_ENT_ARTICLES_OF_ASSOCIATION \| Articles of association \|
	\| 9 \| AA_LEGAL_ENT_CERTIFICATE \| Commercial Registry certificates \|
	\| 10 \| AA_NEW_POLICE_IDENTITY_CARD \| New police ID cards \|
	\| 11 \| AA_Natural_Person_Information_Form \| Ownership certificates \|
	\| 12 \| AA_Pension_Certificate \| Pension certificates \|
	\| 13 \| AA_Personal_Income_Tax_(FEP) \| Personal income tax (FEP) \|
	\| 14 \| AA_SOLEMN_DECLARATION \| Solemn declarations \|
	\| 15 \| AA_TELEPHONY \| Phone bills \|
	\| 16 \| BB_Other_Documents \| Other identifiable documents \|
	\| 17 \| Other \| Unclassified pages \|

	## How to Get Started with the Model

	### Prerequisites

	```bash
	pip install transformers torch
	```

	### Preprocessing Function (Required!)

	⚠️ IMPORTANT: This preprocessing MUST be applied to all texts before inference. The model was trained with this preprocessing.

	```python
	import re
	import unicodedata

	# Same symbols removed during training
	SYMBOLS_TO_REMOVE = r"[`~!@#$%^&*()\-+=\[\]{\}/?><,\'\":;\|»«§°·¦ʼ¬£€©΄´\\…\n]"

	def strip_accents_and_lowercase(text: str) -> str:
	"""Remove accents and convert to lowercase."""
	return "".join(
	c for c in unicodedata.normalize("NFD", text)
	if unicodedata.category(c) != "Mn"
	).lower()

	def clean_text(text: str, symbols_to_remove: str \| None = None) -> str:
	"""
	Main preprocessing function.

	Steps:
	1. Remove special symbols
	2. Collapse multiple dots into single dot
	3. Remove accents + lowercase
	4. Normalize whitespace
	"""
	if symbols_to_remove:
	text = re.sub(symbols_to_remove, " ", text)

	text = re.sub(r"\.{2,}", ". ", text)
	text = strip_accents_and_lowercase(text)
	text = re.sub(r"\s+", " ", text).strip()
	return text

	def preprocess_text(text: str) -> str:
	return clean_text(text, symbols_to_remove=SYMBOLS_TO_REMOVE)
	```

	### Inference Code Snippet (includes preprocessing + dummy strings)

	```python
	import json
	import re
	import unicodedata
	import torch
	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	# Preprocessing (REQUIRED!)
	SYMBOLS_TO_REMOVE = r"[`~!@#$%^&*()\-+=\[\]{\}/?><,\'\":;\|»«§°·¦ʼ¬£€©΄´\\…\n]"

	def strip_accents_and_lowercase(text: str) -> str:
	return "".join(
	c for c in unicodedata.normalize("NFD", text)
	if unicodedata.category(c) != "Mn"
	).lower()

	def clean_text(text: str, symbols_to_remove: str \| None = None) -> str:
	if symbols_to_remove:
	text = re.sub(symbols_to_remove, " ", text)
	text = re.sub(r"\.{2,}", ". ", text)
	text = strip_accents_and_lowercase(text)
	text = re.sub(r"\s+", " ", text).strip()
	return text

	def preprocess_text(text: str) -> str:
	return clean_text(text, symbols_to_remove=SYMBOLS_TO_REMOVE)

	# Load model and tokenizer
	MODEL_PATH = "path/to/model"
	tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
	model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)
	model.eval()

	# Load label mapping
	with open(f"{MODEL_PATH}/id2label.json", "r", encoding="utf-8") as f:
	id2label = json.load(f)

	# Dummy texts (examples)
	texts = [
	"ΔΕΛΤΙΟ ΑΣΤΥΝΟΜΙΚΗΣ ΤΑΥΤΟΤΗΤΑΣ ΠΑΠΑΔΟΠΟΥΛΟΣ ΙΩΑΝΝΗΣ",
	"ΕΝΤΥΠΟ Ε1 ΔΗΛΩΣΗ ΦΟΡΟΛΟΓΙΑΣ ΕΙΣΟΔΗΜΑΤΟΣ 2024",
	]

	# Preprocess texts
	preprocessed_texts = [preprocess_text(t) for t in texts]

	# Tokenize
	inputs = tokenizer(
	preprocessed_texts,
	truncation=True,
	padding="max_length",
	max_length=512,
	return_tensors="pt"
	)

	# Inference
	with torch.no_grad():
	outputs = model(**inputs)
	logits = outputs.logits
	probabilities = torch.sigmoid(logits) # Multi-label sigmoid
	predictions = probabilities.argmax(dim=1)

	# Get labels
	for i, pred in enumerate(predictions):
	label = id2label[str(pred.item())]
	confidence = probabilities[i][pred].item()
	print(f"Text: {texts[i][:50]}...")
	print(f"Prediction: {label} (confidence: {confidence:.4f})")
	print()
	```

	### Expected Output

	```
	Text: ΔΕΛΤΙΟ ΑΣΤΥΝΟΜΙΚΗΣ ΤΑΥΤΟΤΗΤΑΣ ΠΑΠΑΔΟΠΟΥΛΟΣ ΙΩΑΝΝΗΣ...
	Prediction: AA_ID_Card (confidence: 0.9842)

	Text: ΕΝΤΥΠΟ Ε1 ΔΗΛΩΣΗ ΦΟΡΟΛΟΓΙΑΣ ΕΙΣΟΔΗΜΑΤΟΣ 2024...
	Prediction: AA_INCOME_TAX_RETURN_-_E1 (confidence: 0.9567)
	```

	## Training Details

	### Training Data

	- Dataset: Internal annotated document dataset
	- Total Samples: ~6,600 (train + validation)
	- Test Samples: 1,336
	- Classes: 18 (imbalanced distribution)
	- Largest Class: Other (571 test samples, ~43%)
	- Smallest Class: AA_LEGAL_ENTITY_MINUTES (7 test samples, ~0.5%)

	### Training Procedure

	#### Phase 1: Contrastive Learning
	- Base Model: nlpaueb/bert-base-greek-uncased-v1
	- Loss Function: Supervised Contrastive Loss (SCL)
	- Epochs: 200
	- Learning Rate: 2e-5
	- Batch Size: 32
	- Layer Pruning: Kept layers [0, 2, 4, 6, 8, 11]

	#### Phase 2: Classification
	- Base Model: Output of Phase 1 (26_01_2026_15_00_12)
	- Loss Function: Asymmetric Loss (gamma=4)
	- Epochs: 50
	- Learning Rate: 1e-4
	- Batch Size: 32
	- Gradient Accumulation: 2
	- Warmup Ratio: 0.1
	- LR Scheduler: Cosine
	- Oversampling: BB_Other_Documents (x2)

	### Framework Versions

	- Python: 3.9.0
	- PyTorch: 2.x
	- Transformers: 4.38.2
	- Datasets: 2.x

	## Evaluation Results

	### Overall Metrics (Test Set: 1,336 samples)

	\| Metric \| Score \|
	\|--------\|-------\|
	\| Accuracy \| 0.94 \|
	\| Macro F1 \| 0.92 \|
	\| Weighted F1 \| 0.94 \|

	### Per-Class Performance

	\| Class \| Precision \| Recall \| F1-Score \| Support \|
	\|-------\|-----------\|--------\|----------\|---------\|
	\| AA_AADE_OTHER \| 0.89 \| 0.89 \| 0.89 \| 9 \|
	\| AA_Certificate_of_Current_Image \| 1.00 \| 1.00 \| 1.00 \| 10 \|
	\| AA_ENERGY \| 0.92 \| 0.89 \| 0.91 \| 27 \|
	\| AA_Employer's_Certificate/Payroll \| 0.86 \| 0.97 \| 0.92 \| 39 \|
	\| AA_ID_Card \| 1.00 \| 0.99 \| 1.00 \| 190 \|
	\| AA_INCOME_TAX_RETURN_-_E1 \| 0.92 \| 0.86 \| 0.89 \| 77 \|
	\| AA_INCOME_TAX_RETURN_LEGAL \| 1.00 \| 1.00 \| 1.00 \| 8 \|
	\| AA_LEGAL_ENTITY_MINUTES \| 1.00 \| 1.00 \| 1.00 \| 7 \|
	\| AA_LEGAL_ENT_ARTICLES \| 0.80 \| 1.00 \| 0.89 \| 8 \|
	\| AA_LEGAL_ENT_CERTIFICATE \| 0.71 \| 0.88 \| 0.79 \| 17 \|
	\| AA_NEW_POLICE_IDENTITY_CARD \| 0.96 \| 1.00 \| 0.98 \| 26 \|
	\| AA_Natural_Person_Form \| 0.90 \| 0.93 \| 0.92 \| 30 \|
	\| AA_Pension_Certificate \| 0.92 \| 0.95 \| 0.93 \| 74 \|
	\| AA_Personal_Income_Tax_(FEP) \| 1.00 \| 0.94 \| 0.97 \| 147 \|
	\| AA_SOLEMN_DECLARATION \| 0.80 \| 0.89 \| 0.84 \| 9 \|
	\| AA_TELEPHONY \| 0.97 \| 0.92 \| 0.94 \| 65 \|
	\| BB_Other_Documents \| 0.82 \| 0.64 \| 0.72 \| 22 \|
	\| Other \| 0.94 \| 0.95 \| 0.95 \| 571 \|

	### Key Performance Highlights

	- ✅ Other class: F1=0.95 (excellent handling of the majority class)
	- ✅ BB_Other_Documents: F1=0.72 (best among all trained models for this rare class)
	- ✅ High-confidence classes: AA_ID_Card, AA_Certificate, AA_Legal_Entity_Minutes all achieve 1.00 F1
	- ⚠️ Lower performance: AA_LEGAL_ENT_CERTIFICATE (F1=0.79) - needs more training data

	## Model Files

	\| File \| Description \| Required \|
	\|------\|-------------\|----------\|
	\| `model.safetensors` \| Model weights \| ✅ Yes \|
	\| `config.json` \| Model architecture + id2label/label2id \| ✅ Yes \|
	\| `tokenizer.json` \| Tokenizer \| ✅ Yes \|
	\| `tokenizer_config.json` \| Tokenizer config \| ✅ Yes \|
	\| `vocab.txt` \| Vocabulary \| ✅ Yes \|
	\| `special_tokens_map.json` \| Special tokens \| ✅ Yes \|
	\| `id2label.json` \| ID to label mapping \| ✅ Yes \|
	\| `label2id.json` \| Label to ID mapping \| ✅ Yes \|
	\| `test_report.txt` \| Classification report \| Optional \|

	## Model Card Authors

	AI Services Team - Archeiothiki S.A.

	## Model Card Contact

	Internal use only.