Text Classification
Transformers
Safetensors
Greek
bert
greek
document-classification
page-classification
nlp
contrastive-learning
text-embeddings-inference
Instructions to use Archeiothiki/KYC_classification with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Archeiothiki/KYC_classification with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Archeiothiki/KYC_classification")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("Archeiothiki/KYC_classification") model = AutoModelForSequenceClassification.from_pretrained("Archeiothiki/KYC_classification") - Notebooks
- Google Colab
- Kaggle
| license: other | |
| language: | |
| - el | |
| library_name: transformers | |
| pipeline_tag: text-classification | |
| tags: | |
| - text-classification | |
| - bert | |
| - greek | |
| - document-classification | |
| - page-classification | |
| - nlp | |
| - contrastive-learning | |
| base_model: nlpaueb/bert-base-greek-uncased-v1 | |
| metrics: | |
| - accuracy | |
| - f1 | |
| # Arch-L3869-PageClassification | |
| ## Model Details | |
| ### Model Description | |
| This is a **Greek text classification model** for categorizing document pages into 18 different classes. The model was trained using a two-phase approach: | |
| 1. **Phase 1 (Contrastive Learning):** Further pre-training of the base BERT model using Supervised Contrastive Learning (SCL) to create better document embeddings. | |
| 2. **Phase 2 (Classification):** Fine-tuning with Asymmetric Loss for handling class imbalance. | |
| - **Developed by:** Archeiothiki S.A. - AI Services Team | |
| - **Model type:** BertForSequenceClassification | |
| - **Language(s):** Greek (el) | |
| - **Finetuned from model:** [nlpaueb/bert-base-greek-uncased-v1](https://huggingface.co/nlpaueb/bert-base-greek-uncased-v1) | |
| ### Model Architecture | |
| - **Base Model:** nlpaueb/bert-base-greek-uncased-v1 | |
| - **Pruned Layers:** [0, 2, 4, 6, 8, 11] (6 layers kept for efficiency) | |
| - **Hidden Size:** 768 | |
| - **Attention Heads:** 12 | |
| - **Max Position Embeddings:** 512 | |
| - **Vocab Size:** 35,000 | |
| ## Uses | |
| ### Direct Use | |
| This model classifies document pages (text extracted via OCR) into one of 18 categories: | |
| | ID | Class Label | Description | | |
| |----|-------------|-------------| | |
| | 0 | AA_AADE_OTHER | Other AADE documents | | |
| | 1 | AA_Certificate_of_Current_Image_of_Entity | Business/Partnership Certificates | | |
| | 2 | AA_ENERGY | Energy bills | | |
| | 3 | AA_Employer's_Certificate/Payroll | Employment certificates | | |
| | 4 | AA_ID_Card | Identity cards | | |
| | 5 | AA_INCOME_TAX_RETURN_-_E1 | Income tax return (E1 form) | | |
| | 6 | AA_INCOME_TAX_RETURN_OF_LEGAL_PERSONS | Legal entity tax returns (N form) | | |
| | 7 | AA_LEGAL_ENTITY_MINUTES | General Assembly/Board minutes | | |
| | 8 | AA_LEGAL_ENT_ARTICLES_OF_ASSOCIATION | Articles of association | | |
| | 9 | AA_LEGAL_ENT_CERTIFICATE | Commercial Registry certificates | | |
| | 10 | AA_NEW_POLICE_IDENTITY_CARD | New police ID cards | | |
| | 11 | AA_Natural_Person_Information_Form | Ownership certificates | | |
| | 12 | AA_Pension_Certificate | Pension certificates | | |
| | 13 | AA_Personal_Income_Tax_(FEP) | Personal income tax (FEP) | | |
| | 14 | AA_SOLEMN_DECLARATION | Solemn declarations | | |
| | 15 | AA_TELEPHONY | Phone bills | | |
| | 16 | BB_Other_Documents | Other identifiable documents | | |
| | 17 | Other | Unclassified pages | | |
| ## How to Get Started with the Model | |
| ### Prerequisites | |
| ```bash | |
| pip install transformers torch | |
| ``` | |
| ### Preprocessing Function (Required!) | |
| ⚠️ **IMPORTANT:** This preprocessing MUST be applied to all texts before inference. The model was trained with this preprocessing. | |
| ```python | |
| import re | |
| import unicodedata | |
| # Same symbols removed during training | |
| SYMBOLS_TO_REMOVE = r"[`~!@#$%^&*()\-+=\[\]{\}/?><,\'\":;|»«§°·¦ʼ¬£€©΄´\\…\n]" | |
| def strip_accents_and_lowercase(text: str) -> str: | |
| """Remove accents and convert to lowercase.""" | |
| return "".join( | |
| c for c in unicodedata.normalize("NFD", text) | |
| if unicodedata.category(c) != "Mn" | |
| ).lower() | |
| def clean_text(text: str, symbols_to_remove: str | None = None) -> str: | |
| """ | |
| Main preprocessing function. | |
| Steps: | |
| 1. Remove special symbols | |
| 2. Collapse multiple dots into single dot | |
| 3. Remove accents + lowercase | |
| 4. Normalize whitespace | |
| """ | |
| if symbols_to_remove: | |
| text = re.sub(symbols_to_remove, " ", text) | |
| text = re.sub(r"\.{2,}", ". ", text) | |
| text = strip_accents_and_lowercase(text) | |
| text = re.sub(r"\s+", " ", text).strip() | |
| return text | |
| def preprocess_text(text: str) -> str: | |
| return clean_text(text, symbols_to_remove=SYMBOLS_TO_REMOVE) | |
| ``` | |
| ### Inference Code Snippet (includes preprocessing + dummy strings) | |
| ```python | |
| import json | |
| import re | |
| import unicodedata | |
| import torch | |
| from transformers import AutoModelForSequenceClassification, AutoTokenizer | |
| # Preprocessing (REQUIRED!) | |
| SYMBOLS_TO_REMOVE = r"[`~!@#$%^&*()\-+=\[\]{\}/?><,\'\":;|»«§°·¦ʼ¬£€©΄´\\…\n]" | |
| def strip_accents_and_lowercase(text: str) -> str: | |
| return "".join( | |
| c for c in unicodedata.normalize("NFD", text) | |
| if unicodedata.category(c) != "Mn" | |
| ).lower() | |
| def clean_text(text: str, symbols_to_remove: str | None = None) -> str: | |
| if symbols_to_remove: | |
| text = re.sub(symbols_to_remove, " ", text) | |
| text = re.sub(r"\.{2,}", ". ", text) | |
| text = strip_accents_and_lowercase(text) | |
| text = re.sub(r"\s+", " ", text).strip() | |
| return text | |
| def preprocess_text(text: str) -> str: | |
| return clean_text(text, symbols_to_remove=SYMBOLS_TO_REMOVE) | |
| # Load model and tokenizer | |
| MODEL_PATH = "path/to/model" | |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) | |
| model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH) | |
| model.eval() | |
| # Load label mapping | |
| with open(f"{MODEL_PATH}/id2label.json", "r", encoding="utf-8") as f: | |
| id2label = json.load(f) | |
| # Dummy texts (examples) | |
| texts = [ | |
| "ΔΕΛΤΙΟ ΑΣΤΥΝΟΜΙΚΗΣ ΤΑΥΤΟΤΗΤΑΣ ΠΑΠΑΔΟΠΟΥΛΟΣ ΙΩΑΝΝΗΣ", | |
| "ΕΝΤΥΠΟ Ε1 ΔΗΛΩΣΗ ΦΟΡΟΛΟΓΙΑΣ ΕΙΣΟΔΗΜΑΤΟΣ 2024", | |
| ] | |
| # Preprocess texts | |
| preprocessed_texts = [preprocess_text(t) for t in texts] | |
| # Tokenize | |
| inputs = tokenizer( | |
| preprocessed_texts, | |
| truncation=True, | |
| padding="max_length", | |
| max_length=512, | |
| return_tensors="pt" | |
| ) | |
| # Inference | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| logits = outputs.logits | |
| probabilities = torch.sigmoid(logits) # Multi-label sigmoid | |
| predictions = probabilities.argmax(dim=1) | |
| # Get labels | |
| for i, pred in enumerate(predictions): | |
| label = id2label[str(pred.item())] | |
| confidence = probabilities[i][pred].item() | |
| print(f"Text: {texts[i][:50]}...") | |
| print(f"Prediction: {label} (confidence: {confidence:.4f})") | |
| print() | |
| ``` | |
| ### Expected Output | |
| ``` | |
| Text: ΔΕΛΤΙΟ ΑΣΤΥΝΟΜΙΚΗΣ ΤΑΥΤΟΤΗΤΑΣ ΠΑΠΑΔΟΠΟΥΛΟΣ ΙΩΑΝΝΗΣ... | |
| Prediction: AA_ID_Card (confidence: 0.9842) | |
| Text: ΕΝΤΥΠΟ Ε1 ΔΗΛΩΣΗ ΦΟΡΟΛΟΓΙΑΣ ΕΙΣΟΔΗΜΑΤΟΣ 2024... | |
| Prediction: AA_INCOME_TAX_RETURN_-_E1 (confidence: 0.9567) | |
| ``` | |
| ## Training Details | |
| ### Training Data | |
| - **Dataset:** Internal annotated document dataset | |
| - **Total Samples:** ~6,600 (train + validation) | |
| - **Test Samples:** 1,336 | |
| - **Classes:** 18 (imbalanced distribution) | |
| - **Largest Class:** Other (571 test samples, ~43%) | |
| - **Smallest Class:** AA_LEGAL_ENTITY_MINUTES (7 test samples, ~0.5%) | |
| ### Training Procedure | |
| #### Phase 1: Contrastive Learning | |
| - **Base Model:** nlpaueb/bert-base-greek-uncased-v1 | |
| - **Loss Function:** Supervised Contrastive Loss (SCL) | |
| - **Epochs:** 200 | |
| - **Learning Rate:** 2e-5 | |
| - **Batch Size:** 32 | |
| - **Layer Pruning:** Kept layers [0, 2, 4, 6, 8, 11] | |
| #### Phase 2: Classification | |
| - **Base Model:** Output of Phase 1 (26_01_2026_15_00_12) | |
| - **Loss Function:** Asymmetric Loss (gamma=4) | |
| - **Epochs:** 50 | |
| - **Learning Rate:** 1e-4 | |
| - **Batch Size:** 32 | |
| - **Gradient Accumulation:** 2 | |
| - **Warmup Ratio:** 0.1 | |
| - **LR Scheduler:** Cosine | |
| - **Oversampling:** BB_Other_Documents (x2) | |
| ### Framework Versions | |
| - **Python:** 3.9.0 | |
| - **PyTorch:** 2.x | |
| - **Transformers:** 4.38.2 | |
| - **Datasets:** 2.x | |
| ## Evaluation Results | |
| ### Overall Metrics (Test Set: 1,336 samples) | |
| | Metric | Score | | |
| |--------|-------| | |
| | **Accuracy** | 0.94 | | |
| | **Macro F1** | 0.92 | | |
| | **Weighted F1** | 0.94 | | |
| ### Per-Class Performance | |
| | Class | Precision | Recall | F1-Score | Support | | |
| |-------|-----------|--------|----------|---------| | |
| | AA_AADE_OTHER | 0.89 | 0.89 | 0.89 | 9 | | |
| | AA_Certificate_of_Current_Image | 1.00 | 1.00 | 1.00 | 10 | | |
| | AA_ENERGY | 0.92 | 0.89 | 0.91 | 27 | | |
| | AA_Employer's_Certificate/Payroll | 0.86 | 0.97 | 0.92 | 39 | | |
| | AA_ID_Card | 1.00 | 0.99 | 1.00 | 190 | | |
| | AA_INCOME_TAX_RETURN_-_E1 | 0.92 | 0.86 | 0.89 | 77 | | |
| | AA_INCOME_TAX_RETURN_LEGAL | 1.00 | 1.00 | 1.00 | 8 | | |
| | AA_LEGAL_ENTITY_MINUTES | 1.00 | 1.00 | 1.00 | 7 | | |
| | AA_LEGAL_ENT_ARTICLES | 0.80 | 1.00 | 0.89 | 8 | | |
| | AA_LEGAL_ENT_CERTIFICATE | 0.71 | 0.88 | 0.79 | 17 | | |
| | AA_NEW_POLICE_IDENTITY_CARD | 0.96 | 1.00 | 0.98 | 26 | | |
| | AA_Natural_Person_Form | 0.90 | 0.93 | 0.92 | 30 | | |
| | AA_Pension_Certificate | 0.92 | 0.95 | 0.93 | 74 | | |
| | AA_Personal_Income_Tax_(FEP) | 1.00 | 0.94 | 0.97 | 147 | | |
| | AA_SOLEMN_DECLARATION | 0.80 | 0.89 | 0.84 | 9 | | |
| | AA_TELEPHONY | 0.97 | 0.92 | 0.94 | 65 | | |
| | **BB_Other_Documents** | **0.82** | **0.64** | **0.72** | 22 | | |
| | **Other** | **0.94** | **0.95** | **0.95** | 571 | | |
| ### Key Performance Highlights | |
| - ✅ **Other class:** F1=0.95 (excellent handling of the majority class) | |
| - ✅ **BB_Other_Documents:** F1=0.72 (best among all trained models for this rare class) | |
| - ✅ **High-confidence classes:** AA_ID_Card, AA_Certificate, AA_Legal_Entity_Minutes all achieve 1.00 F1 | |
| - ⚠️ **Lower performance:** AA_LEGAL_ENT_CERTIFICATE (F1=0.79) - needs more training data | |
| ## Model Files | |
| | File | Description | Required | | |
| |------|-------------|----------| | |
| | `model.safetensors` | Model weights | ✅ Yes | | |
| | `config.json` | Model architecture + id2label/label2id | ✅ Yes | | |
| | `tokenizer.json` | Tokenizer | ✅ Yes | | |
| | `tokenizer_config.json` | Tokenizer config | ✅ Yes | | |
| | `vocab.txt` | Vocabulary | ✅ Yes | | |
| | `special_tokens_map.json` | Special tokens | ✅ Yes | | |
| | `id2label.json` | ID to label mapping | ✅ Yes | | |
| | `label2id.json` | Label to ID mapping | ✅ Yes | | |
| | `test_report.txt` | Classification report | Optional | | |
| ## Model Card Authors | |
| AI Services Team - Archeiothiki S.A. | |
| ## Model Card Contact | |
| Internal use only. | |