File size: 4,318 Bytes
96401d3
122c98b
 
 
 
96401d3
122c98b
 
 
 
 
 
 
 
 
96401d3
 
3aa1bd7
96401d3
122c98b
96401d3
3aa1bd7
96401d3
122c98b
96401d3
122c98b
96401d3
122c98b
96401d3
122c98b
 
96401d3
122c98b
 
 
3aa1bd7
 
122c98b
96401d3
122c98b
 
96401d3
122c98b
 
 
 
96401d3
122c98b
96401d3
122c98b
 
96401d3
122c98b
3aa1bd7
 
122c98b
96401d3
122c98b
96401d3
122c98b
96401d3
122c98b
96401d3
122c98b
 
 
 
 
 
 
 
 
96401d3
122c98b
96401d3
122c98b
96401d3
122c98b
96401d3
122c98b
 
 
 
96401d3
122c98b
96401d3
122c98b
96401d3
122c98b
96401d3
122c98b
96401d3
122c98b
96401d3
122c98b
96401d3
122c98b
 
 
 
 
 
 
 
 
96401d3
122c98b
96401d3
122c98b
96401d3
122c98b
 
 
3aa1bd7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
license: apache-2.0
language:
- el
pipeline_tag: fill-mask
library_name: transformers
tags:
- electra
- fill-mask
- greek
- legal
- discriminator
- generator
base_model:
- google/electra-base-discriminator
---

# GEM-ELECTRA Legal: A Greek Legal Language Model

## Model Description

**GEM-ELECTRA Legal** is an improved ELECTRA-base model pre-trained from scratch on a large, 17GB corpus of Greek legal, parliamentary, and governmental text. This second version incorporates refined training hyperparameters for enhanced performance and stability. It is designed for understanding the complex vocabulary and context of the legal domain in Greece and the EU.

This model was trained as part of a research project and has been optimized for downstream tasks such as Named Entity Recognition (NER), Text Classification, and Question Answering within the legal field. The ELECTRA architecture provides more efficient pre-training compared to masked language models like BERT by using a generator-discriminator approach.

## How to Get Started

You can use this model directly with the `fill-mask` pipeline:

```python
from transformers import pipeline

# Load the model
fill_mask = pipeline(
    "fill-mask",
    model="novelcore/gem-electra-legal",
    tokenizer="novelcore/gem-electra-legal"
)

# Example from a legal context
text = "Ο κ. Μητσοτάκης <mask> ότι η κυβέρνηση σέβεται πλήρως τις αποφάσεις του Συμβουλίου της Επικρατείας."

# Get predictions
predictions = fill_mask(text)
print(predictions)
```

For downstream tasks:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# For legal document classification
tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-electra-legal")
model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-electra-legal")
```

## Training Data

The model was pre-trained on a comprehensive 17GB corpus of Greek text compiled from various legal and governmental sources. The corpus was carefully cleaned, UTF-8 encoded, and deduplicated to ensure high quality and diversity before training.

The composition of the training corpus is as follows:

| Corpus Source | Size (GB) | Context |
| :--- | :--- | :--- |
| FEK - Greek Government Gazette (all issues) | 11.0 | Legal |
| Greek Parliament Proceedings | 2.9 | Legal / Parliamentary |
| Political Reports of the Supreme Court | 1.2 | Legal |
| Eur-Lex (Greek Content) | 0.92 | Legal |
| Europarl (Greek Content) | 0.38 | Legal / Parliamentary |
| Raptarchis Legal Dictionary | 0.35 | Legal |
| **Total** | **~16.75 GB** | |

## Training Procedure

### Model Architecture

The model uses the ELECTRA architecture with the following configuration:

- **Discriminator Hidden Size**: 768
- **Discriminator Attention Heads**: 12
- **Discriminator Hidden Layers**: 12
- **Generator Size Fraction**: 0.25 (192 hidden size generator)

### Preprocessing

The text was tokenized using a custom `ByteLevelBPE` tokenizer trained from scratch on the Greek legal corpus. The tokenizer is uncased (does not distinguish between upper and lower case) and uses a vocabulary of 50,264 tokens.

The data was then processed into fixed-size chunks of 512 tokens, respecting document boundaries to ensure contextual coherence.

### Pre-training

The model was pre-trained from scratch for **200,000 steps** on 8x NVIDIA A100 40GB GPUs, using BFloat16 (`bf16`) mixed-precision for stability and speed. This second version incorporates improved hyperparameters for enhanced convergence and performance.

The key hyperparameters used were:

- **Learning Rate**: 1e-4 (0.0001) with a linear warmup of 12,000 steps
- **Batch Size**: Effective batch size of 3,840 (`per_device_train_batch_size: 60`, `gradient_accumulation_steps: 8`)
- **Optimizer**: AdamW with `beta1=0.9`, `beta2=0.98`, `epsilon=1e-6`
- **Weight Decay**: 0.01
- **Max Sequence Length**: 512
- **Max Steps**: 200,000
- **Warmup Steps**: 12,000
- **Generator Loss Weight**: 50.0
- **Discriminator Loss Weight**: 50.0

### Training Results

The model achieved the following performance metrics:

- **Final Training Loss**: 0.0056
- **Final Evaluation Loss**: 0.0054
- **Training Infrastructure**: 8x NVIDIA A100 40GB GPUs
- **Total Training Steps**: 200,000