|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- el |
|
|
pipeline_tag: fill-mask |
|
|
library_name: transformers |
|
|
tags: |
|
|
- roberta |
|
|
- fill-mask |
|
|
- greek |
|
|
- legal |
|
|
- masked-lm |
|
|
- data-repetition |
|
|
base_model: |
|
|
- roberta-base |
|
|
--- |
|
|
|
|
|
# GEM-RoBERTa HQ Legal: A Greek Legal Language Model with Quality-Based Data Repetition |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**GEM-RoBERTa HQ Legal** is a RoBERTa-base model pre-trained from scratch on a strategically curated 21GB corpus of Greek legal, parliamentary, and governmental text. This model employs an innovative **quality-based data repetition strategy**, where higher-quality legal sources are repeated multiple times during training to enhance the model's understanding of premium legal terminology and concepts. |
|
|
|
|
|
This model was trained as part of a research project and has been optimized for downstream tasks such as Named Entity Recognition (NER), Text Classification, and Question Answering within the legal field. The RoBERTa architecture provides enhanced performance through dynamic masking and removal of the Next Sentence Prediction (NSP) task, focusing entirely on Masked Language Modeling (MLM). |
|
|
|
|
|
## How to Get Started |
|
|
|
|
|
You can use this model directly with the `fill-mask` pipeline: |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Load the model |
|
|
fill_mask = pipeline( |
|
|
"fill-mask", |
|
|
model="novelcore/gem-roberta-hq-legal", |
|
|
tokenizer="novelcore/gem-roberta-hq-legal" |
|
|
) |
|
|
|
|
|
# Example from a legal context |
|
|
text = "Ο κ. Μητσοτάκης <mask> ότι η κυβέρνηση σέβεται πλήρως τις αποφάσεις του Συμβουλίου της Επικρατείας." |
|
|
|
|
|
# Get predictions |
|
|
predictions = fill_mask(text) |
|
|
print(predictions) |
|
|
``` |
|
|
|
|
|
For downstream tasks: |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
|
|
# For legal document classification |
|
|
tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-roberta-hq-legal") |
|
|
model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-roberta-hq-legal") |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was pre-trained on a comprehensive corpus of Greek text compiled from various legal and governmental sources, with a **quality-based data repetition strategy** that increases exposure to higher-quality legal content. The original 16.75GB corpus was expanded to 21.12GB through strategic repetition. |
|
|
|
|
|
### Quality-Based Data Repetition Strategy |
|
|
|
|
|
| Dataset | Original Size (GB) | Quality Level | Repetition Factor | Effective Size (GB) | |
|
|
| :--- | :--- | :--- | :--- | :--- | |
|
|
| **Raptarchis Legal Dictionary** | 0.35 | **Best** | **4x** | **1.40** | |
|
|
| **Political Reports of the Supreme Court** | 1.20 | **Medium-Best** | **3x** | **3.60** | |
|
|
| **Eur-Lex (Greek Content)** | 0.92 | **Medium** | **2x** | **1.84** | |
|
|
| FEK - Greek Government Gazette | 11.00 | Low | 1x | 11.00 | |
|
|
| Greek Parliament Proceedings | 2.90 | Low-Medium | 1x | 2.90 | |
|
|
| Europarl (Greek Content) | 0.38 | Low | 1x | 0.38 | |
|
|
| **TOTAL** | **16.75 GB** | **-** | **-** | **21.12 GB** | |
|
|
|
|
|
### Rationale for Data Repetition |
|
|
|
|
|
The quality-based repetition strategy enhances the model's exposure to: |
|
|
- **Premium legal terminology** from the Raptarchis Legal Dictionary (4x repetition) |
|
|
- **High-quality judicial reasoning** from Supreme Court reports (3x repetition) |
|
|
- **EU legal concepts** from Eur-Lex content (2x repetition) |
|
|
|
|
|
This approach ensures the model develops a stronger understanding of sophisticated legal language while maintaining exposure to the broader legal corpus. |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Model Architecture |
|
|
|
|
|
The model uses the RoBERTa-base architecture with the following configuration: |
|
|
|
|
|
- **Hidden Size**: 768 |
|
|
- **Attention Heads**: 12 |
|
|
- **Hidden Layers**: 12 |
|
|
- **Parameters**: ~125M |
|
|
- **Max Position Embeddings**: 514 |
|
|
- **Vocabulary Size**: 50,264 |
|
|
|
|
|
### Preprocessing |
|
|
|
|
|
The text was tokenized using a custom `ByteLevelBPE` tokenizer trained from scratch on the Greek legal corpus. The tokenizer uses a vocabulary of 50,264 tokens optimized for Greek legal terminology. |
|
|
|
|
|
The data was processed into fixed-size chunks of 512 tokens, respecting document boundaries to ensure contextual coherence. Higher-quality sources were strategically repeated during the data preparation phase. |
|
|
|
|
|
### Pre-training |
|
|
|
|
|
The model was pre-trained from scratch for **150,000 steps** on 8x NVIDIA A100 40GB GPUs, using BFloat16 (`bf16`) mixed-precision for stability and speed. The training took approximately **81 hours and 39 minutes** to complete. |
|
|
|
|
|
The key hyperparameters used were: |
|
|
|
|
|
- **Learning Rate**: 1.5e-4 (0.00015) with a linear warmup of 9,000 steps |
|
|
- **Batch Size**: Effective batch size of 2,688 (`per_device_train_batch_size: 48`, `gradient_accumulation_steps: 7`) |
|
|
- **Optimizer**: AdamW with standard parameters |
|
|
- **Weight Decay**: 0.01 |
|
|
- **Max Sequence Length**: 512 |
|
|
- **Max Steps**: 150,000 |
|
|
- **Warmup Steps**: 9,000 |
|
|
- **MLM Probability**: 0.15 |
|
|
- **Max Gradient Norm**: 1.0 |
|
|
|
|
|
### Training Results |
|
|
|
|
|
The model achieved the following performance metrics: |
|
|
|
|
|
- **Final Training Loss**: 0.617 |
|
|
- **Final Evaluation Loss**: 0.573035 |
|
|
- **Training Infrastructure**: 8x NVIDIA A100 40GB GPUs |
|
|
- **Total Training Steps**: 150,000 |
|
|
- **Total Training Time**: 66 hours 39 minutes |
|
|
- **Train/Validation Split**: 95%/5% |
|
|
- **Effective Training Data**: 21.12GB (with quality-based repetition) |