|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- el |
|
|
pipeline_tag: fill-mask |
|
|
library_name: transformers |
|
|
tags: |
|
|
- convbert |
|
|
- fill-mask |
|
|
- greek |
|
|
- legal |
|
|
- masked-lm |
|
|
- data-repetition |
|
|
- convolution |
|
|
base_model: |
|
|
- convbert-base |
|
|
--- |
|
|
|
|
|
# GEM-ConvBERT HQ Legal: A Greek Legal Language Model with Quality-Based Data Repetition |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**GEM-ConvBERT HQ Legal** is a ConvBERT-base model pre-trained from scratch on a strategically curated 21GB corpus of Greek legal, parliamentary, and governmental text. This model employs an innovative **quality-based data repetition strategy**, where higher-quality legal sources are repeated multiple times during training to enhance the model's understanding of premium legal terminology and concepts. |
|
|
|
|
|
ConvBERT combines the strengths of BERT with span-based dynamic convolution, replacing some self-attention heads with more efficient convolutional layers. This hybrid architecture provides better efficiency and performance, particularly suitable for understanding local patterns in legal text while maintaining global context awareness. |
|
|
|
|
|
This model was trained as part of a research project and has been optimized for downstream tasks such as Named Entity Recognition (NER), Text Classification, and Question Answering within the legal field. |
|
|
|
|
|
## How to Get Started |
|
|
|
|
|
You can use this model directly with the `fill-mask` pipeline: |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Load the model |
|
|
fill_mask = pipeline( |
|
|
"fill-mask", |
|
|
model="novelcore/gem-convbert-hq-legal", |
|
|
tokenizer="novelcore/gem-convbert-hq-legal" |
|
|
) |
|
|
|
|
|
# Example from a legal context |
|
|
text = "Ο κ. Μητσοτάκης <mask> ότι η κυβέρνηση σέβεται πλήρως τις αποφάσεις του Συμβουλίου της Επικρατείας." |
|
|
|
|
|
# Get predictions |
|
|
predictions = fill_mask(text) |
|
|
print(predictions) |
|
|
``` |
|
|
|
|
|
For downstream tasks: |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
|
|
# For legal document classification |
|
|
tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-convbert-hq-legal") |
|
|
model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-convbert-hq-legal") |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was pre-trained on a comprehensive corpus of Greek text compiled from various legal and governmental sources, with a **quality-based data repetition strategy** that increases exposure to higher-quality legal content. The original 16.75GB corpus was expanded to 21.12GB through strategic repetition. |
|
|
|
|
|
### Quality-Based Data Repetition Strategy |
|
|
|
|
|
| Dataset | Original Size (GB) | Quality Level | Repetition Factor | Effective Size (GB) | |
|
|
| :--- | :--- | :--- | :--- | :--- | |
|
|
| **Raptarchis Legal Dictionary** | 0.35 | **Best** | **4x** | **1.40** | |
|
|
| **Political Reports of the Supreme Court** | 1.20 | **Medium-Best** | **3x** | **3.60** | |
|
|
| **Eur-Lex (Greek Content)** | 0.92 | **Medium** | **2x** | **1.84** | |
|
|
| FEK - Greek Government Gazette | 11.00 | Low | 1x | 11.00 | |
|
|
| Greek Parliament Proceedings | 2.90 | Low-Medium | 1x | 2.90 | |
|
|
| Europarl (Greek Content) | 0.38 | Low | 1x | 0.38 | |
|
|
| **TOTAL** | **16.75 GB** | **-** | **-** | **21.12 GB** | |
|
|
|
|
|
### Rationale for Data Repetition |
|
|
|
|
|
The quality-based repetition strategy enhances the model's exposure to: |
|
|
- **Premium legal terminology** from the Raptarchis Legal Dictionary (4x repetition) |
|
|
- **High-quality judicial reasoning** from Supreme Court reports (3x repetition) |
|
|
- **EU legal concepts** from Eur-Lex content (2x repetition) |
|
|
|
|
|
This approach ensures the model develops a stronger understanding of sophisticated legal language while maintaining exposure to the broader legal corpus. |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Model Architecture |
|
|
|
|
|
The model uses the ConvBERT-base architecture with the following configuration: |
|
|
|
|
|
- **Hidden Size**: 768 |
|
|
- **Attention Heads**: 12 |
|
|
- **Hidden Layers**: 12 |
|
|
- **Intermediate Size**: 3072 |
|
|
- **Conv Kernel Size**: 9 |
|
|
- **Num Conv Groups**: 1 |
|
|
- **Parameters**: ~106M |
|
|
- **Vocabulary Size**: 50,264 |
|
|
|
|
|
### ConvBERT Architecture Advantages |
|
|
|
|
|
ConvBERT's hybrid architecture provides several benefits for legal text processing: |
|
|
|
|
|
- **Efficient Local Pattern Recognition**: Convolutional layers excel at capturing local linguistic patterns common in legal terminology |
|
|
- **Global Context Awareness**: Self-attention mechanisms maintain understanding of document-wide context |
|
|
- **Computational Efficiency**: Mixed attention-convolution approach reduces computational complexity |
|
|
- **Better Span Representation**: Dynamic convolution enhances understanding of legal entity spans and clause structures |
|
|
|
|
|
### Preprocessing |
|
|
|
|
|
The text was tokenized using a custom `WordPiece` tokenizer trained from scratch on the Greek legal corpus. The tokenizer uses a vocabulary of 50,264 tokens optimized for Greek legal terminology. |
|
|
|
|
|
The data was processed into fixed-size chunks of 512 tokens, respecting document boundaries to ensure contextual coherence. Higher-quality sources were strategically repeated during the data preparation phase. |
|
|
|
|
|
### Pre-training |
|
|
|
|
|
The model was pre-trained from scratch for **200,000 steps** on 8x NVIDIA A100 40GB GPUs, using BFloat16 (`bf16`) mixed-precision for stability and speed. The training took approximately **45 hours and 32 minutes** to complete. |
|
|
|
|
|
The key hyperparameters used were: |
|
|
|
|
|
- **Learning Rate**: 2e-4 (0.0002) with a linear warmup of 12,000 steps |
|
|
- **Batch Size**: Effective batch size of 768 (`per_device_train_batch_size: 32`, `gradient_accumulation_steps: 3`) |
|
|
- **Optimizer**: AdamW with standard parameters |
|
|
- **Weight Decay**: 0.01 |
|
|
- **Max Sequence Length**: 512 |
|
|
- **Max Steps**: 200,000 |
|
|
- **Warmup Steps**: 12,000 |
|
|
- **MLM Probability**: 0.15 |
|
|
- **Max Gradient Norm**: 1.0 |
|
|
|
|
|
### Training Results |
|
|
|
|
|
The model achieved the following performance metrics: |
|
|
|
|
|
- **Final Training Loss**: 0.6413 |
|
|
- **Final Evaluation Loss**: 0.604455 |
|
|
- **Training Infrastructure**: 8x NVIDIA A100 40GB GPUs |
|
|
- **Total Training Steps**: 200,000 |
|
|
- **Total Training Time**: 45 hours 32 minutes |
|
|
- **Train/Validation Split**: 90%/10% |
|
|
- **Effective Training Data**: 21.12GB (with quality-based repetition) |
|
|
|
|
|
### Training Infrastructure |
|
|
|
|
|
The model was trained using distributed training with the following optimizations: |
|
|
|
|
|
- **Backend**: NCCL for efficient multi-GPU communication |
|
|
- **Mixed Precision**: BFloat16 for improved training stability |
|
|
- **Evaluation Frequency**: Every 4,000 steps |
|
|
- **Checkpointing**: Every 4,000 steps |
|
|
- **Logging**: Every 200 steps |
|
|
|
|
|
## Key Innovations |
|
|
|
|
|
### Quality-Based Data Repetition |
|
|
|
|
|
This model introduces a novel **quality-based data repetition strategy** where: |
|
|
|
|
|
1. **Highest quality sources** (legal dictionaries) are repeated 4x for maximum terminology exposure |
|
|
2. **Medium-high quality sources** (court reports) are repeated 3x for judicial reasoning patterns |
|
|
3. **Medium quality sources** (EU legal texts) are repeated 2x for regulatory language |
|
|
4. **Lower quality sources** are used once to maintain diversity |
|
|
|
|
|
This approach resulted in **25% more effective training data** (21.12GB vs 16.75GB) while maintaining computational efficiency. |
|
|
|
|
|
### ConvBERT Architecture for Legal Text |
|
|
|
|
|
The ConvBERT architecture is particularly well-suited for legal text processing: |
|
|
|
|
|
- **Local Legal Pattern Recognition**: Convolutional layers efficiently capture recurring legal phrases and terminology patterns |
|
|
- **Clause Structure Understanding**: Dynamic convolution helps model understand legal document structures and clause relationships |
|
|
- **Computational Efficiency**: Hybrid attention-convolution approach provides faster training and inference compared to pure attention models |
|
|
- **Enhanced Entity Recognition**: Better span-based representations improve legal entity and concept identification |
|
|
|
|
|
### Training Efficiency |
|
|
|
|
|
The model achieved exceptional training efficiency, completing training in only **45 hours 32 minutes** - significantly faster than comparable architectures while processing the expanded 21.12GB dataset. |