|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- el |
|
|
pipeline_tag: fill-mask |
|
|
library_name: transformers |
|
|
tags: |
|
|
- modernbert |
|
|
- fill-mask |
|
|
- greek |
|
|
- legal |
|
|
- masked-lm |
|
|
- data-repetition |
|
|
- flash-attention |
|
|
- stable-adamw |
|
|
base_model: |
|
|
- answerdotai/ModernBERT-base |
|
|
--- |
|
|
|
|
|
# GEM-ModernBERT HQ Legal: A Greek Legal Language Model with Advanced Optimization |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**GEM-ModernBERT HQ Legal** is a ModernBERT-base model pre-trained from scratch on a strategically curated 21GB corpus of Greek legal, parliamentary, and governmental text. This model leverages ModernBERT's cutting-edge architectural innovations including **Flash Attention 2**, **StableAdamW optimizer**, **1024-token context length**, and **advanced memory optimization** techniques to deliver superior performance on Greek legal document understanding tasks. |
|
|
|
|
|
Building upon our proven **quality-based data repetition strategy**, this model incorporates ModernBERT's state-of-the-art training methodology with **30% masking probability**, **trapezoidal learning rate scheduling**, and **optimized batch sizing** for enhanced convergence and performance. The model is specifically designed to handle longer legal documents with its extended 1024-token context window while maintaining computational efficiency through advanced optimization techniques. |
|
|
|
|
|
This model represents the culmination of our Greek legal language modeling research, combining domain expertise with the latest architectural advances in transformer-based language models. It has been optimized for downstream tasks such as Named Entity Recognition (NER), Text Classification, and Question Answering within the complex legal domain. |
|
|
|
|
|
## How to Get Started |
|
|
|
|
|
You can use this model directly with the `fill-mask` pipeline: |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Load the model |
|
|
fill_mask = pipeline( |
|
|
"fill-mask", |
|
|
model="novelcore/gem-modernbert-hq-legal", |
|
|
tokenizer="novelcore/gem-modernbert-hq-legal" |
|
|
) |
|
|
|
|
|
# Example from a legal context with longer sequence support |
|
|
text = "Σύμφωνα με το άρθρο 15 του Συντάγματος, η <mask> των δικαιωμάτων του ανθρώπου αποτελεί βασική υποχρέωση του κράτους στο πλαίσιο της δημοκρατικής πολιτείας." |
|
|
|
|
|
# Get predictions |
|
|
predictions = fill_mask(text) |
|
|
print(predictions) |
|
|
``` |
|
|
|
|
|
For downstream tasks: |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
|
|
# For legal document classification with extended context |
|
|
tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-modernbert-hq-legal") |
|
|
model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-modernbert-hq-legal") |
|
|
|
|
|
# The model supports up to 1024 tokens for longer legal documents |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was pre-trained on the same comprehensive corpus of Greek text used in our previous models, employing our proven **quality-based data repetition strategy** that increases exposure to higher-quality legal content. The original 16.75GB corpus was expanded to 21.12GB through strategic repetition, now processed with **1024-token sequences** for enhanced context understanding. |
|
|
|
|
|
### Quality-Based Data Repetition Strategy |
|
|
|
|
|
| Dataset | Original Size (GB) | Quality Level | Repetition Factor | Effective Size (GB) | |
|
|
| :--- | :--- | :--- | :--- | :--- | |
|
|
| **Raptarchis Legal Dictionary** | 0.35 | **Best** | **4x** | **1.40** | |
|
|
| **Political Reports of the Supreme Court** | 1.20 | **Medium-Best** | **3x** | **3.60** | |
|
|
| **Eur-Lex (Greek Content)** | 0.92 | **Medium** | **2x** | **1.84** | |
|
|
| FEK - Greek Government Gazette | 11.00 | Low | 1x | 11.00 | |
|
|
| Greek Parliament Proceedings | 2.90 | Low-Medium | 1x | 2.90 | |
|
|
| Europarl (Greek Content) | 0.38 | Low | 1x | 0.38 | |
|
|
| **TOTAL** | **16.75 GB** | **-** | **-** | **21.12 GB** | |
|
|
|
|
|
### Enhanced Context Processing |
|
|
|
|
|
With **1024-token sequences**, this model can process: |
|
|
- **Complete legal articles** without truncation |
|
|
- **Full court decisions** with extended reasoning |
|
|
- **Complex legislative texts** with multiple references |
|
|
- **Parliamentary debates** with comprehensive context |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Model Architecture |
|
|
|
|
|
The model uses the ModernBERT-base architecture with the following configuration: |
|
|
|
|
|
- **Hidden Size**: 768 |
|
|
- **Attention Heads**: 12 |
|
|
- **Hidden Layers**: 12 |
|
|
- **Parameters**: ~139M |
|
|
- **Max Position Embeddings**: 1024 |
|
|
- **Vocabulary Size**: 50,373 |
|
|
- **Flash Attention 2**: Enabled |
|
|
- **Context Length**: 1024 tokens (2x longer than previous models) |
|
|
|
|
|
### Key Architectural Advantages |
|
|
|
|
|
ModernBERT's innovations provide significant benefits for legal text processing: |
|
|
|
|
|
1. **Flash Attention 2**: Memory-efficient attention computation for longer sequences |
|
|
2. **Extended Context**: 1024-token sequences capture complete legal documents |
|
|
3. **StableAdamW Optimizer**: Enhanced training stability and convergence |
|
|
4. **Optimized MLM**: 30% masking probability for improved representation learning |
|
|
5. **Advanced Memory Management**: Optimized CUDA memory allocation for large batches |
|
|
|
|
|
### Preprocessing |
|
|
|
|
|
The text was processed into **1024-token chunks** using ModernBERT's tokenizer (vocabulary: 50,373 tokens), providing excellent coverage of Greek legal terminology while maintaining compatibility with the base architecture. |
|
|
|
|
|
Higher-quality sources were strategically repeated during the data preparation phase, with sequences now capturing much more context per training example. |
|
|
|
|
|
### Pre-training |
|
|
|
|
|
The model was pre-trained from scratch for **150,000 steps** on 8x NVIDIA H100 80GB GPUs, using BFloat16 (`bf16`) mixed-precision with advanced optimization techniques. The training took approximately **97 hours and 9 minutes** to complete. |
|
|
|
|
|
#### Key Training Optimizations |
|
|
|
|
|
**Batch Size Optimization:** |
|
|
- **Per-device batch size**: 16 (optimized for H100 memory) |
|
|
- **Gradient accumulation steps**: 8 |
|
|
- **Effective batch size**: 1,024 (16 × 8 × 8 GPUs) |
|
|
- **Context length**: 1024 tokens per sequence |
|
|
|
|
|
**StableAdamW Configuration:** |
|
|
- **Learning Rate**: 0.0002 (conservative for stable convergence) |
|
|
- **Weight Decay**: 0.1 |
|
|
- **Adam Beta1**: 0.9 |
|
|
- **Adam Beta2**: 0.95 |
|
|
- **Adam Epsilon**: 1e-08 |
|
|
- **Gradient Clipping**: 1.0 |
|
|
- **Epsilon Mode**: element_wise |
|
|
|
|
|
**Advanced Learning Rate Schedule:** |
|
|
- **Schedule Type**: Polynomial decay with trapezoidal warmup |
|
|
- **Warmup Steps**: 9,000 |
|
|
- **Decay Power**: 0.5 (square-root decay) |
|
|
- **Max Steps**: 150,000 |
|
|
|
|
|
**ModernBERT Specifications:** |
|
|
- **MLM Probability**: 0.30 (higher than traditional 15%) |
|
|
- **Max Sequence Length**: 1024 |
|
|
- **Flash Attention 2**: Enabled with optimizations |
|
|
- **Memory Optimization**: Advanced CUDA allocation strategies |
|
|
|
|
|
### Training Results |
|
|
|
|
|
The model achieved excellent performance metrics: |
|
|
|
|
|
- **Final Training Loss**: 0.7648 |
|
|
- **Final Evaluation Loss**: 0.7751 |
|
|
- **Training Infrastructure**: 8x NVIDIA H100 80GB GPUs |
|
|
- **Total Training Steps**: 150,000 |
|
|
- **Total Training Time**: 97 hours 9 minutes |
|
|
- **Train/Validation Split**: 90%/10% |
|
|
- **Effective Training Data**: 21.12GB (with quality-based repetition) |
|
|
- **Context Length**: 1024 tokens per sequence |
|
|
|
|
|
### Advanced Training Infrastructure |
|
|
|
|
|
The model was trained with cutting-edge optimizations: |
|
|
|
|
|
**Flash Attention 2 Optimizations:** |
|
|
```yaml |
|
|
FLASH_ATTENTION_FORCE_FP16: "0" # Use bfloat16 |
|
|
FLASH_ATTENTION_SKIP_RESHAPE: "1" # Skip unnecessary reshapes |
|
|
FLASH_ATTENTION_CAUSAL: "0" # Non-causal for BERT |
|
|
FORCE_FLASH_ATTENTION: "1" # Force Flash Attention usage |
|
|
``` |
|
|
|
|
|
**Memory Optimization:** |
|
|
```yaml |
|
|
PYTORCH_CUDA_ALLOC_CONF: "max_split_size_mb:256,roundup_power2_divisions:16,expandable_segments:True,garbage_collection_threshold:0.8" |
|
|
``` |
|
|
|
|
|
**Distributed Training:** |
|
|
- **Backend**: NCCL with extended timeout configurations |
|
|
- **Mixed Precision**: BFloat16 for optimal H100 performance |
|
|
- **Evaluation Frequency**: Every 5,000 steps |
|
|
- **Checkpointing**: Every 5,000 steps |
|
|
- **Logging**: Every 250 steps |
|
|
|
|
|
## Key Innovations |
|
|
|
|
|
### ModernBERT Architecture Benefits |
|
|
|
|
|
1. **Extended Context Window**: 1024 tokens vs 512 in previous models |
|
|
2. **Flash Attention 2**: Memory-efficient attention for longer sequences |
|
|
3. **StableAdamW Optimizer**: Enhanced training stability and convergence |
|
|
4. **Higher MLM Probability**: 30% masking for improved representation learning |
|
|
5. **Trapezoidal LR Schedule**: Optimized learning rate progression |
|
|
|
|
|
### Quality-Based Data Repetition |
|
|
|
|
|
Consistent with our previous models: |
|
|
|
|
|
1. **Highest quality sources** (legal dictionaries) repeated 4x |
|
|
2. **Medium-high quality sources** (court reports) repeated 3x |
|
|
3. **Medium quality sources** (EU legal texts) repeated 2x |
|
|
4. **Lower quality sources** used once for diversity |