File size: 8,668 Bytes

d196bfa
3642ace
 
 
 
d196bfa
3642ace
 
 
 
 
 
 
 
 
 
 
d196bfa
 
515d4d3
d196bfa
3642ace
d196bfa
515d4d3
d196bfa
3642ace
d196bfa
3642ace
d196bfa
3642ace
d196bfa
3642ace
d196bfa
3642ace
 
d196bfa
3642ace
 
 
85f9445
 
3642ace
d196bfa
3642ace
 
d196bfa
3642ace
 
 
 
d196bfa
3642ace
d196bfa
3642ace
 
d196bfa
3642ace
85f9445
 
d196bfa
3642ace
 
d196bfa
3642ace
d196bfa
3642ace
d196bfa
3642ace
d196bfa
3642ace
 
 
 
 
 
 
 
 
d196bfa
3642ace
d196bfa
3642ace
 
 
 
 
d196bfa
3642ace
d196bfa
3642ace
d196bfa
3642ace
d196bfa
3642ace
 
 
 
 
 
 
 
d196bfa
3642ace
d196bfa
3642ace
d196bfa
3642ace
 
 
 
 
d196bfa
3642ace
d196bfa
3642ace
d196bfa
3642ace
d196bfa
3642ace
d196bfa
3642ace
d196bfa
3642ace
d196bfa
3642ace
 
 
 
 
d196bfa
3642ace
 
 
 
 
 
 
 
d196bfa
3642ace
 
 
 
 
d196bfa
3642ace
 
 
 
 
d196bfa
3642ace
d196bfa
3642ace
d196bfa
3642ace
 
 
 
 
 
 
 
d196bfa
3642ace
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85f9445

---
license: apache-2.0
language:
- el
pipeline_tag: fill-mask
library_name: transformers
tags:
- modernbert
- fill-mask
- greek
- legal
- masked-lm
- data-repetition
- flash-attention
- stable-adamw
base_model:
- answerdotai/ModernBERT-base
---

# GEM-ModernBERT HQ Legal: A Greek Legal Language Model with Advanced Optimization

## Model Description

**GEM-ModernBERT HQ Legal** is a ModernBERT-base model pre-trained from scratch on a strategically curated 21GB corpus of Greek legal, parliamentary, and governmental text. This model leverages ModernBERT's cutting-edge architectural innovations including **Flash Attention 2**, **StableAdamW optimizer**, **1024-token context length**, and **advanced memory optimization** techniques to deliver superior performance on Greek legal document understanding tasks.

Building upon our proven **quality-based data repetition strategy**, this model incorporates ModernBERT's state-of-the-art training methodology with **30% masking probability**, **trapezoidal learning rate scheduling**, and **optimized batch sizing** for enhanced convergence and performance. The model is specifically designed to handle longer legal documents with its extended 1024-token context window while maintaining computational efficiency through advanced optimization techniques.

This model represents the culmination of our Greek legal language modeling research, combining domain expertise with the latest architectural advances in transformer-based language models. It has been optimized for downstream tasks such as Named Entity Recognition (NER), Text Classification, and Question Answering within the complex legal domain.

## How to Get Started

You can use this model directly with the `fill-mask` pipeline:

```python
from transformers import pipeline

# Load the model
fill_mask = pipeline(
    "fill-mask",
    model="novelcore/gem-modernbert-hq-legal",
    tokenizer="novelcore/gem-modernbert-hq-legal"
)

# Example from a legal context with longer sequence support
text = "Σύμφωνα με το άρθρο 15 του Συντάγματος, η <mask> των δικαιωμάτων του ανθρώπου αποτελεί βασική υποχρέωση του κράτους στο πλαίσιο της δημοκρατικής πολιτείας."

# Get predictions
predictions = fill_mask(text)
print(predictions)
```

For downstream tasks:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# For legal document classification with extended context
tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-modernbert-hq-legal")
model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-modernbert-hq-legal")

# The model supports up to 1024 tokens for longer legal documents
```

## Training Data

The model was pre-trained on the same comprehensive corpus of Greek text used in our previous models, employing our proven **quality-based data repetition strategy** that increases exposure to higher-quality legal content. The original 16.75GB corpus was expanded to 21.12GB through strategic repetition, now processed with **1024-token sequences** for enhanced context understanding.

### Quality-Based Data Repetition Strategy

| Dataset | Original Size (GB) | Quality Level | Repetition Factor | Effective Size (GB) |
| :--- | :--- | :--- | :--- | :--- |
| **Raptarchis Legal Dictionary** | 0.35 | **Best** | **4x** | **1.40** |
| **Political Reports of the Supreme Court** | 1.20 | **Medium-Best** | **3x** | **3.60** |
| **Eur-Lex (Greek Content)** | 0.92 | **Medium** | **2x** | **1.84** |
| FEK - Greek Government Gazette | 11.00 | Low | 1x | 11.00 |
| Greek Parliament Proceedings | 2.90 | Low-Medium | 1x | 2.90 |
| Europarl (Greek Content) | 0.38 | Low | 1x | 0.38 |
| **TOTAL** | **16.75 GB** | **-** | **-** | **21.12 GB** |

### Enhanced Context Processing

With **1024-token sequences**, this model can process:
- **Complete legal articles** without truncation
- **Full court decisions** with extended reasoning
- **Complex legislative texts** with multiple references
- **Parliamentary debates** with comprehensive context

## Training Procedure

### Model Architecture

The model uses the ModernBERT-base architecture with the following configuration:

- **Hidden Size**: 768
- **Attention Heads**: 12
- **Hidden Layers**: 12
- **Parameters**: ~139M
- **Max Position Embeddings**: 1024
- **Vocabulary Size**: 50,373
- **Flash Attention 2**: Enabled
- **Context Length**: 1024 tokens (2x longer than previous models)

### Key Architectural Advantages

ModernBERT's innovations provide significant benefits for legal text processing:

1. **Flash Attention 2**: Memory-efficient attention computation for longer sequences
2. **Extended Context**: 1024-token sequences capture complete legal documents
3. **StableAdamW Optimizer**: Enhanced training stability and convergence
4. **Optimized MLM**: 30% masking probability for improved representation learning
5. **Advanced Memory Management**: Optimized CUDA memory allocation for large batches

### Preprocessing

The text was processed into **1024-token chunks** using ModernBERT's tokenizer (vocabulary: 50,373 tokens), providing excellent coverage of Greek legal terminology while maintaining compatibility with the base architecture.

Higher-quality sources were strategically repeated during the data preparation phase, with sequences now capturing much more context per training example.

### Pre-training

The model was pre-trained from scratch for **150,000 steps** on 8x NVIDIA H100 80GB GPUs, using BFloat16 (`bf16`) mixed-precision with advanced optimization techniques. The training took approximately **97 hours and 9 minutes** to complete.

#### Key Training Optimizations

**Batch Size Optimization:**
- **Per-device batch size**: 16 (optimized for H100 memory)
- **Gradient accumulation steps**: 8
- **Effective batch size**: 1,024 (16 × 8 × 8 GPUs)
- **Context length**: 1024 tokens per sequence

**StableAdamW Configuration:**
- **Learning Rate**: 0.0002 (conservative for stable convergence)
- **Weight Decay**: 0.1
- **Adam Beta1**: 0.9
- **Adam Beta2**: 0.95
- **Adam Epsilon**: 1e-08
- **Gradient Clipping**: 1.0
- **Epsilon Mode**: element_wise

**Advanced Learning Rate Schedule:**
- **Schedule Type**: Polynomial decay with trapezoidal warmup
- **Warmup Steps**: 9,000
- **Decay Power**: 0.5 (square-root decay)
- **Max Steps**: 150,000

**ModernBERT Specifications:**
- **MLM Probability**: 0.30 (higher than traditional 15%)
- **Max Sequence Length**: 1024
- **Flash Attention 2**: Enabled with optimizations
- **Memory Optimization**: Advanced CUDA allocation strategies

### Training Results

The model achieved excellent performance metrics:

- **Final Training Loss**: 0.7648
- **Final Evaluation Loss**: 0.7751
- **Training Infrastructure**: 8x NVIDIA H100 80GB GPUs
- **Total Training Steps**: 150,000
- **Total Training Time**: 97 hours 9 minutes
- **Train/Validation Split**: 90%/10%
- **Effective Training Data**: 21.12GB (with quality-based repetition)
- **Context Length**: 1024 tokens per sequence

### Advanced Training Infrastructure

The model was trained with cutting-edge optimizations:

**Flash Attention 2 Optimizations:**
```yaml
FLASH_ATTENTION_FORCE_FP16: "0"        # Use bfloat16
FLASH_ATTENTION_SKIP_RESHAPE: "1"      # Skip unnecessary reshapes
FLASH_ATTENTION_CAUSAL: "0"            # Non-causal for BERT
FORCE_FLASH_ATTENTION: "1"             # Force Flash Attention usage
```

**Memory Optimization:**
```yaml
PYTORCH_CUDA_ALLOC_CONF: "max_split_size_mb:256,roundup_power2_divisions:16,expandable_segments:True,garbage_collection_threshold:0.8"
```

**Distributed Training:**
- **Backend**: NCCL with extended timeout configurations
- **Mixed Precision**: BFloat16 for optimal H100 performance
- **Evaluation Frequency**: Every 5,000 steps
- **Checkpointing**: Every 5,000 steps
- **Logging**: Every 250 steps

## Key Innovations

### ModernBERT Architecture Benefits

1. **Extended Context Window**: 1024 tokens vs 512 in previous models
2. **Flash Attention 2**: Memory-efficient attention for longer sequences
3. **StableAdamW Optimizer**: Enhanced training stability and convergence
4. **Higher MLM Probability**: 30% masking for improved representation learning
5. **Trapezoidal LR Schedule**: Optimized learning rate progression

### Quality-Based Data Repetition

Consistent with our previous models:

1. **Highest quality sources** (legal dictionaries) repeated 4x
2. **Medium-high quality sources** (court reports) repeated 3x  
3. **Medium quality sources** (EU legal texts) repeated 2x
4. **Lower quality sources** used once for diversity