gem-modernbert / README.md

Update README.md

515d4d3 verified 2 months ago

8.67 kB

	---
	license: apache-2.0
	language:
	- el
	pipeline_tag: fill-mask
	library_name: transformers
	tags:
	- modernbert
	- fill-mask
	- greek
	- legal
	- masked-lm
	- data-repetition
	- flash-attention
	- stable-adamw
	base_model:
	- answerdotai/ModernBERT-base
	---

	# GEM-ModernBERT HQ Legal: A Greek Legal Language Model with Advanced Optimization

	## Model Description

	GEM-ModernBERT HQ Legal is a ModernBERT-base model pre-trained from scratch on a strategically curated 21GB corpus of Greek legal, parliamentary, and governmental text. This model leverages ModernBERT's cutting-edge architectural innovations including Flash Attention 2, StableAdamW optimizer, 1024-token context length, and advanced memory optimization techniques to deliver superior performance on Greek legal document understanding tasks.

	Building upon our proven quality-based data repetition strategy, this model incorporates ModernBERT's state-of-the-art training methodology with 30% masking probability, trapezoidal learning rate scheduling, and optimized batch sizing for enhanced convergence and performance. The model is specifically designed to handle longer legal documents with its extended 1024-token context window while maintaining computational efficiency through advanced optimization techniques.

	This model represents the culmination of our Greek legal language modeling research, combining domain expertise with the latest architectural advances in transformer-based language models. It has been optimized for downstream tasks such as Named Entity Recognition (NER), Text Classification, and Question Answering within the complex legal domain.

	## How to Get Started

	You can use this model directly with the `fill-mask` pipeline:

	```python
	from transformers import pipeline

	# Load the model
	fill_mask = pipeline(
	"fill-mask",
	model="novelcore/gem-modernbert-hq-legal",
	tokenizer="novelcore/gem-modernbert-hq-legal"
	)

	# Example from a legal context with longer sequence support
	text = "Σύμφωνα με το άρθρο 15 του Συντάγματος, η <mask> των δικαιωμάτων του ανθρώπου αποτελεί βασική υποχρέωση του κράτους στο πλαίσιο της δημοκρατικής πολιτείας."

	# Get predictions
	predictions = fill_mask(text)
	print(predictions)
	```

	For downstream tasks:

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	# For legal document classification with extended context
	tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-modernbert-hq-legal")
	model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-modernbert-hq-legal")

	# The model supports up to 1024 tokens for longer legal documents
	```

	## Training Data

	The model was pre-trained on the same comprehensive corpus of Greek text used in our previous models, employing our proven quality-based data repetition strategy that increases exposure to higher-quality legal content. The original 16.75GB corpus was expanded to 21.12GB through strategic repetition, now processed with 1024-token sequences for enhanced context understanding.

	### Quality-Based Data Repetition Strategy

	\| Dataset \| Original Size (GB) \| Quality Level \| Repetition Factor \| Effective Size (GB) \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| Raptarchis Legal Dictionary \| 0.35 \| Best \| 4x \| 1.40 \|
	\| Political Reports of the Supreme Court \| 1.20 \| Medium-Best \| 3x \| 3.60 \|
	\| Eur-Lex (Greek Content) \| 0.92 \| Medium \| 2x \| 1.84 \|
	\| FEK - Greek Government Gazette \| 11.00 \| Low \| 1x \| 11.00 \|
	\| Greek Parliament Proceedings \| 2.90 \| Low-Medium \| 1x \| 2.90 \|
	\| Europarl (Greek Content) \| 0.38 \| Low \| 1x \| 0.38 \|
	\| TOTAL \| 16.75 GB \| - \| - \| 21.12 GB \|

	### Enhanced Context Processing

	With 1024-token sequences, this model can process:
	- Complete legal articles without truncation
	- Full court decisions with extended reasoning
	- Complex legislative texts with multiple references
	- Parliamentary debates with comprehensive context

	## Training Procedure

	### Model Architecture

	The model uses the ModernBERT-base architecture with the following configuration:

	- Hidden Size: 768
	- Attention Heads: 12
	- Hidden Layers: 12
	- Parameters: ~139M
	- Max Position Embeddings: 1024
	- Vocabulary Size: 50,373
	- Flash Attention 2: Enabled
	- Context Length: 1024 tokens (2x longer than previous models)

	### Key Architectural Advantages

	ModernBERT's innovations provide significant benefits for legal text processing:

	1. Flash Attention 2: Memory-efficient attention computation for longer sequences
	2. Extended Context: 1024-token sequences capture complete legal documents
	3. StableAdamW Optimizer: Enhanced training stability and convergence
	4. Optimized MLM: 30% masking probability for improved representation learning
	5. Advanced Memory Management: Optimized CUDA memory allocation for large batches

	### Preprocessing

	The text was processed into 1024-token chunks using ModernBERT's tokenizer (vocabulary: 50,373 tokens), providing excellent coverage of Greek legal terminology while maintaining compatibility with the base architecture.

	Higher-quality sources were strategically repeated during the data preparation phase, with sequences now capturing much more context per training example.

	### Pre-training

	The model was pre-trained from scratch for 150,000 steps on 8x NVIDIA H100 80GB GPUs, using BFloat16 (`bf16`) mixed-precision with advanced optimization techniques. The training took approximately 97 hours and 9 minutes to complete.

	#### Key Training Optimizations

	Batch Size Optimization:
	- Per-device batch size: 16 (optimized for H100 memory)
	- Gradient accumulation steps: 8
	- Effective batch size: 1,024 (16 × 8 × 8 GPUs)
	- Context length: 1024 tokens per sequence

	StableAdamW Configuration:
	- Learning Rate: 0.0002 (conservative for stable convergence)
	- Weight Decay: 0.1
	- Adam Beta1: 0.9
	- Adam Beta2: 0.95
	- Adam Epsilon: 1e-08
	- Gradient Clipping: 1.0
	- Epsilon Mode: element_wise

	Advanced Learning Rate Schedule:
	- Schedule Type: Polynomial decay with trapezoidal warmup
	- Warmup Steps: 9,000
	- Decay Power: 0.5 (square-root decay)
	- Max Steps: 150,000

	ModernBERT Specifications:
	- MLM Probability: 0.30 (higher than traditional 15%)
	- Max Sequence Length: 1024
	- Flash Attention 2: Enabled with optimizations
	- Memory Optimization: Advanced CUDA allocation strategies

	### Training Results

	The model achieved excellent performance metrics:

	- Final Training Loss: 0.7648
	- Final Evaluation Loss: 0.7751
	- Training Infrastructure: 8x NVIDIA H100 80GB GPUs
	- Total Training Steps: 150,000
	- Total Training Time: 97 hours 9 minutes
	- Train/Validation Split: 90%/10%
	- Effective Training Data: 21.12GB (with quality-based repetition)
	- Context Length: 1024 tokens per sequence

	### Advanced Training Infrastructure

	The model was trained with cutting-edge optimizations:

	Flash Attention 2 Optimizations:
	```yaml
	FLASH_ATTENTION_FORCE_FP16: "0" # Use bfloat16
	FLASH_ATTENTION_SKIP_RESHAPE: "1" # Skip unnecessary reshapes
	FLASH_ATTENTION_CAUSAL: "0" # Non-causal for BERT
	FORCE_FLASH_ATTENTION: "1" # Force Flash Attention usage
	```

	Memory Optimization:
	```yaml
	PYTORCH_CUDA_ALLOC_CONF: "max_split_size_mb:256,roundup_power2_divisions:16,expandable_segments:True,garbage_collection_threshold:0.8"
	```

	Distributed Training:
	- Backend: NCCL with extended timeout configurations
	- Mixed Precision: BFloat16 for optimal H100 performance
	- Evaluation Frequency: Every 5,000 steps
	- Checkpointing: Every 5,000 steps
	- Logging: Every 250 steps

	## Key Innovations

	### ModernBERT Architecture Benefits

	1. Extended Context Window: 1024 tokens vs 512 in previous models
	2. Flash Attention 2: Memory-efficient attention for longer sequences
	3. StableAdamW Optimizer: Enhanced training stability and convergence
	4. Higher MLM Probability: 30% masking for improved representation learning
	5. Trapezoidal LR Schedule: Optimized learning rate progression

	### Quality-Based Data Repetition

	Consistent with our previous models:

	1. Highest quality sources (legal dictionaries) repeated 4x
	2. Medium-high quality sources (court reports) repeated 3x
	3. Medium quality sources (EU legal texts) repeated 2x
	4. Lower quality sources used once for diversity