|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- el |
|
|
pipeline_tag: fill-mask |
|
|
library_name: transformers |
|
|
tags: |
|
|
- longformer |
|
|
- fill-mask |
|
|
- greek |
|
|
- legal |
|
|
- long-document |
|
|
- attention |
|
|
base_model: |
|
|
- allenai/longformer-base-4096 |
|
|
--- |
|
|
|
|
|
# GEM-Longformer Legal: A Greek Legal Long-Document Language Model |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**GEM-Longformer Legal** is a Longformer-base model pre-trained from scratch on a large, 17GB corpus of Greek legal, parliamentary, and governmental text. This model is specifically designed to handle long legal documents with sequences up to 1024 tokens, utilizing the Longformer's efficient sparse attention mechanism to understand complex legal contexts and cross-references within extended documents. |
|
|
|
|
|
Built upon the Longformer architecture, this model excels at processing lengthy Greek legal texts while maintaining computational efficiency through its sliding window attention pattern combined with global attention on special tokens. It is optimized for downstream tasks such as long-document classification, legal document analysis, and information extraction from extended legal texts. |
|
|
|
|
|
## How to Get Started |
|
|
|
|
|
You can use this model directly with the `fill-mask` pipeline: |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Load the model |
|
|
fill_mask = pipeline( |
|
|
"fill-mask", |
|
|
model="novelcore/gem-longformer-legal", |
|
|
tokenizer="novelcore/gem-longformer-legal" |
|
|
) |
|
|
|
|
|
# Example from a legal context with longer sequence support |
|
|
text = """Σύμφωνα με τις διατάξεις του άρθρου 25 του Συντάγματος και των σχετικών νομοθετικών |
|
|
ρυθμίσεων, ο κ. Μητσοτάκης <mask> ότι η κυβέρνηση σέβεται πλήρως τις αποφάσεις του |
|
|
Συμβουλίου της Επικρατείας σχετικά με τη νομιμότητα των διοικητικών πράξεων.""" |
|
|
|
|
|
# Get predictions |
|
|
predictions = fill_mask(text) |
|
|
print(predictions) |
|
|
``` |
|
|
|
|
|
For long document processing: |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
|
|
# For long legal document classification |
|
|
tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-longformer-legal") |
|
|
model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-longformer-legal") |
|
|
|
|
|
# Process documents up to 1024 tokens efficiently |
|
|
long_document = "..." # Your long legal document |
|
|
inputs = tokenizer(long_document, return_tensors="pt", max_length=1024, truncation=True) |
|
|
outputs = model(**inputs) |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was pre-trained on a comprehensive 17GB corpus of Greek text compiled from various legal and governmental sources. The corpus was carefully cleaned, UTF-8 encoded, and deduplicated to ensure high quality and diversity before training. |
|
|
|
|
|
The composition of the training corpus is as follows: |
|
|
|
|
|
| Corpus Source | Size (GB) | Context | |
|
|
| :--- | :--- | :--- | |
|
|
| FEK - Greek Government Gazette (all issues) | 11.0 | Legal | |
|
|
| Greek Parliament Proceedings | 2.9 | Legal / Parliamentary | |
|
|
| Political Reports of the Supreme Court | 1.2 | Legal | |
|
|
| Eur-Lex (Greek Content) | 0.92 | Legal | |
|
|
| Europarl (Greek Content) | 0.38 | Legal / Parliamentary | |
|
|
| Raptarchis Legal Dictionary | 0.35 | Legal | |
|
|
| **Total** | **~16.75 GB** | | |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Model Architecture |
|
|
|
|
|
The model uses the Longformer architecture with the following configuration: |
|
|
|
|
|
- **Hidden Size**: 768 |
|
|
- **Attention Heads**: 12 |
|
|
- **Hidden Layers**: 12 |
|
|
- **Max Position Embeddings**: 1026 |
|
|
- **Attention Window Size**: 512 tokens |
|
|
- **Max Sequence Length**: 1024 tokens |
|
|
- **Vocabulary Size**: 50,264 tokens |
|
|
- **Type Vocab Size**: 1 |
|
|
|
|
|
### Preprocessing |
|
|
|
|
|
The text was tokenized using a custom `ByteLevelBPE` tokenizer trained from scratch on the Greek legal corpus. The tokenizer is uncased (does not distinguish between upper and lower case) and uses a vocabulary of 50,264 tokens. |
|
|
|
|
|
The data was processed into sequences of up to 1024 tokens, taking advantage of the Longformer's ability to handle longer sequences efficiently through its sparse attention mechanism. |
|
|
|
|
|
### Pre-training |
|
|
|
|
|
The model was pre-trained from scratch for **120,000 steps** on 8x NVIDIA A100 40GB GPUs, using BFloat16 (`bf16`) mixed-precision for stability and speed. The training utilized distributed computing with NCCL backend for efficient multi-GPU training. |
|
|
|
|
|
The key hyperparameters used were: |
|
|
|
|
|
- **Learning Rate**: 2e-5 (0.00002) with linear warmup of 7,000 steps |
|
|
- **Batch Size**: Effective batch size of 1,536 (`per_device_train_batch_size: 64`, `gradient_accumulation_steps: 3`) |
|
|
- **Optimizer**: AdamW with weight decay of 0.01 |
|
|
- **Max Gradient Norm**: 1.0 |
|
|
- **Max Sequence Length**: 1024 |
|
|
- **MLM Probability**: 0.15 |
|
|
- **Max Steps**: 120,000 |
|
|
- **Warmup Steps**: 7,000 |
|
|
- **Patience**: 4 (early stopping) |
|
|
- **Train/Validation Split**: 90%/10% |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
- **Precision**: BFloat16 (bf16=True) |
|
|
- **Evaluation Steps**: 4,000 |
|
|
- **Save Steps**: 4,000 |
|
|
- **Logging Steps**: 200 |
|
|
- **Distributed Backend**: NCCL with optimized timeout settings |
|
|
- **Training Infrastructure**: 8x NVIDIA A100 40GB GPUs |
|
|
|
|
|
### Training Results |
|
|
|
|
|
The model achieved the following performance metrics: |
|
|
|
|
|
- **Final Training Loss**: 1.2823 |
|
|
- **Final Evaluation Loss**: 1.1720 |
|
|
- **Training Infrastructure**: 8x NVIDIA A100 40GB GPUs |
|
|
- **Training Duration**: 262:24:39 hours |
|
|
- **Total Training Steps**: 120,000 |
|
|
- **Distributed Training**: NCCL backend with enhanced stability settings |
|
|
- **Memory Optimization**: BFloat16 precision with gradient accumulation |
|
|
|
|
|
## Key Features |
|
|
|
|
|
### Long Document Processing |
|
|
- **Extended Context**: Processes sequences up to 1024 tokens efficiently |
|
|
- **Sparse Attention**: Uses sliding window attention (512 tokens) with global attention |
|
|
- **Memory Efficient**: Longformer's attention pattern scales linearly with sequence length |
|
|
|
|
|
### Legal Domain Specialization |
|
|
- **Legal Vocabulary**: Trained on comprehensive Greek legal corpus |
|
|
- **Document Structure**: Understands legal document formatting and cross-references |
|
|
- **Regulatory Text**: Optimized for governmental and parliamentary language |