Update README.md
Browse files
README.md
CHANGED
|
@@ -15,11 +15,11 @@ base_model:
|
|
| 15 |
- allenai/longformer-base-4096
|
| 16 |
---
|
| 17 |
|
| 18 |
-
#
|
| 19 |
|
| 20 |
## Model Description
|
| 21 |
|
| 22 |
-
**
|
| 23 |
|
| 24 |
Built upon the Longformer architecture, this model excels at processing lengthy Greek legal texts while maintaining computational efficiency through its sliding window attention pattern combined with global attention on special tokens. It is optimized for downstream tasks such as long-document classification, legal document analysis, and information extraction from extended legal texts.
|
| 25 |
|
|
@@ -33,8 +33,8 @@ from transformers import pipeline
|
|
| 33 |
# Load the model
|
| 34 |
fill_mask = pipeline(
|
| 35 |
"fill-mask",
|
| 36 |
-
model="novelcore/
|
| 37 |
-
tokenizer="novelcore/
|
| 38 |
)
|
| 39 |
|
| 40 |
# Example from a legal context with longer sequence support
|
|
@@ -53,8 +53,8 @@ For long document processing:
|
|
| 53 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 54 |
|
| 55 |
# For long legal document classification
|
| 56 |
-
tokenizer = AutoTokenizer.from_pretrained("novelcore/
|
| 57 |
-
model = AutoModelForSequenceClassification.from_pretrained("novelcore/
|
| 58 |
|
| 59 |
# Process documents up to 1024 tokens efficiently
|
| 60 |
long_document = "..." # Your long legal document
|
|
@@ -147,71 +147,4 @@ The model achieved the following performance metrics:
|
|
| 147 |
### Legal Domain Specialization
|
| 148 |
- **Legal Vocabulary**: Trained on comprehensive Greek legal corpus
|
| 149 |
- **Document Structure**: Understands legal document formatting and cross-references
|
| 150 |
-
- **Regulatory Text**: Optimized for governmental and parliamentary language
|
| 151 |
-
|
| 152 |
-
## Evaluation Results
|
| 153 |
-
|
| 154 |
-
The model's performance was evaluated on various long-document legal tasks and compared against other Greek language models.
|
| 155 |
-
|
| 156 |
-
*This section should be filled with your specific results. For example:*
|
| 157 |
-
|
| 158 |
-
| Model | Long Doc Classification F1 | Legal NER F1 | Document Retrieval mAP |
|
| 159 |
-
| :--- | :--- | :--- | :--- |
|
| 160 |
-
| `AI-team-UoA/GreekLegalRoBERTa_v3` | `[Baseline Score]` | `[Baseline Score]` | `[Baseline Score]` |
|
| 161 |
-
| `Themida-Longformer` (this model) | `[Your Score]` | `[Your Score]` | `[Your Score]` |
|
| 162 |
-
|
| 163 |
-
## Intended Uses
|
| 164 |
-
|
| 165 |
-
### Primary Use Cases
|
| 166 |
-
- **Long Legal Document Analysis**: Classification and analysis of extended legal texts
|
| 167 |
-
- **Legal Document Summarization**: Processing multi-page legal documents
|
| 168 |
-
- **Cross-Reference Resolution**: Understanding connections within long legal texts
|
| 169 |
-
- **Regulatory Compliance**: Analysis of comprehensive regulatory documents
|
| 170 |
-
- **Legal Research**: Information extraction from lengthy legal corpora
|
| 171 |
-
|
| 172 |
-
### Secondary Use Cases
|
| 173 |
-
- **Parliamentary Proceedings Analysis**: Processing extended parliamentary debates
|
| 174 |
-
- **Legal Contract Review**: Analysis of complex multi-section contracts
|
| 175 |
-
- **Case Law Analysis**: Processing lengthy court decisions and opinions
|
| 176 |
-
|
| 177 |
-
## Limitations and Bias
|
| 178 |
-
|
| 179 |
-
- **Sequence Length**: Limited to 1024 tokens (longer than BERT but shorter than full Longformer capacity)
|
| 180 |
-
- **Domain Specificity**: Optimized for Greek legal domain; may not generalize well to other domains
|
| 181 |
-
- **Computational Requirements**: Requires more memory than standard BERT models for long sequences
|
| 182 |
-
- **Language Limitation**: Specifically trained for Greek language legal texts
|
| 183 |
-
- **Bias**: May reflect biases present in Greek legal and governmental texts
|
| 184 |
-
- **Attention Patterns**: Global attention tokens need to be strategically placed for optimal performance
|
| 185 |
-
|
| 186 |
-
## Technical Specifications
|
| 187 |
-
|
| 188 |
-
- **Architecture**: Longformer with sparse attention
|
| 189 |
-
- **Max Sequence Length**: 1024 tokens
|
| 190 |
-
- **Attention Window**: 512 tokens
|
| 191 |
-
- **Global Attention**: Configurable on special tokens
|
| 192 |
-
- **Vocabulary**: Custom Greek legal BPE tokenizer (50,264 tokens)
|
| 193 |
-
- **Training Precision**: BFloat16
|
| 194 |
-
- **Distributed Training**: Multi-GPU NCCL backend
|
| 195 |
-
|
| 196 |
-
## Model Card Authors
|
| 197 |
-
|
| 198 |
-
[Your Name / Your Organization's Name]
|
| 199 |
-
|
| 200 |
-
## Citation
|
| 201 |
-
|
| 202 |
-
If you use this model in your research, please cite it as follows:
|
| 203 |
-
|
| 204 |
-
```bibtex
|
| 205 |
-
@misc{your_name_2025_themida_longformer,
|
| 206 |
-
author = {[Your Name/Organization]},
|
| 207 |
-
title = {Themida-Longformer: A Greek Legal Long-Document Language Model},
|
| 208 |
-
year = {2025},
|
| 209 |
-
publisher = {Hugging Face},
|
| 210 |
-
journal = {Hugging Face Hub},
|
| 211 |
-
howpublished = {\url{https://huggingface.co/novelcore/themida-longformer-legal-17G-1024}},
|
| 212 |
-
}
|
| 213 |
-
```
|
| 214 |
-
|
| 215 |
-
## Acknowledgments
|
| 216 |
-
|
| 217 |
-
We thank the Greek government institutions for making their legal texts publicly available, enabling the creation of this specialized long-document language model for the Greek legal domain. Special acknowledgment to the Allen Institute for AI for developing the Longformer architecture that enables efficient long-sequence processing.
|
|
|
|
| 15 |
- allenai/longformer-base-4096
|
| 16 |
---
|
| 17 |
|
| 18 |
+
# GEM-Longformer Legal: A Greek Legal Long-Document Language Model
|
| 19 |
|
| 20 |
## Model Description
|
| 21 |
|
| 22 |
+
**GEM-Longformer Legal** is a Longformer-base model pre-trained from scratch on a large, 17GB corpus of Greek legal, parliamentary, and governmental text. This model is specifically designed to handle long legal documents with sequences up to 1024 tokens, utilizing the Longformer's efficient sparse attention mechanism to understand complex legal contexts and cross-references within extended documents.
|
| 23 |
|
| 24 |
Built upon the Longformer architecture, this model excels at processing lengthy Greek legal texts while maintaining computational efficiency through its sliding window attention pattern combined with global attention on special tokens. It is optimized for downstream tasks such as long-document classification, legal document analysis, and information extraction from extended legal texts.
|
| 25 |
|
|
|
|
| 33 |
# Load the model
|
| 34 |
fill_mask = pipeline(
|
| 35 |
"fill-mask",
|
| 36 |
+
model="novelcore/gem-longformer-legal",
|
| 37 |
+
tokenizer="novelcore/gem-longformer-legal"
|
| 38 |
)
|
| 39 |
|
| 40 |
# Example from a legal context with longer sequence support
|
|
|
|
| 53 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 54 |
|
| 55 |
# For long legal document classification
|
| 56 |
+
tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-longformer-legal")
|
| 57 |
+
model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-longformer-legal")
|
| 58 |
|
| 59 |
# Process documents up to 1024 tokens efficiently
|
| 60 |
long_document = "..." # Your long legal document
|
|
|
|
| 147 |
### Legal Domain Specialization
|
| 148 |
- **Legal Vocabulary**: Trained on comprehensive Greek legal corpus
|
| 149 |
- **Document Structure**: Understands legal document formatting and cross-references
|
| 150 |
+
- **Regulatory Text**: Optimized for governmental and parliamentary language
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|