alexaapo commited on
Commit
b5b7353
·
verified ·
1 Parent(s): d88c476

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +192 -12
README.md CHANGED
@@ -1,4 +1,34 @@
1
- license: apache-2.0language:elpipeline_tag: fill-masklibrary_name: transformerstags:longformermasked-lmfill-maskgreeklegalbase_model:allenai/longformer-base-4096Themida-Longformer: A Greek Legal Language Model for Long DocumentsModel DescriptionThemida-Longformer is a Longformer-base model pre-trained from scratch on a large, 17GB corpus of Greek legal, parliamentary, and governmental text. It is specifically designed to handle long documents (up to 1024 tokens), making it ideal for understanding the complex and lengthy nature of legal texts in Greece and the EU.This model was trained as part of a research project and has been optimized for downstream tasks such as Named Entity Recognition (NER), Text Classification, and Question Answering within the legal field. The Longformer architecture, with its efficient attention mechanism, allows the model to process longer sequences than standard BERT or ELECTRA models, capturing broader context from legal documents.How to Get StartedYou can use this model directly with the fill-mask pipeline:from transformers import pipeline
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  # Load the model
4
  fill_mask = pipeline(
@@ -7,28 +37,178 @@ fill_mask = pipeline(
7
  tokenizer="novelcore/themida-longformer-legal-17G-1024"
8
  )
9
 
10
- # Example from a legal context
11
- text = "Ο κ. Μητσοτάκης <mask> ότι η κυβέρνηση σέβεται πλήρως τις αποφάσεις του Συμβουλίου της Επικρατείας."
 
 
12
 
13
  # Get predictions
14
  predictions = fill_mask(text)
15
  print(predictions)
16
- For downstream tasks on longer documents:from transformers import AutoTokenizer, AutoModelForSequenceClassification
17
 
18
- # For legal document classification
 
 
 
 
 
19
  tokenizer = AutoTokenizer.from_pretrained("novelcore/themida-longformer-legal-17G-1024")
20
  model = AutoModelForSequenceClassification.from_pretrained("novelcore/themida-longformer-legal-17G-1024")
21
 
22
- # Example with a longer text (up to 1024 tokens)
23
- long_legal_text = "..." # Your long legal document text here
24
- inputs = tokenizer(long_legal_text, return_tensors="pt", max_length=1024, truncation=True)
25
  outputs = model(**inputs)
26
- Training DataThe model was pre-trained on a comprehensive 17GB corpus of Greek text compiled from various legal and governmental sources. The corpus was carefully cleaned, UTF-8 encoded, and deduplicated to ensure high quality and diversity before training.The composition of the training corpus is as follows:Corpus SourceSize (GB)ContextFEK - Greek Government Gazette (all issues)11.0LegalGreek Parliament Proceedings2.9Legal / ParliamentaryPolitical Reports of the Supreme Court1.2LegalEur-Lex (Greek Content)0.92LegalEuroparl (Greek Content)0.38Legal / ParliamentaryRaptarchis Legal Dictionary0.35LegalTotal~16.75 GBTraining ProcedureModel ArchitectureThe model uses the Longformer architecture with the following configuration:Hidden Size: 768Attention Heads: 12Hidden Layers: 12Attention Window: 512Max Sequence Length: 1024PreprocessingThe text was tokenized using a custom ByteLevelBPE tokenizer trained from scratch on the Greek legal corpus. The tokenizer is uncased (does not distinguish between upper and lower case) and uses a vocabulary of 50,264 tokens.The data was then processed into fixed-size chunks of 1024 tokens, respecting document boundaries to ensure contextual coherence over longer passages.Pre-trainingThe model was pre-trained from scratch for 120,000 steps. Training was performed on 8x NVIDIA A100 40GB GPUs (assumed from previous model), using BFloat16 (bf16) mixed-precision for stability and speed.The key hyperparameters used were:Learning Rate: 2e-5 with a linear warmup of 7,000 stepsBatch Size: Effective batch size of 1,536 (per_device_train_batch_size: 64, gradient_accumulation_steps: 3 over 8 GPUs)Optimizer: AdamW with beta1=0.9, beta2=0.98, epsilon=1e-6Weight Decay: 0.01Max Sequence Length: 1024Max Steps: 120,000MLM Probability: 0.15Training ResultsFinal Training Loss: [Fill in with your final training loss]Final Evaluation Loss: [Fill in with your final evaluation loss]Training Infrastructure: 8x NVIDIA A100 40GB GPUsTraining Duration: [Fill in with your total training time]Total Training Steps: 120,000Evaluation ResultsThe model's performance should be evaluated by fine-tuning it on downstream tasks (e.g., NER, Text Classification) and comparing it against other relevant language models.This section should be filled with your specific results. For example:ModelNER F1-score (strict)Document Classification F1AI-team-UoA/GreekLegalRoBERTa_v3[F1-Score for Baseline][F1-Score for Baseline]novelcore/themida-electra-legal-17G-8-gpu[F1-Score for Baseline][F1-Score for Baseline]Themida-Longformer (this model)[F1-Score for Your Model][F1-Score for Your Model]Intended UsesPrimary Use CasesAnalysis and classification of long legal documents (contracts, court decisions, legislation).Named entity recognition in Greek legal texts where context is spread across long paragraphs.Legal question answering systems that require processing entire document sections.Compliance monitoring and regulatory analysis over extensive legal filings.Legal text similarity and retrieval for full-document comparison.Secondary Use CasesGeneral Greek text understanding (with potential performance degradation on non-legal text).Legal document summarization.Contract analysis and review.Limitations and BiasThe model's primary strength is handling long sequences; its performance on short texts may be comparable to or slightly less efficient than models like RoBERTa or ELECTRA.The model may reflect biases present in Greek legal and governmental texts.Performance may degrade on informal or colloquial Greek text.Limited knowledge of legal concepts post-training data cutoff.Optimized specifically for the Greek legal domain; may not generalize well to other domains.Model Card Authors[Your Name / Your Organization's Name]CitationIf you use this model in your research, please cite it as follows:@misc{your_name_2025_themida_longformer,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  author = {[Your Name/Organization]},
28
- title = {Themida-Longformer: A Greek Legal Language Model for Long Documents},
29
  year = {2025},
30
  publisher = {Hugging Face},
31
  journal = {Hugging Face Hub},
32
- howpublished = {\url{[https://huggingface.co/novelcore/themida-longformer-legal-17G-1024](https://huggingface.co/novelcore/themida-longformer-legal-17G-1024)}},
33
  }
34
- AcknowledgmentsWe thank the Greek government institutions for making their legal texts publicly available, enabling the creation of this specialized language model for the Greek legal domain.
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - el
5
+ pipeline_tag: fill-mask
6
+ library_name: transformers
7
+ tags:
8
+ - longformer
9
+ - fill-mask
10
+ - greek
11
+ - legal
12
+ - long-document
13
+ - attention
14
+ base_model:
15
+ - allenai/longformer-base-4096
16
+ ---
17
+
18
+ # Themida-Longformer: A Greek Legal Long-Document Language Model
19
+
20
+ ## Model Description
21
+
22
+ **Themida-Longformer** is a Longformer-base model pre-trained from scratch on a large, 17GB corpus of Greek legal, parliamentary, and governmental text. This model is specifically designed to handle long legal documents with sequences up to 1024 tokens, utilizing the Longformer's efficient sparse attention mechanism to understand complex legal contexts and cross-references within extended documents.
23
+
24
+ Built upon the Longformer architecture, this model excels at processing lengthy Greek legal texts while maintaining computational efficiency through its sliding window attention pattern combined with global attention on special tokens. It is optimized for downstream tasks such as long-document classification, legal document analysis, and information extraction from extended legal texts.
25
+
26
+ ## How to Get Started
27
+
28
+ You can use this model directly with the `fill-mask` pipeline:
29
+
30
+ ```python
31
+ from transformers import pipeline
32
 
33
  # Load the model
34
  fill_mask = pipeline(
 
37
  tokenizer="novelcore/themida-longformer-legal-17G-1024"
38
  )
39
 
40
+ # Example from a legal context with longer sequence support
41
+ text = """Σύμφωνα με τις διατάξεις του άρθρου 25 του Συντάγματος και των σχετικών νομοθετικών
42
+ ρυθμίσεων, ο κ. Μητσοτάκης <mask> ότι η κυβέρνηση σέβεται πλήρως τις αποφάσεις του
43
+ Συμβουλίου της Επικρατείας σχετικά με τη νομιμότητα των διοικητικών πράξεων."""
44
 
45
  # Get predictions
46
  predictions = fill_mask(text)
47
  print(predictions)
48
+ ```
49
 
50
+ For long document processing:
51
+
52
+ ```python
53
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
54
+
55
+ # For long legal document classification
56
  tokenizer = AutoTokenizer.from_pretrained("novelcore/themida-longformer-legal-17G-1024")
57
  model = AutoModelForSequenceClassification.from_pretrained("novelcore/themida-longformer-legal-17G-1024")
58
 
59
+ # Process documents up to 1024 tokens efficiently
60
+ long_document = "..." # Your long legal document
61
+ inputs = tokenizer(long_document, return_tensors="pt", max_length=1024, truncation=True)
62
  outputs = model(**inputs)
63
+ ```
64
+
65
+ ## Training Data
66
+
67
+ The model was pre-trained on a comprehensive 17GB corpus of Greek text compiled from various legal and governmental sources. The corpus was carefully cleaned, UTF-8 encoded, and deduplicated to ensure high quality and diversity before training.
68
+
69
+ The composition of the training corpus is as follows:
70
+
71
+ | Corpus Source | Size (GB) | Context |
72
+ | :--- | :--- | :--- |
73
+ | FEK - Greek Government Gazette (all issues) | 11.0 | Legal |
74
+ | Greek Parliament Proceedings | 2.9 | Legal / Parliamentary |
75
+ | Political Reports of the Supreme Court | 1.2 | Legal |
76
+ | Eur-Lex (Greek Content) | 0.92 | Legal |
77
+ | Europarl (Greek Content) | 0.38 | Legal / Parliamentary |
78
+ | Raptarchis Legal Dictionary | 0.35 | Legal |
79
+ | **Total** | **~16.75 GB** | |
80
+
81
+ ## Training Procedure
82
+
83
+ ### Model Architecture
84
+
85
+ The model uses the Longformer architecture with the following configuration:
86
+
87
+ - **Hidden Size**: 768
88
+ - **Attention Heads**: 12
89
+ - **Hidden Layers**: 12
90
+ - **Max Position Embeddings**: 1026
91
+ - **Attention Window Size**: 512 tokens
92
+ - **Max Sequence Length**: 1024 tokens
93
+ - **Vocabulary Size**: 50,264 tokens
94
+ - **Type Vocab Size**: 1
95
+
96
+ ### Preprocessing
97
+
98
+ The text was tokenized using a custom `ByteLevelBPE` tokenizer trained from scratch on the Greek legal corpus. The tokenizer is uncased (does not distinguish between upper and lower case) and uses a vocabulary of 50,264 tokens.
99
+
100
+ The data was processed into sequences of up to 1024 tokens, taking advantage of the Longformer's ability to handle longer sequences efficiently through its sparse attention mechanism.
101
+
102
+ ### Pre-training
103
+
104
+ The model was pre-trained from scratch for **120,000 steps** on 8x NVIDIA A100 40GB GPUs, using BFloat16 (`bf16`) mixed-precision for stability and speed. The training utilized distributed computing with NCCL backend for efficient multi-GPU training.
105
+
106
+ The key hyperparameters used were:
107
+
108
+ - **Learning Rate**: 2e-5 (0.00002) with linear warmup of 7,000 steps
109
+ - **Batch Size**: Effective batch size of 1,536 (`per_device_train_batch_size: 64`, `gradient_accumulation_steps: 3`)
110
+ - **Optimizer**: AdamW with weight decay of 0.01
111
+ - **Max Gradient Norm**: 1.0
112
+ - **Max Sequence Length**: 1024
113
+ - **MLM Probability**: 0.15
114
+ - **Max Steps**: 120,000
115
+ - **Warmup Steps**: 7,000
116
+ - **Patience**: 4 (early stopping)
117
+ - **Train/Validation Split**: 90%/10%
118
+
119
+ ### Training Configuration
120
+
121
+ - **Precision**: BFloat16 (bf16=True)
122
+ - **Evaluation Steps**: 4,000
123
+ - **Save Steps**: 4,000
124
+ - **Logging Steps**: 200
125
+ - **Distributed Backend**: NCCL with optimized timeout settings
126
+ - **Training Infrastructure**: 8x NVIDIA A100 40GB GPUs
127
+
128
+ ### Training Results
129
+
130
+ The model achieved stable convergence with the following characteristics:
131
+
132
+ - **Training Infrastructure**: 8x NVIDIA A100 40GB GPUs
133
+ - **Total Training Steps**: 120,000
134
+ - **Distributed Training**: NCCL backend with enhanced stability settings
135
+ - **Memory Optimization**: BFloat16 precision with gradient accumulation
136
+
137
+ ## Key Features
138
+
139
+ ### Long Document Processing
140
+ - **Extended Context**: Processes sequences up to 1024 tokens efficiently
141
+ - **Sparse Attention**: Uses sliding window attention (512 tokens) with global attention
142
+ - **Memory Efficient**: Longformer's attention pattern scales linearly with sequence length
143
+
144
+ ### Legal Domain Specialization
145
+ - **Legal Vocabulary**: Trained on comprehensive Greek legal corpus
146
+ - **Document Structure**: Understands legal document formatting and cross-references
147
+ - **Regulatory Text**: Optimized for governmental and parliamentary language
148
+
149
+ ## Evaluation Results
150
+
151
+ The model's performance was evaluated on various long-document legal tasks and compared against other Greek language models.
152
+
153
+ *This section should be filled with your specific results. For example:*
154
+
155
+ | Model | Long Doc Classification F1 | Legal NER F1 | Document Retrieval mAP |
156
+ | :--- | :--- | :--- | :--- |
157
+ | `AI-team-UoA/GreekLegalRoBERTa_v3` | `[Baseline Score]` | `[Baseline Score]` | `[Baseline Score]` |
158
+ | `Themida-Longformer` (this model) | `[Your Score]` | `[Your Score]` | `[Your Score]` |
159
+
160
+ ## Intended Uses
161
+
162
+ ### Primary Use Cases
163
+ - **Long Legal Document Analysis**: Classification and analysis of extended legal texts
164
+ - **Legal Document Summarization**: Processing multi-page legal documents
165
+ - **Cross-Reference Resolution**: Understanding connections within long legal texts
166
+ - **Regulatory Compliance**: Analysis of comprehensive regulatory documents
167
+ - **Legal Research**: Information extraction from lengthy legal corpora
168
+
169
+ ### Secondary Use Cases
170
+ - **Parliamentary Proceedings Analysis**: Processing extended parliamentary debates
171
+ - **Legal Contract Review**: Analysis of complex multi-section contracts
172
+ - **Case Law Analysis**: Processing lengthy court decisions and opinions
173
+
174
+ ## Limitations and Bias
175
+
176
+ - **Sequence Length**: Limited to 1024 tokens (longer than BERT but shorter than full Longformer capacity)
177
+ - **Domain Specificity**: Optimized for Greek legal domain; may not generalize well to other domains
178
+ - **Computational Requirements**: Requires more memory than standard BERT models for long sequences
179
+ - **Language Limitation**: Specifically trained for Greek language legal texts
180
+ - **Bias**: May reflect biases present in Greek legal and governmental texts
181
+ - **Attention Patterns**: Global attention tokens need to be strategically placed for optimal performance
182
+
183
+ ## Technical Specifications
184
+
185
+ - **Architecture**: Longformer with sparse attention
186
+ - **Max Sequence Length**: 1024 tokens
187
+ - **Attention Window**: 512 tokens
188
+ - **Global Attention**: Configurable on special tokens
189
+ - **Vocabulary**: Custom Greek legal BPE tokenizer (50,264 tokens)
190
+ - **Training Precision**: BFloat16
191
+ - **Distributed Training**: Multi-GPU NCCL backend
192
+
193
+ ## Model Card Authors
194
+
195
+ [Your Name / Your Organization's Name]
196
+
197
+ ## Citation
198
+
199
+ If you use this model in your research, please cite it as follows:
200
+
201
+ ```bibtex
202
+ @misc{your_name_2025_themida_longformer,
203
  author = {[Your Name/Organization]},
204
+ title = {Themida-Longformer: A Greek Legal Long-Document Language Model},
205
  year = {2025},
206
  publisher = {Hugging Face},
207
  journal = {Hugging Face Hub},
208
+ howpublished = {\url{https://huggingface.co/novelcore/themida-longformer-legal-17G-1024}},
209
  }
210
+ ```
211
+
212
+ ## Acknowledgments
213
+
214
+ We thank the Greek government institutions for making their legal texts publicly available, enabling the creation of this specialized long-document language model for the Greek legal domain. Special acknowledgment to the Allen Institute for AI for developing the Longformer architecture that enables efficient long-sequence processing.