alexaapo commited on
Commit
a72dae3
·
verified ·
1 Parent(s): cdf101f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -74
README.md CHANGED
@@ -15,11 +15,11 @@ base_model:
15
  - allenai/longformer-base-4096
16
  ---
17
 
18
- # Themida-Longformer: A Greek Legal Long-Document Language Model
19
 
20
  ## Model Description
21
 
22
- **Themida-Longformer** is a Longformer-base model pre-trained from scratch on a large, 17GB corpus of Greek legal, parliamentary, and governmental text. This model is specifically designed to handle long legal documents with sequences up to 1024 tokens, utilizing the Longformer's efficient sparse attention mechanism to understand complex legal contexts and cross-references within extended documents.
23
 
24
  Built upon the Longformer architecture, this model excels at processing lengthy Greek legal texts while maintaining computational efficiency through its sliding window attention pattern combined with global attention on special tokens. It is optimized for downstream tasks such as long-document classification, legal document analysis, and information extraction from extended legal texts.
25
 
@@ -33,8 +33,8 @@ from transformers import pipeline
33
  # Load the model
34
  fill_mask = pipeline(
35
  "fill-mask",
36
- model="novelcore/themida-longformer-legal-17G-1024",
37
- tokenizer="novelcore/themida-longformer-legal-17G-1024"
38
  )
39
 
40
  # Example from a legal context with longer sequence support
@@ -53,8 +53,8 @@ For long document processing:
53
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
54
 
55
  # For long legal document classification
56
- tokenizer = AutoTokenizer.from_pretrained("novelcore/themida-longformer-legal-17G-1024")
57
- model = AutoModelForSequenceClassification.from_pretrained("novelcore/themida-longformer-legal-17G-1024")
58
 
59
  # Process documents up to 1024 tokens efficiently
60
  long_document = "..." # Your long legal document
@@ -147,71 +147,4 @@ The model achieved the following performance metrics:
147
  ### Legal Domain Specialization
148
  - **Legal Vocabulary**: Trained on comprehensive Greek legal corpus
149
  - **Document Structure**: Understands legal document formatting and cross-references
150
- - **Regulatory Text**: Optimized for governmental and parliamentary language
151
-
152
- ## Evaluation Results
153
-
154
- The model's performance was evaluated on various long-document legal tasks and compared against other Greek language models.
155
-
156
- *This section should be filled with your specific results. For example:*
157
-
158
- | Model | Long Doc Classification F1 | Legal NER F1 | Document Retrieval mAP |
159
- | :--- | :--- | :--- | :--- |
160
- | `AI-team-UoA/GreekLegalRoBERTa_v3` | `[Baseline Score]` | `[Baseline Score]` | `[Baseline Score]` |
161
- | `Themida-Longformer` (this model) | `[Your Score]` | `[Your Score]` | `[Your Score]` |
162
-
163
- ## Intended Uses
164
-
165
- ### Primary Use Cases
166
- - **Long Legal Document Analysis**: Classification and analysis of extended legal texts
167
- - **Legal Document Summarization**: Processing multi-page legal documents
168
- - **Cross-Reference Resolution**: Understanding connections within long legal texts
169
- - **Regulatory Compliance**: Analysis of comprehensive regulatory documents
170
- - **Legal Research**: Information extraction from lengthy legal corpora
171
-
172
- ### Secondary Use Cases
173
- - **Parliamentary Proceedings Analysis**: Processing extended parliamentary debates
174
- - **Legal Contract Review**: Analysis of complex multi-section contracts
175
- - **Case Law Analysis**: Processing lengthy court decisions and opinions
176
-
177
- ## Limitations and Bias
178
-
179
- - **Sequence Length**: Limited to 1024 tokens (longer than BERT but shorter than full Longformer capacity)
180
- - **Domain Specificity**: Optimized for Greek legal domain; may not generalize well to other domains
181
- - **Computational Requirements**: Requires more memory than standard BERT models for long sequences
182
- - **Language Limitation**: Specifically trained for Greek language legal texts
183
- - **Bias**: May reflect biases present in Greek legal and governmental texts
184
- - **Attention Patterns**: Global attention tokens need to be strategically placed for optimal performance
185
-
186
- ## Technical Specifications
187
-
188
- - **Architecture**: Longformer with sparse attention
189
- - **Max Sequence Length**: 1024 tokens
190
- - **Attention Window**: 512 tokens
191
- - **Global Attention**: Configurable on special tokens
192
- - **Vocabulary**: Custom Greek legal BPE tokenizer (50,264 tokens)
193
- - **Training Precision**: BFloat16
194
- - **Distributed Training**: Multi-GPU NCCL backend
195
-
196
- ## Model Card Authors
197
-
198
- [Your Name / Your Organization's Name]
199
-
200
- ## Citation
201
-
202
- If you use this model in your research, please cite it as follows:
203
-
204
- ```bibtex
205
- @misc{your_name_2025_themida_longformer,
206
- author = {[Your Name/Organization]},
207
- title = {Themida-Longformer: A Greek Legal Long-Document Language Model},
208
- year = {2025},
209
- publisher = {Hugging Face},
210
- journal = {Hugging Face Hub},
211
- howpublished = {\url{https://huggingface.co/novelcore/themida-longformer-legal-17G-1024}},
212
- }
213
- ```
214
-
215
- ## Acknowledgments
216
-
217
- We thank the Greek government institutions for making their legal texts publicly available, enabling the creation of this specialized long-document language model for the Greek legal domain. Special acknowledgment to the Allen Institute for AI for developing the Longformer architecture that enables efficient long-sequence processing.
 
15
  - allenai/longformer-base-4096
16
  ---
17
 
18
+ # GEM-Longformer Legal: A Greek Legal Long-Document Language Model
19
 
20
  ## Model Description
21
 
22
+ **GEM-Longformer Legal** is a Longformer-base model pre-trained from scratch on a large, 17GB corpus of Greek legal, parliamentary, and governmental text. This model is specifically designed to handle long legal documents with sequences up to 1024 tokens, utilizing the Longformer's efficient sparse attention mechanism to understand complex legal contexts and cross-references within extended documents.
23
 
24
  Built upon the Longformer architecture, this model excels at processing lengthy Greek legal texts while maintaining computational efficiency through its sliding window attention pattern combined with global attention on special tokens. It is optimized for downstream tasks such as long-document classification, legal document analysis, and information extraction from extended legal texts.
25
 
 
33
  # Load the model
34
  fill_mask = pipeline(
35
  "fill-mask",
36
+ model="novelcore/gem-longformer-legal",
37
+ tokenizer="novelcore/gem-longformer-legal"
38
  )
39
 
40
  # Example from a legal context with longer sequence support
 
53
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
54
 
55
  # For long legal document classification
56
+ tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-longformer-legal")
57
+ model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-longformer-legal")
58
 
59
  # Process documents up to 1024 tokens efficiently
60
  long_document = "..." # Your long legal document
 
147
  ### Legal Domain Specialization
148
  - **Legal Vocabulary**: Trained on comprehensive Greek legal corpus
149
  - **Document Structure**: Understands legal document formatting and cross-references
150
+ - **Regulatory Text**: Optimized for governmental and parliamentary language