novelcore
/

gem-longformer

+license: apache-2.0language:elpipeline_tag: fill-masklibrary_name: transformerstags:longformermasked-lmfill-maskgreeklegalbase_model:allenai/longformer-base-4096Themida-Longformer: A Greek Legal Language Model for Long DocumentsModel DescriptionThemida-Longformer is a Longformer-base model pre-trained from scratch on a large, 17GB corpus of Greek legal, parliamentary, and governmental text. It is specifically designed to handle long documents (up to 1024 tokens), making it ideal for understanding the complex and lengthy nature of legal texts in Greece and the EU.This model was trained as part of a research project and has been optimized for downstream tasks such as Named Entity Recognition (NER), Text Classification, and Question Answering within the legal field. The Longformer architecture, with its efficient attention mechanism, allows the model to process longer sequences than standard BERT or ELECTRA models, capturing broader context from legal documents.How to Get StartedYou can use this model directly with the fill-mask pipeline:from transformers import pipeline
+# Load the model
+fill_mask = pipeline(
+    "fill-mask",
+    model="novelcore/themida-longformer-legal-17G-1024",
+    tokenizer="novelcore/themida-longformer-legal-17G-1024"
+)
+# Example from a legal context
+text = "Ο κ. Μητσοτάκης <mask> ότι η κυβέρνηση σέβεται πλήρως τις αποφάσεις του Συμβουλίου της Επικρατείας."
+# Get predictions
+predictions = fill_mask(text)
+print(predictions)
+For downstream tasks on longer documents:from transformers import AutoTokenizer, AutoModelForSequenceClassification
+# For legal document classification
+tokenizer = AutoTokenizer.from_pretrained("novelcore/themida-longformer-legal-17G-1024")
+model = AutoModelForSequenceClassification.from_pretrained("novelcore/themida-longformer-legal-17G-1024")
+# Example with a longer text (up to 1024 tokens)
+long_legal_text = "..." # Your long legal document text here
+inputs = tokenizer(long_legal_text, return_tensors="pt", max_length=1024, truncation=True)
+outputs = model(**inputs)
+Training DataThe model was pre-trained on a comprehensive 17GB corpus of Greek text compiled from various legal and governmental sources. The corpus was carefully cleaned, UTF-8 encoded, and deduplicated to ensure high quality and diversity before training.The composition of the training corpus is as follows:Corpus SourceSize (GB)ContextFEK - Greek Government Gazette (all issues)11.0LegalGreek Parliament Proceedings2.9Legal / ParliamentaryPolitical Reports of the Supreme Court1.2LegalEur-Lex (Greek Content)0.92LegalEuroparl (Greek Content)0.38Legal / ParliamentaryRaptarchis Legal Dictionary0.35LegalTotal~16.75 GBTraining ProcedureModel ArchitectureThe model uses the Longformer architecture with the following configuration:Hidden Size: 768Attention Heads: 12Hidden Layers: 12Attention Window: 512Max Sequence Length: 1024PreprocessingThe text was tokenized using a custom ByteLevelBPE tokenizer trained from scratch on the Greek legal corpus. The tokenizer is uncased (does not distinguish between upper and lower case) and uses a vocabulary of 50,264 tokens.The data was then processed into fixed-size chunks of 1024 tokens, respecting document boundaries to ensure contextual coherence over longer passages.Pre-trainingThe model was pre-trained from scratch for 120,000 steps. Training was performed on 8x NVIDIA A100 40GB GPUs (assumed from previous model), using BFloat16 (bf16) mixed-precision for stability and speed.The key hyperparameters used were:Learning Rate: 2e-5 with a linear warmup of 7,000 stepsBatch Size: Effective batch size of 1,536 (per_device_train_batch_size: 64, gradient_accumulation_steps: 3 over 8 GPUs)Optimizer: AdamW with beta1=0.9, beta2=0.98, epsilon=1e-6Weight Decay: 0.01Max Sequence Length: 1024Max Steps: 120,000MLM Probability: 0.15Training ResultsFinal Training Loss: [Fill in with your final training loss]Final Evaluation Loss: [Fill in with your final evaluation loss]Training Infrastructure: 8x NVIDIA A100 40GB GPUsTraining Duration: [Fill in with your total training time]Total Training Steps: 120,000Evaluation ResultsThe model's performance should be evaluated by fine-tuning it on downstream tasks (e.g., NER, Text Classification) and comparing it against other relevant language models.This section should be filled with your specific results. For example:ModelNER F1-score (strict)Document Classification F1AI-team-UoA/GreekLegalRoBERTa_v3[F1-Score for Baseline][F1-Score for Baseline]novelcore/themida-electra-legal-17G-8-gpu[F1-Score for Baseline][F1-Score for Baseline]Themida-Longformer (this model)[F1-Score for Your Model][F1-Score for Your Model]Intended UsesPrimary Use CasesAnalysis and classification of long legal documents (contracts, court decisions, legislation).Named entity recognition in Greek legal texts where context is spread across long paragraphs.Legal question answering systems that require processing entire document sections.Compliance monitoring and regulatory analysis over extensive legal filings.Legal text similarity and retrieval for full-document comparison.Secondary Use CasesGeneral Greek text understanding (with potential performance degradation on non-legal text).Legal document summarization.Contract analysis and review.Limitations and BiasThe model's primary strength is handling long sequences; its performance on short texts may be comparable to or slightly less efficient than models like RoBERTa or ELECTRA.The model may reflect biases present in Greek legal and governmental texts.Performance may degrade on informal or colloquial Greek text.Limited knowledge of legal concepts post-training data cutoff.Optimized specifically for the Greek legal domain; may not generalize well to other domains.Model Card Authors[Your Name / Your Organization's Name]CitationIf you use this model in your research, please cite it as follows:@misc{your_name_2025_themida_longformer,
+  author = {[Your Name/Organization]},
+  title = {Themida-Longformer: A Greek Legal Language Model for Long Documents},
+  year = {2025},
+  publisher = {Hugging Face},
+  journal = {Hugging Face Hub},
+  howpublished = {\url{[https://huggingface.co/novelcore/themida-longformer-legal-17G-1024](https://huggingface.co/novelcore/themida-longformer-legal-17G-1024)}},
+}
+AcknowledgmentsWe thank the Greek government institutions for making their legal texts publicly available, enabling the creation of this specialized language model for the Greek legal domain.