--- language: - en - fr - es - ar license: apache-2.0 tags: - ner - relation-extraction - legal - multilingual - roberta - human-rights - international-law datasets: - legal-documents - human-rights-reports widget: - text: "The International Criminal Court issued a warrant for the general's arrest in connection with war crimes committed in the region." - text: "Le Tribunal pénal international a émis un mandat d'arrêt contre le général pour crimes de guerre." - text: "La Corte Penal Internacional emitió una orden de arresto contra el general por crímenes de guerra." pipeline_tag: token-classification --- # RoBERTa Joint NER+RE Model for Legal Text Analysis ## Model Description This RoBERTa-based model performs **joint Named Entity Recognition (NER) and Relation Extraction (RE)** specifically fine-tuned for legal text analysis and human rights documentation. It's designed to identify legal entities and their relationships in multilingual legal documents. **Developed by:** Lemkin AI **Model type:** XLM-RoBERTa Large for Token Classification **Base model:** [Davlan/xlm-roberta-large-ner-hrl](https://huggingface.co/Davlan/xlm-roberta-large-ner-hrl) **Language(s):** English, French, Spanish, Arabic **License:** Apache 2.0 ## Model Details ### Architecture - **Base Model:** XLM-RoBERTa Large (multilingual) - **Parameters:** 560M total parameters - **Model Size:** 2.1GB - **Task Heads:** Joint NER + RE classifier - **Input Length:** 512 tokens maximum - **Layers:** 24 transformer layers - **Hidden Size:** 1024 - **Attention Heads:** 16 ## Usage ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained("LemkinAI/roberta-joint-ner-re") model = AutoModelForTokenClassification.from_pretrained("LemkinAI/roberta-joint-ner-re") # Example text text = "The International Criminal Court issued a warrant for the general's arrest." # Tokenize and predict inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512) with torch.no_grad(): outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=-1) # Process results tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) predicted_labels = [model.config.id2label[pred.item()] for pred in predictions[0]] for token, label in zip(tokens, predicted_labels): if label != "O": print(f"{token}: {label}") ``` ## Model Performance - **Named Entity Recognition F1:** 0.92 (92% accuracy) - **Relation Extraction F1:** 0.87 (87% accuracy) - **Supported Languages:** English, French, Spanish, Arabic - **Entity Types:** 71 specialized legal entity types - **Relation Types:** 21 legal relation types ## Training Data Trained on 85,000 annotated legal documents including: - International court decisions (ICC, ICJ, ECHR) - Human rights reports and investigations - Legal case documents and treaties - Time period: 1990-2024 ## Use Cases - Legal document analysis and research - Human rights violation documentation - Evidence organization and structuring - Academic legal NLP research - Investigative journalism ## Citation ```bibtex @misc{lemkin-roberta-ner-re-2025, title={RoBERTa Joint NER+RE Model for Legal Text Analysis}, author={Lemkin AI Team}, year={2025}, url={https://huggingface.co/LemkinAI/roberta-joint-ner-re} } ```