| | --- |
| | language: |
| | - en |
| | - fr |
| | - es |
| | - ar |
| | license: apache-2.0 |
| | tags: |
| | - ner |
| | - relation-extraction |
| | - legal |
| | - multilingual |
| | - roberta |
| | - human-rights |
| | - international-law |
| | datasets: |
| | - legal-documents |
| | - human-rights-reports |
| | widget: |
| | - text: "The International Criminal Court issued a warrant for the general's arrest in connection with war crimes committed in the region." |
| | - text: "Le Tribunal pénal international a émis un mandat d'arrêt contre le général pour crimes de guerre." |
| | - text: "La Corte Penal Internacional emitió una orden de arresto contra el general por crímenes de guerra." |
| | pipeline_tag: token-classification |
| | --- |
| | |
| | # RoBERTa Joint NER+RE Model for Legal Text Analysis |
| |
|
| | ## Model Description |
| |
|
| | This RoBERTa-based model performs **joint Named Entity Recognition (NER) and Relation Extraction (RE)** specifically fine-tuned for legal text analysis and human rights documentation. It's designed to identify legal entities and their relationships in multilingual legal documents. |
| |
|
| | **Developed by:** Lemkin AI |
| | **Model type:** XLM-RoBERTa Large for Token Classification |
| | **Base model:** [Davlan/xlm-roberta-large-ner-hrl](https://huggingface.co/Davlan/xlm-roberta-large-ner-hrl) |
| | **Language(s):** English, French, Spanish, Arabic |
| | **License:** Apache 2.0 |
| |
|
| | ## Model Details |
| |
|
| | ### Architecture |
| | - **Base Model:** XLM-RoBERTa Large (multilingual) |
| | - **Parameters:** 560M total parameters |
| | - **Model Size:** 2.1GB |
| | - **Task Heads:** Joint NER + RE classifier |
| | - **Input Length:** 512 tokens maximum |
| | - **Layers:** 24 transformer layers |
| | - **Hidden Size:** 1024 |
| | - **Attention Heads:** 16 |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForTokenClassification |
| | import torch |
| | |
| | # Load model and tokenizer |
| | tokenizer = AutoTokenizer.from_pretrained("LemkinAI/roberta-joint-ner-re") |
| | model = AutoModelForTokenClassification.from_pretrained("LemkinAI/roberta-joint-ner-re") |
| | |
| | # Example text |
| | text = "The International Criminal Court issued a warrant for the general's arrest." |
| | |
| | # Tokenize and predict |
| | inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512) |
| | with torch.no_grad(): |
| | outputs = model(**inputs) |
| | predictions = torch.argmax(outputs.logits, dim=-1) |
| | |
| | # Process results |
| | tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) |
| | predicted_labels = [model.config.id2label[pred.item()] for pred in predictions[0]] |
| | |
| | for token, label in zip(tokens, predicted_labels): |
| | if label != "O": |
| | print(f"{token}: {label}") |
| | ``` |
| |
|
| | ## Model Performance |
| |
|
| | - **Named Entity Recognition F1:** 0.92 (92% accuracy) |
| | - **Relation Extraction F1:** 0.87 (87% accuracy) |
| | - **Supported Languages:** English, French, Spanish, Arabic |
| | - **Entity Types:** 71 specialized legal entity types |
| | - **Relation Types:** 21 legal relation types |
| |
|
| | ## Training Data |
| |
|
| | Trained on 85,000 annotated legal documents including: |
| | - International court decisions (ICC, ICJ, ECHR) |
| | - Human rights reports and investigations |
| | - Legal case documents and treaties |
| | - Time period: 1990-2024 |
| |
|
| | ## Use Cases |
| |
|
| | - Legal document analysis and research |
| | - Human rights violation documentation |
| | - Evidence organization and structuring |
| | - Academic legal NLP research |
| | - Investigative journalism |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @misc{lemkin-roberta-ner-re-2025, |
| | title={RoBERTa Joint NER+RE Model for Legal Text Analysis}, |
| | author={Lemkin AI Team}, |
| | year={2025}, |
| | url={https://huggingface.co/LemkinAI/roberta-joint-ner-re} |
| | } |
| | ``` |
| |
|