---
language: 
- en
- fr
- es
- ar
license: apache-2.0
tags:
- ner
- relation-extraction
- legal
- multilingual
- roberta
- human-rights
- international-law
datasets:
- legal-documents
- human-rights-reports
widget:
  - text: "The International Criminal Court issued a warrant for the general's arrest in connection with war crimes committed in the region."
  - text: "Le Tribunal pénal international a émis un mandat d'arrêt contre le général pour crimes de guerre."
  - text: "La Corte Penal Internacional emitió una orden de arresto contra el general por crímenes de guerra."
pipeline_tag: token-classification
---

# RoBERTa Joint NER+RE Model for Legal Text Analysis

## Model Description

This RoBERTa-based model performs **joint Named Entity Recognition (NER) and Relation Extraction (RE)** specifically fine-tuned for legal text analysis and human rights documentation. It's designed to identify legal entities and their relationships in multilingual legal documents.

**Developed by:** Lemkin AI  
**Model type:** XLM-RoBERTa Large for Token Classification  
**Base model:** [Davlan/xlm-roberta-large-ner-hrl](https://huggingface.co/Davlan/xlm-roberta-large-ner-hrl)  
**Language(s):** English, French, Spanish, Arabic  
**License:** Apache 2.0  

## Model Details

### Architecture
- **Base Model:** XLM-RoBERTa Large (multilingual)
- **Parameters:** 560M total parameters
- **Model Size:** 2.1GB
- **Task Heads:** Joint NER + RE classifier
- **Input Length:** 512 tokens maximum
- **Layers:** 24 transformer layers
- **Hidden Size:** 1024
- **Attention Heads:** 16

## Usage

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("LemkinAI/roberta-joint-ner-re")
model = AutoModelForTokenClassification.from_pretrained("LemkinAI/roberta-joint-ner-re")

# Example text
text = "The International Criminal Court issued a warrant for the general's arrest."

# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Process results
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
predicted_labels = [model.config.id2label[pred.item()] for pred in predictions[0]]

for token, label in zip(tokens, predicted_labels):
    if label != "O":
        print(f"{token}: {label}")
```

## Model Performance

- **Named Entity Recognition F1:** 0.92 (92% accuracy)  
- **Relation Extraction F1:** 0.87 (87% accuracy)
- **Supported Languages:** English, French, Spanish, Arabic
- **Entity Types:** 71 specialized legal entity types
- **Relation Types:** 21 legal relation types

## Training Data

Trained on 85,000 annotated legal documents including:
- International court decisions (ICC, ICJ, ECHR)
- Human rights reports and investigations  
- Legal case documents and treaties
- Time period: 1990-2024

## Use Cases

- Legal document analysis and research
- Human rights violation documentation
- Evidence organization and structuring
- Academic legal NLP research
- Investigative journalism

## Citation

```bibtex
@misc{lemkin-roberta-ner-re-2025,
  title={RoBERTa Joint NER+RE Model for Legal Text Analysis},
  author={Lemkin AI Team},
  year={2025},
  url={https://huggingface.co/LemkinAI/roberta-joint-ner-re}
}
```