LemkinAI's picture
Update README.md
5af570a verified
---
language:
- en
- fr
- es
- ar
license: apache-2.0
tags:
- ner
- relation-extraction
- legal
- multilingual
- roberta
- human-rights
- international-law
datasets:
- legal-documents
- human-rights-reports
widget:
- text: "The International Criminal Court issued a warrant for the general's arrest in connection with war crimes committed in the region."
- text: "Le Tribunal pénal international a émis un mandat d'arrêt contre le général pour crimes de guerre."
- text: "La Corte Penal Internacional emitió una orden de arresto contra el general por crímenes de guerra."
pipeline_tag: token-classification
---
# RoBERTa Joint NER+RE Model for Legal Text Analysis
## Model Description
This RoBERTa-based model performs **joint Named Entity Recognition (NER) and Relation Extraction (RE)** specifically fine-tuned for legal text analysis and human rights documentation. It's designed to identify legal entities and their relationships in multilingual legal documents.
**Developed by:** Lemkin AI
**Model type:** XLM-RoBERTa Large for Token Classification
**Base model:** [Davlan/xlm-roberta-large-ner-hrl](https://huggingface.co/Davlan/xlm-roberta-large-ner-hrl)
**Language(s):** English, French, Spanish, Arabic
**License:** Apache 2.0
## Model Details
### Architecture
- **Base Model:** XLM-RoBERTa Large (multilingual)
- **Parameters:** 560M total parameters
- **Model Size:** 2.1GB
- **Task Heads:** Joint NER + RE classifier
- **Input Length:** 512 tokens maximum
- **Layers:** 24 transformer layers
- **Hidden Size:** 1024
- **Attention Heads:** 16
## Usage
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("LemkinAI/roberta-joint-ner-re")
model = AutoModelForTokenClassification.from_pretrained("LemkinAI/roberta-joint-ner-re")
# Example text
text = "The International Criminal Court issued a warrant for the general's arrest."
# Tokenize and predict
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
# Process results
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
predicted_labels = [model.config.id2label[pred.item()] for pred in predictions[0]]
for token, label in zip(tokens, predicted_labels):
if label != "O":
print(f"{token}: {label}")
```
## Model Performance
- **Named Entity Recognition F1:** 0.92 (92% accuracy)
- **Relation Extraction F1:** 0.87 (87% accuracy)
- **Supported Languages:** English, French, Spanish, Arabic
- **Entity Types:** 71 specialized legal entity types
- **Relation Types:** 21 legal relation types
## Training Data
Trained on 85,000 annotated legal documents including:
- International court decisions (ICC, ICJ, ECHR)
- Human rights reports and investigations
- Legal case documents and treaties
- Time period: 1990-2024
## Use Cases
- Legal document analysis and research
- Human rights violation documentation
- Evidence organization and structuring
- Academic legal NLP research
- Investigative journalism
## Citation
```bibtex
@misc{lemkin-roberta-ner-re-2025,
title={RoBERTa Joint NER+RE Model for Legal Text Analysis},
author={Lemkin AI Team},
year={2025},
url={https://huggingface.co/LemkinAI/roberta-joint-ner-re}
}
```