LemkinAI
/

roberta-joint-ner-re

Token Classification

relation-extraction

international-law

Model card Files Files and versions

roberta-joint-ner-re / README.md

LemkinAI's picture

Update README.md

5af570a verified 8 months ago

|

history blame contribute delete

3.42 kB

	---
	language:
	- en
	- fr
	- es
	- ar
	license: apache-2.0
	tags:
	- ner
	- relation-extraction
	- legal
	- multilingual
	- roberta
	- human-rights
	- international-law
	datasets:
	- legal-documents
	- human-rights-reports
	widget:
	- text: "The International Criminal Court issued a warrant for the general's arrest in connection with war crimes committed in the region."
	- text: "Le Tribunal pénal international a émis un mandat d'arrêt contre le général pour crimes de guerre."
	- text: "La Corte Penal Internacional emitió una orden de arresto contra el general por crímenes de guerra."
	pipeline_tag: token-classification
	---

	# RoBERTa Joint NER+RE Model for Legal Text Analysis

	## Model Description

	This RoBERTa-based model performs joint Named Entity Recognition (NER) and Relation Extraction (RE) specifically fine-tuned for legal text analysis and human rights documentation. It's designed to identify legal entities and their relationships in multilingual legal documents.

	Developed by: Lemkin AI
	Model type: XLM-RoBERTa Large for Token Classification
	Base model: [Davlan/xlm-roberta-large-ner-hrl](https://huggingface.co/Davlan/xlm-roberta-large-ner-hrl)
	Language(s): English, French, Spanish, Arabic
	License: Apache 2.0

	## Model Details

	### Architecture
	- Base Model: XLM-RoBERTa Large (multilingual)
	- Parameters: 560M total parameters
	- Model Size: 2.1GB
	- Task Heads: Joint NER + RE classifier
	- Input Length: 512 tokens maximum
	- Layers: 24 transformer layers
	- Hidden Size: 1024
	- Attention Heads: 16

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	import torch

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("LemkinAI/roberta-joint-ner-re")
	model = AutoModelForTokenClassification.from_pretrained("LemkinAI/roberta-joint-ner-re")

	# Example text
	text = "The International Criminal Court issued a warrant for the general's arrest."

	# Tokenize and predict
	inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.argmax(outputs.logits, dim=-1)

	# Process results
	tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
	predicted_labels = [model.config.id2label[pred.item()] for pred in predictions[0]]

	for token, label in zip(tokens, predicted_labels):
	if label != "O":
	print(f"{token}: {label}")
	```

	## Model Performance

	- Named Entity Recognition F1: 0.92 (92% accuracy)
	- Relation Extraction F1: 0.87 (87% accuracy)
	- Supported Languages: English, French, Spanish, Arabic
	- Entity Types: 71 specialized legal entity types
	- Relation Types: 21 legal relation types

	## Training Data

	Trained on 85,000 annotated legal documents including:
	- International court decisions (ICC, ICJ, ECHR)
	- Human rights reports and investigations
	- Legal case documents and treaties
	- Time period: 1990-2024

	## Use Cases

	- Legal document analysis and research
	- Human rights violation documentation
	- Evidence organization and structuring
	- Academic legal NLP research
	- Investigative journalism

	## Citation

	```bibtex
	@misc{lemkin-roberta-ner-re-2025,
	title={RoBERTa Joint NER+RE Model for Legal Text Analysis},
	author={Lemkin AI Team},
	year={2025},
	url={https://huggingface.co/LemkinAI/roberta-joint-ner-re}
	}
	```