|
|
--- |
|
|
library_name: transformers |
|
|
datasets: |
|
|
- coastalchp/ledgar |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- nlpaueb/legal-bert-base-uncased |
|
|
pipeline_tag: text-classification |
|
|
|
|
|
--- |
|
|
# LegalBERT Fine-Tuned on LEDGAR Dataset |
|
|
|
|
|
This model is a fine-tuned version of **[LegalBERT](https://huggingface.co/nlpaueb/legal-bert-base-uncased)** on the **LEDGAR** dataset for **legal clause classification**. |
|
|
It classifies legal clauses into one of **100 clause types** (e.g., confidentiality, termination, liability, etc.). |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
- **Base Model:** `nlpaueb/legal-bert-base-uncased` |
|
|
- **Task:** Multi-class clause classification |
|
|
- **Dataset:** LEDGAR |
|
|
- **Language:** English |
|
|
- **Number of labels:** 100 |
|
|
- **Fine-tuning epochs:** 4 |
|
|
- **Batch size:** 32 |
|
|
- **Optimizer:** AdamW |
|
|
- **Mixed Precision (FP16):** Enabled (when CUDA available) |
|
|
|
|
|
--- |
|
|
|
|
|
## Dataset Details |
|
|
|
|
|
| Split | Samples | Description | |
|
|
|-------|----------|-------------| |
|
|
| Train | 60,000 | Used for model fine-tuning | |
|
|
| Eval | 10,000 | Used for validation during training | |
|
|
| Test | 10,000 | Held-out test set for final evaluation | |
|
|
|
|
|
- **Total samples:** 80,000 |
|
|
- **Number of labels:** 100 |
|
|
- **Text column:** `text` (contains the clause text) |
|
|
- **Label column:** `label` |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
## Evaluation Results (on Test Set) |
|
|
|
|
|
| Metric | Score | |
|
|
|---------|--------| |
|
|
| **Accuracy** | 0.8678 | |
|
|
| **Macro F1** | 0.7779 | |
|
|
| **Macro Precision** | 0.7917 | |
|
|
| **Macro Recall** | 0.7763 | |
|
|
| **Evaluation Time** | 38.37 sec | |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Use |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "FENTECH/Legal-BERT-Clause-Classification" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
# Example inference |
|
|
text = "The contractor shall maintain confidentiality of all client information." |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
|
|
|
predicted_label = outputs.logits.argmax(dim=-1).item() |
|
|
print("Predicted label ID:", predicted_label) |
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
|