File size: 2,085 Bytes

c7cb76e
 
a0e05f4
 
 
 
 
 
 
c7cb76e
a0e05f4
 
c7cb76e
a0e05f4
 
c7cb76e
a0e05f4
c7cb76e
a0e05f4
c7cb76e
a0e05f4
 
 
 
 
 
 
 
 
c7cb76e
a0e05f4
c7cb76e
a0e05f4
c7cb76e
a0e05f4
 
 
 
 
c7cb76e
a0e05f4
 
 
 
c7cb76e
a0e05f4
c7cb76e
 
a0e05f4
c7cb76e
a0e05f4
 
 
 
 
 
 
c7cb76e
a0e05f4
c7cb76e
a0e05f4
c7cb76e
a0e05f4
 
c7cb76e
a0e05f4
 
 
 
c7cb76e
a0e05f4
 
 
 
c7cb76e
a0e05f4
 
 
c7cb76e

---
library_name: transformers
datasets:
- coastalchp/ledgar
language:
- en
base_model:
- nlpaueb/legal-bert-base-uncased
pipeline_tag: text-classification

---
#  LegalBERT Fine-Tuned on LEDGAR Dataset

This model is a fine-tuned version of **[LegalBERT](https://huggingface.co/nlpaueb/legal-bert-base-uncased)** on the **LEDGAR** dataset for **legal clause classification**.  
It classifies legal clauses into one of **100 clause types** (e.g., confidentiality, termination, liability, etc.).

---

##  Model Overview

- **Base Model:** `nlpaueb/legal-bert-base-uncased`  
- **Task:** Multi-class clause classification  
- **Dataset:** LEDGAR  
- **Language:** English  
- **Number of labels:** 100  
- **Fine-tuning epochs:** 4  
- **Batch size:** 32  
- **Optimizer:** AdamW  
- **Mixed Precision (FP16):** Enabled (when CUDA available) 

---

##  Dataset Details

| Split | Samples | Description |
|-------|----------|-------------|
| Train | 60,000 | Used for model fine-tuning |
| Eval  | 10,000 | Used for validation during training |
| Test  | 10,000 | Held-out test set for final evaluation |

- **Total samples:** 80,000  
- **Number of labels:** 100  
- **Text column:** `text` (contains the clause text)  
- **Label column:** `label`  

---


##  Evaluation Results (on Test Set)

| Metric | Score |
|---------|--------|
| **Accuracy** | 0.8678 |
| **Macro F1** | 0.7779 |
| **Macro Precision** | 0.7917 |
| **Macro Recall** | 0.7763 |
| **Evaluation Time** | 38.37 sec |

---

##  How to Use

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model and tokenizer
model_name = "FENTECH/Legal-BERT-Clause-Classification"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example inference
text = "The contractor shall maintain confidentiality of all client information."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

predicted_label = outputs.logits.argmax(dim=-1).item()
print("Predicted label ID:", predicted_label)
```