| | --- |
| | library_name: transformers |
| | datasets: |
| | - coastalchp/ledgar |
| | language: |
| | - en |
| | base_model: |
| | - nlpaueb/legal-bert-base-uncased |
| | pipeline_tag: text-classification |
| |
|
| | --- |
| | # LegalBERT Fine-Tuned on LEDGAR Dataset |
| |
|
| | This model is a fine-tuned version of **[LegalBERT](https://huggingface.co/nlpaueb/legal-bert-base-uncased)** on the **LEDGAR** dataset for **legal clause classification**. |
| | It classifies legal clauses into one of **100 clause types** (e.g., confidentiality, termination, liability, etc.). |
| |
|
| | --- |
| |
|
| | ## Model Overview |
| |
|
| | - **Base Model:** `nlpaueb/legal-bert-base-uncased` |
| | - **Task:** Multi-class clause classification |
| | - **Dataset:** LEDGAR |
| | - **Language:** English |
| | - **Number of labels:** 100 |
| | - **Fine-tuning epochs:** 4 |
| | - **Batch size:** 32 |
| | - **Optimizer:** AdamW |
| | - **Mixed Precision (FP16):** Enabled (when CUDA available) |
| |
|
| | --- |
| |
|
| | ## Dataset Details |
| |
|
| | | Split | Samples | Description | |
| | |-------|----------|-------------| |
| | | Train | 60,000 | Used for model fine-tuning | |
| | | Eval | 10,000 | Used for validation during training | |
| | | Test | 10,000 | Held-out test set for final evaluation | |
| |
|
| | - **Total samples:** 80,000 |
| | - **Number of labels:** 100 |
| | - **Text column:** `text` (contains the clause text) |
| | - **Label column:** `label` |
| |
|
| | --- |
| |
|
| |
|
| | ## Evaluation Results (on Test Set) |
| |
|
| | | Metric | Score | |
| | |---------|--------| |
| | | **Accuracy** | 0.8678 | |
| | | **Macro F1** | 0.7779 | |
| | | **Macro Precision** | 0.7917 | |
| | | **Macro Recall** | 0.7763 | |
| | | **Evaluation Time** | 38.37 sec | |
| |
|
| | --- |
| |
|
| | ## How to Use |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | |
| | # Load model and tokenizer |
| | model_name = "FENTECH/Legal-BERT-Clause-Classification" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | model = AutoModelForSequenceClassification.from_pretrained(model_name) |
| | |
| | # Example inference |
| | text = "The contractor shall maintain confidentiality of all client information." |
| | inputs = tokenizer(text, return_tensors="pt") |
| | outputs = model(**inputs) |
| | |
| | predicted_label = outputs.logits.argmax(dim=-1).item() |
| | print("Predicted label ID:", predicted_label) |
| | ``` |
| |
|
| |
|
| |
|
| |
|