HALE-LAB
/

Legal-Analytics

Model card Files Files and versions

xet

Community

vartika0302 commited on Nov 5, 2024

Commit

8f4a777

verified ·

1 Parent(s): 49bfb59

Update README.md

Browse files

Files changed (1) hide show

README.md +87 -3

README.md CHANGED Viewed

@@ -1,3 +1,87 @@
----
-license: mit
----

+---
+license: mit
+---
+# Legal Document Processing Models
+This repository contains two models for processing legal documents:
+1. **Rhetorical Role Segmentation** - A model designed to handle class imbalance using a class-weighted oversampling technique.
+2. **Summarization** - A hybrid model that combines extractive and abstractive summarization to retain key information in legal texts.
+Both models are saved in `.h5` format and can be loaded and used for inference as detailed below.
+---
+## Table of Contents
+1. [Project Overview](#project-overview)
+2. [Model Details](#model-details)
+    - [Rhetorical Role Segmentation](#rhetorical-role-segmentation)
+    - [Summarization](#summarization)
+3. [Requirements](#requirements)
+4. [Setup and Installation](#setup-and-installation)
+5. [Model Usage](#model-usage)
+6. [Evaluation Metrics](#evaluation-metrics)
+7. [License](#license)
+---
+## Project Overview
+This project addresses key challenges in processing legal documents, specifically:
+- **Rhetorical Role Segmentation**: Tackling class imbalance in datasets to enhance the model's ability to detect minority rhetorical roles in legal texts.
+- **Summarization**: Generating high-quality summaries using a hybrid extractive-abstractive approach to capture crucial legal entities and condense information.
+---
+## Model Details
+### Rhetorical Role Segmentation
+**Objective**: Segment legal text based on rhetorical roles (e.g., Facts, Arguments, Rulings).
+**Challenge**: Imbalanced class labels, with majority classes overshadowing minority classes.
+**Solution**: A class-weighted oversampling method that calculates an oversampling rate based on the frequency distribution of class labels, ensuring adequate representation of minority classes.
+#### Model Architecture
+This model is trained to distinguish 13 class labels in legal text, utilizing class-weighted oversampling to prevent overfitting to the majority classes.
+#### Data Augmentation
+- A normalized frequency distribution of class labels was created, as visualized in `Figure 5: Unbalanced Dataset`.
+- Using this distribution, we applied a class-weighted oversampling rate to increase minority class samples proportionally.
+---
+### Summarization
+**Objective**: Summarize legal documents by retaining essential legal entities and reducing information redundancy.
+**Approach**:
+1. **Extractive Step**: An NER model tailored for legal documents extracts sentences with named entities, producing an initial summary.
+2. **Abstractive Step**: A fine-tuned Longformer Encoder-Decoder (LED) model refines the extractive summary to create a concise output.
+#### Model Architecture
+- The extractive component utilizes a legal-specific Named Entity Recognition (NER) model.
+- The abstractive summarization is achieved using a pre-trained LED model on Hugging Face, further fine-tuned on legal data.
+#### Evaluation
+Each component was evaluated using ROUGE metrics, with the hybrid approach demonstrating superior performance in terms of informativeness and conciseness.
+---
+## Requirements
+- Python 3.7 or later
+- TensorFlow
+- Keras
+- Hugging Face Transformers
+- `torch`, `rouge_score`, and other standard libraries
+Install the dependencies with:
+```bash
+pip install -r requirements.txt