Legal-Analytics / README.md

vartika0302

Update README.md

8f4a777 verified over 1 year ago

preview code

raw

history blame contribute delete

3.27 kB

metadata

license: mit

Legal Document Processing Models

This repository contains two models for processing legal documents:

Rhetorical Role Segmentation - A model designed to handle class imbalance using a class-weighted oversampling technique.
Summarization - A hybrid model that combines extractive and abstractive summarization to retain key information in legal texts.

Both models are saved in .h5 format and can be loaded and used for inference as detailed below.

Project Overview
Model Details
- Rhetorical Role Segmentation
- Summarization
Requirements
Setup and Installation
Model Usage
Evaluation Metrics
License

Project Overview

This project addresses key challenges in processing legal documents, specifically:

Rhetorical Role Segmentation: Tackling class imbalance in datasets to enhance the model's ability to detect minority rhetorical roles in legal texts.
Summarization: Generating high-quality summaries using a hybrid extractive-abstractive approach to capture crucial legal entities and condense information.

Model Details

Rhetorical Role Segmentation

Objective: Segment legal text based on rhetorical roles (e.g., Facts, Arguments, Rulings).

Challenge: Imbalanced class labels, with majority classes overshadowing minority classes.

Solution: A class-weighted oversampling method that calculates an oversampling rate based on the frequency distribution of class labels, ensuring adequate representation of minority classes.

Model Architecture

This model is trained to distinguish 13 class labels in legal text, utilizing class-weighted oversampling to prevent overfitting to the majority classes.

Data Augmentation

A normalized frequency distribution of class labels was created, as visualized in Figure 5: Unbalanced Dataset.
Using this distribution, we applied a class-weighted oversampling rate to increase minority class samples proportionally.

Summarization

Objective: Summarize legal documents by retaining essential legal entities and reducing information redundancy.

Approach:

Extractive Step: An NER model tailored for legal documents extracts sentences with named entities, producing an initial summary.
Abstractive Step: A fine-tuned Longformer Encoder-Decoder (LED) model refines the extractive summary to create a concise output.

Model Architecture

The extractive component utilizes a legal-specific Named Entity Recognition (NER) model.
The abstractive summarization is achieved using a pre-trained LED model on Hugging Face, further fine-tuned on legal data.

Evaluation

Each component was evaluated using ROUGE metrics, with the hybrid approach demonstrating superior performance in terms of informativeness and conciseness.

Requirements

Python 3.7 or later
TensorFlow
Keras
Hugging Face Transformers
torch, rouge_score, and other standard libraries

Install the dependencies with:

pip install -r requirements.txt

HALE-LAB
/

Legal-Analytics

Legal Document Processing Models

Table of Contents

Project Overview

Model Details

Rhetorical Role Segmentation

Model Architecture

Data Augmentation

Summarization

Model Architecture

Evaluation

Requirements