Legal-Analytics / README.md
vartika0302's picture
Update README.md
8f4a777 verified
metadata
license: mit

Legal Document Processing Models

This repository contains two models for processing legal documents:

  1. Rhetorical Role Segmentation - A model designed to handle class imbalance using a class-weighted oversampling technique.
  2. Summarization - A hybrid model that combines extractive and abstractive summarization to retain key information in legal texts.

Both models are saved in .h5 format and can be loaded and used for inference as detailed below.


Table of Contents

  1. Project Overview
  2. Model Details
  3. Requirements
  4. Setup and Installation
  5. Model Usage
  6. Evaluation Metrics
  7. License

Project Overview

This project addresses key challenges in processing legal documents, specifically:

  • Rhetorical Role Segmentation: Tackling class imbalance in datasets to enhance the model's ability to detect minority rhetorical roles in legal texts.
  • Summarization: Generating high-quality summaries using a hybrid extractive-abstractive approach to capture crucial legal entities and condense information.

Model Details

Rhetorical Role Segmentation

Objective: Segment legal text based on rhetorical roles (e.g., Facts, Arguments, Rulings).

Challenge: Imbalanced class labels, with majority classes overshadowing minority classes.

Solution: A class-weighted oversampling method that calculates an oversampling rate based on the frequency distribution of class labels, ensuring adequate representation of minority classes.

Model Architecture

This model is trained to distinguish 13 class labels in legal text, utilizing class-weighted oversampling to prevent overfitting to the majority classes.

Data Augmentation

  • A normalized frequency distribution of class labels was created, as visualized in Figure 5: Unbalanced Dataset.
  • Using this distribution, we applied a class-weighted oversampling rate to increase minority class samples proportionally.

Summarization

Objective: Summarize legal documents by retaining essential legal entities and reducing information redundancy.

Approach:

  1. Extractive Step: An NER model tailored for legal documents extracts sentences with named entities, producing an initial summary.
  2. Abstractive Step: A fine-tuned Longformer Encoder-Decoder (LED) model refines the extractive summary to create a concise output.

Model Architecture

  • The extractive component utilizes a legal-specific Named Entity Recognition (NER) model.
  • The abstractive summarization is achieved using a pre-trained LED model on Hugging Face, further fine-tuned on legal data.

Evaluation

Each component was evaluated using ROUGE metrics, with the hybrid approach demonstrating superior performance in terms of informativeness and conciseness.


Requirements

  • Python 3.7 or later
  • TensorFlow
  • Keras
  • Hugging Face Transformers
  • torch, rouge_score, and other standard libraries

Install the dependencies with:

pip install -r requirements.txt