vartika0302 commited on
Commit
8f4a777
·
verified ·
1 Parent(s): 49bfb59

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -3
README.md CHANGED
@@ -1,3 +1,87 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ # Legal Document Processing Models
6
+
7
+ This repository contains two models for processing legal documents:
8
+ 1. **Rhetorical Role Segmentation** - A model designed to handle class imbalance using a class-weighted oversampling technique.
9
+ 2. **Summarization** - A hybrid model that combines extractive and abstractive summarization to retain key information in legal texts.
10
+
11
+ Both models are saved in `.h5` format and can be loaded and used for inference as detailed below.
12
+
13
+ ---
14
+
15
+ ## Table of Contents
16
+
17
+ 1. [Project Overview](#project-overview)
18
+ 2. [Model Details](#model-details)
19
+ - [Rhetorical Role Segmentation](#rhetorical-role-segmentation)
20
+ - [Summarization](#summarization)
21
+ 3. [Requirements](#requirements)
22
+ 4. [Setup and Installation](#setup-and-installation)
23
+ 5. [Model Usage](#model-usage)
24
+ 6. [Evaluation Metrics](#evaluation-metrics)
25
+ 7. [License](#license)
26
+
27
+ ---
28
+
29
+ ## Project Overview
30
+
31
+ This project addresses key challenges in processing legal documents, specifically:
32
+ - **Rhetorical Role Segmentation**: Tackling class imbalance in datasets to enhance the model's ability to detect minority rhetorical roles in legal texts.
33
+ - **Summarization**: Generating high-quality summaries using a hybrid extractive-abstractive approach to capture crucial legal entities and condense information.
34
+
35
+ ---
36
+
37
+ ## Model Details
38
+
39
+ ### Rhetorical Role Segmentation
40
+
41
+ **Objective**: Segment legal text based on rhetorical roles (e.g., Facts, Arguments, Rulings).
42
+
43
+ **Challenge**: Imbalanced class labels, with majority classes overshadowing minority classes.
44
+
45
+ **Solution**: A class-weighted oversampling method that calculates an oversampling rate based on the frequency distribution of class labels, ensuring adequate representation of minority classes.
46
+
47
+ #### Model Architecture
48
+
49
+ This model is trained to distinguish 13 class labels in legal text, utilizing class-weighted oversampling to prevent overfitting to the majority classes.
50
+
51
+ #### Data Augmentation
52
+
53
+ - A normalized frequency distribution of class labels was created, as visualized in `Figure 5: Unbalanced Dataset`.
54
+ - Using this distribution, we applied a class-weighted oversampling rate to increase minority class samples proportionally.
55
+
56
+ ---
57
+
58
+ ### Summarization
59
+
60
+ **Objective**: Summarize legal documents by retaining essential legal entities and reducing information redundancy.
61
+
62
+ **Approach**:
63
+ 1. **Extractive Step**: An NER model tailored for legal documents extracts sentences with named entities, producing an initial summary.
64
+ 2. **Abstractive Step**: A fine-tuned Longformer Encoder-Decoder (LED) model refines the extractive summary to create a concise output.
65
+
66
+ #### Model Architecture
67
+
68
+ - The extractive component utilizes a legal-specific Named Entity Recognition (NER) model.
69
+ - The abstractive summarization is achieved using a pre-trained LED model on Hugging Face, further fine-tuned on legal data.
70
+
71
+ #### Evaluation
72
+
73
+ Each component was evaluated using ROUGE metrics, with the hybrid approach demonstrating superior performance in terms of informativeness and conciseness.
74
+
75
+ ---
76
+
77
+ ## Requirements
78
+
79
+ - Python 3.7 or later
80
+ - TensorFlow
81
+ - Keras
82
+ - Hugging Face Transformers
83
+ - `torch`, `rouge_score`, and other standard libraries
84
+
85
+ Install the dependencies with:
86
+ ```bash
87
+ pip install -r requirements.txt