Akshan Krithick commited on
Commit
02c74cb
·
verified ·
1 Parent(s): 2ad251b

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # TermsConditioned – RoBERTa large LEDGAR LoRA
3
+
4
+ This repository contains a RoBERTa large encoder with a LoRA adapter fine-tuned on the LEDGAR split of LexGLUE (100 contract clause families). The model is meant as a **clause-level triage engine** for Terms & Conditions paragraphs.
5
+
6
+ The goal is *intake triage*, not full contract review. The model supports:
7
+
8
+ - 100-way clause family classification
9
+ - A hand-picked **risk bucket** of especially important families (e.g. Arbitration, Limitation of Liability, Indemnity, Modifications, Governing Law, Waivers)
10
+ - Calibration and a **global operating point** chosen to cap false green-lights on those risky families
11
+
12
+ ## Base model
13
+
14
+ - Base encoder: __roberta-large__
15
+ - Adapter type: LoRA (PEFT)
16
+
17
+ ## Data
18
+
19
+ - Dataset: `coastalcph/lex_glue`, subset `ledgar`
20
+ - Number of labels: __100__
21
+ - Risk bucket (families treated as especially high-impact if misclassified):
22
+
23
+ `__Amendments, Arbitration, Consent To Jurisdiction, Governing Laws, Indemnifications, Indemnity, Jurisdictions, Modifications, Remedies, Submission To Jurisdiction, Waiver Of Jury Trials, Waivers__`
24
+
25
+ ## Training (high-level)
26
+
27
+ - Start from `__roberta-large__`
28
+ - Add classification head with 100 outputs
29
+ - Train with:
30
+ - Class-weighted loss (inverse-frequency weighting per label)
31
+ - Label smoothing (ϵ = 0.1)
32
+ - AdamW-style optimizer on 8-bit base weights (bitsandbytes)
33
+ - 5 epochs on LEDGAR train split
34
+ - bf16 on GPU
35
+
36
+ The LoRA adapter and classifier are the only trainable parts; the base encoder stays in 8-bit.
37
+
38
+ ## Validation metrics (LEDGAR validation split)
39
+
40
+ - Accuracy: __0.8152__
41
+ - Macro F1: ~0.74 (computed separately)
42
+ - ECE (uncalibrated): __0.115__
43
+ - ECE (after temperature scaling): __0.022__
44
+
45
+ Temperature scaling is done on held-out validation logits.
46
+
47
+ ## Risk policy and operating point
48
+
49
+ A subset of families is treated as **risky** (false green-lights here are especially costly). For those families, a false green-light means:
50
+
51
+ 1. The paragraph is *kept* by the triage policy (not abstained), and
52
+ 2. The predicted family is *not* in the risk bucket.
53
+
54
+ We sweep a global threshold τ on the maximum calibrated probability and choose τ* to enforce a cap on the false green rate among risky clauses.
55
+
56
+ - Temperature T*: __0.80__
57
+ - Threshold τ*: __0.68__
58
+ - Keep rate at τ* (all clauses): __0.725__
59
+ - False green rate at τ* (risky clauses only): __0.0495__
60
+ - Recall of risky clauses within the kept set at τ*: __0.950__
61
+
62
+ ## Governance slices
63
+
64
+ During evaluation, we compute slices by:
65
+
66
+ - Clause family (`true_name`)
67
+ - Length buckets (rough token-count quartiles)
68
+ - Character count and characters-per-token quartiles
69
+ - Phrase flags such as:
70
+ - `binding arbitration`
71
+ - `sole discretion`
72
+ - `to the maximum extent permitted by law`
73
+ - `including but not limited to`
74
+ - venue / jurisdiction phrases
75
+
76
+ Each slice tracks:
77
+
78
+ - `n` (number of paragraphs in slice)
79
+ - `kept_rate`
80
+ - `false_green_rate` (for risky clauses in that slice)
81
+ - `avg_conf` (average max calibrated probability)
82
+ - `harm_score` ≈ expected number of false green-lights in that slice
83
+
84
+ These tables are not shipped here, but can be recomputed from logits and the same flags.
85
+
86
+ ## Intended use
87
+
88
+ This model is intended as a **research and prototyping** component for:
89
+
90
+ - Clause family classification on LEDGAR
91
+ - Intake triage experiments on ToS / boilerplate contracts
92
+ - Governance-style audits of clause-level models
93
+
94
+ It does **not** replace a lawyer or compliance team. Use it only with human review.
95
+
96
+ ## How to load (Hub)
97
+
98
+ ```python
99
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
100
+ from peft import PeftModel
101
+ import json, torch
102
+
103
+ MODEL_ID = "__snickerszz/termsconditioned-roberta-large-ledgar-lora__"
104
+ roberta-large = "__roberta-large__"
105
+
106
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
107
+
108
+ # Load base encoder
109
+ base_model = AutoModelForSequenceClassification.from_pretrained(
110
+ roberta-large,
111
+ num_labels=100,
112
+ )
113
+
114
+ # Attach LoRA adapter
115
+ model = PeftModel.from_pretrained(base_model, MODEL_ID)
116
+ model.eval()
117
+
118
+ # Load meta for calibration + policy
119
+ with open("termsconditioned_meta.json") as f:
120
+ meta = json.load(f)
121
+
122
+ T_star = meta["T_star"]
123
+ tau_star = meta["tau_star"]
124
+ id2label = {int(k): v for k, v in meta["id2label"].items()}
125
+ risky_families = set(meta["risky_families"])
126
+
127
+ def classify_paragraph(text: str):
128
+ enc = tokenizer(text, return_tensors="pt", truncation=True, max_length=384)
129
+ with torch.no_grad():
130
+ out = model(**enc)
131
+ logits = out.logits / T_star
132
+ probs = torch.softmax(logits, dim=-1)[0]
133
+ conf, pred_id = probs.max(dim=-1)
134
+ pred_id = int(pred_id)
135
+ family = id2label[pred_id]
136
+ conf = float(conf)
137
+
138
+ if conf < tau_star:
139
+ action = "Needs review"
140
+ else:
141
+ action = "Flag as risky" if family in risky_families else "Green-light"
142
+
143
+ return {"family": family, "confidence": conf, "action": action}
144
+ Limitations
145
+ Trained only on LEDGAR; generalization outside that domain is not guaranteed.
146
+
147
+ Clause-level only; no document-wide reasoning.
148
+
149
+ Calibration and policy thresholds are specific to this validation split; re-calibrate if you shift domains or data distributions.
adapter_config.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": {
4
+ "base_model_class": "RobertaForSequenceClassification",
5
+ "parent_library": "transformers.models.roberta.modeling_roberta"
6
+ },
7
+ "base_model_name_or_path": "roberta-large",
8
+ "bias": "none",
9
+ "fan_in_fan_out": false,
10
+ "inference_mode": true,
11
+ "init_lora_weights": true,
12
+ "layer_replication": null,
13
+ "layers_pattern": null,
14
+ "layers_to_transform": null,
15
+ "loftq_config": {},
16
+ "lora_alpha": 32,
17
+ "lora_dropout": 0.05,
18
+ "megatron_config": null,
19
+ "megatron_core": "megatron.core",
20
+ "modules_to_save": [
21
+ "classifier"
22
+ ],
23
+ "peft_type": "LORA",
24
+ "r": 16,
25
+ "rank_pattern": {},
26
+ "revision": null,
27
+ "target_modules": [
28
+ "output.dense",
29
+ "value",
30
+ "intermediate.dense",
31
+ "query",
32
+ "key"
33
+ ],
34
+ "task_type": null,
35
+ "use_dora": false,
36
+ "use_rslora": false
37
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7156f28bcdf34dac6461166a1c3328e900c16777d50caa690b39b9ee22a966ac
3
+ size 30658032
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": {
6
+ "content": "<mask>",
7
+ "lstrip": true,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "unk_token": "<unk>"
15
+ }
termsconditioned_meta.json ADDED
@@ -0,0 +1,233 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "base_model": "roberta-large",
3
+ "num_labels": 100,
4
+ "id2label": {
5
+ "0": "Adjustments",
6
+ "1": "Agreements",
7
+ "2": "Amendments",
8
+ "3": "Anti-Corruption Laws",
9
+ "4": "Applicable Laws",
10
+ "5": "Approvals",
11
+ "6": "Arbitration",
12
+ "7": "Assignments",
13
+ "8": "Assigns",
14
+ "9": "Authority",
15
+ "10": "Authorizations",
16
+ "11": "Base Salary",
17
+ "12": "Benefits",
18
+ "13": "Binding Effects",
19
+ "14": "Books",
20
+ "15": "Brokers",
21
+ "16": "Capitalization",
22
+ "17": "Change In Control",
23
+ "18": "Closings",
24
+ "19": "Compliance With Laws",
25
+ "20": "Confidentiality",
26
+ "21": "Consent To Jurisdiction",
27
+ "22": "Consents",
28
+ "23": "Construction",
29
+ "24": "Cooperation",
30
+ "25": "Costs",
31
+ "26": "Counterparts",
32
+ "27": "Death",
33
+ "28": "Defined Terms",
34
+ "29": "Definitions",
35
+ "30": "Disability",
36
+ "31": "Disclosures",
37
+ "32": "Duties",
38
+ "33": "Effective Dates",
39
+ "34": "Effectiveness",
40
+ "35": "Employment",
41
+ "36": "Enforceability",
42
+ "37": "Enforcements",
43
+ "38": "Entire Agreements",
44
+ "39": "Erisa",
45
+ "40": "Existence",
46
+ "41": "Expenses",
47
+ "42": "Fees",
48
+ "43": "Financial Statements",
49
+ "44": "Forfeitures",
50
+ "45": "Further Assurances",
51
+ "46": "General",
52
+ "47": "Governing Laws",
53
+ "48": "Headings",
54
+ "49": "Indemnifications",
55
+ "50": "Indemnity",
56
+ "51": "Insurances",
57
+ "52": "Integration",
58
+ "53": "Intellectual Property",
59
+ "54": "Interests",
60
+ "55": "Interpretations",
61
+ "56": "Jurisdictions",
62
+ "57": "Liens",
63
+ "58": "Litigations",
64
+ "59": "Miscellaneous",
65
+ "60": "Modifications",
66
+ "61": "No Conflicts",
67
+ "62": "No Defaults",
68
+ "63": "No Waivers",
69
+ "64": "Non-Disparagement",
70
+ "65": "Notices",
71
+ "66": "Organizations",
72
+ "67": "Participations",
73
+ "68": "Payments",
74
+ "69": "Positions",
75
+ "70": "Powers",
76
+ "71": "Publicity",
77
+ "72": "Qualifications",
78
+ "73": "Records",
79
+ "74": "Releases",
80
+ "75": "Remedies",
81
+ "76": "Representations",
82
+ "77": "Sales",
83
+ "78": "Sanctions",
84
+ "79": "Severability",
85
+ "80": "Solvency",
86
+ "81": "Specific Performance",
87
+ "82": "Submission To Jurisdiction",
88
+ "83": "Subsidiaries",
89
+ "84": "Successors",
90
+ "85": "Survival",
91
+ "86": "Tax Withholdings",
92
+ "87": "Taxes",
93
+ "88": "Terminations",
94
+ "89": "Terms",
95
+ "90": "Titles",
96
+ "91": "Transactions With Affiliates",
97
+ "92": "Use Of Proceeds",
98
+ "93": "Vacations",
99
+ "94": "Venues",
100
+ "95": "Vesting",
101
+ "96": "Waiver Of Jury Trials",
102
+ "97": "Waivers",
103
+ "98": "Warranties",
104
+ "99": "Withholdings"
105
+ },
106
+ "label2id": {
107
+ "Adjustments": 0,
108
+ "Agreements": 1,
109
+ "Amendments": 2,
110
+ "Anti-Corruption Laws": 3,
111
+ "Applicable Laws": 4,
112
+ "Approvals": 5,
113
+ "Arbitration": 6,
114
+ "Assignments": 7,
115
+ "Assigns": 8,
116
+ "Authority": 9,
117
+ "Authorizations": 10,
118
+ "Base Salary": 11,
119
+ "Benefits": 12,
120
+ "Binding Effects": 13,
121
+ "Books": 14,
122
+ "Brokers": 15,
123
+ "Capitalization": 16,
124
+ "Change In Control": 17,
125
+ "Closings": 18,
126
+ "Compliance With Laws": 19,
127
+ "Confidentiality": 20,
128
+ "Consent To Jurisdiction": 21,
129
+ "Consents": 22,
130
+ "Construction": 23,
131
+ "Cooperation": 24,
132
+ "Costs": 25,
133
+ "Counterparts": 26,
134
+ "Death": 27,
135
+ "Defined Terms": 28,
136
+ "Definitions": 29,
137
+ "Disability": 30,
138
+ "Disclosures": 31,
139
+ "Duties": 32,
140
+ "Effective Dates": 33,
141
+ "Effectiveness": 34,
142
+ "Employment": 35,
143
+ "Enforceability": 36,
144
+ "Enforcements": 37,
145
+ "Entire Agreements": 38,
146
+ "Erisa": 39,
147
+ "Existence": 40,
148
+ "Expenses": 41,
149
+ "Fees": 42,
150
+ "Financial Statements": 43,
151
+ "Forfeitures": 44,
152
+ "Further Assurances": 45,
153
+ "General": 46,
154
+ "Governing Laws": 47,
155
+ "Headings": 48,
156
+ "Indemnifications": 49,
157
+ "Indemnity": 50,
158
+ "Insurances": 51,
159
+ "Integration": 52,
160
+ "Intellectual Property": 53,
161
+ "Interests": 54,
162
+ "Interpretations": 55,
163
+ "Jurisdictions": 56,
164
+ "Liens": 57,
165
+ "Litigations": 58,
166
+ "Miscellaneous": 59,
167
+ "Modifications": 60,
168
+ "No Conflicts": 61,
169
+ "No Defaults": 62,
170
+ "No Waivers": 63,
171
+ "Non-Disparagement": 64,
172
+ "Notices": 65,
173
+ "Organizations": 66,
174
+ "Participations": 67,
175
+ "Payments": 68,
176
+ "Positions": 69,
177
+ "Powers": 70,
178
+ "Publicity": 71,
179
+ "Qualifications": 72,
180
+ "Records": 73,
181
+ "Releases": 74,
182
+ "Remedies": 75,
183
+ "Representations": 76,
184
+ "Sales": 77,
185
+ "Sanctions": 78,
186
+ "Severability": 79,
187
+ "Solvency": 80,
188
+ "Specific Performance": 81,
189
+ "Submission To Jurisdiction": 82,
190
+ "Subsidiaries": 83,
191
+ "Successors": 84,
192
+ "Survival": 85,
193
+ "Tax Withholdings": 86,
194
+ "Taxes": 87,
195
+ "Terminations": 88,
196
+ "Terms": 89,
197
+ "Titles": 90,
198
+ "Transactions With Affiliates": 91,
199
+ "Use Of Proceeds": 92,
200
+ "Vacations": 93,
201
+ "Venues": 94,
202
+ "Vesting": 95,
203
+ "Waiver Of Jury Trials": 96,
204
+ "Waivers": 97,
205
+ "Warranties": 98,
206
+ "Withholdings": 99
207
+ },
208
+ "risky_families": [
209
+ "Amendments",
210
+ "Arbitration",
211
+ "Consent To Jurisdiction",
212
+ "Governing Laws",
213
+ "Indemnifications",
214
+ "Indemnity",
215
+ "Jurisdictions",
216
+ "Modifications",
217
+ "Remedies",
218
+ "Submission To Jurisdiction",
219
+ "Waiver Of Jury Trials",
220
+ "Waivers"
221
+ ],
222
+ "T_star": 0.8,
223
+ "tau_star": 0.6799999999999999,
224
+ "val_acc_raw": 0.8152,
225
+ "ECE_raw": 0.1154585200600326,
226
+ "ECE_cal": 0.0220565241061151,
227
+ "false_green_rate_at_tau": 0.04952076677316294,
228
+ "keep_rate_at_tau": 0.7251,
229
+ "risky_recall_kept_at_tau": 0.950479233226837,
230
+ "dataset": "coastalcph/lex_glue: ledgar",
231
+ "mode": "Encoder+LoRA 8-bit",
232
+ "hf_model_id": "snickerszz/termsconditioned-roberta-large-ledgar-lora"
233
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<s>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<pad>",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<unk>",
30
+ "lstrip": false,
31
+ "normalized": true,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "50264": {
37
+ "content": "<mask>",
38
+ "lstrip": true,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ }
44
+ },
45
+ "bos_token": "<s>",
46
+ "clean_up_tokenization_spaces": true,
47
+ "cls_token": "<s>",
48
+ "eos_token": "</s>",
49
+ "errors": "replace",
50
+ "mask_token": "<mask>",
51
+ "model_max_length": 512,
52
+ "pad_token": "<pad>",
53
+ "sep_token": "</s>",
54
+ "tokenizer_class": "RobertaTokenizer",
55
+ "trim_offsets": true,
56
+ "unk_token": "<unk>"
57
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff