Upload folder using huggingface_hub

Browse files

Files changed (9) hide show

README.md +149 -0
adapter_config.json +37 -0
adapter_model.safetensors +3 -0
merges.txt +0 -0
special_tokens_map.json +15 -0
termsconditioned_meta.json +233 -0
tokenizer.json +0 -0
tokenizer_config.json +57 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,149 @@

+# TermsConditioned – RoBERTa large LEDGAR LoRA
+This repository contains a RoBERTa large encoder with a LoRA adapter fine-tuned on the LEDGAR split of LexGLUE (100 contract clause families). The model is meant as a **clause-level triage engine** for Terms & Conditions paragraphs.
+The goal is *intake triage*, not full contract review. The model supports:
+- 100-way clause family classification
+- A hand-picked **risk bucket** of especially important families (e.g. Arbitration, Limitation of Liability, Indemnity, Modifications, Governing Law, Waivers)
+- Calibration and a **global operating point** chosen to cap false green-lights on those risky families
+## Base model
+- Base encoder: __roberta-large__
+- Adapter type: LoRA (PEFT)
+## Data
+- Dataset: `coastalcph/lex_glue`, subset `ledgar`
+- Number of labels: __100__
+- Risk bucket (families treated as especially high-impact if misclassified):
+`__Amendments, Arbitration, Consent To Jurisdiction, Governing Laws, Indemnifications, Indemnity, Jurisdictions, Modifications, Remedies, Submission To Jurisdiction, Waiver Of Jury Trials, Waivers__`
+## Training (high-level)
+- Start from `__roberta-large__`
+- Add classification head with 100 outputs
+- Train with:
+  - Class-weighted loss (inverse-frequency weighting per label)
+  - Label smoothing (ϵ = 0.1)
+  - AdamW-style optimizer on 8-bit base weights (bitsandbytes)
+  - 5 epochs on LEDGAR train split
+  - bf16 on GPU
+The LoRA adapter and classifier are the only trainable parts; the base encoder stays in 8-bit.
+## Validation metrics (LEDGAR validation split)
+- Accuracy: __0.8152__
+- Macro F1: ~0.74 (computed separately)
+- ECE (uncalibrated): __0.115__
+- ECE (after temperature scaling): __0.022__
+Temperature scaling is done on held-out validation logits.
+## Risk policy and operating point
+A subset of families is treated as **risky** (false green-lights here are especially costly). For those families, a false green-light means:
+1. The paragraph is *kept* by the triage policy (not abstained), and
+2. The predicted family is *not* in the risk bucket.
+We sweep a global threshold τ on the maximum calibrated probability and choose τ* to enforce a cap on the false green rate among risky clauses.
+- Temperature T*: __0.80__
+- Threshold τ*: __0.68__
+- Keep rate at τ* (all clauses): __0.725__
+- False green rate at τ* (risky clauses only): __0.0495__
+- Recall of risky clauses within the kept set at τ*: __0.950__
+## Governance slices
+During evaluation, we compute slices by:
+- Clause family (`true_name`)
+- Length buckets (rough token-count quartiles)
+- Character count and characters-per-token quartiles
+- Phrase flags such as:
+  - `binding arbitration`
+  - `sole discretion`
+  - `to the maximum extent permitted by law`
+  - `including but not limited to`
+  - venue / jurisdiction phrases
+Each slice tracks:
+- `n` (number of paragraphs in slice)
+- `kept_rate`
+- `false_green_rate` (for risky clauses in that slice)
+- `avg_conf` (average max calibrated probability)
+- `harm_score` ≈ expected number of false green-lights in that slice
+These tables are not shipped here, but can be recomputed from logits and the same flags.
+## Intended use
+This model is intended as a **research and prototyping** component for:
+- Clause family classification on LEDGAR
+- Intake triage experiments on ToS / boilerplate contracts
+- Governance-style audits of clause-level models
+It does **not** replace a lawyer or compliance team. Use it only with human review.
+## How to load (Hub)
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+from peft import PeftModel
+import json, torch
+MODEL_ID = "__snickerszz/termsconditioned-roberta-large-ledgar-lora__"
+roberta-large = "__roberta-large__"
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+# Load base encoder
+base_model = AutoModelForSequenceClassification.from_pretrained(
+    roberta-large,
+    num_labels=100,
+)
+# Attach LoRA adapter
+model = PeftModel.from_pretrained(base_model, MODEL_ID)
+model.eval()
+# Load meta for calibration + policy
+with open("termsconditioned_meta.json") as f:
+    meta = json.load(f)
+T_star = meta["T_star"]
+tau_star = meta["tau_star"]
+id2label = {int(k): v for k, v in meta["id2label"].items()}
+risky_families = set(meta["risky_families"])
+def classify_paragraph(text: str):
+    enc = tokenizer(text, return_tensors="pt", truncation=True, max_length=384)
+    with torch.no_grad():
+        out = model(**enc)
+    logits = out.logits / T_star
+    probs = torch.softmax(logits, dim=-1)[0]
+    conf, pred_id = probs.max(dim=-1)
+    pred_id = int(pred_id)
+    family = id2label[pred_id]
+    conf = float(conf)
+    if conf < tau_star:
+        action = "Needs review"
+    else:
+        action = "Flag as risky" if family in risky_families else "Green-light"
+    return {"family": family, "confidence": conf, "action": action}
+Limitations
+Trained only on LEDGAR; generalization outside that domain is not guaranteed.
+Clause-level only; no document-wide reasoning.
+Calibration and policy thresholds are specific to this validation split; re-calibrate if you shift domains or data distributions.

adapter_config.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": {
+    "base_model_class": "RobertaForSequenceClassification",
+    "parent_library": "transformers.models.roberta.modeling_roberta"
+  },
+  "base_model_name_or_path": "roberta-large",
+  "bias": "none",
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": [
+    "classifier"
+  ],
+  "peft_type": "LORA",
+  "r": 16,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "output.dense",
+    "value",
+    "intermediate.dense",
+    "query",
+    "key"
+  ],
+  "task_type": null,
+  "use_dora": false,
+  "use_rslora": false
+}

adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7156f28bcdf34dac6461166a1c3328e900c16777d50caa690b39b9ee22a966ac
+size 30658032

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "bos_token": "<s>",
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "unk_token": "<unk>"
+}

termsconditioned_meta.json ADDED Viewed

	@@ -0,0 +1,233 @@

+{
+  "base_model": "roberta-large",
+  "num_labels": 100,
+  "id2label": {
+    "0": "Adjustments",
+    "1": "Agreements",
+    "2": "Amendments",
+    "3": "Anti-Corruption Laws",
+    "4": "Applicable Laws",
+    "5": "Approvals",
+    "6": "Arbitration",
+    "7": "Assignments",
+    "8": "Assigns",
+    "9": "Authority",
+    "10": "Authorizations",
+    "11": "Base Salary",
+    "12": "Benefits",
+    "13": "Binding Effects",
+    "14": "Books",
+    "15": "Brokers",
+    "16": "Capitalization",
+    "17": "Change In Control",
+    "18": "Closings",
+    "19": "Compliance With Laws",
+    "20": "Confidentiality",
+    "21": "Consent To Jurisdiction",
+    "22": "Consents",
+    "23": "Construction",
+    "24": "Cooperation",
+    "25": "Costs",
+    "26": "Counterparts",
+    "27": "Death",
+    "28": "Defined Terms",
+    "29": "Definitions",
+    "30": "Disability",
+    "31": "Disclosures",
+    "32": "Duties",
+    "33": "Effective Dates",
+    "34": "Effectiveness",
+    "35": "Employment",
+    "36": "Enforceability",
+    "37": "Enforcements",
+    "38": "Entire Agreements",
+    "39": "Erisa",
+    "40": "Existence",
+    "41": "Expenses",
+    "42": "Fees",
+    "43": "Financial Statements",
+    "44": "Forfeitures",
+    "45": "Further Assurances",
+    "46": "General",
+    "47": "Governing Laws",
+    "48": "Headings",
+    "49": "Indemnifications",
+    "50": "Indemnity",
+    "51": "Insurances",
+    "52": "Integration",
+    "53": "Intellectual Property",
+    "54": "Interests",
+    "55": "Interpretations",
+    "56": "Jurisdictions",
+    "57": "Liens",
+    "58": "Litigations",
+    "59": "Miscellaneous",
+    "60": "Modifications",
+    "61": "No Conflicts",
+    "62": "No Defaults",
+    "63": "No Waivers",
+    "64": "Non-Disparagement",
+    "65": "Notices",
+    "66": "Organizations",
+    "67": "Participations",
+    "68": "Payments",
+    "69": "Positions",
+    "70": "Powers",
+    "71": "Publicity",
+    "72": "Qualifications",
+    "73": "Records",
+    "74": "Releases",
+    "75": "Remedies",
+    "76": "Representations",
+    "77": "Sales",
+    "78": "Sanctions",
+    "79": "Severability",
+    "80": "Solvency",
+    "81": "Specific Performance",
+    "82": "Submission To Jurisdiction",
+    "83": "Subsidiaries",
+    "84": "Successors",
+    "85": "Survival",
+    "86": "Tax Withholdings",
+    "87": "Taxes",
+    "88": "Terminations",
+    "89": "Terms",
+    "90": "Titles",
+    "91": "Transactions With Affiliates",
+    "92": "Use Of Proceeds",
+    "93": "Vacations",
+    "94": "Venues",
+    "95": "Vesting",
+    "96": "Waiver Of Jury Trials",
+    "97": "Waivers",
+    "98": "Warranties",
+    "99": "Withholdings"
+  },
+  "label2id": {
+    "Adjustments": 0,
+    "Agreements": 1,
+    "Amendments": 2,
+    "Anti-Corruption Laws": 3,
+    "Applicable Laws": 4,
+    "Approvals": 5,
+    "Arbitration": 6,
+    "Assignments": 7,
+    "Assigns": 8,
+    "Authority": 9,
+    "Authorizations": 10,
+    "Base Salary": 11,
+    "Benefits": 12,
+    "Binding Effects": 13,
+    "Books": 14,
+    "Brokers": 15,
+    "Capitalization": 16,
+    "Change In Control": 17,
+    "Closings": 18,
+    "Compliance With Laws": 19,
+    "Confidentiality": 20,
+    "Consent To Jurisdiction": 21,
+    "Consents": 22,
+    "Construction": 23,
+    "Cooperation": 24,
+    "Costs": 25,
+    "Counterparts": 26,
+    "Death": 27,
+    "Defined Terms": 28,
+    "Definitions": 29,
+    "Disability": 30,
+    "Disclosures": 31,
+    "Duties": 32,
+    "Effective Dates": 33,
+    "Effectiveness": 34,
+    "Employment": 35,
+    "Enforceability": 36,
+    "Enforcements": 37,
+    "Entire Agreements": 38,
+    "Erisa": 39,
+    "Existence": 40,
+    "Expenses": 41,
+    "Fees": 42,
+    "Financial Statements": 43,
+    "Forfeitures": 44,
+    "Further Assurances": 45,
+    "General": 46,
+    "Governing Laws": 47,
+    "Headings": 48,
+    "Indemnifications": 49,
+    "Indemnity": 50,
+    "Insurances": 51,
+    "Integration": 52,
+    "Intellectual Property": 53,
+    "Interests": 54,
+    "Interpretations": 55,
+    "Jurisdictions": 56,
+    "Liens": 57,
+    "Litigations": 58,
+    "Miscellaneous": 59,
+    "Modifications": 60,
+    "No Conflicts": 61,
+    "No Defaults": 62,
+    "No Waivers": 63,
+    "Non-Disparagement": 64,
+    "Notices": 65,
+    "Organizations": 66,
+    "Participations": 67,
+    "Payments": 68,
+    "Positions": 69,
+    "Powers": 70,
+    "Publicity": 71,
+    "Qualifications": 72,
+    "Records": 73,
+    "Releases": 74,
+    "Remedies": 75,
+    "Representations": 76,
+    "Sales": 77,
+    "Sanctions": 78,
+    "Severability": 79,
+    "Solvency": 80,
+    "Specific Performance": 81,
+    "Submission To Jurisdiction": 82,
+    "Subsidiaries": 83,
+    "Successors": 84,
+    "Survival": 85,
+    "Tax Withholdings": 86,
+    "Taxes": 87,
+    "Terminations": 88,
+    "Terms": 89,
+    "Titles": 90,
+    "Transactions With Affiliates": 91,
+    "Use Of Proceeds": 92,
+    "Vacations": 93,
+    "Venues": 94,
+    "Vesting": 95,
+    "Waiver Of Jury Trials": 96,
+    "Waivers": 97,
+    "Warranties": 98,
+    "Withholdings": 99
+  },
+  "risky_families": [
+    "Amendments",
+    "Arbitration",
+    "Consent To Jurisdiction",
+    "Governing Laws",
+    "Indemnifications",
+    "Indemnity",
+    "Jurisdictions",
+    "Modifications",
+    "Remedies",
+    "Submission To Jurisdiction",
+    "Waiver Of Jury Trials",
+    "Waivers"
+  ],
+  "T_star": 0.8,
+  "tau_star": 0.6799999999999999,
+  "val_acc_raw": 0.8152,
+  "ECE_raw": 0.1154585200600326,
+  "ECE_cal": 0.0220565241061151,
+  "false_green_rate_at_tau": 0.04952076677316294,
+  "keep_rate_at_tau": 0.7251,
+  "risky_recall_kept_at_tau": 0.950479233226837,
+  "dataset": "coastalcph/lex_glue: ledgar",
+  "mode": "Encoder+LoRA 8-bit",
+  "hf_model_id": "snickerszz/termsconditioned-roberta-large-ledgar-lora"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,57 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50264": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "errors": "replace",
+  "mask_token": "<mask>",
+  "model_max_length": 512,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "tokenizer_class": "RobertaTokenizer",
+  "trim_offsets": true,
+  "unk_token": "<unk>"
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff