sik247
/

lexpt

@@ -1,62 +1,221 @@
 ---
-base_model: unsloth/gpt-oss-20b-unsloth-bnb-4bit
-library_name: peft
-model_name: outputs_lexpt
 tags:
-- base_model:adapter:unsloth/gpt-oss-20b-unsloth-bnb-4bit
-- lora
 - sft
-- transformers
-- trl
-- unsloth
-licence: license
 ---
-# Model Card for outputs_lexpt
-This model is a fine-tuned version of [unsloth/gpt-oss-20b-unsloth-bnb-4bit](https://huggingface.co/unsloth/gpt-oss-20b-unsloth-bnb-4bit).
-It has been trained using [TRL](https://github.com/huggingface/trl).
-## Quick start
 ```python
-from transformers import pipeline
-question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
-generator = pipeline("text-generation", model="None", device="cuda")
-output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
-print(output["generated_text"])
 ```
-## Training procedure
-This model was trained with SFT.
-### Framework versions
-- PEFT 0.17.0
-- TRL: 0.21.0
-- Transformers: 4.56.0.dev0
-- Pytorch: 2.8.0
-- Datasets: 3.6.0
-- Tokenizers: 0.21.4
-## Citations
-Cite TRL as:
-```bibtex
-@misc{vonwerra2022trl,
-	title        = {{TRL: Transformer Reinforcement Learning}},
-	author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
-	year         = 2020,
-	journal      = {GitHub repository},
-	publisher    = {GitHub},
-	howpublished = {\url{https://github.com/huggingface/trl}}
-}
-```

 ---
+pretty_name: LEXPT Law SFT (CAP subset)
+dataset_name: lexpt-law-sft
 tags:
+- legal
+- law
+- caselaw
 - sft
+- lora
+- chatml
+- instruction-tuning
+task_categories:
+- text-generation
+- question-answering
+- summarization
+language:
+- en
+license: cc-by-4.0
+size_categories:
+- 10K<n<100K
+source_datasets:
+- common-pile/Caselaw_Access_Project
+datasets:
+- common-pile/caselaw_access_project
+base_model:
+- openai/gpt-oss-20b
+pipeline_tag: text-generation
+---
+# LEXPT Law SFT (CAP subset)
+## Dataset Summary
+**LEXPT Law SFT** is a supervised fine-tuning corpus for **U.S. case-law analysis**. It provides **chat-style instruction/response** records derived from **public-domain judicial opinions** (e.g., the Caselaw Access Project, “CAP”) and lawyer-authored prompts targeting appellate/habeas skills:
+- Case skeleton extraction (posture, issues, holdings, standards, disposition)
+- Variance vs. constructive amendment analysis
+- Preservation/waiver and prejudice analysis
+- Habeas procedural-default framing (cause–prejudice; innocence gateway)
+- Evidence topics (authentication, 801(d)(2)(E), Rule 403, juror aids)
+- IRAC drafting and advocacy point-headings (petitioner/state)
+- Bluebook formatting exercises
+The data are curated for **base+LoRA** legal assistants and are compatible with `tokenizer.apply_chat_template(...)` (ChatML-style roles). All **opinion texts** are public-domain; **prompts/annotations** are newly authored and released under **CC-BY-4.0**.
+---
+## Intended Use
+- Fine-tuning or LoRA-adapting general LLMs for **opinion-grounded legal reasoning**.
+- Evaluation/benchmarking of structured appellate/habeas analysis on held-out opinions.
+- Not for production of legal advice; this is a research/engineering dataset to improve structured legal outputs.
 ---
+## Use Cases (15 task templates)
+1. **Core extraction (case skeleton)**
+   Extract (1) procedural posture, (2) issues, (3) holdings (one line each), (4) standards of review, (5) disposition from a provided opinion excerpt.
+2. **Variance vs. constructive amendment**
+   Define both doctrines, then classify the opinion’s problem (proof–pleading discrepancy vs. alteration of elements) and justify using the court’s analysis.
+3. **Preservation / waiver**
+   Identify the exact trial steps necessary to preserve a fatal-variance claim (contemporaneous objection, motion grounds specificity, request for continuance) and assess whether they occurred.
+4. **Prejudice analysis (variance)**
+   Evaluate whether variant proof (e.g., gun vs. knife) misled the defense, caused surprise, or impaired preparation; point to record facts showing (no) prejudice.
+5. **Habeas framing (procedural default)**
+   Explain how a state-trial variance claim is reviewed on federal habeas when no contemporaneous objection was made; outline cause-and-prejudice / actual-innocence gateways if prompted.
+6. **Standard of review**
+   State which standard(s) the court applied (de novo, abuse of discretion, harmless error) and why; explain how lack of preservation narrowed the scope.
+7. **Argument for petitioner/appellant**
+   Draft 4–8 concise advocacy points that a means discrepancy (e.g., knife → gun) violated Sixth-Amendment notice and was not harmless.
+8. **Argument for the state/appellee**
+   Draft 4–8 concise counterpoints on waiver (failure to object), lack of prejudice/surprise, alignment with defense theory, and adequacy of notice.
+9. **Record checklist**
+   Bullet list of record items to pull for briefing (charging instrument; key witness testimony; objections or lack thereof; motions and grounds; any continuance requested; state appeal; federal habeas pleadings).
+10. **Remedies**
+    State the proper remedies if a preserved fatal variance is found on direct appeal vs. habeas (reversal, new trial, or other relief), and when harmless error applies.
+11. **Hypothetical preservation**
+    Re-analyze outcome/posture assuming defense counsel objected when variant proof emerged and sought a continuance; discuss how that affects prejudice and review.
+12. **Notice pleading in informations**
+    Explain required factual specificity to satisfy notice; apply to “assault with intent to kill” and assess whether the instrument’s means (knife vs. gun) is material.
+13. **Jury-instruction angle**
+    Propose a limiting/clarifying instruction to mitigate variance prejudice (e.g., confining the theory to the charged means) and analyze whether refusal would be reversible error.
+14. **Bluebook formatting**
+    Provide full and short-form citations for the controlling decision(s) and the referenced state case; compose a citation string suitable for a brief’s argument section.
+15. **One-page IRAC**
+    Produce an IRAC with exact headers—**Issue**, **Rule**, **Application**, **Conclusion**—summarizing the variance/notice dispute and the court’s reasoning.
+---
+## Data Structure
+### Record Schema
+| Field          | Type   | Description                                                                                       |
+|----------------|--------|---------------------------------------------------------------------------------------------------|
+| `id`           | str    | Unique identifier (e.g., `ridgeway_habeas_0001`).                                                 |
+| `case_name`    | str    | Case caption (e.g., “Ridgeway v. Hutto”).                                                         |
+| `court`        | str    | Court (e.g., “8th Cir.”).                                                                         |
+| `year`         | int    | Decision year.                                                                                    |
+| `jurisdiction` | str    | “federal” or “state”.                                                                             |
+| `prompt_type`  | str    | One of the 15 task categories (see **Use Cases**).                                                |
+| `opinion_text` | str    | Public-domain opinion excerpt used as context.                                                    |
+| `messages`     | list   | ChatML-style messages: `[{"role": "system"|"user"|"assistant", "content": "..."}]`.               |
+| `source_ref`   | str    | Short provenance note (e.g., “CAP; citation: 474 F.2d 22 (8th Cir. 1973)”).                       |
+### Example Record
+```json
+{
+  "id": "ridgeway_habeas_0001",
+  "case_name": "Ridgeway v. Hutto",
+  "court": "8th Cir.",
+  "year": 1973,
+  "jurisdiction": "federal",
+  "prompt_type": "core_extraction",
+  "opinion_text": "…public-domain opinion excerpt…",
+  "messages": [
+    {
+      "role": "system",
+      "content": "You are a legal analysis assistant. Return ONLY the final answer. No prefaces or meta-commentary."
+    },
+    {
+      "role": "user",
+      "content": "From the opinion text, list: (1) procedural posture, (2) issues, (3) holdings, (4) standards of review, (5) disposition.\n\nOPINION TEXT:\n…"
+    },
+    {
+      "role": "assistant",
+      "content": "1) …\n2) …\n3) …\n4) …\n5) …"
+    }
+  ],
+  "source_ref": "CAP; citation: 474 F.2d 22 (8th Cir. 1973)"
+}
+```
+### Splits
+- `train`: update after upload
+- `validation`: update after upload
+- `test` (optional): update after upload
+> **Split policy:** Do **not** split tasks for the **same case** across train/val/test to avoid leakage.
+---
+## How to Use
+### Load with 🤗 Datasets
+```python
+from datasets import load_dataset
+ds = load_dataset("sik247/lexpt-law-sft")  # replace with your repo id
+print(ds)
+print(ds["train"][0])
+```
+### Use with Chat Templates (Transformers)
 ```python
+from transformers import AutoTokenizer
+tok = AutoTokenizer.from_pretrained("unsloth/gpt-oss-20b")  # or your base
+sample = ds["train"][0]["messages"]
+prompt = tok.apply_chat_template(sample, add_generation_prompt=True, tokenize=False)
 ```
+---
+## Curation & Construction
+- **Sources:** public-domain opinions (e.g., CAP).
+- **Selection:** appellate/habeas cases and issues suited for structured outputs (lists, checklists, IRAC).
+- **Annotation:** prompts and answers authored by legal-knowledgeable contributors; emphasis on **final-answer-only** style.
+- **Preprocessing:** remove site boilerplate; normalize whitespace/quotes; ensure consistent role formatting; de-duplicate near-identical snippets.
+---
+## Quality Control
+- Spot checks for: (i) factual alignment with the opinion excerpt, (ii) formatting adherence (lists/IRAC), (iii) concise, jurisdiction-aware language.
+- Where uncertainty exists, assistant outputs avoid invented facts/citations and prefer “Insufficient information.”
+---
+## Ethical Considerations & Limitations
+- **Not legal advice.** This dataset trains formatting and structure for legal analysis; always verify with primary sources.
+- **Coverage:** U.S. appellate caselaw; not exhaustive across jurisdictions or dates.
+- **Model risk:** Misstatements of doctrine or miscitation can occur; downstream users should validate.
+- **Bias:** Judicial texts may reflect historical or jurisdictional bias; outputs may inherit such patterns.
+---
+## Licensing
+- **Opinion texts:** Public domain (as supplied by CAP and similar sources).
+- **Prompts & annotations:** © 2025 sik247, released under **CC-BY-4.0**.
+- When redistributing, include attribution: *“sik247 / LEXPT Law SFT (CAP subset)”*.
+---
+## Citation
+If you use this dataset, please cite:
+```
+sik247. LEXPT Law SFT (CAP subset). 2025. Hugging Face Dataset.
+```
+And acknowledge the public-domain opinion sources (e.g., CAP) per their attribution guidance.
+---
+## Maintainer
+- **Author/Maintainer:** `sik247`
+- Issues/requests: open a Discussion on the dataset page.
+---
+## Changelog
+- **v1.0** — Initial release with CAP-based opinion excerpts, 15 task templates, and ChatML records. Update counts and add additional jurisdictions in subsequent versions.