File size: 9,952 Bytes
39b50a8
0c92364
 
39b50a8
0c92364
 
 
39b50a8
0c92364
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39b50a8
 
0c92364
 
 
 
 
 
 
 
 
 
39b50a8
0c92364
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39b50a8
0c92364
39b50a8
0c92364
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39b50a8
0c92364
 
39b50a8
0c92364
 
39b50a8
 
0c92364
 
 
 
 
 
 
39b50a8
0c92364
39b50a8
0c92364
 
 
39b50a8
0c92364
39b50a8
0c92364
 
 
 
 
39b50a8
0c92364
39b50a8
0c92364
 
 
 
39b50a8
0c92364
39b50a8
0c92364
 
 
 
 
 
39b50a8
0c92364
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
---
pretty_name: LEXPT Law SFT (CAP subset)
dataset_name: lexpt-law-sft
tags:
- legal
- law
- caselaw
- sft
- lora
- chatml
- instruction-tuning
task_categories:
- text-generation
- question-answering
- summarization
language:
- en
license: cc-by-4.0
size_categories:
- 10K<n<100K
source_datasets:
- common-pile/Caselaw_Access_Project
datasets:
- common-pile/caselaw_access_project
base_model:
- openai/gpt-oss-20b
pipeline_tag: text-generation
---

# LEXPT Law SFT (CAP subset)

## Dataset Summary
**LEXPT Law SFT** is a supervised fine-tuning corpus for **U.S. case-law analysis**. It provides **chat-style instruction/response** records derived from **public-domain judicial opinions** (e.g., the Caselaw Access Project, “CAP”) and lawyer-authored prompts targeting appellate/habeas skills:

- Case skeleton extraction (posture, issues, holdings, standards, disposition)  
- Variance vs. constructive amendment analysis  
- Preservation/waiver and prejudice analysis  
- Habeas procedural-default framing (cause–prejudice; innocence gateway)  
- Evidence topics (authentication, 801(d)(2)(E), Rule 403, juror aids)  
- IRAC drafting and advocacy point-headings (petitioner/state)  
- Bluebook formatting exercises

The data are curated for **base+LoRA** legal assistants and are compatible with `tokenizer.apply_chat_template(...)` (ChatML-style roles). All **opinion texts** are public-domain; **prompts/annotations** are newly authored and released under **CC-BY-4.0**.

---

## Intended Use
- Fine-tuning or LoRA-adapting general LLMs for **opinion-grounded legal reasoning**.  
- Evaluation/benchmarking of structured appellate/habeas analysis on held-out opinions.  
- Not for production of legal advice; this is a research/engineering dataset to improve structured legal outputs.

---

## Use Cases (15 task templates)

1. **Core extraction (case skeleton)**  
   Extract (1) procedural posture, (2) issues, (3) holdings (one line each), (4) standards of review, (5) disposition from a provided opinion excerpt.

2. **Variance vs. constructive amendment**  
   Define both doctrines, then classify the opinion’s problem (proof–pleading discrepancy vs. alteration of elements) and justify using the court’s analysis.

3. **Preservation / waiver**  
   Identify the exact trial steps necessary to preserve a fatal-variance claim (contemporaneous objection, motion grounds specificity, request for continuance) and assess whether they occurred.

4. **Prejudice analysis (variance)**  
   Evaluate whether variant proof (e.g., gun vs. knife) misled the defense, caused surprise, or impaired preparation; point to record facts showing (no) prejudice.

5. **Habeas framing (procedural default)**  
   Explain how a state-trial variance claim is reviewed on federal habeas when no contemporaneous objection was made; outline cause-and-prejudice / actual-innocence gateways if prompted.

6. **Standard of review**  
   State which standard(s) the court applied (de novo, abuse of discretion, harmless error) and why; explain how lack of preservation narrowed the scope.

7. **Argument for petitioner/appellant**  
   Draft 4–8 concise advocacy points that a means discrepancy (e.g., knife → gun) violated Sixth-Amendment notice and was not harmless.

8. **Argument for the state/appellee**  
   Draft 4–8 concise counterpoints on waiver (failure to object), lack of prejudice/surprise, alignment with defense theory, and adequacy of notice.

9. **Record checklist**  
   Bullet list of record items to pull for briefing (charging instrument; key witness testimony; objections or lack thereof; motions and grounds; any continuance requested; state appeal; federal habeas pleadings).

10. **Remedies**  
    State the proper remedies if a preserved fatal variance is found on direct appeal vs. habeas (reversal, new trial, or other relief), and when harmless error applies.

11. **Hypothetical preservation**  
    Re-analyze outcome/posture assuming defense counsel objected when variant proof emerged and sought a continuance; discuss how that affects prejudice and review.

12. **Notice pleading in informations**  
    Explain required factual specificity to satisfy notice; apply to “assault with intent to kill” and assess whether the instrument’s means (knife vs. gun) is material.

13. **Jury-instruction angle**  
    Propose a limiting/clarifying instruction to mitigate variance prejudice (e.g., confining the theory to the charged means) and analyze whether refusal would be reversible error.

14. **Bluebook formatting**  
    Provide full and short-form citations for the controlling decision(s) and the referenced state case; compose a citation string suitable for a brief’s argument section.

15. **One-page IRAC**  
    Produce an IRAC with exact headers—**Issue**, **Rule**, **Application**, **Conclusion**—summarizing the variance/notice dispute and the court’s reasoning.

---

## Data Structure

### Record Schema
| Field          | Type   | Description                                                                                       |
|----------------|--------|---------------------------------------------------------------------------------------------------|
| `id`           | str    | Unique identifier (e.g., `ridgeway_habeas_0001`).                                                 |
| `case_name`    | str    | Case caption (e.g., “Ridgeway v. Hutto”).                                                         |
| `court`        | str    | Court (e.g., “8th Cir.”).                                                                         |
| `year`         | int    | Decision year.                                                                                    |
| `jurisdiction` | str    | “federal” or “state”.                                                                             |
| `prompt_type`  | str    | One of the 15 task categories (see **Use Cases**).                                                |
| `opinion_text` | str    | Public-domain opinion excerpt used as context.                                                    |
| `messages`     | list   | ChatML-style messages: `[{"role": "system"|"user"|"assistant", "content": "..."}]`.               |
| `source_ref`   | str    | Short provenance note (e.g., “CAP; citation: 474 F.2d 22 (8th Cir. 1973)”).                       |

### Example Record
```json
{
  "id": "ridgeway_habeas_0001",
  "case_name": "Ridgeway v. Hutto",
  "court": "8th Cir.",
  "year": 1973,
  "jurisdiction": "federal",
  "prompt_type": "core_extraction",
  "opinion_text": "…public-domain opinion excerpt…",
  "messages": [
    {
      "role": "system",
      "content": "You are a legal analysis assistant. Return ONLY the final answer. No prefaces or meta-commentary."
    },
    {
      "role": "user",
      "content": "From the opinion text, list: (1) procedural posture, (2) issues, (3) holdings, (4) standards of review, (5) disposition.\n\nOPINION TEXT:\n…"
    },
    {
      "role": "assistant",
      "content": "1) …\n2) …\n3) …\n4) …\n5) …"
    }
  ],
  "source_ref": "CAP; citation: 474 F.2d 22 (8th Cir. 1973)"
}
```

### Splits
- `train`: update after upload  
- `validation`: update after upload  
- `test` (optional): update after upload  

> **Split policy:** Do **not** split tasks for the **same case** across train/val/test to avoid leakage.

---

## How to Use

### Load with 🤗 Datasets
```python
from datasets import load_dataset
ds = load_dataset("sik247/lexpt-law-sft")  # replace with your repo id
print(ds)
print(ds["train"][0])
```

### Use with Chat Templates (Transformers)
```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("unsloth/gpt-oss-20b")  # or your base

sample = ds["train"][0]["messages"]
prompt = tok.apply_chat_template(sample, add_generation_prompt=True, tokenize=False)
```

---

## Curation & Construction
- **Sources:** public-domain opinions (e.g., CAP).  
- **Selection:** appellate/habeas cases and issues suited for structured outputs (lists, checklists, IRAC).  
- **Annotation:** prompts and answers authored by legal-knowledgeable contributors; emphasis on **final-answer-only** style.  
- **Preprocessing:** remove site boilerplate; normalize whitespace/quotes; ensure consistent role formatting; de-duplicate near-identical snippets.

---

## Quality Control
- Spot checks for: (i) factual alignment with the opinion excerpt, (ii) formatting adherence (lists/IRAC), (iii) concise, jurisdiction-aware language.  
- Where uncertainty exists, assistant outputs avoid invented facts/citations and prefer “Insufficient information.”

---

## Ethical Considerations & Limitations
- **Not legal advice.** This dataset trains formatting and structure for legal analysis; always verify with primary sources.  
- **Coverage:** U.S. appellate caselaw; not exhaustive across jurisdictions or dates.  
- **Model risk:** Misstatements of doctrine or miscitation can occur; downstream users should validate.  
- **Bias:** Judicial texts may reflect historical or jurisdictional bias; outputs may inherit such patterns.

---

## Licensing
- **Opinion texts:** Public domain (as supplied by CAP and similar sources).  
- **Prompts & annotations:** © 2025 sik247, released under **CC-BY-4.0**.  
- When redistributing, include attribution: *“sik247 / LEXPT Law SFT (CAP subset)”*.

---

## Citation
If you use this dataset, please cite:
```
sik247. LEXPT Law SFT (CAP subset). 2025. Hugging Face Dataset.
```
And acknowledge the public-domain opinion sources (e.g., CAP) per their attribution guidance.

---

## Maintainer
- **Author/Maintainer:** `sik247`  
- Issues/requests: open a Discussion on the dataset page.

---

## Changelog
- **v1.0** — Initial release with CAP-based opinion excerpts, 15 task templates, and ChatML records. Update counts and add additional jurisdictions in subsequent versions.