File size: 9,952 Bytes
39b50a8 0c92364 39b50a8 0c92364 39b50a8 0c92364 39b50a8 0c92364 39b50a8 0c92364 39b50a8 0c92364 39b50a8 0c92364 39b50a8 0c92364 39b50a8 0c92364 39b50a8 0c92364 39b50a8 0c92364 39b50a8 0c92364 39b50a8 0c92364 39b50a8 0c92364 39b50a8 0c92364 39b50a8 0c92364 39b50a8 0c92364 39b50a8 0c92364 39b50a8 0c92364 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 | ---
pretty_name: LEXPT Law SFT (CAP subset)
dataset_name: lexpt-law-sft
tags:
- legal
- law
- caselaw
- sft
- lora
- chatml
- instruction-tuning
task_categories:
- text-generation
- question-answering
- summarization
language:
- en
license: cc-by-4.0
size_categories:
- 10K<n<100K
source_datasets:
- common-pile/Caselaw_Access_Project
datasets:
- common-pile/caselaw_access_project
base_model:
- openai/gpt-oss-20b
pipeline_tag: text-generation
---
# LEXPT Law SFT (CAP subset)
## Dataset Summary
**LEXPT Law SFT** is a supervised fine-tuning corpus for **U.S. case-law analysis**. It provides **chat-style instruction/response** records derived from **public-domain judicial opinions** (e.g., the Caselaw Access Project, “CAP”) and lawyer-authored prompts targeting appellate/habeas skills:
- Case skeleton extraction (posture, issues, holdings, standards, disposition)
- Variance vs. constructive amendment analysis
- Preservation/waiver and prejudice analysis
- Habeas procedural-default framing (cause–prejudice; innocence gateway)
- Evidence topics (authentication, 801(d)(2)(E), Rule 403, juror aids)
- IRAC drafting and advocacy point-headings (petitioner/state)
- Bluebook formatting exercises
The data are curated for **base+LoRA** legal assistants and are compatible with `tokenizer.apply_chat_template(...)` (ChatML-style roles). All **opinion texts** are public-domain; **prompts/annotations** are newly authored and released under **CC-BY-4.0**.
---
## Intended Use
- Fine-tuning or LoRA-adapting general LLMs for **opinion-grounded legal reasoning**.
- Evaluation/benchmarking of structured appellate/habeas analysis on held-out opinions.
- Not for production of legal advice; this is a research/engineering dataset to improve structured legal outputs.
---
## Use Cases (15 task templates)
1. **Core extraction (case skeleton)**
Extract (1) procedural posture, (2) issues, (3) holdings (one line each), (4) standards of review, (5) disposition from a provided opinion excerpt.
2. **Variance vs. constructive amendment**
Define both doctrines, then classify the opinion’s problem (proof–pleading discrepancy vs. alteration of elements) and justify using the court’s analysis.
3. **Preservation / waiver**
Identify the exact trial steps necessary to preserve a fatal-variance claim (contemporaneous objection, motion grounds specificity, request for continuance) and assess whether they occurred.
4. **Prejudice analysis (variance)**
Evaluate whether variant proof (e.g., gun vs. knife) misled the defense, caused surprise, or impaired preparation; point to record facts showing (no) prejudice.
5. **Habeas framing (procedural default)**
Explain how a state-trial variance claim is reviewed on federal habeas when no contemporaneous objection was made; outline cause-and-prejudice / actual-innocence gateways if prompted.
6. **Standard of review**
State which standard(s) the court applied (de novo, abuse of discretion, harmless error) and why; explain how lack of preservation narrowed the scope.
7. **Argument for petitioner/appellant**
Draft 4–8 concise advocacy points that a means discrepancy (e.g., knife → gun) violated Sixth-Amendment notice and was not harmless.
8. **Argument for the state/appellee**
Draft 4–8 concise counterpoints on waiver (failure to object), lack of prejudice/surprise, alignment with defense theory, and adequacy of notice.
9. **Record checklist**
Bullet list of record items to pull for briefing (charging instrument; key witness testimony; objections or lack thereof; motions and grounds; any continuance requested; state appeal; federal habeas pleadings).
10. **Remedies**
State the proper remedies if a preserved fatal variance is found on direct appeal vs. habeas (reversal, new trial, or other relief), and when harmless error applies.
11. **Hypothetical preservation**
Re-analyze outcome/posture assuming defense counsel objected when variant proof emerged and sought a continuance; discuss how that affects prejudice and review.
12. **Notice pleading in informations**
Explain required factual specificity to satisfy notice; apply to “assault with intent to kill” and assess whether the instrument’s means (knife vs. gun) is material.
13. **Jury-instruction angle**
Propose a limiting/clarifying instruction to mitigate variance prejudice (e.g., confining the theory to the charged means) and analyze whether refusal would be reversible error.
14. **Bluebook formatting**
Provide full and short-form citations for the controlling decision(s) and the referenced state case; compose a citation string suitable for a brief’s argument section.
15. **One-page IRAC**
Produce an IRAC with exact headers—**Issue**, **Rule**, **Application**, **Conclusion**—summarizing the variance/notice dispute and the court’s reasoning.
---
## Data Structure
### Record Schema
| Field | Type | Description |
|----------------|--------|---------------------------------------------------------------------------------------------------|
| `id` | str | Unique identifier (e.g., `ridgeway_habeas_0001`). |
| `case_name` | str | Case caption (e.g., “Ridgeway v. Hutto”). |
| `court` | str | Court (e.g., “8th Cir.”). |
| `year` | int | Decision year. |
| `jurisdiction` | str | “federal” or “state”. |
| `prompt_type` | str | One of the 15 task categories (see **Use Cases**). |
| `opinion_text` | str | Public-domain opinion excerpt used as context. |
| `messages` | list | ChatML-style messages: `[{"role": "system"|"user"|"assistant", "content": "..."}]`. |
| `source_ref` | str | Short provenance note (e.g., “CAP; citation: 474 F.2d 22 (8th Cir. 1973)”). |
### Example Record
```json
{
"id": "ridgeway_habeas_0001",
"case_name": "Ridgeway v. Hutto",
"court": "8th Cir.",
"year": 1973,
"jurisdiction": "federal",
"prompt_type": "core_extraction",
"opinion_text": "…public-domain opinion excerpt…",
"messages": [
{
"role": "system",
"content": "You are a legal analysis assistant. Return ONLY the final answer. No prefaces or meta-commentary."
},
{
"role": "user",
"content": "From the opinion text, list: (1) procedural posture, (2) issues, (3) holdings, (4) standards of review, (5) disposition.\n\nOPINION TEXT:\n…"
},
{
"role": "assistant",
"content": "1) …\n2) …\n3) …\n4) …\n5) …"
}
],
"source_ref": "CAP; citation: 474 F.2d 22 (8th Cir. 1973)"
}
```
### Splits
- `train`: update after upload
- `validation`: update after upload
- `test` (optional): update after upload
> **Split policy:** Do **not** split tasks for the **same case** across train/val/test to avoid leakage.
---
## How to Use
### Load with 🤗 Datasets
```python
from datasets import load_dataset
ds = load_dataset("sik247/lexpt-law-sft") # replace with your repo id
print(ds)
print(ds["train"][0])
```
### Use with Chat Templates (Transformers)
```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("unsloth/gpt-oss-20b") # or your base
sample = ds["train"][0]["messages"]
prompt = tok.apply_chat_template(sample, add_generation_prompt=True, tokenize=False)
```
---
## Curation & Construction
- **Sources:** public-domain opinions (e.g., CAP).
- **Selection:** appellate/habeas cases and issues suited for structured outputs (lists, checklists, IRAC).
- **Annotation:** prompts and answers authored by legal-knowledgeable contributors; emphasis on **final-answer-only** style.
- **Preprocessing:** remove site boilerplate; normalize whitespace/quotes; ensure consistent role formatting; de-duplicate near-identical snippets.
---
## Quality Control
- Spot checks for: (i) factual alignment with the opinion excerpt, (ii) formatting adherence (lists/IRAC), (iii) concise, jurisdiction-aware language.
- Where uncertainty exists, assistant outputs avoid invented facts/citations and prefer “Insufficient information.”
---
## Ethical Considerations & Limitations
- **Not legal advice.** This dataset trains formatting and structure for legal analysis; always verify with primary sources.
- **Coverage:** U.S. appellate caselaw; not exhaustive across jurisdictions or dates.
- **Model risk:** Misstatements of doctrine or miscitation can occur; downstream users should validate.
- **Bias:** Judicial texts may reflect historical or jurisdictional bias; outputs may inherit such patterns.
---
## Licensing
- **Opinion texts:** Public domain (as supplied by CAP and similar sources).
- **Prompts & annotations:** © 2025 sik247, released under **CC-BY-4.0**.
- When redistributing, include attribution: *“sik247 / LEXPT Law SFT (CAP subset)”*.
---
## Citation
If you use this dataset, please cite:
```
sik247. LEXPT Law SFT (CAP subset). 2025. Hugging Face Dataset.
```
And acknowledge the public-domain opinion sources (e.g., CAP) per their attribution guidance.
---
## Maintainer
- **Author/Maintainer:** `sik247`
- Issues/requests: open a Discussion on the dataset page.
---
## Changelog
- **v1.0** — Initial release with CAP-based opinion excerpts, 15 task templates, and ChatML records. Update counts and add additional jurisdictions in subsequent versions. |