File size: 9,952 Bytes

---
pretty_name: LEXPT Law SFT (CAP subset)
dataset_name: lexpt-law-sft
tags:
- legal
- law
- caselaw
- sft
- lora
- chatml
- instruction-tuning
task_categories:
- text-generation
- question-answering
- summarization
language:
- en
license: cc-by-4.0
size_categories:
- 10K<n<100K
source_datasets:
- common-pile/Caselaw_Access_Project
datasets:
- common-pile/caselaw_access_project
base_model:
- openai/gpt-oss-20b
pipeline_tag: text-generation
---

# LEXPT Law SFT (CAP subset)

## Dataset Summary
**LEXPT Law SFT** is a supervised fine-tuning corpus for **U.S. case-law analysis**. It provides **chat-style instruction/response** records derived from **public-domain judicial opinions** (e.g., the Caselaw Access Project, “CAP”) and lawyer-authored prompts targeting appellate/habeas skills:

- Case skeleton extraction (posture, issues, holdings, standards, disposition)  
- Variance vs. constructive amendment analysis  
- Preservation/waiver and prejudice analysis  
- Habeas procedural-default framing (cause–prejudice; innocence gateway)  
- Evidence topics (authentication, 801(d)(2)(E), Rule 403, juror aids)  
- IRAC drafting and advocacy point-headings (petitioner/state)  
- Bluebook formatting exercises

The data are curated for **base+LoRA** legal assistants and are compatible with `tokenizer.apply_chat_template(...)` (ChatML-style roles). All **opinion texts** are public-domain; **prompts/annotations** are newly authored and released under **CC-BY-4.0**.

---

## Intended Use
- Fine-tuning or LoRA-adapting general LLMs for **opinion-grounded legal reasoning**.  
- Evaluation/benchmarking of structured appellate/habeas analysis on held-out opinions.  
- Not for production of legal advice; this is a research/engineering dataset to improve structured legal outputs.

---

## Use Cases (15 task templates)

1. **Core extraction (case skeleton)**  
   Extract (1) procedural posture, (2) issues, (3) holdings (one line each), (4) standards of review, (5) disposition from a provided opinion excerpt.

2. **Variance vs. constructive amendment**  
   Define both doctrines, then classify the opinion’s problem (proof–pleading discrepancy vs. alteration of elements) and justify using the court’s analysis.

3. **Preservation / waiver**  
   Identify the exact trial steps necessary to preserve a fatal-variance claim (contemporaneous objection, motion grounds specificity, request for continuance) and assess whether they occurred.

4. **Prejudice analysis (variance)**  
   Evaluate whether variant proof (e.g., gun vs. knife) misled the defense, caused surprise, or impaired preparation; point to record facts showing (no) prejudice.

5. **Habeas framing (procedural default)**  
   Explain how a state-trial variance claim is reviewed on federal habeas when no contemporaneous objection was made; outline cause-and-prejudice / actual-innocence gateways if prompted.

6. **Standard of review**  
   State which standard(s) the court applied (de novo, abuse of discretion, harmless error) and why; explain how lack of preservation narrowed the scope.

7. **Argument for petitioner/appellant**  
   Draft 4–8 concise advocacy points that a means discrepancy (e.g., knife → gun) violated Sixth-Amendment notice and was not harmless.

8. **Argument for the state/appellee**  
   Draft 4–8 concise counterpoints on waiver (failure to object), lack of prejudice/surprise, alignment with defense theory, and adequacy of notice.

9. **Record checklist**  
   Bullet list of record items to pull for briefing (charging instrument; key witness testimony; objections or lack thereof; motions and grounds; any continuance requested; state appeal; federal habeas pleadings).

10. **Remedies**  
    State the proper remedies if a preserved fatal variance is found on direct appeal vs. habeas (reversal, new trial, or other relief), and when harmless error applies.

11. **Hypothetical preservation**  
    Re-analyze outcome/posture assuming defense counsel objected when variant proof emerged and sought a continuance; discuss how that affects prejudice and review.

12. **Notice pleading in informations**  
    Explain required factual specificity to satisfy notice; apply to “assault with intent to kill” and assess whether the instrument’s means (knife vs. gun) is material.

13. **Jury-instruction angle**  
    Propose a limiting/clarifying instruction to mitigate variance prejudice (e.g., confining the theory to the charged means) and analyze whether refusal would be reversible error.

14. **Bluebook formatting**  
    Provide full and short-form citations for the controlling decision(s) and the referenced state case; compose a citation string suitable for a brief’s argument section.

15. **One-page IRAC**  
    Produce an IRAC with exact headers—**Issue**, **Rule**, **Application**, **Conclusion**—summarizing the variance/notice dispute and the court’s reasoning.

---

## Data Structure

### Record Schema
| Field          | Type   | Description                                                                                       |
|----------------|--------|---------------------------------------------------------------------------------------------------|
| `id`           | str    | Unique identifier (e.g., `ridgeway_habeas_0001`).                                                 |
| `case_name`    | str    | Case caption (e.g., “Ridgeway v. Hutto”).                                                         |
| `court`        | str    | Court (e.g., “8th Cir.”).                                                                         |
| `year`         | int    | Decision year.                                                                                    |
| `jurisdiction` | str    | “federal” or “state”.                                                                             |
| `prompt_type`  | str    | One of the 15 task categories (see **Use Cases**).                                                |
| `opinion_text` | str    | Public-domain opinion excerpt used as context.                                                    |
| `messages`     | list   | ChatML-style messages: `[{"role": "system"|"user"|"assistant", "content": "..."}]`.               |
| `source_ref`   | str    | Short provenance note (e.g., “CAP; citation: 474 F.2d 22 (8th Cir. 1973)”).                       |

### Example Record
```json
{
  "id": "ridgeway_habeas_0001",
  "case_name": "Ridgeway v. Hutto",
  "court": "8th Cir.",
  "year": 1973,
  "jurisdiction": "federal",
  "prompt_type": "core_extraction",
  "opinion_text": "…public-domain opinion excerpt…",
  "messages": [
    {
      "role": "system",
      "content": "You are a legal analysis assistant. Return ONLY the final answer. No prefaces or meta-commentary."
    },
    {
      "role": "user",
      "content": "From the opinion text, list: (1) procedural posture, (2) issues, (3) holdings, (4) standards of review, (5) disposition.\n\nOPINION TEXT:\n…"
    },
    {
      "role": "assistant",
      "content": "1) …\n2) …\n3) …\n4) …\n5) …"
    }
  ],
  "source_ref": "CAP; citation: 474 F.2d 22 (8th Cir. 1973)"
}
```

### Splits
- `train`: update after upload  
- `validation`: update after upload  
- `test` (optional): update after upload  

> **Split policy:** Do **not** split tasks for the **same case** across train/val/test to avoid leakage.

---

## How to Use

### Load with 🤗 Datasets
```python
from datasets import load_dataset
ds = load_dataset("sik247/lexpt-law-sft")  # replace with your repo id
print(ds)
print(ds["train"][0])
```

### Use with Chat Templates (Transformers)
```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("unsloth/gpt-oss-20b")  # or your base

sample = ds["train"][0]["messages"]
prompt = tok.apply_chat_template(sample, add_generation_prompt=True, tokenize=False)
```

---

## Curation & Construction
- **Sources:** public-domain opinions (e.g., CAP).  
- **Selection:** appellate/habeas cases and issues suited for structured outputs (lists, checklists, IRAC).  
- **Annotation:** prompts and answers authored by legal-knowledgeable contributors; emphasis on **final-answer-only** style.  
- **Preprocessing:** remove site boilerplate; normalize whitespace/quotes; ensure consistent role formatting; de-duplicate near-identical snippets.

---

## Quality Control
- Spot checks for: (i) factual alignment with the opinion excerpt, (ii) formatting adherence (lists/IRAC), (iii) concise, jurisdiction-aware language.  
- Where uncertainty exists, assistant outputs avoid invented facts/citations and prefer “Insufficient information.”

---

## Ethical Considerations & Limitations
- **Not legal advice.** This dataset trains formatting and structure for legal analysis; always verify with primary sources.  
- **Coverage:** U.S. appellate caselaw; not exhaustive across jurisdictions or dates.  
- **Model risk:** Misstatements of doctrine or miscitation can occur; downstream users should validate.  
- **Bias:** Judicial texts may reflect historical or jurisdictional bias; outputs may inherit such patterns.

---

## Licensing
- **Opinion texts:** Public domain (as supplied by CAP and similar sources).  
- **Prompts & annotations:** © 2025 sik247, released under **CC-BY-4.0**.  
- When redistributing, include attribution: *“sik247 / LEXPT Law SFT (CAP subset)”*.

---

## Citation
If you use this dataset, please cite:
```
sik247. LEXPT Law SFT (CAP subset). 2025. Hugging Face Dataset.
```
And acknowledge the public-domain opinion sources (e.g., CAP) per their attribution guidance.

---

## Maintainer
- **Author/Maintainer:** `sik247`  
- Issues/requests: open a Discussion on the dataset page.

---

## Changelog
- **v1.0** — Initial release with CAP-based opinion excerpts, 15 task templates, and ChatML records. Update counts and add additional jurisdictions in subsequent versions.