lexpt / README.md
sik247's picture
Update README.md
0c92364 verified
---
pretty_name: LEXPT Law SFT (CAP subset)
dataset_name: lexpt-law-sft
tags:
- legal
- law
- caselaw
- sft
- lora
- chatml
- instruction-tuning
task_categories:
- text-generation
- question-answering
- summarization
language:
- en
license: cc-by-4.0
size_categories:
- 10K<n<100K
source_datasets:
- common-pile/Caselaw_Access_Project
datasets:
- common-pile/caselaw_access_project
base_model:
- openai/gpt-oss-20b
pipeline_tag: text-generation
---
# LEXPT Law SFT (CAP subset)
## Dataset Summary
**LEXPT Law SFT** is a supervised fine-tuning corpus for **U.S. case-law analysis**. It provides **chat-style instruction/response** records derived from **public-domain judicial opinions** (e.g., the Caselaw Access Project, “CAP”) and lawyer-authored prompts targeting appellate/habeas skills:
- Case skeleton extraction (posture, issues, holdings, standards, disposition)
- Variance vs. constructive amendment analysis
- Preservation/waiver and prejudice analysis
- Habeas procedural-default framing (cause–prejudice; innocence gateway)
- Evidence topics (authentication, 801(d)(2)(E), Rule 403, juror aids)
- IRAC drafting and advocacy point-headings (petitioner/state)
- Bluebook formatting exercises
The data are curated for **base+LoRA** legal assistants and are compatible with `tokenizer.apply_chat_template(...)` (ChatML-style roles). All **opinion texts** are public-domain; **prompts/annotations** are newly authored and released under **CC-BY-4.0**.
---
## Intended Use
- Fine-tuning or LoRA-adapting general LLMs for **opinion-grounded legal reasoning**.
- Evaluation/benchmarking of structured appellate/habeas analysis on held-out opinions.
- Not for production of legal advice; this is a research/engineering dataset to improve structured legal outputs.
---
## Use Cases (15 task templates)
1. **Core extraction (case skeleton)**
Extract (1) procedural posture, (2) issues, (3) holdings (one line each), (4) standards of review, (5) disposition from a provided opinion excerpt.
2. **Variance vs. constructive amendment**
Define both doctrines, then classify the opinion’s problem (proof–pleading discrepancy vs. alteration of elements) and justify using the court’s analysis.
3. **Preservation / waiver**
Identify the exact trial steps necessary to preserve a fatal-variance claim (contemporaneous objection, motion grounds specificity, request for continuance) and assess whether they occurred.
4. **Prejudice analysis (variance)**
Evaluate whether variant proof (e.g., gun vs. knife) misled the defense, caused surprise, or impaired preparation; point to record facts showing (no) prejudice.
5. **Habeas framing (procedural default)**
Explain how a state-trial variance claim is reviewed on federal habeas when no contemporaneous objection was made; outline cause-and-prejudice / actual-innocence gateways if prompted.
6. **Standard of review**
State which standard(s) the court applied (de novo, abuse of discretion, harmless error) and why; explain how lack of preservation narrowed the scope.
7. **Argument for petitioner/appellant**
Draft 4–8 concise advocacy points that a means discrepancy (e.g., knife → gun) violated Sixth-Amendment notice and was not harmless.
8. **Argument for the state/appellee**
Draft 4–8 concise counterpoints on waiver (failure to object), lack of prejudice/surprise, alignment with defense theory, and adequacy of notice.
9. **Record checklist**
Bullet list of record items to pull for briefing (charging instrument; key witness testimony; objections or lack thereof; motions and grounds; any continuance requested; state appeal; federal habeas pleadings).
10. **Remedies**
State the proper remedies if a preserved fatal variance is found on direct appeal vs. habeas (reversal, new trial, or other relief), and when harmless error applies.
11. **Hypothetical preservation**
Re-analyze outcome/posture assuming defense counsel objected when variant proof emerged and sought a continuance; discuss how that affects prejudice and review.
12. **Notice pleading in informations**
Explain required factual specificity to satisfy notice; apply to “assault with intent to kill” and assess whether the instrument’s means (knife vs. gun) is material.
13. **Jury-instruction angle**
Propose a limiting/clarifying instruction to mitigate variance prejudice (e.g., confining the theory to the charged means) and analyze whether refusal would be reversible error.
14. **Bluebook formatting**
Provide full and short-form citations for the controlling decision(s) and the referenced state case; compose a citation string suitable for a brief’s argument section.
15. **One-page IRAC**
Produce an IRAC with exact headers—**Issue**, **Rule**, **Application**, **Conclusion**—summarizing the variance/notice dispute and the court’s reasoning.
---
## Data Structure
### Record Schema
| Field | Type | Description |
|----------------|--------|---------------------------------------------------------------------------------------------------|
| `id` | str | Unique identifier (e.g., `ridgeway_habeas_0001`). |
| `case_name` | str | Case caption (e.g., “Ridgeway v. Hutto”). |
| `court` | str | Court (e.g., “8th Cir.”). |
| `year` | int | Decision year. |
| `jurisdiction` | str | “federal” or “state”. |
| `prompt_type` | str | One of the 15 task categories (see **Use Cases**). |
| `opinion_text` | str | Public-domain opinion excerpt used as context. |
| `messages` | list | ChatML-style messages: `[{"role": "system"|"user"|"assistant", "content": "..."}]`. |
| `source_ref` | str | Short provenance note (e.g., “CAP; citation: 474 F.2d 22 (8th Cir. 1973)”). |
### Example Record
```json
{
"id": "ridgeway_habeas_0001",
"case_name": "Ridgeway v. Hutto",
"court": "8th Cir.",
"year": 1973,
"jurisdiction": "federal",
"prompt_type": "core_extraction",
"opinion_text": "…public-domain opinion excerpt…",
"messages": [
{
"role": "system",
"content": "You are a legal analysis assistant. Return ONLY the final answer. No prefaces or meta-commentary."
},
{
"role": "user",
"content": "From the opinion text, list: (1) procedural posture, (2) issues, (3) holdings, (4) standards of review, (5) disposition.\n\nOPINION TEXT:\n…"
},
{
"role": "assistant",
"content": "1) …\n2) …\n3) …\n4) …\n5) …"
}
],
"source_ref": "CAP; citation: 474 F.2d 22 (8th Cir. 1973)"
}
```
### Splits
- `train`: update after upload
- `validation`: update after upload
- `test` (optional): update after upload
> **Split policy:** Do **not** split tasks for the **same case** across train/val/test to avoid leakage.
---
## How to Use
### Load with 🤗 Datasets
```python
from datasets import load_dataset
ds = load_dataset("sik247/lexpt-law-sft") # replace with your repo id
print(ds)
print(ds["train"][0])
```
### Use with Chat Templates (Transformers)
```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("unsloth/gpt-oss-20b") # or your base
sample = ds["train"][0]["messages"]
prompt = tok.apply_chat_template(sample, add_generation_prompt=True, tokenize=False)
```
---
## Curation & Construction
- **Sources:** public-domain opinions (e.g., CAP).
- **Selection:** appellate/habeas cases and issues suited for structured outputs (lists, checklists, IRAC).
- **Annotation:** prompts and answers authored by legal-knowledgeable contributors; emphasis on **final-answer-only** style.
- **Preprocessing:** remove site boilerplate; normalize whitespace/quotes; ensure consistent role formatting; de-duplicate near-identical snippets.
---
## Quality Control
- Spot checks for: (i) factual alignment with the opinion excerpt, (ii) formatting adherence (lists/IRAC), (iii) concise, jurisdiction-aware language.
- Where uncertainty exists, assistant outputs avoid invented facts/citations and prefer “Insufficient information.”
---
## Ethical Considerations & Limitations
- **Not legal advice.** This dataset trains formatting and structure for legal analysis; always verify with primary sources.
- **Coverage:** U.S. appellate caselaw; not exhaustive across jurisdictions or dates.
- **Model risk:** Misstatements of doctrine or miscitation can occur; downstream users should validate.
- **Bias:** Judicial texts may reflect historical or jurisdictional bias; outputs may inherit such patterns.
---
## Licensing
- **Opinion texts:** Public domain (as supplied by CAP and similar sources).
- **Prompts & annotations:** © 2025 sik247, released under **CC-BY-4.0**.
- When redistributing, include attribution: *“sik247 / LEXPT Law SFT (CAP subset)”*.
---
## Citation
If you use this dataset, please cite:
```
sik247. LEXPT Law SFT (CAP subset). 2025. Hugging Face Dataset.
```
And acknowledge the public-domain opinion sources (e.g., CAP) per their attribution guidance.
---
## Maintainer
- **Author/Maintainer:** `sik247`
- Issues/requests: open a Discussion on the dataset page.
---
## Changelog
- **v1.0** — Initial release with CAP-based opinion excerpts, 15 task templates, and ChatML records. Update counts and add additional jurisdictions in subsequent versions.