|
|
--- |
|
|
pretty_name: LEXPT Law SFT (CAP subset) |
|
|
dataset_name: lexpt-law-sft |
|
|
tags: |
|
|
- legal |
|
|
- law |
|
|
- caselaw |
|
|
- sft |
|
|
- lora |
|
|
- chatml |
|
|
- instruction-tuning |
|
|
task_categories: |
|
|
- text-generation |
|
|
- question-answering |
|
|
- summarization |
|
|
language: |
|
|
- en |
|
|
license: cc-by-4.0 |
|
|
size_categories: |
|
|
- 10K<n<100K |
|
|
source_datasets: |
|
|
- common-pile/Caselaw_Access_Project |
|
|
datasets: |
|
|
- common-pile/caselaw_access_project |
|
|
base_model: |
|
|
- openai/gpt-oss-20b |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# LEXPT Law SFT (CAP subset) |
|
|
|
|
|
## Dataset Summary |
|
|
**LEXPT Law SFT** is a supervised fine-tuning corpus for **U.S. case-law analysis**. It provides **chat-style instruction/response** records derived from **public-domain judicial opinions** (e.g., the Caselaw Access Project, “CAP”) and lawyer-authored prompts targeting appellate/habeas skills: |
|
|
|
|
|
- Case skeleton extraction (posture, issues, holdings, standards, disposition) |
|
|
- Variance vs. constructive amendment analysis |
|
|
- Preservation/waiver and prejudice analysis |
|
|
- Habeas procedural-default framing (cause–prejudice; innocence gateway) |
|
|
- Evidence topics (authentication, 801(d)(2)(E), Rule 403, juror aids) |
|
|
- IRAC drafting and advocacy point-headings (petitioner/state) |
|
|
- Bluebook formatting exercises |
|
|
|
|
|
The data are curated for **base+LoRA** legal assistants and are compatible with `tokenizer.apply_chat_template(...)` (ChatML-style roles). All **opinion texts** are public-domain; **prompts/annotations** are newly authored and released under **CC-BY-4.0**. |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
- Fine-tuning or LoRA-adapting general LLMs for **opinion-grounded legal reasoning**. |
|
|
- Evaluation/benchmarking of structured appellate/habeas analysis on held-out opinions. |
|
|
- Not for production of legal advice; this is a research/engineering dataset to improve structured legal outputs. |
|
|
|
|
|
--- |
|
|
|
|
|
## Use Cases (15 task templates) |
|
|
|
|
|
1. **Core extraction (case skeleton)** |
|
|
Extract (1) procedural posture, (2) issues, (3) holdings (one line each), (4) standards of review, (5) disposition from a provided opinion excerpt. |
|
|
|
|
|
2. **Variance vs. constructive amendment** |
|
|
Define both doctrines, then classify the opinion’s problem (proof–pleading discrepancy vs. alteration of elements) and justify using the court’s analysis. |
|
|
|
|
|
3. **Preservation / waiver** |
|
|
Identify the exact trial steps necessary to preserve a fatal-variance claim (contemporaneous objection, motion grounds specificity, request for continuance) and assess whether they occurred. |
|
|
|
|
|
4. **Prejudice analysis (variance)** |
|
|
Evaluate whether variant proof (e.g., gun vs. knife) misled the defense, caused surprise, or impaired preparation; point to record facts showing (no) prejudice. |
|
|
|
|
|
5. **Habeas framing (procedural default)** |
|
|
Explain how a state-trial variance claim is reviewed on federal habeas when no contemporaneous objection was made; outline cause-and-prejudice / actual-innocence gateways if prompted. |
|
|
|
|
|
6. **Standard of review** |
|
|
State which standard(s) the court applied (de novo, abuse of discretion, harmless error) and why; explain how lack of preservation narrowed the scope. |
|
|
|
|
|
7. **Argument for petitioner/appellant** |
|
|
Draft 4–8 concise advocacy points that a means discrepancy (e.g., knife → gun) violated Sixth-Amendment notice and was not harmless. |
|
|
|
|
|
8. **Argument for the state/appellee** |
|
|
Draft 4–8 concise counterpoints on waiver (failure to object), lack of prejudice/surprise, alignment with defense theory, and adequacy of notice. |
|
|
|
|
|
9. **Record checklist** |
|
|
Bullet list of record items to pull for briefing (charging instrument; key witness testimony; objections or lack thereof; motions and grounds; any continuance requested; state appeal; federal habeas pleadings). |
|
|
|
|
|
10. **Remedies** |
|
|
State the proper remedies if a preserved fatal variance is found on direct appeal vs. habeas (reversal, new trial, or other relief), and when harmless error applies. |
|
|
|
|
|
11. **Hypothetical preservation** |
|
|
Re-analyze outcome/posture assuming defense counsel objected when variant proof emerged and sought a continuance; discuss how that affects prejudice and review. |
|
|
|
|
|
12. **Notice pleading in informations** |
|
|
Explain required factual specificity to satisfy notice; apply to “assault with intent to kill” and assess whether the instrument’s means (knife vs. gun) is material. |
|
|
|
|
|
13. **Jury-instruction angle** |
|
|
Propose a limiting/clarifying instruction to mitigate variance prejudice (e.g., confining the theory to the charged means) and analyze whether refusal would be reversible error. |
|
|
|
|
|
14. **Bluebook formatting** |
|
|
Provide full and short-form citations for the controlling decision(s) and the referenced state case; compose a citation string suitable for a brief’s argument section. |
|
|
|
|
|
15. **One-page IRAC** |
|
|
Produce an IRAC with exact headers—**Issue**, **Rule**, **Application**, **Conclusion**—summarizing the variance/notice dispute and the court’s reasoning. |
|
|
|
|
|
--- |
|
|
|
|
|
## Data Structure |
|
|
|
|
|
### Record Schema |
|
|
| Field | Type | Description | |
|
|
|----------------|--------|---------------------------------------------------------------------------------------------------| |
|
|
| `id` | str | Unique identifier (e.g., `ridgeway_habeas_0001`). | |
|
|
| `case_name` | str | Case caption (e.g., “Ridgeway v. Hutto”). | |
|
|
| `court` | str | Court (e.g., “8th Cir.”). | |
|
|
| `year` | int | Decision year. | |
|
|
| `jurisdiction` | str | “federal” or “state”. | |
|
|
| `prompt_type` | str | One of the 15 task categories (see **Use Cases**). | |
|
|
| `opinion_text` | str | Public-domain opinion excerpt used as context. | |
|
|
| `messages` | list | ChatML-style messages: `[{"role": "system"|"user"|"assistant", "content": "..."}]`. | |
|
|
| `source_ref` | str | Short provenance note (e.g., “CAP; citation: 474 F.2d 22 (8th Cir. 1973)”). | |
|
|
|
|
|
### Example Record |
|
|
```json |
|
|
{ |
|
|
"id": "ridgeway_habeas_0001", |
|
|
"case_name": "Ridgeway v. Hutto", |
|
|
"court": "8th Cir.", |
|
|
"year": 1973, |
|
|
"jurisdiction": "federal", |
|
|
"prompt_type": "core_extraction", |
|
|
"opinion_text": "…public-domain opinion excerpt…", |
|
|
"messages": [ |
|
|
{ |
|
|
"role": "system", |
|
|
"content": "You are a legal analysis assistant. Return ONLY the final answer. No prefaces or meta-commentary." |
|
|
}, |
|
|
{ |
|
|
"role": "user", |
|
|
"content": "From the opinion text, list: (1) procedural posture, (2) issues, (3) holdings, (4) standards of review, (5) disposition.\n\nOPINION TEXT:\n…" |
|
|
}, |
|
|
{ |
|
|
"role": "assistant", |
|
|
"content": "1) …\n2) …\n3) …\n4) …\n5) …" |
|
|
} |
|
|
], |
|
|
"source_ref": "CAP; citation: 474 F.2d 22 (8th Cir. 1973)" |
|
|
} |
|
|
``` |
|
|
|
|
|
### Splits |
|
|
- `train`: update after upload |
|
|
- `validation`: update after upload |
|
|
- `test` (optional): update after upload |
|
|
|
|
|
> **Split policy:** Do **not** split tasks for the **same case** across train/val/test to avoid leakage. |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Use |
|
|
|
|
|
### Load with 🤗 Datasets |
|
|
```python |
|
|
from datasets import load_dataset |
|
|
ds = load_dataset("sik247/lexpt-law-sft") # replace with your repo id |
|
|
print(ds) |
|
|
print(ds["train"][0]) |
|
|
``` |
|
|
|
|
|
### Use with Chat Templates (Transformers) |
|
|
```python |
|
|
from transformers import AutoTokenizer |
|
|
tok = AutoTokenizer.from_pretrained("unsloth/gpt-oss-20b") # or your base |
|
|
|
|
|
sample = ds["train"][0]["messages"] |
|
|
prompt = tok.apply_chat_template(sample, add_generation_prompt=True, tokenize=False) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Curation & Construction |
|
|
- **Sources:** public-domain opinions (e.g., CAP). |
|
|
- **Selection:** appellate/habeas cases and issues suited for structured outputs (lists, checklists, IRAC). |
|
|
- **Annotation:** prompts and answers authored by legal-knowledgeable contributors; emphasis on **final-answer-only** style. |
|
|
- **Preprocessing:** remove site boilerplate; normalize whitespace/quotes; ensure consistent role formatting; de-duplicate near-identical snippets. |
|
|
|
|
|
--- |
|
|
|
|
|
## Quality Control |
|
|
- Spot checks for: (i) factual alignment with the opinion excerpt, (ii) formatting adherence (lists/IRAC), (iii) concise, jurisdiction-aware language. |
|
|
- Where uncertainty exists, assistant outputs avoid invented facts/citations and prefer “Insufficient information.” |
|
|
|
|
|
--- |
|
|
|
|
|
## Ethical Considerations & Limitations |
|
|
- **Not legal advice.** This dataset trains formatting and structure for legal analysis; always verify with primary sources. |
|
|
- **Coverage:** U.S. appellate caselaw; not exhaustive across jurisdictions or dates. |
|
|
- **Model risk:** Misstatements of doctrine or miscitation can occur; downstream users should validate. |
|
|
- **Bias:** Judicial texts may reflect historical or jurisdictional bias; outputs may inherit such patterns. |
|
|
|
|
|
--- |
|
|
|
|
|
## Licensing |
|
|
- **Opinion texts:** Public domain (as supplied by CAP and similar sources). |
|
|
- **Prompts & annotations:** © 2025 sik247, released under **CC-BY-4.0**. |
|
|
- When redistributing, include attribution: *“sik247 / LEXPT Law SFT (CAP subset)”*. |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
If you use this dataset, please cite: |
|
|
``` |
|
|
sik247. LEXPT Law SFT (CAP subset). 2025. Hugging Face Dataset. |
|
|
``` |
|
|
And acknowledge the public-domain opinion sources (e.g., CAP) per their attribution guidance. |
|
|
|
|
|
--- |
|
|
|
|
|
## Maintainer |
|
|
- **Author/Maintainer:** `sik247` |
|
|
- Issues/requests: open a Discussion on the dataset page. |
|
|
|
|
|
--- |
|
|
|
|
|
## Changelog |
|
|
- **v1.0** — Initial release with CAP-based opinion excerpts, 15 task templates, and ChatML records. Update counts and add additional jurisdictions in subsequent versions. |