lexpt / README.md

Update README.md

0c92364 verified 6 months ago

9.95 kB

	---
	pretty_name: LEXPT Law SFT (CAP subset)
	dataset_name: lexpt-law-sft
	tags:
	- legal
	- law
	- caselaw
	- sft
	- lora
	- chatml
	- instruction-tuning
	task_categories:
	- text-generation
	- question-answering
	- summarization
	language:
	- en
	license: cc-by-4.0
	size_categories:
	- 10K<n<100K
	source_datasets:
	- common-pile/Caselaw_Access_Project
	datasets:
	- common-pile/caselaw_access_project
	base_model:
	- openai/gpt-oss-20b
	pipeline_tag: text-generation
	---

	# LEXPT Law SFT (CAP subset)

	## Dataset Summary
	LEXPT Law SFT is a supervised fine-tuning corpus for U.S. case-law analysis. It provides chat-style instruction/response records derived from public-domain judicial opinions (e.g., the Caselaw Access Project, “CAP”) and lawyer-authored prompts targeting appellate/habeas skills:

	- Case skeleton extraction (posture, issues, holdings, standards, disposition)
	- Variance vs. constructive amendment analysis
	- Preservation/waiver and prejudice analysis
	- Habeas procedural-default framing (cause–prejudice; innocence gateway)
	- Evidence topics (authentication, 801(d)(2)(E), Rule 403, juror aids)
	- IRAC drafting and advocacy point-headings (petitioner/state)
	- Bluebook formatting exercises

	The data are curated for base+LoRA legal assistants and are compatible with `tokenizer.apply_chat_template(...)` (ChatML-style roles). All opinion texts are public-domain; prompts/annotations are newly authored and released under CC-BY-4.0.

	---

	## Intended Use
	- Fine-tuning or LoRA-adapting general LLMs for opinion-grounded legal reasoning.
	- Evaluation/benchmarking of structured appellate/habeas analysis on held-out opinions.
	- Not for production of legal advice; this is a research/engineering dataset to improve structured legal outputs.

	---

	## Use Cases (15 task templates)

	1. Core extraction (case skeleton)
	Extract (1) procedural posture, (2) issues, (3) holdings (one line each), (4) standards of review, (5) disposition from a provided opinion excerpt.

	2. Variance vs. constructive amendment
	Define both doctrines, then classify the opinion’s problem (proof–pleading discrepancy vs. alteration of elements) and justify using the court’s analysis.

	3. Preservation / waiver
	Identify the exact trial steps necessary to preserve a fatal-variance claim (contemporaneous objection, motion grounds specificity, request for continuance) and assess whether they occurred.

	4. Prejudice analysis (variance)
	Evaluate whether variant proof (e.g., gun vs. knife) misled the defense, caused surprise, or impaired preparation; point to record facts showing (no) prejudice.

	5. Habeas framing (procedural default)
	Explain how a state-trial variance claim is reviewed on federal habeas when no contemporaneous objection was made; outline cause-and-prejudice / actual-innocence gateways if prompted.

	6. Standard of review
	State which standard(s) the court applied (de novo, abuse of discretion, harmless error) and why; explain how lack of preservation narrowed the scope.

	7. Argument for petitioner/appellant
	Draft 4–8 concise advocacy points that a means discrepancy (e.g., knife → gun) violated Sixth-Amendment notice and was not harmless.

	8. Argument for the state/appellee
	Draft 4–8 concise counterpoints on waiver (failure to object), lack of prejudice/surprise, alignment with defense theory, and adequacy of notice.

	9. Record checklist
	Bullet list of record items to pull for briefing (charging instrument; key witness testimony; objections or lack thereof; motions and grounds; any continuance requested; state appeal; federal habeas pleadings).

	10. Remedies
	State the proper remedies if a preserved fatal variance is found on direct appeal vs. habeas (reversal, new trial, or other relief), and when harmless error applies.

	11. Hypothetical preservation
	Re-analyze outcome/posture assuming defense counsel objected when variant proof emerged and sought a continuance; discuss how that affects prejudice and review.

	12. Notice pleading in informations
	Explain required factual specificity to satisfy notice; apply to “assault with intent to kill” and assess whether the instrument’s means (knife vs. gun) is material.

	13. Jury-instruction angle
	Propose a limiting/clarifying instruction to mitigate variance prejudice (e.g., confining the theory to the charged means) and analyze whether refusal would be reversible error.

	14. Bluebook formatting
	Provide full and short-form citations for the controlling decision(s) and the referenced state case; compose a citation string suitable for a brief’s argument section.

	15. One-page IRAC
	Produce an IRAC with exact headers—Issue, Rule, Application, Conclusion—summarizing the variance/notice dispute and the court’s reasoning.

	---

	## Data Structure

	### Record Schema
	\| Field \| Type \| Description \|
	\|----------------\|--------\|---------------------------------------------------------------------------------------------------\|
	\| `id` \| str \| Unique identifier (e.g., `ridgeway_habeas_0001`). \|
	\| `case_name` \| str \| Case caption (e.g., “Ridgeway v. Hutto”). \|
	\| `court` \| str \| Court (e.g., “8th Cir.”). \|
	\| `year` \| int \| Decision year. \|
	\| `jurisdiction` \| str \| “federal” or “state”. \|
	\| `prompt_type` \| str \| One of the 15 task categories (see Use Cases). \|
	\| `opinion_text` \| str \| Public-domain opinion excerpt used as context. \|
	\| `messages` \| list \| ChatML-style messages: `[{"role": "system"\|"user"\|"assistant", "content": "..."}]`. \|
	\| `source_ref` \| str \| Short provenance note (e.g., “CAP; citation: 474 F.2d 22 (8th Cir. 1973)”). \|

	### Example Record
	```json
	{
	"id": "ridgeway_habeas_0001",
	"case_name": "Ridgeway v. Hutto",
	"court": "8th Cir.",
	"year": 1973,
	"jurisdiction": "federal",
	"prompt_type": "core_extraction",
	"opinion_text": "…public-domain opinion excerpt…",
	"messages": [
	{
	"role": "system",
	"content": "You are a legal analysis assistant. Return ONLY the final answer. No prefaces or meta-commentary."
	},
	{
	"role": "user",
	"content": "From the opinion text, list: (1) procedural posture, (2) issues, (3) holdings, (4) standards of review, (5) disposition.\n\nOPINION TEXT:\n…"
	},
	{
	"role": "assistant",
	"content": "1) …\n2) …\n3) …\n4) …\n5) …"
	}
	],
	"source_ref": "CAP; citation: 474 F.2d 22 (8th Cir. 1973)"
	}
	```

	### Splits
	- `train`: update after upload
	- `validation`: update after upload
	- `test` (optional): update after upload

	> Split policy: Do not split tasks for the same case across train/val/test to avoid leakage.

	---

	## How to Use

	### Load with 🤗 Datasets
	```python
	from datasets import load_dataset
	ds = load_dataset("sik247/lexpt-law-sft") # replace with your repo id
	print(ds)
	print(ds["train"][0])
	```

	### Use with Chat Templates (Transformers)
	```python
	from transformers import AutoTokenizer
	tok = AutoTokenizer.from_pretrained("unsloth/gpt-oss-20b") # or your base

	sample = ds["train"][0]["messages"]
	prompt = tok.apply_chat_template(sample, add_generation_prompt=True, tokenize=False)
	```

	---

	## Curation & Construction
	- Sources: public-domain opinions (e.g., CAP).
	- Selection: appellate/habeas cases and issues suited for structured outputs (lists, checklists, IRAC).
	- Annotation: prompts and answers authored by legal-knowledgeable contributors; emphasis on final-answer-only style.
	- Preprocessing: remove site boilerplate; normalize whitespace/quotes; ensure consistent role formatting; de-duplicate near-identical snippets.

	---

	## Quality Control
	- Spot checks for: (i) factual alignment with the opinion excerpt, (ii) formatting adherence (lists/IRAC), (iii) concise, jurisdiction-aware language.
	- Where uncertainty exists, assistant outputs avoid invented facts/citations and prefer “Insufficient information.”

	---

	## Ethical Considerations & Limitations
	- Not legal advice. This dataset trains formatting and structure for legal analysis; always verify with primary sources.
	- Coverage: U.S. appellate caselaw; not exhaustive across jurisdictions or dates.
	- Model risk: Misstatements of doctrine or miscitation can occur; downstream users should validate.
	- Bias: Judicial texts may reflect historical or jurisdictional bias; outputs may inherit such patterns.

	---

	## Licensing
	- Opinion texts: Public domain (as supplied by CAP and similar sources).
	- Prompts & annotations: © 2025 sik247, released under CC-BY-4.0.
	- When redistributing, include attribution: “sik247 / LEXPT Law SFT (CAP subset)”.

	---

	## Citation
	If you use this dataset, please cite:
	```
	sik247. LEXPT Law SFT (CAP subset). 2025. Hugging Face Dataset.
	```
	And acknowledge the public-domain opinion sources (e.g., CAP) per their attribution guidance.

	---

	## Maintainer
	- Author/Maintainer: `sik247`
	- Issues/requests: open a Discussion on the dataset page.

	---

	## Changelog
	- v1.0 — Initial release with CAP-based opinion excerpts, 15 task templates, and ChatML records. Update counts and add additional jurisdictions in subsequent versions.