Update README.md
Browse files
README.md
CHANGED
|
@@ -1,62 +1,221 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
model_name: outputs_lexpt
|
| 5 |
tags:
|
| 6 |
-
-
|
| 7 |
-
-
|
|
|
|
| 8 |
- sft
|
| 9 |
-
-
|
| 10 |
-
-
|
| 11 |
-
-
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
---
|
| 14 |
|
| 15 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
-
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
-
##
|
| 21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
```python
|
| 23 |
-
from transformers import
|
|
|
|
| 24 |
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
|
| 28 |
-
print(output["generated_text"])
|
| 29 |
```
|
| 30 |
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
-
|
| 34 |
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
-
-
|
| 41 |
-
- TRL: 0.21.0
|
| 42 |
-
- Transformers: 4.56.0.dev0
|
| 43 |
-
- Pytorch: 2.8.0
|
| 44 |
-
- Datasets: 3.6.0
|
| 45 |
-
- Tokenizers: 0.21.4
|
| 46 |
|
| 47 |
-
##
|
|
|
|
|
|
|
|
|
|
| 48 |
|
|
|
|
| 49 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
}
|
| 62 |
-
```
|
|
|
|
| 1 |
---
|
| 2 |
+
pretty_name: LEXPT Law SFT (CAP subset)
|
| 3 |
+
dataset_name: lexpt-law-sft
|
|
|
|
| 4 |
tags:
|
| 5 |
+
- legal
|
| 6 |
+
- law
|
| 7 |
+
- caselaw
|
| 8 |
- sft
|
| 9 |
+
- lora
|
| 10 |
+
- chatml
|
| 11 |
+
- instruction-tuning
|
| 12 |
+
task_categories:
|
| 13 |
+
- text-generation
|
| 14 |
+
- question-answering
|
| 15 |
+
- summarization
|
| 16 |
+
language:
|
| 17 |
+
- en
|
| 18 |
+
license: cc-by-4.0
|
| 19 |
+
size_categories:
|
| 20 |
+
- 10K<n<100K
|
| 21 |
+
source_datasets:
|
| 22 |
+
- common-pile/Caselaw_Access_Project
|
| 23 |
+
datasets:
|
| 24 |
+
- common-pile/caselaw_access_project
|
| 25 |
+
base_model:
|
| 26 |
+
- openai/gpt-oss-20b
|
| 27 |
+
pipeline_tag: text-generation
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
# LEXPT Law SFT (CAP subset)
|
| 31 |
+
|
| 32 |
+
## Dataset Summary
|
| 33 |
+
**LEXPT Law SFT** is a supervised fine-tuning corpus for **U.S. case-law analysis**. It provides **chat-style instruction/response** records derived from **public-domain judicial opinions** (e.g., the Caselaw Access Project, “CAP”) and lawyer-authored prompts targeting appellate/habeas skills:
|
| 34 |
+
|
| 35 |
+
- Case skeleton extraction (posture, issues, holdings, standards, disposition)
|
| 36 |
+
- Variance vs. constructive amendment analysis
|
| 37 |
+
- Preservation/waiver and prejudice analysis
|
| 38 |
+
- Habeas procedural-default framing (cause–prejudice; innocence gateway)
|
| 39 |
+
- Evidence topics (authentication, 801(d)(2)(E), Rule 403, juror aids)
|
| 40 |
+
- IRAC drafting and advocacy point-headings (petitioner/state)
|
| 41 |
+
- Bluebook formatting exercises
|
| 42 |
+
|
| 43 |
+
The data are curated for **base+LoRA** legal assistants and are compatible with `tokenizer.apply_chat_template(...)` (ChatML-style roles). All **opinion texts** are public-domain; **prompts/annotations** are newly authored and released under **CC-BY-4.0**.
|
| 44 |
+
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
## Intended Use
|
| 48 |
+
- Fine-tuning or LoRA-adapting general LLMs for **opinion-grounded legal reasoning**.
|
| 49 |
+
- Evaluation/benchmarking of structured appellate/habeas analysis on held-out opinions.
|
| 50 |
+
- Not for production of legal advice; this is a research/engineering dataset to improve structured legal outputs.
|
| 51 |
+
|
| 52 |
---
|
| 53 |
|
| 54 |
+
## Use Cases (15 task templates)
|
| 55 |
+
|
| 56 |
+
1. **Core extraction (case skeleton)**
|
| 57 |
+
Extract (1) procedural posture, (2) issues, (3) holdings (one line each), (4) standards of review, (5) disposition from a provided opinion excerpt.
|
| 58 |
+
|
| 59 |
+
2. **Variance vs. constructive amendment**
|
| 60 |
+
Define both doctrines, then classify the opinion’s problem (proof–pleading discrepancy vs. alteration of elements) and justify using the court’s analysis.
|
| 61 |
+
|
| 62 |
+
3. **Preservation / waiver**
|
| 63 |
+
Identify the exact trial steps necessary to preserve a fatal-variance claim (contemporaneous objection, motion grounds specificity, request for continuance) and assess whether they occurred.
|
| 64 |
|
| 65 |
+
4. **Prejudice analysis (variance)**
|
| 66 |
+
Evaluate whether variant proof (e.g., gun vs. knife) misled the defense, caused surprise, or impaired preparation; point to record facts showing (no) prejudice.
|
| 67 |
+
|
| 68 |
+
5. **Habeas framing (procedural default)**
|
| 69 |
+
Explain how a state-trial variance claim is reviewed on federal habeas when no contemporaneous objection was made; outline cause-and-prejudice / actual-innocence gateways if prompted.
|
| 70 |
+
|
| 71 |
+
6. **Standard of review**
|
| 72 |
+
State which standard(s) the court applied (de novo, abuse of discretion, harmless error) and why; explain how lack of preservation narrowed the scope.
|
| 73 |
+
|
| 74 |
+
7. **Argument for petitioner/appellant**
|
| 75 |
+
Draft 4–8 concise advocacy points that a means discrepancy (e.g., knife → gun) violated Sixth-Amendment notice and was not harmless.
|
| 76 |
+
|
| 77 |
+
8. **Argument for the state/appellee**
|
| 78 |
+
Draft 4–8 concise counterpoints on waiver (failure to object), lack of prejudice/surprise, alignment with defense theory, and adequacy of notice.
|
| 79 |
+
|
| 80 |
+
9. **Record checklist**
|
| 81 |
+
Bullet list of record items to pull for briefing (charging instrument; key witness testimony; objections or lack thereof; motions and grounds; any continuance requested; state appeal; federal habeas pleadings).
|
| 82 |
+
|
| 83 |
+
10. **Remedies**
|
| 84 |
+
State the proper remedies if a preserved fatal variance is found on direct appeal vs. habeas (reversal, new trial, or other relief), and when harmless error applies.
|
| 85 |
+
|
| 86 |
+
11. **Hypothetical preservation**
|
| 87 |
+
Re-analyze outcome/posture assuming defense counsel objected when variant proof emerged and sought a continuance; discuss how that affects prejudice and review.
|
| 88 |
+
|
| 89 |
+
12. **Notice pleading in informations**
|
| 90 |
+
Explain required factual specificity to satisfy notice; apply to “assault with intent to kill” and assess whether the instrument’s means (knife vs. gun) is material.
|
| 91 |
+
|
| 92 |
+
13. **Jury-instruction angle**
|
| 93 |
+
Propose a limiting/clarifying instruction to mitigate variance prejudice (e.g., confining the theory to the charged means) and analyze whether refusal would be reversible error.
|
| 94 |
+
|
| 95 |
+
14. **Bluebook formatting**
|
| 96 |
+
Provide full and short-form citations for the controlling decision(s) and the referenced state case; compose a citation string suitable for a brief’s argument section.
|
| 97 |
+
|
| 98 |
+
15. **One-page IRAC**
|
| 99 |
+
Produce an IRAC with exact headers—**Issue**, **Rule**, **Application**, **Conclusion**—summarizing the variance/notice dispute and the court’s reasoning.
|
| 100 |
+
|
| 101 |
+
---
|
| 102 |
|
| 103 |
+
## Data Structure
|
| 104 |
|
| 105 |
+
### Record Schema
|
| 106 |
+
| Field | Type | Description |
|
| 107 |
+
|----------------|--------|---------------------------------------------------------------------------------------------------|
|
| 108 |
+
| `id` | str | Unique identifier (e.g., `ridgeway_habeas_0001`). |
|
| 109 |
+
| `case_name` | str | Case caption (e.g., “Ridgeway v. Hutto”). |
|
| 110 |
+
| `court` | str | Court (e.g., “8th Cir.”). |
|
| 111 |
+
| `year` | int | Decision year. |
|
| 112 |
+
| `jurisdiction` | str | “federal” or “state”. |
|
| 113 |
+
| `prompt_type` | str | One of the 15 task categories (see **Use Cases**). |
|
| 114 |
+
| `opinion_text` | str | Public-domain opinion excerpt used as context. |
|
| 115 |
+
| `messages` | list | ChatML-style messages: `[{"role": "system"|"user"|"assistant", "content": "..."}]`. |
|
| 116 |
+
| `source_ref` | str | Short provenance note (e.g., “CAP; citation: 474 F.2d 22 (8th Cir. 1973)”). |
|
| 117 |
+
|
| 118 |
+
### Example Record
|
| 119 |
+
```json
|
| 120 |
+
{
|
| 121 |
+
"id": "ridgeway_habeas_0001",
|
| 122 |
+
"case_name": "Ridgeway v. Hutto",
|
| 123 |
+
"court": "8th Cir.",
|
| 124 |
+
"year": 1973,
|
| 125 |
+
"jurisdiction": "federal",
|
| 126 |
+
"prompt_type": "core_extraction",
|
| 127 |
+
"opinion_text": "…public-domain opinion excerpt…",
|
| 128 |
+
"messages": [
|
| 129 |
+
{
|
| 130 |
+
"role": "system",
|
| 131 |
+
"content": "You are a legal analysis assistant. Return ONLY the final answer. No prefaces or meta-commentary."
|
| 132 |
+
},
|
| 133 |
+
{
|
| 134 |
+
"role": "user",
|
| 135 |
+
"content": "From the opinion text, list: (1) procedural posture, (2) issues, (3) holdings, (4) standards of review, (5) disposition.\n\nOPINION TEXT:\n…"
|
| 136 |
+
},
|
| 137 |
+
{
|
| 138 |
+
"role": "assistant",
|
| 139 |
+
"content": "1) …\n2) …\n3) …\n4) …\n5) …"
|
| 140 |
+
}
|
| 141 |
+
],
|
| 142 |
+
"source_ref": "CAP; citation: 474 F.2d 22 (8th Cir. 1973)"
|
| 143 |
+
}
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
### Splits
|
| 147 |
+
- `train`: update after upload
|
| 148 |
+
- `validation`: update after upload
|
| 149 |
+
- `test` (optional): update after upload
|
| 150 |
+
|
| 151 |
+
> **Split policy:** Do **not** split tasks for the **same case** across train/val/test to avoid leakage.
|
| 152 |
+
|
| 153 |
+
---
|
| 154 |
+
|
| 155 |
+
## How to Use
|
| 156 |
+
|
| 157 |
+
### Load with 🤗 Datasets
|
| 158 |
+
```python
|
| 159 |
+
from datasets import load_dataset
|
| 160 |
+
ds = load_dataset("sik247/lexpt-law-sft") # replace with your repo id
|
| 161 |
+
print(ds)
|
| 162 |
+
print(ds["train"][0])
|
| 163 |
+
```
|
| 164 |
+
|
| 165 |
+
### Use with Chat Templates (Transformers)
|
| 166 |
```python
|
| 167 |
+
from transformers import AutoTokenizer
|
| 168 |
+
tok = AutoTokenizer.from_pretrained("unsloth/gpt-oss-20b") # or your base
|
| 169 |
|
| 170 |
+
sample = ds["train"][0]["messages"]
|
| 171 |
+
prompt = tok.apply_chat_template(sample, add_generation_prompt=True, tokenize=False)
|
|
|
|
|
|
|
| 172 |
```
|
| 173 |
|
| 174 |
+
---
|
| 175 |
+
|
| 176 |
+
## Curation & Construction
|
| 177 |
+
- **Sources:** public-domain opinions (e.g., CAP).
|
| 178 |
+
- **Selection:** appellate/habeas cases and issues suited for structured outputs (lists, checklists, IRAC).
|
| 179 |
+
- **Annotation:** prompts and answers authored by legal-knowledgeable contributors; emphasis on **final-answer-only** style.
|
| 180 |
+
- **Preprocessing:** remove site boilerplate; normalize whitespace/quotes; ensure consistent role formatting; de-duplicate near-identical snippets.
|
| 181 |
|
| 182 |
+
---
|
| 183 |
|
| 184 |
+
## Quality Control
|
| 185 |
+
- Spot checks for: (i) factual alignment with the opinion excerpt, (ii) formatting adherence (lists/IRAC), (iii) concise, jurisdiction-aware language.
|
| 186 |
+
- Where uncertainty exists, assistant outputs avoid invented facts/citations and prefer “Insufficient information.”
|
| 187 |
|
| 188 |
+
---
|
| 189 |
|
| 190 |
+
## Ethical Considerations & Limitations
|
| 191 |
+
- **Not legal advice.** This dataset trains formatting and structure for legal analysis; always verify with primary sources.
|
| 192 |
+
- **Coverage:** U.S. appellate caselaw; not exhaustive across jurisdictions or dates.
|
| 193 |
+
- **Model risk:** Misstatements of doctrine or miscitation can occur; downstream users should validate.
|
| 194 |
+
- **Bias:** Judicial texts may reflect historical or jurisdictional bias; outputs may inherit such patterns.
|
| 195 |
|
| 196 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 197 |
|
| 198 |
+
## Licensing
|
| 199 |
+
- **Opinion texts:** Public domain (as supplied by CAP and similar sources).
|
| 200 |
+
- **Prompts & annotations:** © 2025 sik247, released under **CC-BY-4.0**.
|
| 201 |
+
- When redistributing, include attribution: *“sik247 / LEXPT Law SFT (CAP subset)”*.
|
| 202 |
|
| 203 |
+
---
|
| 204 |
|
| 205 |
+
## Citation
|
| 206 |
+
If you use this dataset, please cite:
|
| 207 |
+
```
|
| 208 |
+
sik247. LEXPT Law SFT (CAP subset). 2025. Hugging Face Dataset.
|
| 209 |
+
```
|
| 210 |
+
And acknowledge the public-domain opinion sources (e.g., CAP) per their attribution guidance.
|
| 211 |
|
| 212 |
+
---
|
| 213 |
+
|
| 214 |
+
## Maintainer
|
| 215 |
+
- **Author/Maintainer:** `sik247`
|
| 216 |
+
- Issues/requests: open a Discussion on the dataset page.
|
| 217 |
+
|
| 218 |
+
---
|
| 219 |
+
|
| 220 |
+
## Changelog
|
| 221 |
+
- **v1.0** — Initial release with CAP-based opinion excerpts, 15 task templates, and ChatML records. Update counts and add additional jurisdictions in subsequent versions.
|
|
|
|
|
|