--- pretty_name: LEXPT Law SFT (CAP subset) dataset_name: lexpt-law-sft tags: - legal - law - caselaw - sft - lora - chatml - instruction-tuning task_categories: - text-generation - question-answering - summarization language: - en license: cc-by-4.0 size_categories: - 10K **Split policy:** Do **not** split tasks for the **same case** across train/val/test to avoid leakage. --- ## How to Use ### Load with 🤗 Datasets ```python from datasets import load_dataset ds = load_dataset("sik247/lexpt-law-sft") # replace with your repo id print(ds) print(ds["train"][0]) ``` ### Use with Chat Templates (Transformers) ```python from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("unsloth/gpt-oss-20b") # or your base sample = ds["train"][0]["messages"] prompt = tok.apply_chat_template(sample, add_generation_prompt=True, tokenize=False) ``` --- ## Curation & Construction - **Sources:** public-domain opinions (e.g., CAP). - **Selection:** appellate/habeas cases and issues suited for structured outputs (lists, checklists, IRAC). - **Annotation:** prompts and answers authored by legal-knowledgeable contributors; emphasis on **final-answer-only** style. - **Preprocessing:** remove site boilerplate; normalize whitespace/quotes; ensure consistent role formatting; de-duplicate near-identical snippets. --- ## Quality Control - Spot checks for: (i) factual alignment with the opinion excerpt, (ii) formatting adherence (lists/IRAC), (iii) concise, jurisdiction-aware language. - Where uncertainty exists, assistant outputs avoid invented facts/citations and prefer “Insufficient information.” --- ## Ethical Considerations & Limitations - **Not legal advice.** This dataset trains formatting and structure for legal analysis; always verify with primary sources. - **Coverage:** U.S. appellate caselaw; not exhaustive across jurisdictions or dates. - **Model risk:** Misstatements of doctrine or miscitation can occur; downstream users should validate. - **Bias:** Judicial texts may reflect historical or jurisdictional bias; outputs may inherit such patterns. --- ## Licensing - **Opinion texts:** Public domain (as supplied by CAP and similar sources). - **Prompts & annotations:** © 2025 sik247, released under **CC-BY-4.0**. - When redistributing, include attribution: *“sik247 / LEXPT Law SFT (CAP subset)”*. --- ## Citation If you use this dataset, please cite: ``` sik247. LEXPT Law SFT (CAP subset). 2025. Hugging Face Dataset. ``` And acknowledge the public-domain opinion sources (e.g., CAP) per their attribution guidance. --- ## Maintainer - **Author/Maintainer:** `sik247` - Issues/requests: open a Discussion on the dataset page. --- ## Changelog - **v1.0** — Initial release with CAP-based opinion excerpts, 15 task templates, and ChatML records. Update counts and add additional jurisdictions in subsequent versions.