esherialabs/lexchat-Llama-3.1-8B-GRPO — Kenya Court Submissions Generator

A Kenya-first, RL-tuned Llama-3.1-8B checkpoint (GRPO on our SFT base) engineered for court-ready submissions. Rewarded for spine discipline (BLUF → Governing Rules → Controlling Holdings → Application → Relief Sought), side alignment, jurisdiction purity, and wordband control—so drafts land clean and on-spec. Purpose-built to democratize legal drafting capacity for NGOs, legal aid clinics, community justice centers, and paralegal programs under advocate supervision.


1) Model Summary

  • Task: Draft structured submissions from facts + issues + side + posture for Kenyan courts.
  • Output spine: BLUF → Governing Rules → Controlling Holdings → Application → Relief Sought.
  • RL objective: Reward the policy for structure, side correctness, Kenya-only doctrine/lexicon, and wordband discipline; lightly reward doctrine coverage keyed to the prompted issues.

2) Intended Use & Audience

Who this is for

  • NGOs / CBOs, legal-aid clinics, community justice centers.
  • Paralegals and law students working under advocate supervision.
  • Civic-tech / research teams building access-to-justice tooling.

Primary use cases

  • First-pass drafting (injunctions, JR, land, employment, constitutional).
  • Counter-position generation (e.g., Respondent reply) for client education.
  • Issue-spotting & pedagogy with a Kenya-specific doctrinal spine.

Out of scope

  • Direct consumer legal advice without a licensed advocate.
  • Non-Kenyan jurisdictions or filings without human review.
  • Automated mass-filing or claim-spam.

3) Training Data for RL

  • Task pool: Prompts synthesized from esherialabs/kenya-court-submissions-qa (15k Q/A SFT set), plus posture/side variants and short issue packs (rule keys per remedy: Giella/Nguruman tests, proportionality, JR standards, indefeasibility, etc.).
  • Targets: No single “gold text.” Rewards are programmatic (structure/stance/jurisdiction/length/cite-format) with light doctrine-coverage checks sourced from curated rulepacks.
  • Privacy & provenance: No PII; users remain responsible for compliance with the Kenya Data Protection Act, 2019.

4) RL Training Procedure (GRPO)

Base policy: our Kenya-legal SFT checkpoint on Lexchat-Llama-3.1-8B-Instruct
Parameter-efficient RL: LoRA updates only; 4-bit base frozen

  • Loading: 4-bit NF4 (double-quant), bf16 compute; gradient checkpointing.
  • LoRA: r=32, alpha=16, dropout=0.05 on attention + MLP projections.
  • GRPO config: prompt budget 256 tokens, completion window 768 tokens; num_generations=6 per prompt; batch size ≈1 (memory-bounded); cosine LR with warmup; gradient clip 0.1.
  • Optimizer: memory-friendly AdamW (paged) suitable for low-VRAM loops.
  • **Hardware profile:1× NVIDIA A10 (24 GB); ~196 GiB RAM host; 4-bit + LoRA keeps rollout latency in check.

Why GRPO here? Each prompt spawns 6 completions; relative advantages are computed within the group, so winners are pushed up while underperformers are suppressed—an implicit self-competition that works well with tiny batches and mixed, programmatic rewards.


5) Reward Stack

All heads are summed. A “perfectly formatted, on-side, on-jurisdiction, on-band” answer tops out near 4.0. Partial structure still earns partial credit to avoid collapse.

  1. Section-Order Rewards (0 → 1.5)

    • soft_spine_reward (+0.5): all five section headers present in order (case-insensitive; tolerates spacing).
    • strict_spine_reward (+0.5): headers present with newline discipline and bullet semantics where required (• in Application).
    • header_quality_reward (+0.5): BLUF starts with a clear, one-paragraph thesis; Relief ends with concrete prayers.
  2. Side-Alignment Reward (0 → 0.5)

    • Checks stance markers (“Applicant/Respondent/Plaintiff/Defendant/Appellant” etc.) against the instructed side; penalizes self-contradictions (“we submit” vs. opposing relief).
  3. Jurisdiction-Purity Reward (0 → 0.5)

    • Penalizes foreign law unless flagged persuasive; rewards Kenya-centric lexicon and neutral citation style (no links).
    • Lightweight denylist for non-Kenya reporters; allowlist for common Kenya tests (e.g., Giella, Nguruman, proportionality, Article 47, Anarita Karimi / Mumo Matemu standards).
  4. Wordband Reward (0 → 0.5)

    • Pays out when total tokens land within a configured operational band (e.g., ~450–700 words for typical submissions).
  5. Cite-Format Reward (0 → 0.5)

    • Encourages case names/neutral citations; discourages URLs; rewards consistent formatting (no pin-cite hallucinations).
  6. Doctrine-Coverage Reward (0 → 0.5)

    • Matches issue-keyed rulepacks (e.g., injunction → Giella/Nguruman, balance of convenience, irreparable harm); fuzzy match with tolerance for paraphrase.

No separate judge model. All verifiers are deterministic string/AST checks and rulepack lookups. This keeps the loop single-model and hardware-friendly.


6) Verifier Mechanics

  • Deterministic parsers, not regex soup: string/section parsers and small ASTs for the five-part spine; resilient to whitespace.
  • Progressive strictness: soft → strict checks to provide dense early reward while the policy learns layout, then newline and bullet fidelity.
  • Realtime logging: per-head scores + sampled completions are logged for fast failure analysis (e.g., off-side or foreign-law creep).

7) Small-Model Tactics

  • LoRA on the highest-leverage submodules (attn/MLP) only.
  • Conservative LR schedule + clip=0.1 to tame gradient variance.
  • 6-way sampling at batch size ~1 so GRPO’s relative baseline still works under tight VRAM.
  • 4-bit loading to keep rollouts snappy despite multiple verifier passes.

8) Prompting & Inference

Recommended chat format

<|system|>
You are a Kenya court submissions generator. Output MUST follow:
BLUF → Governing Rules → Controlling Holdings → Application → Relief Sought.
Kenya authorities only unless marked persuasive.
<|user|>
Draft from the Applicant’s perspective in a Judicial Review (interlocutory stay).
Facts: [salient facts].
Issues: (1) Article 47 procedural fairness (2) Threshold for stay pending JR.
Constraints: Max ~600 words. Submissions style. Bullets only in Application.
<|assistant|>

Python (merged checkpoint)

from transformers import AutoModelForCausalLM, AutoTokenizer

name = "esherialabs/lexchat-Llama-3.1-8B-GRPO"  # replace with final repo id
tok = AutoTokenizer.from_pretrained(name, use_fast=True)
m = AutoModelForCausalLM.from_pretrained(name, torch_dtype="auto", device_map="auto")

prompt = "<|system|>\\n...see above...\\n<|user|>\\n...case brief...\\n<|assistant|>\\n"
ids = tok(prompt, return_tensors="pt").to(m.device)
out = m.generate(**ids, max_new_tokens=900, temperature=0.25, top_p=0.9, do_sample=True, eos_token_id=tok.eos_token_id)
print(tok.decode(out[0][ids['input_ids'].shape[1]:], skip_special_tokens=True))

Inference policy

  • Temperature 0.2–0.35, top_p=0.9; max_new_tokens ≤ 900.
  • Post-gen linter: spine order, wordband, Kenya-only lexicon, no URLs.
  • Optionally pair with a cite-checker or an allow-listed retrieval layer.

9) Evaluation Protocol

Automated

  • Structure pass (%) — five sections, order, newline/bullets.
  • Side correctness (%) — matches instructed side.
  • Jurisdiction purity (%) — Kenya-only unless flagged persuasive.
  • Doctrine coverage (%) — hits issue-keyed rule anchors.
  • Wordband adherence (%) — within operational target.

Human review

  • Persuasiveness, clarity, and fact fidelity.
  • Filing readiness (minor edits only).
  • Track red-lines to continuously enrich rulepacks/rewards.

10) Safety, Risks, and Limitations

  • Not legal advice. Advocate supervision is mandatory.
  • Evolving law: Doctrine shifts; keep retrieval/cite-bases fresh.
  • Reward hacking risk: Programmatic rewards can be gamed; we audit samples and rotate rulepacks.
  • Bias/coverage gaps: Report misses via Issues; we will iterate.

11) Deployment Blueprint (NGOs & Clinics)

  • Intake templates for facts/issues/side/posture/dates.
  • Generation policy (fixed temperature, max tokens).
  • Automated linter (structure/wordband/jurisdiction).
  • Advocate review checkpoint before filing.
  • Audit & privacy: scrub PII; log drafts/finals; comply with Kenya DPA 2019.

12) Reproducibility

  • Base: esherialabs/lexchat-Llama-3.1-8B-Instruct (SFT on Kenya submissions spine).
  • RL (GRPO): prompt 256, completion 768, num_generations=6; LoRA r=32, α=16, p=0.05; 4-bit NF4; bf16 compute; paged AdamW; cosine LR; warmup; clip=0.1; gradient checkpointing.
  • Hardware: single-GPU, 4-bit friendly (A10/4090/L40S-class).
  • Code patterns: single-model loop; deterministic verifiers; no judge model.

13) Example Prompts (NGO scenarios)

Legal aid clinic — eviction

You are drafting from the Applicant’s perspective in a Kenyan Constitutional Petition (interim conservatory orders).
Facts: Community clinic evicted without notice from county-owned premises after serving low-income patients.
Issues: (1) Article 47 procedural fairness (2) Article 40 property/use rights vs public interest (3) Conservatory order threshold.
Constraints: BLUF → Governing Rules → Controlling Holdings → Application → Relief Sought. Kenya authorities only. Max ~600 words.

Paralegal — employment dispute

You are drafting from the Respondent’s perspective in an Employment & Labour Court appeal.
Facts: Termination for cause with documented warnings; claimant alleges unfair termination.
Issues: (1) Fair procedure under Employment Act (2) Burden of proof and remedies.
Constraints: [same as above]

14) Responsible AI & Community Governance

  • Human-in-the-loop: Always route outputs to a supervising advocate before filing.
  • Community input: NGOs and clinics are invited to open issues with examples where the model under-serves marginalized groups or misses doctrine.

15) Versions

  • v2.0.0-GRPO (2025-10-29): RL-tuned on Kenya legal SFT checkpoint; structure/side/jurisdiction rewards; improved spine discipline.
  • v1.0.0-SFT: initial Kenya submissions generator (15k pairs).

16) Citation

Esheria Labs (2025). LexChat 8B — Kenya Court Submissions Generator (GRPO).
https://huggingface.co/esherialabs/esherialabs/lexchat-Llama-3.1-8B-GRPO
Trained on: https://huggingface.co/datasets/esherialabs/kenya-court-submissions-qa

15) Contact


Downloads last month
3
Video Preview
loading

Model tree for esherialabs/lexchat-Llama-3.1-8B-GRPO

Adapter
(2036)
this model

Dataset used to train esherialabs/lexchat-Llama-3.1-8B-GRPO