esherialabs/lexchat-Llama-3.1-8B-GRPO — Kenya Court Submissions Generator

A Kenya-first, RL-tuned Llama-3.1-8B checkpoint (GRPO on our SFT base) engineered for court-ready submissions. Rewarded for spine discipline (BLUF → Governing Rules → Controlling Holdings → Application → Relief Sought), side alignment, jurisdiction purity, and wordband control—so drafts land clean and on-spec. Purpose-built to democratize legal drafting capacity for NGOs, legal aid clinics, community justice centers, and paralegal programs under advocate supervision.

1) Model Summary

Task: Draft structured submissions from facts + issues + side + posture for Kenyan courts.
Output spine: BLUF → Governing Rules → Controlling Holdings → Application → Relief Sought.
RL objective: Reward the policy for structure, side correctness, Kenya-only doctrine/lexicon, and wordband discipline; lightly reward doctrine coverage keyed to the prompted issues.

2) Intended Use & Audience

Who this is for

NGOs / CBOs, legal-aid clinics, community justice centers.
Paralegals and law students working under advocate supervision.
Civic-tech / research teams building access-to-justice tooling.

Primary use cases

First-pass drafting (injunctions, JR, land, employment, constitutional).
Counter-position generation (e.g., Respondent reply) for client education.
Issue-spotting & pedagogy with a Kenya-specific doctrinal spine.

Out of scope

Direct consumer legal advice without a licensed advocate.
Non-Kenyan jurisdictions or filings without human review.
Automated mass-filing or claim-spam.

3) Training Data for RL

Task pool: Prompts synthesized from esherialabs/kenya-court-submissions-qa (15k Q/A SFT set), plus posture/side variants and short issue packs (rule keys per remedy: Giella/Nguruman tests, proportionality, JR standards, indefeasibility, etc.).
Targets: No single “gold text.” Rewards are programmatic (structure/stance/jurisdiction/length/cite-format) with light doctrine-coverage checks sourced from curated rulepacks.
Privacy & provenance: No PII; users remain responsible for compliance with the Kenya Data Protection Act, 2019.

4) RL Training Procedure (GRPO)

Base policy: our Kenya-legal SFT checkpoint on Lexchat-Llama-3.1-8B-Instruct
Parameter-efficient RL: LoRA updates only; 4-bit base frozen

Loading: 4-bit NF4 (double-quant), bf16 compute; gradient checkpointing.
LoRA: r=32, alpha=16, dropout=0.05 on attention + MLP projections.
GRPO config: prompt budget 256 tokens, completion window 768 tokens; num_generations=6 per prompt; batch size ≈1 (memory-bounded); cosine LR with warmup; gradient clip 0.1.
Optimizer: memory-friendly AdamW (paged) suitable for low-VRAM loops.
**Hardware profile:1× NVIDIA A10 (24 GB); ~196 GiB RAM host; 4-bit + LoRA keeps rollout latency in check.

Why GRPO here? Each prompt spawns 6 completions; relative advantages are computed within the group, so winners are pushed up while underperformers are suppressed—an implicit self-competition that works well with tiny batches and mixed, programmatic rewards.

5) Reward Stack

All heads are summed. A “perfectly formatted, on-side, on-jurisdiction, on-band” answer tops out near 4.0. Partial structure still earns partial credit to avoid collapse.

Section-Order Rewards (0 → 1.5)
- soft_spine_reward (+0.5): all five section headers present in order (case-insensitive; tolerates spacing).
- strict_spine_reward (+0.5): headers present with newline discipline and bullet semantics where required (• in Application).
- header_quality_reward (+0.5): BLUF starts with a clear, one-paragraph thesis; Relief ends with concrete prayers.
Side-Alignment Reward (0 → 0.5)
- Checks stance markers (“Applicant/Respondent/Plaintiff/Defendant/Appellant” etc.) against the instructed side; penalizes self-contradictions (“we submit” vs. opposing relief).
Jurisdiction-Purity Reward (0 → 0.5)
- Penalizes foreign law unless flagged persuasive; rewards Kenya-centric lexicon and neutral citation style (no links).
- Lightweight denylist for non-Kenya reporters; allowlist for common Kenya tests (e.g., Giella, Nguruman, proportionality, Article 47, Anarita Karimi / Mumo Matemu standards).
Wordband Reward (0 → 0.5)
- Pays out when total tokens land within a configured operational band (e.g., ~450–700 words for typical submissions).
Cite-Format Reward (0 → 0.5)
- Encourages case names/neutral citations; discourages URLs; rewards consistent formatting (no pin-cite hallucinations).
Doctrine-Coverage Reward (0 → 0.5)
- Matches issue-keyed rulepacks (e.g., injunction → Giella/Nguruman, balance of convenience, irreparable harm); fuzzy match with tolerance for paraphrase.

No separate judge model. All verifiers are deterministic string/AST checks and rulepack lookups. This keeps the loop single-model and hardware-friendly.

6) Verifier Mechanics

Deterministic parsers, not regex soup: string/section parsers and small ASTs for the five-part spine; resilient to whitespace.
Progressive strictness: soft → strict checks to provide dense early reward while the policy learns layout, then newline and bullet fidelity.
Realtime logging: per-head scores + sampled completions are logged for fast failure analysis (e.g., off-side or foreign-law creep).

7) Small-Model Tactics

LoRA on the highest-leverage submodules (attn/MLP) only.
Conservative LR schedule + clip=0.1 to tame gradient variance.
6-way sampling at batch size ~1 so GRPO’s relative baseline still works under tight VRAM.
4-bit loading to keep rollouts snappy despite multiple verifier passes.

8) Prompting & Inference

Recommended chat format

<|system|>
You are a Kenya court submissions generator. Output MUST follow:
BLUF → Governing Rules → Controlling Holdings → Application → Relief Sought.
Kenya authorities only unless marked persuasive.
<|user|>
Draft from the Applicant’s perspective in a Judicial Review (interlocutory stay).
Facts: [salient facts].
Issues: (1) Article 47 procedural fairness (2) Threshold for stay pending JR.
Constraints: Max ~600 words. Submissions style. Bullets only in Application.
<|assistant|>

Python (merged checkpoint)

from transformers import AutoModelForCausalLM, AutoTokenizer

name = "esherialabs/lexchat-Llama-3.1-8B-GRPO"  # replace with final repo id
tok = AutoTokenizer.from_pretrained(name, use_fast=True)
m = AutoModelForCausalLM.from_pretrained(name, torch_dtype="auto", device_map="auto")

prompt = "<|system|>\\n...see above...\\n<|user|>\\n...case brief...\\n<|assistant|>\\n"
ids = tok(prompt, return_tensors="pt").to(m.device)
out = m.generate(**ids, max_new_tokens=900, temperature=0.25, top_p=0.9, do_sample=True, eos_token_id=tok.eos_token_id)
print(tok.decode(out[0][ids['input_ids'].shape[1]:], skip_special_tokens=True))

Inference policy

Temperature 0.2–0.35, top_p=0.9; max_new_tokens ≤ 900.
Post-gen linter: spine order, wordband, Kenya-only lexicon, no URLs.
Optionally pair with a cite-checker or an allow-listed retrieval layer.

9) Evaluation Protocol

Automated

Structure pass (%) — five sections, order, newline/bullets.
Side correctness (%) — matches instructed side.
Jurisdiction purity (%) — Kenya-only unless flagged persuasive.
Doctrine coverage (%) — hits issue-keyed rule anchors.
Wordband adherence (%) — within operational target.

Human review

Persuasiveness, clarity, and fact fidelity.
Filing readiness (minor edits only).
Track red-lines to continuously enrich rulepacks/rewards.

10) Safety, Risks, and Limitations

Not legal advice. Advocate supervision is mandatory.
Evolving law: Doctrine shifts; keep retrieval/cite-bases fresh.
Reward hacking risk: Programmatic rewards can be gamed; we audit samples and rotate rulepacks.
Bias/coverage gaps: Report misses via Issues; we will iterate.

11) Deployment Blueprint (NGOs & Clinics)

Intake templates for facts/issues/side/posture/dates.
Generation policy (fixed temperature, max tokens).
Automated linter (structure/wordband/jurisdiction).
Advocate review checkpoint before filing.
Audit & privacy: scrub PII; log drafts/finals; comply with Kenya DPA 2019.

12) Reproducibility

Base: esherialabs/lexchat-Llama-3.1-8B-Instruct (SFT on Kenya submissions spine).
RL (GRPO): prompt 256, completion 768, num_generations=6; LoRA r=32, α=16, p=0.05; 4-bit NF4; bf16 compute; paged AdamW; cosine LR; warmup; clip=0.1; gradient checkpointing.
Hardware: single-GPU, 4-bit friendly (A10/4090/L40S-class).
Code patterns: single-model loop; deterministic verifiers; no judge model.

13) Example Prompts (NGO scenarios)

Legal aid clinic — eviction

You are drafting from the Applicant’s perspective in a Kenyan Constitutional Petition (interim conservatory orders).
Facts: Community clinic evicted without notice from county-owned premises after serving low-income patients.
Issues: (1) Article 47 procedural fairness (2) Article 40 property/use rights vs public interest (3) Conservatory order threshold.
Constraints: BLUF → Governing Rules → Controlling Holdings → Application → Relief Sought. Kenya authorities only. Max ~600 words.

Paralegal — employment dispute

You are drafting from the Respondent’s perspective in an Employment & Labour Court appeal.
Facts: Termination for cause with documented warnings; claimant alleges unfair termination.
Issues: (1) Fair procedure under Employment Act (2) Burden of proof and remedies.
Constraints: [same as above]

14) Responsible AI & Community Governance

Human-in-the-loop: Always route outputs to a supervising advocate before filing.
Community input: NGOs and clinics are invited to open issues with examples where the model under-serves marginalized groups or misses doctrine.

15) Versions

v2.0.0-GRPO (2025-10-29): RL-tuned on Kenya legal SFT checkpoint; structure/side/jurisdiction rewards; improved spine discipline.
v1.0.0-SFT: initial Kenya submissions generator (15k pairs).

16) Citation

Esheria Labs (2025). LexChat 8B — Kenya Court Submissions Generator (GRPO).
https://huggingface.co/esherialabs/esherialabs/lexchat-Llama-3.1-8B-GRPO
Trained on: https://huggingface.co/datasets/esherialabs/kenya-court-submissions-qa

15) Contact

Maintainer: Esheria Ventures
Website: https://esheria.co.ke
NGO / community partnerships: partnerships@esheria.co.ke
Enterprise / pro-bono: support@esheria.co.ke
Issues / rulepack feedback: open a Discussion/Issue on the model repo

Downloads last month: 3

Video Preview

Reinforcement Learning

Model tree for esherialabs/lexchat-Llama-3.1-8B-GRPO

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Adapter

(2036)

this model

esherialabs
/

lexchat-Llama-3.1-8B-GRPO