a11y-public-coder

Open-source accessibility coding assistant for the public sector. WCAG 2.2 Level AA conformance, Drupal 11, PHP 8.3, Drush 12, Python 3.12, and Playwright (TypeScript) with both axe-core and Siteimprove Alfa.

Version 0.9.0 · License MIT · Released 2026-05-17

HuggingFace — weights · Install via Ollamaollama pull rockypod/public-a11y-coder · GitHub — exam, dataset, training pipeline

Quick reference

4B 14B
Base model Qwen/Qwen3-4B Qwen/Qwen3-14B
Quantization Q4_K_M GGUF (~2.5 GB) Q4_K_M GGUF (~9 GB)
Recommended use Demo, non-technical explanation, portable inference Daily-driver technical work, OpenWebUI deployment
Exam score 73.3% (22.0/30) 76.7% (23.0/30)
Lift vs base +28.3% (from 45.0% baseline) +23.4% (from 53.3% baseline)
Ollama tag rockypod/public-a11y-coder:4b rockypod/public-a11y-coder:14b

What's in this repo

Path Description
exam/a11y-30q.md Full 30-question evaluation exam with rubric
exam/run_exam.py Exam runner — collects responses and scores
exam/baselines/ Pre-training baseline grades (qwen3:4b, 8b, 14b)
exam/trained/ Post-training grades per model size
dataset/tier*.jsonl Full training corpus — 1,930 pairs across 18 tiers
train.py Full training pipeline (consolidate → LoRA → merge → GGUF)
Modelfile Ollama Modelfile (4B production ChatML template)

Large artifacts (checkpoints, merged HF weights, GGUF) are not in this repo — download GGUFs from the HuggingFace model page or pull via Ollama.

Intended use

a11y-public-coder is designed for use by government agencies, public-sector developers, accessibility professionals, and any developer maintaining Drupal 11 sites that must meet WCAG 2.2 Level AA. The model produces:

  • Drupal 11 module, theme, and Twig code that follows accessibility best practices
  • Drush 12 CLI commands and custom command authoring
  • Python 3.12 utility scripts for accessibility-aware file operations (PDF text layer detection, alt audit, heading hierarchy)
  • Playwright (TypeScript) test scaffolds using @axe-core/playwright and @siteimprove/alfa-playwright
  • WCAG 2.2 AA explanations cited by success criterion number, suitable for both developers and non-technical content editors

The 4B variant is optimized for portable demonstrations (runs comfortably in a Windows 11 VM with 8 GB allocation) and explanation-first responses. The 14B variant is the primary daily-driver, targeted at OpenWebUI deployment on agency or homelab hardware.

Privacy-first training

This model was trained under explicit privacy constraints documented in the dataset card and verifiable in the public training corpus:

  • No PII in any training entry — no real names, addresses, emails, phone numbers, case numbers, or social security numbers
  • No real URLs, hostnames, or production domain names — all examples use example.gov, gov.example.org, or agency.example placeholders
  • No scraped production agency content — every training example was authored from publicly available official documentation: drupal.org, php.net, docs.python.org, playwright.dev, alfa.siteimprove.com, w3.org/WAI/WCAG22/, drush.org
  • Full dataset, exam questions, and per-question grading results are public — see the a11y-public-coder-dataset repository

The privacy-first training approach minimizes the risk of memorized PII surfacing in outputs, but does not eliminate standard LLM safety considerations.

Security and compliance — agency responsibility

a11y-public-coder is designed for self-hosted deployment. Deploying organizations remain responsible for their own security and privacy posture:

  • Self-host: run via Ollama, vLLM, llama.cpp, or similar so that no prompts leave your network
  • No certifications: this model has not been independently certified against NIST 800-53, FedRAMP, CJIS, HIPAA, FERPA, or state-specific frameworks. Agencies must independently validate fitness for their compliance context
  • No sensitive data in prompts: do not paste citizen PII, case numbers, or other sensitive content. The model is a code/audit assistant, not a data-handling system
  • Output review: model output is a suggestion, not authoritative. Human review is required before deployment
  • Access controls and audit logging are the operator's responsibility

Training methodology

a11y-public-coder was trained using the CRAFTED℠ (Continuous Retrieval-Augmented Fine-Tuning, Evaluate, Deploy) pipeline:

  1. Source corpus assembly — 1,930 training pairs generated from official documentation (drupal.org, drush.org, playwright.dev, alfa.siteimprove.com, w3.org/WAI/WCAG22/, php.net, docs.python.org) by a local teacher model (qwen3:30b via Ollama)

  2. CRAFTED℠ correction stream — Every generated entry was reviewed against domain-specific failure-mode filters (e.g. the WCAG 2.5.8 = 24×24 vs 2.5.5 = 44×44 contamination check, the Drupal 7/8/9 → Drupal 11 API leakage check, the Python-vs-TypeScript Playwright fallback check). 1,925 of 1,930 entries passed auto-acceptance with rule-based filters; 5 were manually corrected; 2 additional issues were flagged by a Drupal-specific D7/D8 API validator and corrected. Final corrected entries: 1,930/1,930.

  3. Fine-tuning — Unsloth + LoRA (r=16, alpha=16, no dropout) on NVIDIA RTX 3090 Ti, 4 epochs at learning rate 2e-4 with cosine schedule. The 4B run reweights demo_friendly entries by 1.5× and downsamples entries with len(assistant) > 1800 chars by 0.7× to favor explanation-leaning content; the 14B run uses the full distribution without reweighting.

  4. Conversion — GGUF via pinned llama.cpp commit 57819b8d4 with --outtype f16, quantized to Q4_K_M for serving. Modelfile uses ChatML template override for tokenizer consistency.

The full pipeline is reproducible from the training scripts in this repository.

Dataset

The training corpus is 1,930 high-quality instruction-response pairs across 18 tiers, fully open and downloadable from the dataset repository or from dataset/ in this repo:

Tier Domain Entries
1 Drupal 11 core fundamentals 100
2 Drupal 11 contrib stack (Webform, Paragraphs, Views, Pathauto, Metatag) 100
3 Drupal 11 Twig 3 templating 100
4 Drupal 11 custom modules 100
5 Drupal 11 accessibility patterns 100
6 Drupal-flavored PHP 8.3 100
7 Drush 12 CLI usage 100
8 Drush 12 custom command authoring 100
9 Python 3.12 folder/file utilities 100
10 Python 3.12 file conversion 100
11 Python 3.12 accessibility-aware utilities 100
12 Playwright (TypeScript) fundamentals 100
13 Playwright + @axe-core/playwright 140
14 Playwright + @siteimprove/alfa-playwright 130
15 WCAG 2.2 AA — pre-2.2 carryover SCs 80
16 WCAG 2.2-new success criteria (9 new SCs) 140
17 Negative-example / contamination correction pairs 140
18 End-to-end multi-domain scenarios 100
Total 1,930

Evaluation

Models are evaluated against a 30-question exam covering all training domains, scored Full (1.0) / Partial (0.5) / Fail (0.0) per question, max 30.0 points. The exam is published in full, including grading rubrics: see exam/a11y-30q.md.

Pre-training baselines and post-training results are published in exam/, with per-question grades:

Summary

Model Total Percentage
qwen3:4b baseline 13.5/30 45.0%
qwen3:8b baseline 17.0/30 56.7%
qwen3:14b baseline 16.0/30 53.3%
a11y-public-coder:4b (trained) 22.0/30 73.3%
a11y-public-coder:14b (trained) 23.0/30 76.7%

Per-domain results — 4B trained vs baseline qwen3:4b

Domain Baseline Trained 4B Lift
Drupal 11 2.0/8 (25%) 6.0/8 (75%) +4.0
PHP 8.3 0.5/2 (25%) 1.0/2 (50%) +0.5
Drush 12 2.0/3 (67%) 1.5/3 (50%) -0.5 ⬇
Python 3.12 2.5/4 (63%) 4.0/4 (100%) +1.5
Playwright + axe-core 0.5/3 (17%) 2.0/3 (67%) +1.5
Playwright + Alfa 0.5/2 (25%) 1.5/2 (75%) +1.0
WCAG 2.2 AA (carryover) 3.0/4 (75%) 3.0/4 (75%) 0
WCAG 2.2-new ⭐ 1.5/3 (50%) 2.0/3 (67%) +0.5
Negative/contamination gate 1.0/1 (100%) 1.0/1 (100%) 0 ✓
Total 13.5/30 (45.0%) 22.0/30 (73.3%) +8.5 (+28.3%)

Per-domain results — 14B trained vs baseline qwen3:14b

Domain Baseline Trained 14B Lift
Drupal 11 3.0/8 (37.5%) 6.5/8 (81.3%) +3.5
PHP 8.3 1.5/2 (75.0%) 1.5/2 (75.0%) 0
Drush 12 1.5/3 (50.0%) 1.0/3 (33.3%) -0.5 ⬇
Python 3.12 2.5/4 (62.5%) 3.5/4 (87.5%) +1.0
Playwright + axe-core 1.0/3 (33.3%) 2.5/3 (83.3%) +1.5
Playwright + Alfa 1.0/2 (50.0%) 2.0/2 (100%) +1.0
WCAG 2.2 AA (carryover) 3.0/4 (75.0%) 3.0/4 (75.0%) 0
WCAG 2.2-new ⭐ 1.5/3 (50.0%) 2.0/3 (66.7%) +0.5
Negative/contamination gate 1.0/1 (100%) 1.0/1 (100%) 0 ✓
Total 16.0/30 (53.3%) 23.0/30 (76.7%) +7.0 (+23.4%)

Running the exam yourself

# Against a trained model already loaded in Ollama
python exam/run_exam.py --model rockypod/public-a11y-coder:4b --output exam/trained/4b

# Score after filling grades.json
python exam/run_exam.py --score exam/trained/4b

Grading is manual (Full/Partial/Fail per rubric in exam/a11y-30q.md).

Reproducing training

# On a CUDA GPU server with the Unsloth venv installed
nohup env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True TORCHDYNAMO_DISABLE=1 \
  python train.py --size 4b > logs/train_4b.log 2>&1 &

TORCHDYNAMO_DISABLE=1 is required — Qwen3 + Unsloth triggers Triton JIT compilation which fails on CUDA driver/toolkit version mismatches common on Rocky Linux GPU hosts.

Usage

Ollama (local)

ollama run rockypod/public-a11y-coder:14b
# or for the portable demo model:
ollama run rockypod/public-a11y-coder:4b

OpenWebUI

Add the model under Settings → Models → Ollama, point to your Ollama endpoint (default http://localhost:11434), select rockypod/public-a11y-coder:14b from the model list.

HuggingFace Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "rockypod/a11y-public-coder-4b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

messages = [
    {"role": "user", "content": "Write a Drupal 11 Twig snippet for an accessible image field with a skip-link-friendly heading structure."}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Known limitations

The v0.9.0 release ships with documented gaps to be addressed in v1.0:

  1. Drush flag accuracy — The 4B variant occasionally fabricates non-existent command flags (e.g. inventing --target or --exclude on commands where those flags do not exist). This is a training data quality issue traced to tier 7 generation; v1.0 will include a Drush command-reference validator before retraining.

  2. Contrast ratio computation — Small models cannot reliably compute color contrast ratios from arbitrary hex pairs. The model correctly identifies SC 1.4.3 (Contrast — Minimum) and can recall specific examples that appear in training (#767676 on white = 4.48:1), but does not generalize to compute ratios for novel inputs. Recommend pairing with a deterministic contrast checker.

  3. WCAG 2.2-new exception coverage — SC 2.5.8 (Target Size — Minimum) has five distinct exception cases (offset, essential, inline, user-agent-controlled, equivalent). The 4B reliably outputs the headline 24×24 CSS pixels AA threshold but covers only one of the five exception cases consistently. v1.0 will expand tier 16 with dedicated entries per exception type.

  4. SC-to-SC discrimination — The 4B occasionally confuses related success criteria (e.g. cites SC 2.1.1 + 2.1.2 for a missing button role where 4.1.2 is the primary criterion). v1.0 will add SC-discrimination pair entries to tier 17.

  5. Drupal 11 vs Drupal 10 distinction — While the dataset targets Drupal 11 exclusively, the underlying base model has substantial Drupal 7/8/9 pretraining priors. The contamination gate (tier 17 negative examples) holds at 100% on the exam, but in long-form generation some D7-era patterns may surface. Always validate generated Drupal code against the actual D11 API.

Recommended use cases

Strong fit:

  • Generating Drupal 11 module scaffolds with accessibility baked in
  • Writing Playwright + axe-core / Alfa test files for agency sites
  • Drafting Python utility scripts for accessibility audits (PDF text layer detection, alt text auditing, heading hierarchy)
  • Explaining WCAG 2.2 success criteria to non-technical content editors
  • Drush 12 natural-language to command translation (with verification)

Use with caution:

  • Contrast ratio calculations (verify with a deterministic checker)
  • Drush command flags (verify against drush help <command>)
  • Drupal 8/9 maintenance (this model is Drupal 11-targeted)

Not designed for:

  • General-purpose coding outside the trained domains
  • Production-critical accessibility certification without human review
  • Handling sensitive citizen data in prompts

Roadmap

Version Target Focus
v0.9.0 shipped Initial release, baselines published, ship gate intentionally below 80% with documented limitations
v0.9.5 ~6 weeks Drush flag validation pass, contrast hex-pair expansion, SC 2.5.8 exception coverage
v1.0.0 ~10 weeks All v0.9.0 limitations addressed, ≥85% on the 30Q exam

The CRAFTED℠ methodology means each version uses real-world exam failures and user-reported issues as the correction stream for the next training cycle. The v1.0 release will include an expanded 60-question exam.

Reproducibility

This release is reproducible end-to-end from the public artifacts:

Citation

@misc{a11y-public-coder-v0.9.0,
  author       = {RockyPod},
  title        = {a11y-public-coder: An open-source accessibility coding assistant for the public sector},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/rockypod/a11y-public-coder-4b}},
}

Acknowledgments

License

MIT. See LICENSE for full text. Free for any use including commercial, including by government agencies.

Downloads last month
40
GGUF
Model size
15B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rockypod/a11y-public-coder

Finetuned
Qwen/Qwen3-14B
Quantized
(178)
this model

Evaluation results