---
license: mit
language:
  - en
library_name: transformers
pipeline_tag: text-generation
tags:
  - accessibility
  - wcag
  - wcag-2.2
  - drupal
  - drupal-11
  - php
  - drush
  - python
  - playwright
  - axe-core
  - siteimprove-alfa
  - government
  - public-sector
  - code
base_model:
  - Qwen/Qwen3-4B
  - Qwen/Qwen3-14B
datasets:
  - rockypod/a11y-public-coder-dataset
model-index:
  - name: a11y-public-coder
    results:
      - task:
          type: text-generation
          name: WCAG 2.2 AA Accessibility Coding Exam
        metrics:
          - type: accuracy
            name: 30-question exam (4B)
            value: 73.3
          - type: accuracy
            name: 30-question exam (14B)
            value: 76.7
---

# a11y-public-coder

**Open-source accessibility coding assistant for the public sector.** WCAG 2.2 Level AA conformance, Drupal 11, PHP 8.3, Drush 12, Python 3.12, and Playwright (TypeScript) with both axe-core and Siteimprove Alfa.

> Version **0.9.0** · License **MIT** · Released 2026-05-17

**[HuggingFace — weights](https://huggingface.co/rockypod/a11y-public-coder-4b)** ·
**[Install via Ollama](https://ollama.com/rockypod/public-a11y-coder)** — `ollama pull rockypod/public-a11y-coder` ·
**[GitHub — exam, dataset, training pipeline](https://github.com/rockypod/public_a11y_coder)**

## Quick reference

| | 4B | 14B |
|---|---|---|
| **Base model** | `Qwen/Qwen3-4B` | `Qwen/Qwen3-14B` |
| **Quantization** | Q4_K_M GGUF (~2.5 GB) | Q4_K_M GGUF (~9 GB) |
| **Recommended use** | Demo, non-technical explanation, portable inference | Daily-driver technical work, OpenWebUI deployment |
| **Exam score** | **73.3%** (22.0/30) | **76.7%** (23.0/30) |
| **Lift vs base** | **+28.3%** (from 45.0% baseline) | **+23.4%** (from 53.3% baseline) |
| **Ollama tag** | `rockypod/public-a11y-coder:4b` | `rockypod/public-a11y-coder:14b` |

## What's in this repo

| Path | Description |
|---|---|
| `exam/a11y-30q.md` | Full 30-question evaluation exam with rubric |
| `exam/run_exam.py` | Exam runner — collects responses and scores |
| `exam/baselines/` | Pre-training baseline grades (qwen3:4b, 8b, 14b) |
| `exam/trained/` | Post-training grades per model size |
| `dataset/tier*.jsonl` | Full training corpus — 1,930 pairs across 18 tiers |
| `train.py` | Full training pipeline (consolidate → LoRA → merge → GGUF) |
| `Modelfile` | Ollama Modelfile (4B production ChatML template) |

Large artifacts (checkpoints, merged HF weights, GGUF) are not in this repo — download GGUFs from the HuggingFace model page or pull via Ollama.

## Intended use

`a11y-public-coder` is designed for use by government agencies, public-sector developers, accessibility professionals, and any developer maintaining Drupal 11 sites that must meet WCAG 2.2 Level AA. The model produces:

- Drupal 11 module, theme, and Twig code that follows accessibility best practices
- Drush 12 CLI commands and custom command authoring
- Python 3.12 utility scripts for accessibility-aware file operations (PDF text layer detection, alt audit, heading hierarchy)
- Playwright (TypeScript) test scaffolds using `@axe-core/playwright` and `@siteimprove/alfa-playwright`
- WCAG 2.2 AA explanations cited by success criterion number, suitable for both developers and non-technical content editors

The 4B variant is optimized for portable demonstrations (runs comfortably in a Windows 11 VM with 8 GB allocation) and explanation-first responses. The 14B variant is the primary daily-driver, targeted at OpenWebUI deployment on agency or homelab hardware.

## Privacy-first training

This model was trained under explicit privacy constraints documented in the dataset card and verifiable in the public training corpus:

- **No PII** in any training entry — no real names, addresses, emails, phone numbers, case numbers, or social security numbers
- **No real URLs, hostnames, or production domain names** — all examples use `example.gov`, `gov.example.org`, or `agency.example` placeholders
- **No scraped production agency content** — every training example was authored from publicly available official documentation: drupal.org, php.net, docs.python.org, playwright.dev, alfa.siteimprove.com, w3.org/WAI/WCAG22/, drush.org
- **Full dataset, exam questions, and per-question grading results are public** — see the [`a11y-public-coder-dataset`](https://huggingface.co/datasets/rockypod/a11y-public-coder-dataset) repository

The privacy-first training approach minimizes the risk of memorized PII surfacing in outputs, but does not eliminate standard LLM safety considerations.

## Security and compliance — agency responsibility

`a11y-public-coder` is designed for self-hosted deployment. Deploying organizations remain responsible for their own security and privacy posture:

- **Self-host**: run via Ollama, vLLM, llama.cpp, or similar so that no prompts leave your network
- **No certifications**: this model has not been independently certified against NIST 800-53, FedRAMP, CJIS, HIPAA, FERPA, or state-specific frameworks. Agencies must independently validate fitness for their compliance context
- **No sensitive data in prompts**: do not paste citizen PII, case numbers, or other sensitive content. The model is a code/audit assistant, not a data-handling system
- **Output review**: model output is a suggestion, not authoritative. Human review is required before deployment
- **Access controls and audit logging** are the operator's responsibility

## Training methodology

`a11y-public-coder` was trained using the **CRAFTED℠ (Continuous Retrieval-Augmented Fine-Tuning, Evaluate, Deploy)** pipeline:

1. **Source corpus assembly** — 1,930 training pairs generated from official documentation (drupal.org, drush.org, playwright.dev, alfa.siteimprove.com, w3.org/WAI/WCAG22/, php.net, docs.python.org) by a local teacher model (`qwen3:30b` via Ollama)

2. **CRAFTED℠ correction stream** — Every generated entry was reviewed against domain-specific failure-mode filters (e.g. the WCAG 2.5.8 = 24×24 vs 2.5.5 = 44×44 contamination check, the Drupal 7/8/9 → Drupal 11 API leakage check, the Python-vs-TypeScript Playwright fallback check). 1,925 of 1,930 entries passed auto-acceptance with rule-based filters; 5 were manually corrected; 2 additional issues were flagged by a Drupal-specific D7/D8 API validator and corrected. Final corrected entries: 1,930/1,930.

3. **Fine-tuning** — Unsloth + LoRA (r=16, alpha=16, no dropout) on NVIDIA RTX 3090 Ti, 4 epochs at learning rate 2e-4 with cosine schedule. The 4B run reweights `demo_friendly` entries by 1.5× and downsamples entries with `len(assistant) > 1800 chars` by 0.7× to favor explanation-leaning content; the 14B run uses the full distribution without reweighting.

4. **Conversion** — GGUF via pinned `llama.cpp` commit `57819b8d4` with `--outtype f16`, quantized to Q4_K_M for serving. Modelfile uses ChatML template override for tokenizer consistency.

The full pipeline is reproducible from the [training scripts](https://github.com/rockypod/public_a11y_coder) in this repository.

## Dataset

The training corpus is **1,930 high-quality instruction-response pairs across 18 tiers**, fully open and downloadable from the [dataset repository](https://huggingface.co/datasets/rockypod/a11y-public-coder-dataset) or from `dataset/` in this repo:

| Tier | Domain | Entries |
|---|---|---|
| 1 | Drupal 11 core fundamentals | 100 |
| 2 | Drupal 11 contrib stack (Webform, Paragraphs, Views, Pathauto, Metatag) | 100 |
| 3 | Drupal 11 Twig 3 templating | 100 |
| 4 | Drupal 11 custom modules | 100 |
| 5 | Drupal 11 accessibility patterns | 100 |
| 6 | Drupal-flavored PHP 8.3 | 100 |
| 7 | Drush 12 CLI usage | 100 |
| 8 | Drush 12 custom command authoring | 100 |
| 9 | Python 3.12 folder/file utilities | 100 |
| 10 | Python 3.12 file conversion | 100 |
| 11 | Python 3.12 accessibility-aware utilities | 100 |
| 12 | Playwright (TypeScript) fundamentals | 100 |
| 13 | Playwright + `@axe-core/playwright` | 140 |
| 14 | Playwright + `@siteimprove/alfa-playwright` | 130 |
| 15 | WCAG 2.2 AA — pre-2.2 carryover SCs | 80 |
| 16 | WCAG 2.2-new success criteria (9 new SCs) | 140 |
| 17 | Negative-example / contamination correction pairs | 140 |
| 18 | End-to-end multi-domain scenarios | 100 |
| **Total** | | **1,930** |

## Evaluation

Models are evaluated against a 30-question exam covering all training domains, scored **Full (1.0) / Partial (0.5) / Fail (0.0)** per question, max 30.0 points. The exam is **published in full**, including grading rubrics: see [`exam/a11y-30q.md`](exam/a11y-30q.md).

**Pre-training baselines and post-training results** are published in `exam/`, with per-question grades:

### Summary

| Model | Total | Percentage |
|---|---|---|
| `qwen3:4b` baseline | 13.5/30 | 45.0% |
| `qwen3:8b` baseline | 17.0/30 | 56.7% |
| `qwen3:14b` baseline | 16.0/30 | 53.3% |
| **`a11y-public-coder:4b` (trained)** | **22.0/30** | **73.3%** |
| **`a11y-public-coder:14b` (trained)** | **23.0/30** | **76.7%** |

### Per-domain results — 4B trained vs baseline `qwen3:4b`

| Domain | Baseline | Trained 4B | Lift |
|---|---|---|---|
| Drupal 11 | 2.0/8 (25%) | 6.0/8 (75%) | **+4.0** ⬆ |
| PHP 8.3 | 0.5/2 (25%) | 1.0/2 (50%) | +0.5 |
| Drush 12 | 2.0/3 (67%) | 1.5/3 (50%) | -0.5 ⬇ |
| Python 3.12 | 2.5/4 (63%) | 4.0/4 (100%) | **+1.5** ✓ |
| Playwright + axe-core | 0.5/3 (17%) | 2.0/3 (67%) | **+1.5** ⬆ |
| Playwright + Alfa | 0.5/2 (25%) | 1.5/2 (75%) | **+1.0** ⬆ |
| WCAG 2.2 AA (carryover) | 3.0/4 (75%) | 3.0/4 (75%) | 0 |
| WCAG 2.2-new ⭐ | 1.5/3 (50%) | 2.0/3 (67%) | +0.5 |
| Negative/contamination gate | 1.0/1 (100%) | 1.0/1 (100%) | 0 ✓ |
| **Total** | **13.5/30 (45.0%)** | **22.0/30 (73.3%)** | **+8.5 (+28.3%)** |

### Per-domain results — 14B trained vs baseline `qwen3:14b`

| Domain | Baseline | Trained 14B | Lift |
|---|---|---|---|
| Drupal 11 | 3.0/8 (37.5%) | 6.5/8 (81.3%) | **+3.5** ⬆ |
| PHP 8.3 | 1.5/2 (75.0%) | 1.5/2 (75.0%) | 0 |
| Drush 12 | 1.5/3 (50.0%) | 1.0/3 (33.3%) | -0.5 ⬇ |
| Python 3.12 | 2.5/4 (62.5%) | 3.5/4 (87.5%) | **+1.0** ⬆ |
| Playwright + axe-core | 1.0/3 (33.3%) | 2.5/3 (83.3%) | **+1.5** ⬆ |
| Playwright + Alfa | 1.0/2 (50.0%) | 2.0/2 (100%) | **+1.0** ✓ |
| WCAG 2.2 AA (carryover) | 3.0/4 (75.0%) | 3.0/4 (75.0%) | 0 |
| WCAG 2.2-new ⭐ | 1.5/3 (50.0%) | 2.0/3 (66.7%) | +0.5 |
| Negative/contamination gate | 1.0/1 (100%) | 1.0/1 (100%) | 0 ✓ |
| **Total** | **16.0/30 (53.3%)** | **23.0/30 (76.7%)** | **+7.0 (+23.4%)** |

## Running the exam yourself

```bash
# Against a trained model already loaded in Ollama
python exam/run_exam.py --model rockypod/public-a11y-coder:4b --output exam/trained/4b

# Score after filling grades.json
python exam/run_exam.py --score exam/trained/4b
```

Grading is manual (Full/Partial/Fail per rubric in `exam/a11y-30q.md`).

## Reproducing training

```bash
# On a CUDA GPU server with the Unsloth venv installed
nohup env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True TORCHDYNAMO_DISABLE=1 \
  python train.py --size 4b > logs/train_4b.log 2>&1 &
```

`TORCHDYNAMO_DISABLE=1` is required — Qwen3 + Unsloth triggers Triton JIT compilation which fails on CUDA driver/toolkit version mismatches common on Rocky Linux GPU hosts.

## Usage

### Ollama (local)

```bash
ollama run rockypod/public-a11y-coder:14b
# or for the portable demo model:
ollama run rockypod/public-a11y-coder:4b
```

### OpenWebUI

Add the model under Settings → Models → Ollama, point to your Ollama endpoint (default `http://localhost:11434`), select `rockypod/public-a11y-coder:14b` from the model list.

### HuggingFace Transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "rockypod/a11y-public-coder-4b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")

messages = [
    {"role": "user", "content": "Write a Drupal 11 Twig snippet for an accessible image field with a skip-link-friendly heading structure."}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Known limitations

The v0.9.0 release ships with documented gaps to be addressed in v1.0:

1. **Drush flag accuracy** — The 4B variant occasionally fabricates non-existent command flags (e.g. inventing `--target` or `--exclude` on commands where those flags do not exist). This is a training data quality issue traced to tier 7 generation; v1.0 will include a Drush command-reference validator before retraining.

2. **Contrast ratio computation** — Small models cannot reliably compute color contrast ratios from arbitrary hex pairs. The model correctly identifies SC 1.4.3 (Contrast — Minimum) and can recall specific examples that appear in training (`#767676` on white = 4.48:1), but does not generalize to compute ratios for novel inputs. Recommend pairing with a deterministic contrast checker.

3. **WCAG 2.2-new exception coverage** — SC 2.5.8 (Target Size — Minimum) has five distinct exception cases (offset, essential, inline, user-agent-controlled, equivalent). The 4B reliably outputs the headline `24×24 CSS pixels` AA threshold but covers only one of the five exception cases consistently. v1.0 will expand tier 16 with dedicated entries per exception type.

4. **SC-to-SC discrimination** — The 4B occasionally confuses related success criteria (e.g. cites SC 2.1.1 + 2.1.2 for a missing button role where 4.1.2 is the primary criterion). v1.0 will add SC-discrimination pair entries to tier 17.

5. **Drupal 11 vs Drupal 10 distinction** — While the dataset targets Drupal 11 exclusively, the underlying base model has substantial Drupal 7/8/9 pretraining priors. The contamination gate (tier 17 negative examples) holds at 100% on the exam, but in long-form generation some D7-era patterns may surface. Always validate generated Drupal code against the actual D11 API.

## Recommended use cases

**Strong fit:**
- Generating Drupal 11 module scaffolds with accessibility baked in
- Writing Playwright + axe-core / Alfa test files for agency sites
- Drafting Python utility scripts for accessibility audits (PDF text layer detection, alt text auditing, heading hierarchy)
- Explaining WCAG 2.2 success criteria to non-technical content editors
- Drush 12 natural-language to command translation (with verification)

**Use with caution:**
- Contrast ratio calculations (verify with a deterministic checker)
- Drush command flags (verify against `drush help <command>`)
- Drupal 8/9 maintenance (this model is Drupal 11-targeted)

**Not designed for:**
- General-purpose coding outside the trained domains
- Production-critical accessibility certification without human review
- Handling sensitive citizen data in prompts

## Roadmap

| Version | Target | Focus |
|---|---|---|
| **v0.9.0** | **shipped** | Initial release, baselines published, ship gate intentionally below 80% with documented limitations |
| v0.9.5 | ~6 weeks | Drush flag validation pass, contrast hex-pair expansion, SC 2.5.8 exception coverage |
| v1.0.0 | ~10 weeks | All v0.9.0 limitations addressed, ≥85% on the 30Q exam |

The CRAFTED℠ methodology means each version uses real-world exam failures and user-reported issues as the correction stream for the next training cycle. The v1.0 release will include an expanded 60-question exam.

## Reproducibility

This release is reproducible end-to-end from the public artifacts:

- **Dataset:** [`rockypod/a11y-public-coder-dataset`](https://huggingface.co/datasets/rockypod/a11y-public-coder-dataset) or `dataset/` in this repo
- **Training pipeline:** [`train.py`](train.py) in this repo
- **Evaluation exam:** [`exam/a11y-30q.md`](exam/a11y-30q.md)
- **Exam runner:** [`exam/run_exam.py`](exam/run_exam.py)
- **Per-question grading results:** [`exam/baselines/`](exam/baselines/) and [`exam/trained/`](exam/trained/)

## Citation

```bibtex
@misc{a11y-public-coder-v0.9.0,
  author       = {RockyPod},
  title        = {a11y-public-coder: An open-source accessibility coding assistant for the public sector},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/rockypod/a11y-public-coder-4b}},
}
```

## Acknowledgments

- Base models: [Qwen team](https://github.com/QwenLM) — `Qwen3-4B` and `Qwen3-14B` are MIT-licensed open weights
- Accessibility tooling: [Deque axe-core](https://github.com/dequelabs/axe-core), [Siteimprove Alfa](https://github.com/Siteimprove/alfa)
- Web standards: [W3C WAI](https://www.w3.org/WAI/) for the WCAG 2.2 specification and Understanding documents
- Training infrastructure: [Unsloth](https://github.com/unslothai/unsloth), [llama.cpp](https://github.com/ggerganov/llama.cpp), [Ollama](https://ollama.com/)

## License

MIT. See [LICENSE](LICENSE) for full text. Free for any use including commercial, including by government agencies.