Instructions to use rockypod/a11y-public-coder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use rockypod/a11y-public-coder with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="rockypod/a11y-public-coder") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("rockypod/a11y-public-coder", dtype="auto") - llama-cpp-python
How to use rockypod/a11y-public-coder with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="rockypod/a11y-public-coder", filename="a11y-public-coder-14b-q4_k_m.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use rockypod/a11y-public-coder with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rockypod/a11y-public-coder:Q4_K_M # Run inference directly in the terminal: llama-cli -hf rockypod/a11y-public-coder:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf rockypod/a11y-public-coder:Q4_K_M # Run inference directly in the terminal: llama-cli -hf rockypod/a11y-public-coder:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf rockypod/a11y-public-coder:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf rockypod/a11y-public-coder:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf rockypod/a11y-public-coder:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf rockypod/a11y-public-coder:Q4_K_M
Use Docker
docker model run hf.co/rockypod/a11y-public-coder:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use rockypod/a11y-public-coder with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "rockypod/a11y-public-coder" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rockypod/a11y-public-coder", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/rockypod/a11y-public-coder:Q4_K_M
- SGLang
How to use rockypod/a11y-public-coder with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "rockypod/a11y-public-coder" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rockypod/a11y-public-coder", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "rockypod/a11y-public-coder" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rockypod/a11y-public-coder", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use rockypod/a11y-public-coder with Ollama:
ollama run hf.co/rockypod/a11y-public-coder:Q4_K_M
- Unsloth Studio new
How to use rockypod/a11y-public-coder with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rockypod/a11y-public-coder to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for rockypod/a11y-public-coder to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for rockypod/a11y-public-coder to start chatting
- Pi new
How to use rockypod/a11y-public-coder with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf rockypod/a11y-public-coder:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "rockypod/a11y-public-coder:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use rockypod/a11y-public-coder with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf rockypod/a11y-public-coder:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default rockypod/a11y-public-coder:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use rockypod/a11y-public-coder with Docker Model Runner:
docker model run hf.co/rockypod/a11y-public-coder:Q4_K_M
- Lemonade
How to use rockypod/a11y-public-coder with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull rockypod/a11y-public-coder:Q4_K_M
Run and chat with the model
lemonade run user.a11y-public-coder-Q4_K_M
List all available models
lemonade list
license: mit
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- accessibility
- wcag
- wcag-2.2
- drupal
- drupal-11
- php
- drush
- python
- playwright
- axe-core
- siteimprove-alfa
- government
- public-sector
- code
base_model:
- Qwen/Qwen3-4B
- Qwen/Qwen3-14B
datasets:
- rockypod/a11y-public-coder-dataset
model-index:
- name: a11y-public-coder
results:
- task:
type: text-generation
name: WCAG 2.2 AA Accessibility Coding Exam
metrics:
- type: accuracy
name: 30-question exam (4B)
value: 73.3
- type: accuracy
name: 30-question exam (14B)
value: 76.7
a11y-public-coder
Open-source accessibility coding assistant for the public sector. WCAG 2.2 Level AA conformance, Drupal 11, PHP 8.3, Drush 12, Python 3.12, and Playwright (TypeScript) with both axe-core and Siteimprove Alfa.
Version 0.9.0 Β· License MIT Β· Released 2026-05-17
HuggingFace β weights Β·
Install via Ollama β ollama pull rockypod/public-a11y-coder Β·
GitHub β exam, dataset, training pipeline
Quick reference
| 4B | 14B | |
|---|---|---|
| Base model | Qwen/Qwen3-4B |
Qwen/Qwen3-14B |
| Quantization | Q4_K_M GGUF (~2.5 GB) | Q4_K_M GGUF (~9 GB) |
| Recommended use | Demo, non-technical explanation, portable inference | Daily-driver technical work, OpenWebUI deployment |
| Exam score | 73.3% (22.0/30) | 76.7% (23.0/30) |
| Lift vs base | +28.3% (from 45.0% baseline) | +23.4% (from 53.3% baseline) |
| Ollama tag | rockypod/public-a11y-coder:4b |
rockypod/public-a11y-coder:14b |
What's in this repo
| Path | Description |
|---|---|
exam/a11y-30q.md |
Full 30-question evaluation exam with rubric |
exam/run_exam.py |
Exam runner β collects responses and scores |
exam/baselines/ |
Pre-training baseline grades (qwen3:4b, 8b, 14b) |
exam/trained/ |
Post-training grades per model size |
dataset/tier*.jsonl |
Full training corpus β 1,930 pairs across 18 tiers |
train.py |
Full training pipeline (consolidate β LoRA β merge β GGUF) |
Modelfile |
Ollama Modelfile (4B production ChatML template) |
Large artifacts (checkpoints, merged HF weights, GGUF) are not in this repo β download GGUFs from the HuggingFace model page or pull via Ollama.
Intended use
a11y-public-coder is designed for use by government agencies, public-sector developers, accessibility professionals, and any developer maintaining Drupal 11 sites that must meet WCAG 2.2 Level AA. The model produces:
- Drupal 11 module, theme, and Twig code that follows accessibility best practices
- Drush 12 CLI commands and custom command authoring
- Python 3.12 utility scripts for accessibility-aware file operations (PDF text layer detection, alt audit, heading hierarchy)
- Playwright (TypeScript) test scaffolds using
@axe-core/playwrightand@siteimprove/alfa-playwright - WCAG 2.2 AA explanations cited by success criterion number, suitable for both developers and non-technical content editors
The 4B variant is optimized for portable demonstrations (runs comfortably in a Windows 11 VM with 8 GB allocation) and explanation-first responses. The 14B variant is the primary daily-driver, targeted at OpenWebUI deployment on agency or homelab hardware.
Privacy-first training
This model was trained under explicit privacy constraints documented in the dataset card and verifiable in the public training corpus:
- No PII in any training entry β no real names, addresses, emails, phone numbers, case numbers, or social security numbers
- No real URLs, hostnames, or production domain names β all examples use
example.gov,gov.example.org, oragency.exampleplaceholders - No scraped production agency content β every training example was authored from publicly available official documentation: drupal.org, php.net, docs.python.org, playwright.dev, alfa.siteimprove.com, w3.org/WAI/WCAG22/, drush.org
- Full dataset, exam questions, and per-question grading results are public β see the
a11y-public-coder-datasetrepository
The privacy-first training approach minimizes the risk of memorized PII surfacing in outputs, but does not eliminate standard LLM safety considerations.
Security and compliance β agency responsibility
a11y-public-coder is designed for self-hosted deployment. Deploying organizations remain responsible for their own security and privacy posture:
- Self-host: run via Ollama, vLLM, llama.cpp, or similar so that no prompts leave your network
- No certifications: this model has not been independently certified against NIST 800-53, FedRAMP, CJIS, HIPAA, FERPA, or state-specific frameworks. Agencies must independently validate fitness for their compliance context
- No sensitive data in prompts: do not paste citizen PII, case numbers, or other sensitive content. The model is a code/audit assistant, not a data-handling system
- Output review: model output is a suggestion, not authoritative. Human review is required before deployment
- Access controls and audit logging are the operator's responsibility
Training methodology
a11y-public-coder was trained using the CRAFTEDβ (Continuous Retrieval-Augmented Fine-Tuning, Evaluate, Deploy) pipeline:
Source corpus assembly β 1,930 training pairs generated from official documentation (drupal.org, drush.org, playwright.dev, alfa.siteimprove.com, w3.org/WAI/WCAG22/, php.net, docs.python.org) by a local teacher model (
qwen3:30bvia Ollama)CRAFTEDβ correction stream β Every generated entry was reviewed against domain-specific failure-mode filters (e.g. the WCAG 2.5.8 = 24Γ24 vs 2.5.5 = 44Γ44 contamination check, the Drupal 7/8/9 β Drupal 11 API leakage check, the Python-vs-TypeScript Playwright fallback check). 1,925 of 1,930 entries passed auto-acceptance with rule-based filters; 5 were manually corrected; 2 additional issues were flagged by a Drupal-specific D7/D8 API validator and corrected. Final corrected entries: 1,930/1,930.
Fine-tuning β Unsloth + LoRA (r=16, alpha=16, no dropout) on NVIDIA RTX 3090 Ti, 4 epochs at learning rate 2e-4 with cosine schedule. The 4B run reweights
demo_friendlyentries by 1.5Γ and downsamples entries withlen(assistant) > 1800 charsby 0.7Γ to favor explanation-leaning content; the 14B run uses the full distribution without reweighting.Conversion β GGUF via pinned
llama.cppcommit57819b8d4with--outtype f16, quantized to Q4_K_M for serving. Modelfile uses ChatML template override for tokenizer consistency.
The full pipeline is reproducible from the training scripts in this repository.
Dataset
The training corpus is 1,930 high-quality instruction-response pairs across 18 tiers, fully open and downloadable from the dataset repository or from dataset/ in this repo:
| Tier | Domain | Entries |
|---|---|---|
| 1 | Drupal 11 core fundamentals | 100 |
| 2 | Drupal 11 contrib stack (Webform, Paragraphs, Views, Pathauto, Metatag) | 100 |
| 3 | Drupal 11 Twig 3 templating | 100 |
| 4 | Drupal 11 custom modules | 100 |
| 5 | Drupal 11 accessibility patterns | 100 |
| 6 | Drupal-flavored PHP 8.3 | 100 |
| 7 | Drush 12 CLI usage | 100 |
| 8 | Drush 12 custom command authoring | 100 |
| 9 | Python 3.12 folder/file utilities | 100 |
| 10 | Python 3.12 file conversion | 100 |
| 11 | Python 3.12 accessibility-aware utilities | 100 |
| 12 | Playwright (TypeScript) fundamentals | 100 |
| 13 | Playwright + @axe-core/playwright |
140 |
| 14 | Playwright + @siteimprove/alfa-playwright |
130 |
| 15 | WCAG 2.2 AA β pre-2.2 carryover SCs | 80 |
| 16 | WCAG 2.2-new success criteria (9 new SCs) | 140 |
| 17 | Negative-example / contamination correction pairs | 140 |
| 18 | End-to-end multi-domain scenarios | 100 |
| Total | 1,930 |
Evaluation
Models are evaluated against a 30-question exam covering all training domains, scored Full (1.0) / Partial (0.5) / Fail (0.0) per question, max 30.0 points. The exam is published in full, including grading rubrics: see exam/a11y-30q.md.
Pre-training baselines and post-training results are published in exam/, with per-question grades:
Summary
| Model | Total | Percentage |
|---|---|---|
qwen3:4b baseline |
13.5/30 | 45.0% |
qwen3:8b baseline |
17.0/30 | 56.7% |
qwen3:14b baseline |
16.0/30 | 53.3% |
a11y-public-coder:4b (trained) |
22.0/30 | 73.3% |
a11y-public-coder:14b (trained) |
23.0/30 | 76.7% |
Per-domain results β 4B trained vs baseline qwen3:4b
| Domain | Baseline | Trained 4B | Lift |
|---|---|---|---|
| Drupal 11 | 2.0/8 (25%) | 6.0/8 (75%) | +4.0 β¬ |
| PHP 8.3 | 0.5/2 (25%) | 1.0/2 (50%) | +0.5 |
| Drush 12 | 2.0/3 (67%) | 1.5/3 (50%) | -0.5 β¬ |
| Python 3.12 | 2.5/4 (63%) | 4.0/4 (100%) | +1.5 β |
| Playwright + axe-core | 0.5/3 (17%) | 2.0/3 (67%) | +1.5 β¬ |
| Playwright + Alfa | 0.5/2 (25%) | 1.5/2 (75%) | +1.0 β¬ |
| WCAG 2.2 AA (carryover) | 3.0/4 (75%) | 3.0/4 (75%) | 0 |
| WCAG 2.2-new β | 1.5/3 (50%) | 2.0/3 (67%) | +0.5 |
| Negative/contamination gate | 1.0/1 (100%) | 1.0/1 (100%) | 0 β |
| Total | 13.5/30 (45.0%) | 22.0/30 (73.3%) | +8.5 (+28.3%) |
Per-domain results β 14B trained vs baseline qwen3:14b
| Domain | Baseline | Trained 14B | Lift |
|---|---|---|---|
| Drupal 11 | 3.0/8 (37.5%) | 6.5/8 (81.3%) | +3.5 β¬ |
| PHP 8.3 | 1.5/2 (75.0%) | 1.5/2 (75.0%) | 0 |
| Drush 12 | 1.5/3 (50.0%) | 1.0/3 (33.3%) | -0.5 β¬ |
| Python 3.12 | 2.5/4 (62.5%) | 3.5/4 (87.5%) | +1.0 β¬ |
| Playwright + axe-core | 1.0/3 (33.3%) | 2.5/3 (83.3%) | +1.5 β¬ |
| Playwright + Alfa | 1.0/2 (50.0%) | 2.0/2 (100%) | +1.0 β |
| WCAG 2.2 AA (carryover) | 3.0/4 (75.0%) | 3.0/4 (75.0%) | 0 |
| WCAG 2.2-new β | 1.5/3 (50.0%) | 2.0/3 (66.7%) | +0.5 |
| Negative/contamination gate | 1.0/1 (100%) | 1.0/1 (100%) | 0 β |
| Total | 16.0/30 (53.3%) | 23.0/30 (76.7%) | +7.0 (+23.4%) |
Running the exam yourself
# Against a trained model already loaded in Ollama
python exam/run_exam.py --model rockypod/public-a11y-coder:4b --output exam/trained/4b
# Score after filling grades.json
python exam/run_exam.py --score exam/trained/4b
Grading is manual (Full/Partial/Fail per rubric in exam/a11y-30q.md).
Reproducing training
# On a CUDA GPU server with the Unsloth venv installed
nohup env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True TORCHDYNAMO_DISABLE=1 \
python train.py --size 4b > logs/train_4b.log 2>&1 &
TORCHDYNAMO_DISABLE=1 is required β Qwen3 + Unsloth triggers Triton JIT compilation which fails on CUDA driver/toolkit version mismatches common on Rocky Linux GPU hosts.
Usage
Ollama (local)
ollama run rockypod/public-a11y-coder:14b
# or for the portable demo model:
ollama run rockypod/public-a11y-coder:4b
OpenWebUI
Add the model under Settings β Models β Ollama, point to your Ollama endpoint (default http://localhost:11434), select rockypod/public-a11y-coder:14b from the model list.
HuggingFace Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "rockypod/a11y-public-coder-4b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
messages = [
{"role": "user", "content": "Write a Drupal 11 Twig snippet for an accessible image field with a skip-link-friendly heading structure."}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Known limitations
The v0.9.0 release ships with documented gaps to be addressed in v1.0:
Drush flag accuracy β The 4B variant occasionally fabricates non-existent command flags (e.g. inventing
--targetor--excludeon commands where those flags do not exist). This is a training data quality issue traced to tier 7 generation; v1.0 will include a Drush command-reference validator before retraining.Contrast ratio computation β Small models cannot reliably compute color contrast ratios from arbitrary hex pairs. The model correctly identifies SC 1.4.3 (Contrast β Minimum) and can recall specific examples that appear in training (
#767676on white = 4.48:1), but does not generalize to compute ratios for novel inputs. Recommend pairing with a deterministic contrast checker.WCAG 2.2-new exception coverage β SC 2.5.8 (Target Size β Minimum) has five distinct exception cases (offset, essential, inline, user-agent-controlled, equivalent). The 4B reliably outputs the headline
24Γ24 CSS pixelsAA threshold but covers only one of the five exception cases consistently. v1.0 will expand tier 16 with dedicated entries per exception type.SC-to-SC discrimination β The 4B occasionally confuses related success criteria (e.g. cites SC 2.1.1 + 2.1.2 for a missing button role where 4.1.2 is the primary criterion). v1.0 will add SC-discrimination pair entries to tier 17.
Drupal 11 vs Drupal 10 distinction β While the dataset targets Drupal 11 exclusively, the underlying base model has substantial Drupal 7/8/9 pretraining priors. The contamination gate (tier 17 negative examples) holds at 100% on the exam, but in long-form generation some D7-era patterns may surface. Always validate generated Drupal code against the actual D11 API.
Recommended use cases
Strong fit:
- Generating Drupal 11 module scaffolds with accessibility baked in
- Writing Playwright + axe-core / Alfa test files for agency sites
- Drafting Python utility scripts for accessibility audits (PDF text layer detection, alt text auditing, heading hierarchy)
- Explaining WCAG 2.2 success criteria to non-technical content editors
- Drush 12 natural-language to command translation (with verification)
Use with caution:
- Contrast ratio calculations (verify with a deterministic checker)
- Drush command flags (verify against
drush help <command>) - Drupal 8/9 maintenance (this model is Drupal 11-targeted)
Not designed for:
- General-purpose coding outside the trained domains
- Production-critical accessibility certification without human review
- Handling sensitive citizen data in prompts
Roadmap
| Version | Target | Focus |
|---|---|---|
| v0.9.0 | shipped | Initial release, baselines published, ship gate intentionally below 80% with documented limitations |
| v0.9.5 | ~6 weeks | Drush flag validation pass, contrast hex-pair expansion, SC 2.5.8 exception coverage |
| v1.0.0 | ~10 weeks | All v0.9.0 limitations addressed, β₯85% on the 30Q exam |
The CRAFTEDβ methodology means each version uses real-world exam failures and user-reported issues as the correction stream for the next training cycle. The v1.0 release will include an expanded 60-question exam.
Reproducibility
This release is reproducible end-to-end from the public artifacts:
- Dataset:
rockypod/a11y-public-coder-datasetordataset/in this repo - Training pipeline:
train.pyin this repo - Evaluation exam:
exam/a11y-30q.md - Exam runner:
exam/run_exam.py - Per-question grading results:
exam/baselines/andexam/trained/
Citation
@misc{a11y-public-coder-v0.9.0,
author = {RockyPod},
title = {a11y-public-coder: An open-source accessibility coding assistant for the public sector},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/rockypod/a11y-public-coder-4b}},
}
Acknowledgments
- Base models: Qwen team β
Qwen3-4BandQwen3-14Bare MIT-licensed open weights - Accessibility tooling: Deque axe-core, Siteimprove Alfa
- Web standards: W3C WAI for the WCAG 2.2 specification and Understanding documents
- Training infrastructure: Unsloth, llama.cpp, Ollama
License
MIT. See LICENSE for full text. Free for any use including commercial, including by government agencies.