AI-Assisted Accessibility Remediation with Mandatory Verification Gates
Large language models can draft alt text, ARIA labels, and heading-structure corrections far faster than a human can type them — but a model’s output is a suggestion, never a verified fix. This guide is part of Automated Remediation & Accessibility Fixing Patterns, and its core thesis is simple: treat every AI-proposed change as untrusted input that must pass an automated re-scan and a human review before it is allowed to merge. The danger is not that a model is wrong occasionally; it is that a model is confidently wrong and an unguarded pipeline will ship that confidence straight to production.
Key implementation targets:
- Wrap any model-generated fix in a re-scan gate using axe-core before it can advance.
- Require a human approval step — AI proposals are never auto-merged.
- Validate ARIA correctness against WCAG 2.2 SC 4.1.2 and reject regressions that add new violations.
- Record provenance so every accepted suggestion is traceable to the prompt and model that produced it.
The Problem: Confident Output Is Not Verified Output
A model asked to label an icon button might return aria-label="Submit" for a control that actually deletes a record. It might invent alt text describing an image it never saw, or restructure headings in a way that reads plausibly but breaks the document outline. None of these failures are detectable by reading the diff alone — they require re-measuring the page against the same rules that flagged the original defect. The only safe architecture is one where the model never touches the merge button: it produces a candidate, and an automated tool plus a human decide whether the candidate is real.
This matters more for accessibility than for most code, because the failure mode is silent. A broken aria-label does not throw an exception or turn a test red on its own — it simply misinforms an assistive-technology user. That is exactly why the re-scan gate and the human gate are non-negotiable.
Key Implementation Targets
The pipeline has four moving parts: a generation step that prompts a model with the violating node and its context, a constraint layer that rejects malformed output before it ever reaches a browser, a re-scan that proves the fix removes the original violation and introduces no new one, and a human review surface where a person approves or rejects with full context. Skipping any one of these turns “AI-assisted” into “AI-unsupervised.”
Prerequisites
1. Generate a Candidate, Never a Commit
The generation step must output structured data, not a patch applied in place. Prompt the model with the failing element, its accessible context, and an explicit instruction set, then capture the proposal as JSON for downstream validation.
// suggest.js — produces a candidate fix, applies nothing
import fs from "node:fs";
async function suggestFix(violation) {
// violation: { html, target, ruleId, contextText }
const prompt = [
"You propose a single accessibility fix. Output JSON only:",
'{ "attribute": "aria-label", "value": "..." }',
"Rules: <= 80 chars, no leading/trailing space, do not include the",
'word "button" in an aria-label, language must match the page.',
`Element: ${violation.html}`,
`Nearby text: ${violation.contextText}`,
].join("\n");
const res = await callModel(prompt); // provider-agnostic wrapper
const candidate = JSON.parse(res.text); // throws on malformed output
candidate.ruleId = violation.ruleId;
candidate.target = violation.target;
return candidate; // returned, not written to source
}
export async function suggestAll(violations) {
const out = [];
for (const v of violations) out.push(await suggestFix(v));
fs.writeFileSync("candidates.json", JSON.stringify(out, null, 2));
return out;
}
The deeper validation of these accessible-name proposals — length, redundant-role wording, and WCAG 2.2 SC 2.5.3 label-in-name — is covered in Using LLMs to Suggest ARIA Labels Safely.
2. Apply Candidates to a Throwaway Build
Candidates are applied to an ephemeral working copy, never to the source branch directly. This lets the re-scan run against a real rendered DOM without committing anything a human has not seen.
// apply.js — writes candidates into a scratch checkout only
import { JSDOM } from "jsdom";
import fs from "node:fs";
export function applyCandidates(html, candidates) {
const dom = new JSDOM(html);
const doc = dom.window.document;
for (const c of candidates) {
const el = doc.querySelector(c.target);
if (!el) continue; // stale selector: skip, do not guess
el.setAttribute(c.attribute, c.value);
}
return dom.serialize();
}
const html = fs.readFileSync("dist/index.html", "utf8");
const candidates = JSON.parse(fs.readFileSync("candidates.json", "utf8"));
fs.writeFileSync("dist/index.patched.html", applyCandidates(html, candidates));
3. The Re-Scan Gate
This is the automated half of the safety contract. The patched output is re-scanned and compared to the pre-fix baseline. The fix is only allowed forward if the original violation is gone and the new violation count has not risen. The full CI implementation of this gate — including ARIA attribute-validity checks — lives in Validating AI-Generated ARIA Fixes in CI.
// gate.js — fails the build unless the fix is a net improvement
import { chromium } from "playwright";
import AxeBuilder from "@axe-core/playwright";
async function scan(url) {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto(url);
const { violations } = await new AxeBuilder({ page })
.withTags(["wcag2a", "wcag2aa", "wcag22aa"])
.analyze();
await browser.close();
return violations;
}
const before = await scan("http://localhost:3000/index.html");
const after = await scan("http://localhost:3000/index.patched.html");
const beforeCount = before.reduce((n, v) => n + v.nodes.length, 0);
const afterCount = after.reduce((n, v) => n + v.nodes.length, 0);
if (afterCount > beforeCount) {
console.error(`AI fix added violations: ${beforeCount} -> ${afterCount}`);
process.exit(1); // reject and loop back to suggestion stage
}
console.log(`Re-scan passed: ${beforeCount} -> ${afterCount}`);
4. The Human Review Gate
The re-scan proves a fix is not worse; it cannot prove the label is correct. Only a human knows that the delete control should not say “Submit.” Surface each accepted candidate as a normal pull-request diff with the prompt, the model, and the before/after counts attached, and require a reviewer approval before the branch is mergeable. Never configure the AI’s service account as an auto-merge actor.
Pipeline Integration
Run generation and the re-scan gate as a single CI job that emits candidates.json and a JUnit-style summary artifact. The gate’s process.exit(1) blocks the job; the human gate is a required reviewer on the branch. Upload the before/after axe JSON with actions/upload-artifact so reviewers see exactly what changed. Wire the job as a required check, following the patterns in Pull Request Gating & Branch Policies.
Troubleshooting & Flaky-Test Mitigation
Model output that fails JSON.parse should be retried with a stricter prompt, then surfaced as a failure rather than silently dropped. If the re-scan flickers between counts, the page is scanning before hydration — add waitForLoadState('networkidle') before analyze(). Cache the pre-fix baseline scan per commit so the comparison is stable across reruns.
Common Pitfalls
- Auto-merging AI branches. The single most dangerous misconfiguration. AI suggestions must always pass a human gate.
- Trusting the diff over the re-scan. A plausible-looking label can still be semantically wrong; only the re-scan and a human catch that.
- Comparing total counts without per-rule detail. A fix can remove one violation while adding another of equal count — compare rule IDs, not just totals.
- Losing provenance. Without the originating prompt and model recorded, an incorrect accepted fix is impossible to audit later.
FAQ
Can I let the AI commit directly if the re-scan passes? No. The re-scan only confirms no automated rule regressed. Roughly a third of accessibility requirements are not machine-verifiable, so a human must confirm the label’s meaning. Auto-commit defeats the entire safety model.
Which model should I use?
This pipeline is provider-agnostic by design — the callModel wrapper isolates the vendor. The gates matter far more than the model: a weaker model behind strong gates is safer than a strong model with none.
What if the model proposes a structurally large change like heading restructuring? Constrain proposals to one attribute or one element per candidate. Large restructures should be split into many small candidates, each independently re-scanned and reviewed, so a single bad suggestion cannot hide inside a big diff.
Related
- Automated Remediation & Accessibility Fixing Patterns — the parent section covering all fixing strategies.
- Validating AI-Generated ARIA Fixes in CI — the runnable CI gate that enforces ARIA validity.
- Using LLMs to Suggest ARIA Labels Safely — constraining accessible-name output before review.