And, even moreso: beware of companies who sell them.
I performed my first professional accessibility audit in 2006. Since that time, I’ve performed hundreds of accessibility audits for everything from small websites to massive standalone kiosks in Manhattan. I’ve done audits for small one-person e-commerce stores and massive software companies like Google, Adobe, Microsoft, and Salesforce. For the last decade, most of the audits I’ve done have been either in response to – or from the desire to avoid – a legal threat.
Due to the regulatory and legal scrutiny involved, the quality of an accessibility audit is of the utmost importance. An accessibility audit cannot be based on conjecture and must be performed by testers with a high level of technical knowledge using a clearly defined methodology that is rigorous, reliable, and repeatable. While there is a fair amount of subjectivity inherent to accessibility, a quality audit methodology should seek to reduce that subjectivity by codifying clear opinions in to the test methodology. While it is possible to train AI models to adopt one’s opinions, this is not something we see happening among companies who deliver “AI Audits”.
Recently, we’ve started seeing accessibility consultants using LLMs to perform accessibility audits. This seems to be borne from a growing tendency to assume that because large language models can produce polished writing, summarize technical information, and generate code that looks plausible, they are ready to do accessibility audits. There are several important reasons why using LLMs for auditing is a bad idea.
The technical reason
Using an LLM for accessibility auditing is, technically speaking, unpredictable. Asking an LLM to assess the accessibility of a specific page, for example, will not cause it to execute any pre-defined set of accessibility rules. Instead, they access the page in question using internal tools, then perform a web search looking for common accessibility barriers and inspect the page for those barriers. They are not doing a deep inspection of anything, much less the code.
ChatGPT says:
“What I’m testing from a URL alone is the page representation that ChatGPT’s web tools can retrieve and expose to me, not the browser’s live DOM, not JavaScript-executed state, and not a raw source file dump. OpenAI’s documentation says ChatGPT can search the web and use web results, but it does not document arbitrary DevTools-style DOM scripting against a live page.”So, what you’re getting out-of-the-box with an LLM is a poor approximation – at best – of automatic testing. ChatGPT continues:
“I am not directly testing computed browser state such as:
- the live accessibility tree after JavaScript runs
- focus movement after interaction
- keyboard event handling in custom widgets
- computed accessible names after scripts mutate the DOM
- dynamic ARIA state changes
- shadow DOM behavior
- screen reader announcements in a real AT/browser stack”
Agents, Skills, and MCP can’t fix this
Sophisticated users may point to the possibility of using agents, skills, or MCP (or, probably, all three) as a way to enhance the LLM’s capabilities. While agents and skills are merely pre-baked sets of instructions to the LLM, an MCP can provide real power to the task. If you think of an MCP server as being an API that the LLM can query to do specified tasks, then you can see how this can be really useful for accessibility testing. An MCP that has available functions for automatic testing can be really useful. But, in practice, doing automatic testing in an LLM is worse than doing automatic testing directly within your favorite tool, for reasons I’ll discuss below. Then, of course, there’s also the fact that there’s only so much that automatic testing can do.
Recently some companies have begun marketing MCP servers. Some are even charging six-figure prices for them. I’m extremely familiar with MCP and have made multiple MCP servers myself: one for my own personal use, and one for Eventably. The reason why MCP cannot overcome anything I’ve said above is because the LLM is left to choose for itself how to address the task they’re being asked to perform. It might use the MCP, it might use some other tool, or it might use the MCP and other tools together. The only real way to make an LLM only use the MCP – and nothing else – is to have the MCP server installed locally, have no other competing MCP servers installed locally, and disconnect the computer from the outside world. This will isolate the LLM in a way that leaves it no other choice but to directly use the MCP. Barring that, the LLM will decide for itself what tool to use, even if you tell it the exact MCP functionality it should use.
LLMs thrive in small tasks and die on large ones
In practice, LLMs tend to be very eager to accomplish a task successfully. When asked to perform a big task, they will first break down that big task into a series of smaller tasks and keep that list of the smaller tasks in memory (or a temporary file) to keep track of what they’re doing and to remember what they’ve done. Not long ago, losing track of tasks and forgetting conversation history was a bigger problem. These days, major LLMs seem like they do a better job of remembering things, but that’s really a mirage, because the longer a conversation gets, the more likely the details get lost. And, if anything, a good accessibility audit requires details.
As the conversation grows longer, this tendency to lose track of details results in an inability to follow instructions consistently, even with an agent and MCP. This is not a minor inconvenience. It is one of the biggest operational reasons they are not suited to autonomous audit work. Give a model a detailed set of formatting requirements, a template, rules about where code belongs, how screenshots should be taken, and how issue descriptions should be structured, and it may comply for a while. Then it drifts. It leaves fields blank. It puts prose where code should be. It ignores requirements that were explicitly stated. It starts out appearing disciplined, then slowly unravels as the conversation gets longer and the volume increases.
A lack of opinion means a lack of consistency
Accessibility auditing is not a writing exercise. It is not a formatting exercise. It is not a matter of taking the output of automated tools, cleaning up the language, and turning it into a deliverable. It is expert analysis. It requires someone to inspect evidence, understand context, recognize when a supposed issue is not really an issue, determine what standard applies, and recommend a fix that actually addresses the problem without making something else worse. That combination of skepticism, technical accuracy, and situational judgment is exactly where LLMs still fail.
One of the most obvious problems is that they do not handle special circumstances well because they don’t have their own opinion. A thing that looks wrong in isolation may be perfectly acceptable once you understand the larger structure, the intended interaction, the state of the component, or the way assistive technologies actually encounter it. Human auditors know this because they have seen enough real interfaces to understand that context changes everything. LLMs, by contrast, have a strong tendency to flatten nuance into certainty. They see something that resembles a known problem and they proceed as if resemblance is proof.
Another major weakness is that LLMs hallucinate standards mappings with alarming ease. In ordinary content generation, a fabricated detail may be embarrassing. In an accessibility audit, it undermines the entire result. If a model assigns the wrong WCAG Success Criterion to a finding, or worse, gets the conformance level wrong, then it is not doing serious audit work. It is improvising. Calling something Level A when it is actually AAA is not a harmless typo. It changes the meaning of the finding, the compliance implications, and potentially the urgency with which a client responds. Accessibility professionals are expected to be precise about standards because that precision matters. Current LLMs are still far too willing to make that mapping up when they do not actually know it.
Worse, LLMs sometimes completely hallucinate issues. They invent content that has nothing to do with the issue at hand. This is where their fluency becomes actively deceptive. A human reviewer may see a neatly written issue description, a plausible explanation, and a code sample that looks technical enough to pass inspection at a glance. But once you compare it to the actual product that was tested, you notice odd inconsistencies. The description may not match the code or the issue itself may not exist. The rationale may be attached to the wrong pattern entirely. This is not just ordinary inaccuracy. It is counterfeit expertise. The system produces something that looks like competent work while quietly severing itself from the source evidence.
That lack of grounding is especially obvious when models are asked to provide code samples. In a real accessibility audit, “issue code” is not illustrative. It is evidentiary. It is supposed to come from the actual component, the real DOM, the actual markup under review. It is supposed to show the failing element and enough surrounding context to make the problem clear. LLMs are notoriously bad at respecting this boundary. Instead of extracting code from the real source, they generate snippets that look about right, or they repeat whatever some analyzer reported, or they fabricate something based on what they assume the code probably looks like. Once that happens, the rest of the issue becomes untrustworthy. If the evidence is synthetic, then the audit is no longer anchored in reality.
The same thing happens with remediation guidance. There is a persistent fantasy that AI can not only identify accessibility issues but also fix them automatically. Sometimes it can produce something useful for highly mechanical problems. But the moment actual judgment is required, the quality drops fast. One of the most common failure modes is that the recommended code is identical to the failing code. Another is that the code includes a supposed fix that either does nothing or introduces a fresh accessibility problem. In other cases, the guidance becomes so generic that it is functionally useless. “Add an accessible name” is not remediation guidance if the hard part is determining what the accessible name should be, where it should come from, and how it should interact with the rest of the component. Real fixes depend on intent, context, semantics, and implementation details. Those are precisely the areas where LLMs are least dependable.
This just isn’t a task for LLM
I’m bullish on AI and I believe that human accessibility testers will be replaced by AI in the future. The fact that AI hasn’t replaced accessibility testers isn’t a skills issue but a data issue. There will be models in the future that can do this job. But that future is years away, even at the breakneck speed that AI is currently advancing.
The right model, at least for now, is human in the loop. Let automated tools identify possible issues. Let human experts verify those findings against the real code, the real DOM, and the real user experience. Let AI assist with packaging and process after the fact. But do not confuse that support role with expertise. An LLM can help an auditor move faster around the edges. It cannot yet replace the actual act of auditing by expert humans. Treat anyone who says otherwise with deep skepticism.


