Agentic Manual Testing
Have an agent do the clicking, typing, and watching that a human QA tester used to do: start the server, visit the URL, try the flow, read the result, and report what broke.
Also known as: Agent-driven QA, Agentic end-to-end testing, Agent pair testing (when paired with a human observer).
Understand This First
- Test — the scripted, executable check this pattern complements rather than replaces.
- Verification Loop — the change-test-inspect-iterate cycle this pattern plugs into.
- Agent-Computer Interface (ACI) — the layer of tools (shell, browser driver, HTTP client) the agent needs for this work.
Context
You’re at the tactical level. The code compiles, the unit tests are green, and the linters are quiet. Someone still has to answer the question automated tests can’t: does the thing actually work end-to-end for a person using it? Historically that answer came from a human QA tester clicking through flows, or from a developer reluctantly doing the same at three in the morning before a release. In an agentic workflow, much of that clicking can be delegated to the agent: the same agent that wrote the code, or a dedicated testing agent sitting alongside it.
This matters most in the agentic era because agents produce changes faster than humans can regression-test them. If the only integration check is “a developer runs the app locally and pokes at it,” that check becomes the bottleneck the moment the agent’s output rate exceeds the developer’s patience.
Problem
Scripted tests cover the behaviors you wrote assertions for. Exploratory testing finds surprises, but it requires a skilled human’s attention. Between them sits a broad, dull band of work that neither kind of test covers well: the manual integration check. Does the signup form actually send the email? Does the file uploader show a progress bar and then a preview? Can you open the admin dashboard on a fresh database without a stack trace? Humans used to do these checks by rote. Nobody wants to script them, because they’re too brittle, too environment-dependent, and too cheap to bother with. But skipping them ships broken software. How do you cover this middle band without hiring a QA team or writing another end-to-end test suite nobody will maintain?
Forces
- End-to-end tests are expensive to write, slow to run, and flaky enough that teams ignore failures.
- A human doing manual QA is fast and flexible, but the labor doesn’t scale with the rate at which agents change the code.
- Agents can now drive a browser, run a dev server, and read network logs; the capability is here, but the discipline for using it is new.
- Agent-written code is especially prone to plausible-but-wrong integration behavior: the API call looks right, returns 200, and silently discards the payload.
- Delegating QA to the same agent that wrote the code creates a conflict of interest; a second pair of eyes (human or another agent) is often needed.
Solution
Give the agent the tools and the charter to act as a manual tester. The kit is concrete: a way to start and stop the application (a shell tool that runs npm run dev or docker compose up), a way to make requests (curl or an HTTP client), a way to drive a browser (Playwright, a Chrome DevTools Protocol wrapper, or a browser MCP server), and a way to read what happened (stdout, network logs, screenshots). The charter is a short English paragraph that names what to test and how to decide if it passed: “Start the dev server. Visit /signup. Register with a new email and a 12-character password. Confirm that a success page appears and that the database contains the new user.”
Then let the agent run the charter. The agent starts the server, waits for it to be ready, opens a browser or fires a request, observes the response, and writes a short report: what it tried, what it saw, and whether the expected outcome occurred. If anything fails, the agent includes the evidence: the error message, a screenshot, the failing request. The developer reads the report and decides what to do next.
A few habits keep the reports signal-rich rather than noisy:
- Fresh state. Start each session from a known state: a clean database, a fresh browser context, a default feature-flag configuration. Shared state between sessions makes every report suspect.
- Explicit success criteria. “Does the flow work?” is too vague. “Does clicking Create return the user to the dashboard within three seconds and display the new item at the top of the list?” is testable. Write criteria the agent can check.
- Human sampling. Read a random subset of the agent’s reports in full. Agents miss subtle problems: misaligned layouts, confusing copy, the wrong color on a danger button, a loading spinner that never disappears. Sampling catches both agent blind spots and flagging drift.
The goal is not to replace scripted tests. Anything the agent finds worth checking more than twice is a candidate for automation. Agentic manual testing is the staging area between “nobody has tried this yet” and “we have a test for this.”
How It Plays Out
A developer finishes a feature that adds a two-factor authentication flow. The unit tests pass. Instead of running the server and clicking through the flow herself, she writes a one-paragraph charter and hands it to the agent: start the server, register a new account with a real email, confirm the 2FA code arrives in the test inbox, enter the code, confirm the dashboard loads. The agent does exactly that, takes a screenshot at each step, and writes back that the flow works — except the 2FA code email is sent with the plaintext code in the subject line rather than the body. That’s a security bug she would have missed in unit testing, and a bug the agent notices because its charter said “confirm the code arrives” and the subject line was the easiest place to find it.
A small team ships a SaaS product built largely by an agent working from a spec. Before every release they run a smoke suite manually: ten flows that matter most (signup, login, billing, upgrade, downgrade, password reset, invite teammate, change plan, cancel, re-subscribe). The manual run used to take a human 90 minutes. Now they hand the same charter list to a second agent with browser access, Playwright, and a disposable database. The agent runs all ten flows in 12 minutes, flags two regressions (the upgrade flow double-charges the card; the cancel flow doesn’t send the confirmation email), and the team fixes both before the release.
Keep a file called qa-charters.md in the repo. Each charter is three or four sentences: the flow, the inputs, the expected outcome. When you add a feature, add a charter. When a bug ships and you catch it in QA, add a charter that would have caught it. Let the agent read and run the file on a schedule or before each release.
A developer debugging a reported issue can’t reproduce it locally. Rather than asking the reporter for more screenshots, he hands the agent a charter: reproduce the user’s scenario by clicking through these five specific steps, record the console, record the network tab, report what you see. The agent does the walkthrough in a scripted browser, captures the console error that doesn’t appear in the developer’s own browser (it’s a cache-related edge case), and the developer has the reproduction in minutes instead of days.
Consequences
Benefits. The bulk of routine integration QA stops being a bottleneck. Releases can ship faster without sacrificing the manual-check coverage that teams quietly depended on. Agents are tireless, will happily run the same 40-flow smoke suite every night, and produce artifacts (screenshots, logs, HAR files) a human tester often skips in the interest of time. The reports also surface issues that scripted tests miss: layout breakage after a CSS refactor, confusing error messages, and the class of bug that only appears when you actually look at the page.
Liabilities. The agent can report green on a flow a human would flag; it has no taste about visual design, copy, or UX smell. A second agent or a sampling human still has to close that gap. The agent also needs real tools and real access: a sandboxed environment, a browser driver, possibly test credentials. That infrastructure isn’t free. Flaky charters (ones that sometimes pass and sometimes fail for environmental reasons) train the team to ignore failures the same way flaky scripted tests do; keep charters deterministic or retire them. Finally, letting the agent test its own code is a well-known failure mode: it will happily write a charter that passes for the wrong reason. When the stakes are high, hand the charter to a different agent — or a human — than the one that wrote the code.
Related Patterns
- Complements: Test — agentic manual testing covers flows scripted tests don’t, and promotes successful checks into scripts over time.
- Complements: Exploratory Testing — exploratory testing is human-led hypothesis generation; agentic manual testing is agent-led execution of predefined flows. They cover different bands of the testing spectrum.
- Uses: Verification Loop — each charter run is an instance of the verification loop, with the “test” being a real end-to-end interaction.
- Uses: Agent-Computer Interface (ACI) — the tools the agent drives (shell, browser, HTTP client) are the ACI surface this pattern depends on.
- Uses: Tool — browser drivers, HTTP clients, and dev-server controllers are the specific tools this pattern requires.
- Uses: Sandbox — the environment the agent drives must be isolated from production; a QA agent with production credentials is its own accident waiting to happen.
- Refines: Happy Path — a charter is typically a happy-path scenario plus one or two boundary conditions.
- Feeds: Regression — flows the agent runs repeatedly become candidates for promotion into a scripted regression suite.
- Contrasts with: Code Review — code review reads the change; agentic manual testing runs the change.
- Contrasts with: Consumer-Driven Contract Testing — CDCT verifies the interface between services; agentic manual testing verifies the interface between the application and a user.
- Risks: AI Smell — an agent testing its own code can miss what it wrote wrong; a different agent or a sampling human closes the gap.
- Related: Eval — when the agent itself is the product (a chatbot, a coding assistant), evals play a similar role to charters: scripted scenarios with expected outcomes.
Sources
The manual-testing-with-a-robot idea has long roots. Record-and-playback browser tools like Selenium (2004) automated parts of the clicker’s job but required fragile scripts. The Chrome DevTools Protocol (2017) and Playwright (Microsoft, 2020) made it practical for any program, including a language model, to drive a real browser, capture screenshots, and inspect network traffic.
The specific practice of letting an agent interpret a plain-English charter, drive the tools itself, and write a report in response emerged from the agentic coding practitioner community in 2025 and 2026. The Model Context Protocol (Anthropic, late 2024) made browser-driving capabilities a portable agent skill, and browser-automation MCP servers quickly became standard parts of an agent’s toolkit. The charter-plus-agent approach was formalized in public writing and conference talks over the winter of 2025–2026, as teams realized that the biggest productivity gain wasn’t the code the agent wrote, but the manual QA work it could now do in parallel.
The pattern also draws on Cem Kaner and James Bach’s session-based testing tradition (see Exploratory Testing), which established the charter as the unit of structured-but-open-ended testing. Agentic manual testing differs in that the agent, not a human, executes the session, but the charter form and the debrief discipline are inherited directly.
Further Reading
- Playwright documentation — the de facto standard browser driver for agent-accessible end-to-end testing; the “codegen” and “trace viewer” tools are useful starting points.
- Model Context Protocol documentation — the standard by which agents acquire browser, shell, and HTTP tools in a portable way.
- Elisabeth Hendrickson, Explore It! (Pragmatic Bookshelf, 2013) — the charter form this pattern borrows from, written for human testers but directly applicable.