Web Agents That Can Actually See

The DOM is a lie

I lost a week to a button.

It was right there on the screen — a big blue Submit at the bottom of a form — and my agent swore it didn't exist. More prompting didn't help, because the agent wasn't looking at the screen. It was reading the DOM, and in the DOM that button lived three layers deep inside a shadow root, drawn by a component that looked nothing like what I could plainly see.

That's the whole problem in one anecdote. Most web automation — from Selenium scripts to the newest AI agents — works by reading the DOM: parse the HTML, find the element, click it. Reasonable, right up until the HTML stops describing the page.

It stops constantly. The DOM is a structural representation, not a visual one. CSS can hide elements, reorder them, stack them behind overlays, or paint them in ways that have nothing to do with their place in the tree. A div set to display: none is invisible to you but present in the DOM. A button rendered on a <canvas> is visible to you but absent from it. Shadow DOM, iframes, content injected after load — the gap between what the markup says and what a person sees is wide, and every framework widens it.

When you use a website, you don't read the source. You look at the screen and figure out what to click. That's the capability we're building.

Vision-language models as the perception layer

The idea is simple to state. Screenshot the page, hand it to a vision-language model along with the task and the history of actions so far, and let the model decide what to do next. It sees the page the way you do — as rendered pixels — and reasons about what's clickable, what's relevant, and what moves it toward the goal.

Simple to state, brutal to serve fast enough. We run these models through a custom engine that fuses FlashInfer tensor cores with tree-structured sparse attention; FP8 centroid scoring and Top-k page selection pull per-step latency down to about 17 ms — roughly 1.36× faster than SGLang. That number is the point. A web agent that takes thirty seconds to decide what to click is a demo, not a tool. Latency is a constraint you commit to up front, not something you bolt on at the end.

The rest is unglamorous serving work: KV-cache management across multi-turn interactions, batching across concurrent sessions, graceful degradation when the model is unsure. We spent more time on the infrastructure than on the model — which, for applied work, is the correct ratio.

What makes the web adversarial

Static benchmarks are easy. The web is not static.

A page loads asynchronously: the agent screenshots it, decides, and by the time it acts a modal has appeared or the layout has jumped. Cookie banners, notification prompts, CAPTCHAs, A/B variants — each reshapes a page into a configuration the agent has never seen. Forms validate on blur and surface errors that didn't exist a second ago.

State is the deeper problem. Real workflows — book a flight, file an expense report, configure a CI pipeline — stretch across dozens of transitions, any of which can fail or redirect. The agent has to notice when it's lost, back out, and try another route. That recovery is the thing people do without thinking and the thing that's genuinely hard to build.

So we built a benchmark around it. FullStackArena is 1,000+ tasks across dynamic web simulations — ride-sharing, finance, social, maps — aimed squarely at these failure modes. Not "can the agent finish on a clean page" but "can it recover when a popup interrupts a checkout," and "does it even notice when a page load failed silently." Robustness on the messy, adversarial web is the only metric that survives contact with reality.

Where this is going

The goal is open-source tooling for autonomous web navigation that works on real sites, not curated demos. Build an agent that reliably handles multi-step workflows across arbitrary websites and you automate an enormous class of work that today still requires a person at a browser, clicking.

We're not there. Models still stumble on long-horizon planning and loop when they hit states they don't recognize. But notice which failures those are. A year ago the wall was perception — the model simply couldn't tell what was on the page. That wall is gone. What's left is latency, recovery, and state: engineering problems, not capability problems. I'll take engineering problems every time.