State of Browser Use, May 2026

/Metadata

I’ve been playing with browser use lately. It’s an interesting problem, mostly because it isn’t like the other web things agents do.

Web search is solved. I covered the whole market in March. Web fetch is solved (Jina Reader, markit, Firecrawl). Both are stateless request/response calls that any agent can plug into. They’re a commodity.

Browser use is not. The job to be done is different and the infrastructure underneath is completely different. You need a real browser. You need tab and session management. You need cookies, local storage, and whatever goodies the page has decided to depend on. Without those, the agent can’t log into anything you care about, which means it can’t do anything you care about. A browser agent that can’t access your state is a demo, not a tool.

Which means: browser use is first and foremost a thing that runs on the user’s computer. There’s a market for running it remotely (Operator, Browser Use Cloud, every hosted Chrome provider in the second half of this post) but I’d bet the ROI on remote-only browser use is low for most users. The whole reason you want an agent driving a browser is so it can do your errands on your accounts. Anything that doesn’t have access to your accounts is solving a different problem.

This post is an overview of where browser use stands in May 2026. Two shapes the whole field reduces to, the cloud providers fighting over the runtime, benchmarks claiming 97% on tasks you wouldn’t trust an intern with, and a prompt injection war that nobody is winning.

I’m building browsemode, which is firmly on the tooling side of the taxonomy below, so treat this as opinionated and the rest as field notes.

Two shapes

Everything in this space is one of two things.

A browser agent. It’s the agent. You give it a goal, it drives Chrome end-to-end, you get a result back. The agent is the product. Most of these come bundled with a hosted-Chrome-as-a-service business attached - the OSS repo is the tooling, the money is in running the browser for you. Browser Use + Browser Use Cloud, Skyvern + Skyvern Cloud, Magnitude + app.magnitude.dev, OpenAI’s Operator, Google’s Project Mariner, Gemini 2.5 Computer Use. Different perception (DOM, screenshots, pixels), different reasoning, but the shape is the same: a self-contained agent product with a browser inside it.

Tooling for agents. Libraries that some other agent calls. Mostly OSS. Playwright itself, Stagehand (Playwright + AI escape hatches), Playwright MCP, Chrome DevTools MCP, Hyperbrowser’s HyperAgent, Notte. Your agent does the reasoning, the tooling does the driving. The vendor’s job is to give you a clean surface (named operations, action caching, accessibility snapshots) and stay out of the way.

A lot of the confusion in the space is that these two shapes overlap on the homepage. Stagehand is tooling but ships an agent primitive that turns it into a browser agent. Browser Use is a browser agent but exposes the underlying primitives so you can drop into tooling mode. The categories blur because everyone wants to sell into both buckets. But the underlying question is always the same: who owns the control loop. If it’s the vendor’s product, you have a browser agent. If it’s your agent, you have tooling.

Code mode is a flavor of the second one and it’s the bet I’m making. The programming model is closer to Playwright than to a Browser Use-style autonomous loop - you write code that drives the page - but with a twist on how the stage is framed. Scan the page into a typed catalog of named elements (signInButton, searchInput, issuesList), feed the catalog to the model, let the model write one short JavaScript block that runs in a sandbox: await page.signInButton.click(). One reasoning step, many actions, deterministic dispatch. The reason I like this framing is that it forces the agent to search the page (find the element name in the catalog) before it acts, and search-before-act is a tool design pattern that consistently produces better agent behavior - I wrote about it in tool design is all about the flow. ABP (Agent Browser Protocol) is a different take on the same instinct, freezing JS and rendering between steps so the agent acts on a stable world state. Browsemode is the SDK/CLI take. I haven’t seen the pattern productized anywhere else yet.

The browser agents

Browser Use is where most people start. Python, MIT, originally Playwright but migrated to direct CDP with their own event-driven watchdog architecture in early 2026. ChatBrowserUse is the tuned default model with OpenAI/Anthropic/Google/Ollama wired up. The cloud version adds stealth, proxies, CAPTCHA, persistent filesystem.

Their March 2026 Auto-Research write-up is the most interesting evals work of the year. They handed Claude Code a CLI to their eval platform and let it loop for 20 cycles, tree-searching the prompt space. That’s how they got 97% on Online-Mind2Web. The takeaway isn’t the score. It’s that the eval problem gets a lot easier when you treat it as a coding agent task.

Stagehand is the TypeScript answer and Browserbase’s home-court advantage. Stagehand on Browserbase, with the Model Gateway routing every provider through one key, is the cleanest production stack on the market right now. Action caching means subsequent runs decay toward Playwright-native speed. Skyvern’s comparison piece on it flagged 6/10 failure on login + 2FA without external help, which is fair criticism of every framework in this list.

Skyvern is the one to reach for when DOM access fails. Vision-first, AGPL with a commercial license, native CAPTCHA + 2FA/TOTP + proxies, self-hostable in Docker or K8s. 64.4% on their own WebBench (5,750 tasks, 500+ sites), which is the harder number than WebVoyager. Reddit consensus for hard real-world flows (“procurement bot across Shopify and Walmart with multi-tab state and 2FA”) tends to land on Skyvern + Claude.

Magnitude is the vision-first TypeScript option. Apache-2.0, pixel coordinates, built-in test runner with visual assertions. 93.9% on WebVoyager, currently #4 on the leaderboard behind Jina (98.9%), Alumnium (98.6%), and Surfer 2 (97.1%). The clean pick if you want vision and don’t want Python.

The foundation-model variants - Operator, Project Mariner, Gemini 2.5 Computer Use - are all browser agents too. Same shape, more polish, less hackable, locked to one vendor’s model. Operator is the one most people have actually used.

The strong second tier (LaVague, Agent-E, Nanobrowser, Browserable, Surfer-H/Holo1, OpenManus, Agent-TARS) all exist and all have their pockets of fans. For most teams the choice is one of the big four.

The cloud infra layer

This is where the actual money is. Every framework above runs on someone’s Chrome, and the question of whose increasingly drives the architecture.

Browserbase is the production default. $20-99+/mo plus usage, $40M Series B in mid-2025 at $300M, 36M+ monthly sessions. Stagehand is theirs and they’re not subtle about it. Model Gateway gives you one key for OpenAI, Anthropic, Gemini Computer Use, with no markup - useful when you want to swap models in evals without rewiring auth.

Steel.dev is the OSS option. Apache-2.0, sub-1s session start, generous free tier. They run the public agent leaderboard, which has become the closest thing to a neutral source of truth.

Anchor Browser has the best login-handling reputation per Reddit consensus. $6M seed, component pricing ($0.05/browser hour + proxy + AI step charges).

Hyperbrowser focuses on stealth at scale. Sub-second startup, thousands of concurrent sessions, native CAPTCHA solving, fingerprint randomization, global IP rotation. HyperAgent extends Playwright with page.ai() / page.extract().

Cloudflare Browser Run (rebrand of Browser Rendering, early 2026) added Live View, Human-in-the-Loop, CDP access, session recordings, 4x concurrency. Cheapest at small scale if you’re already on Workers.

Bright Data Agent Browser is the enterprise pick. $5-8/GB, integrated proxy + CAPTCHA, 95% feature coverage on AIMultiple’s remote-browsers benchmark.

The choice between them mostly reduces to: who do you trust to run your real Chrome, and how much do you care about stealth, OSS, or session replay. Browser Arena (built by Notte) is the most honest comparison I’ve seen.

The Obscura wildcard

There’s one project worth flagging that doesn’t fit anywhere else and might rearrange the whole cloud-browser layer if it matures. Obscura is a Rust headless browser. V8 for JS, full CDP, drop-in for Puppeteer and Playwright. The numbers:

30 MB memory vs 200+ MB for headless Chrome
85 ms page load vs ~500 ms
Instant startup vs ~2 seconds
70 MB binary vs 300+ MB
Stealth built in, not bolted on (per-session fingerprint randomization, 3,520 blocked tracker domains, navigator.webdriver masking, native function spoofing)
Apache 2.0

If the resource numbers hold up at scale, the entire remote-browser pricing model gets rewritten. Browserbase, Hyperbrowser, Anchor all charge for the Chrome resources they’re running on your behalf, and they sell stealth as a premium add-on. Obscura is 7x lighter on memory, faster on every page-load metric, starts instantly, and ships stealth as default behavior. Same CDP surface, same Puppeteer/Playwright compatibility.

It’s still early. Real-world coverage of edge cases (downloads, dialogs, iframe quirks, full request interception) is the part that takes years to nail and Chromium has had two decades. But the architectural bet - that automation deserves a browser engine designed for automation rather than a desktop browser stripped of its UI - is sound, and the early numbers suggest the bet is going to pay off. If Obscura gets to feature-parity with Chromium’s CDP surface, the cloud-browser business changes shape.

The tooling

If you’re not buying a browser agent off the shelf, you’re wiring up your own agent with browser tooling. The interesting options here are mostly OSS.

Stagehand is Playwright with AI escape hatches. You write deterministic code where you can; you call act("click sign in") or extract({ schema }) where you can’t. Successful AI actions get cached as selectors so subsequent runs skip the LLM. TypeScript-first, Browserbase ships it as their primary offering, the Model Gateway routes every provider through one key with no markup. The most polished option in this category.

MCP servers are the path of least resistance if your agent is Claude Code or Cursor and you just want it to drive Chrome occasionally. Steve Kinney’s framing of driving vs. debugging is the right one. Playwright MCP drives the browser (clicks, types, fills, navigates). Chrome DevTools MCP debugs it (network, console, performance). Install both first-party servers in lean modes and let the agent pick.

HyperAgent extends Playwright with page.ai() / page.extract() and runs on Hyperbrowser’s stealth-focused infra. Notte ships its own headless browser with a perception layer that converts pages into structured natural-language maps. Both are smaller communities, both are plausible if Stagehand doesn’t fit.

Code mode (browsemode, ABP) is the framing I described above and the lane I’m building in. Same bucket as the rest of the tooling, different stage.

The benchmarks (with the usual grain of salt)

Current WebVoyager leaderboard (top 10):

Rank	System	Organization	Score
1	Jina	Om Labs	98.9%
2	Alumnium	Alumnium	98.6%
3	Surfer 2	H Company	97.1%
4	Magnitude	Magnitude	93.9%
5	AIME Browser-Use	Aime	92.3%
6	Surfer-H + Holo1	H Company	92.2%
7	Browserable	Browserable	90.4%
8	Browser Use	Browser Use	89.1%
9	GLM-5V-Turbo	Z.ai	88.5%
10	Operator	OpenAI	87.0%
12	Skyvern 2.0	Skyvern	85.9%
13	Project Mariner	Google	83.5%

WebVoyager is saturated and almost certainly being benchmaxxed. Top 10 systems are all above 87%, top 4 above 93%, top 2 within 0.3 points of each other. Once a benchmark runs out of headroom it stops differentiating - the gaps between systems live in the 1% you can’t measure with this, and the incentive shifts from building a better agent to overfitting to the test set. Same pattern that hollowed out every coding benchmark before it. The honest signal moved to harder benchmarks the moment WebVoyager tipped over.

Browser Use’s 97% on Online-Mind2Web is the most impressive single number. ABP + Opus 4.6 was 90.5%. Most frontier agents only manage 30%.

Now the other half of the picture. An Illusion of Progress? (Xue et al., COLM 2025) showed that prior benchmarks hugely overstated capability and frontier agents complete only ~30% of real tasks. ClawBench (153 live-site write tasks) caps the best frontier model at 33.3%. WebBench (5,750 tasks) tops at Skyvern’s 64.4%.

The pattern: the easier the benchmark a vendor cites, the more impressive the number. The moment you move to live sites with real writes, real auth, real CAPTCHAs, everyone collapses to roughly the same range.

I covered this in your eval sucks and it applies double here. These scores are mostly vendor benchmarks tuned by the vendor on a small task set. Treat them as relative ordering, not absolute capability.

The actual ceiling: prompt injection

The interesting story of 2026 isn’t reliability. It’s that prompt injection moved from theoretical to wild.

Google’s threat intel reported a 32% relative increase in malicious indirect prompt injection content between November 2025 and February 2026, scanned across CommonCrawl. Unit 42 documented the first large-scale IDPI attacks in the wild in March 2026. Ad review evasion, system prompt leakage, on live commercial platforms. Vectra reports 84% attack success rates against agentic systems with production CVEs above 9.0. OpenAI shipped Lockdown Mode for ChatGPT on February 13. Anthropic published browser-use injection defenses and continues to find novel attacks faster than they patch them.

OWASP put prompt injection at #1 on the 2026 AI threat list. Every framework above is vulnerable to some flavor of “page contains hidden text telling the agent to email itself the user’s cookies”, and none of them have a complete answer.

The arXiv paper from November 2025 on building production browser agents is blunt. Programmatic safety boundaries beat LLM-judged ones. Specialization beats generalization. Prompt-injection mitigation is unsolved at the architecture level.

If you’re shipping a browser agent in 2026 and you haven’t thought about this, you have a real problem. Sandbox the runtime. Constrain the action surface. Don’t give the agent ambient credentials it can’t itself reason about. Treat any text the page renders as adversarial input.

What I’m actually picking

My preferred harness is pi. I want browser tooling that fits cleanly into it, not a black-box agent product and not an MCP server.

MCP is the wrong abstraction for what I’m doing. It’s an integration layer for editors and chat clients, not a programming model for agents that need fine-grained control over a runtime. Every MCP browser tool I’ve used flattens the surface to a small set of generic verbs (click, type, screenshot) and hides everything underneath. That’s exactly the wrong direction for an agent harness where I’d rather expose more, not less.

What I want is direct CDP. The browser as a real object I can inspect, scan, and drive. No wrapper deciding what’s interesting. No tool schema deciding what’s reachable. The two projects in this space that are CDP-native are Browser Use (since their migration off Playwright) and browsemode, which is the one I’m building. Browser Use is a great agent if you want a browser agent. Browsemode is what I reach for when I want browser tooling for an agent that already speaks code.

That’s the bet. Code-mode framing, CDP under the hood, scan-then-script flow, no MCP layer in between.

The honest version

The framework choice matters less than the runtime choice (where does your Chrome live), the model choice (computer-use vs. text+screenshot vs. DOM), and the safety boundaries (what can the agent actually do, regardless of what it decides).

Both shapes work for some workload. None of them are solved. The benchmarks are theater. Prompt injection is winning.

Pick the shape that fits your problem, not the one with the highest leaderboard number. Then dogfood it until it stops embarrassing you.

Snapshot: May 2026. The space moves fast. Verify before betting a roadmap on any number above.