The most interesting part of the latest AI security benchmark cycle is not that Claude can find hard bugs in Firefox, or that vendors keep posting heroic model screenshots.

It is that once you remove the expert babysitter, the leaderboard starts to look less like a test of intelligence and more like a test of economics.

That is why a post now sitting near the top of /r/netsec/hot is worth more than another round of model bragging. Hacktron's new write-up, titled "Why Mythos doesn't matter (for us)", argues that for real-world vulnerability research, smaller models run repeatedly can beat larger frontier models on cost-to-recall. The headline sounds like benchmark trolling. The underlying point is better than that.

What is actually verified

Hacktron's benchmark is built on a real target and real disclosed bugs. The company says it tested different model configurations on oauth2-proxy v7.15.0 using two authentication-bypass vulnerabilities it had already found as ground truth:

  • CVE-2026-34457 / GHSA-5hvv-m4w4-gf6v: health-check user-agent matching bypass in auth_request mode
  • CVE-2026-40575 / GHSA-7x63-xv5r-3p2x: authentication bypass via X-Forwarded-Uri header spoofing

Those advisories are public through GitHub's advisory API, and oauth2-proxy's v7.15.2 release notes explicitly say the release fixed multiple critical vulnerabilities, including those two bypasses. So the benchmark is not built on imaginary bugs or private scoring rules alone. It is anchored to public remediation artifacts.

The other verified part is the benchmark's shape. Hacktron is not claiming it handed raw source code to a naked model and watched magic happen. The post describes a pipeline with code parsing, call-graph construction, context gathering, code enrichment, prompt assembly, deduplication, and later human validation. For the benchmark itself, Hacktron says it was measuring recall before the validation step, not precision after full triage.

That detail matters because it changes what the table means. This is not a clean statement that one model is better at security research in general. It is a narrower claim: inside Hacktron's pipeline, on this target, which model setup finds the known bugs most cheaply and most often?

Why this matters more than the benchmark flex

The background to this debate is easy to see. Anthropic's March write-up about its Mozilla partnership said Claude Opus 4.6 discovered 22 Firefox vulnerabilities in two weeks, with Mozilla classifying 14 of them as high severity. That is real news, and it supports the broader idea that top-tier models are becoming useful security tools.

But it also reflects a human-in-the-loop workflow. Anthropic's post describes researchers validating findings, steering the work, and focusing the search. Hacktron's complaint is not that those results are fake. It is that most buyers do not need an AI-assisted elite research lab. They want a system that can grind through ordinary codebases without a senior operator constantly nudging it back onto the trail.

That is where the economics change.

In Hacktron's table, frontier models still look strong in raw recall. Claude Opus 4.6 and Gemini 3.1 Pro surface the two known findings more reliably than the weakest small models. But Hacktron says the cost story flips once repeated runs enter the picture. In its reported numbers, Gemini 3.1 Flash Lite surfaced Finding A in 9 of 10 runs and Finding B in 10 of 10 runs at an average cost of about $3.70, while Claude Opus 4.6 averaged about $79 per scan. The post also argues that GPT-5.4 Nano, despite weak single-run performance, becomes competitive when rerun enough times because one Opus-priced pass buys many Nano-priced passes.

That argument is less glamorous than "frontier model found a browser zero-day," but it may be closer to how this market will settle. Most enterprise software is not Firefox's JavaScript engine. A lot of it is boring Go, Python, Java, Node, auth glue, API edges, config mistakes, and trust-boundary confusion. If small models can be run many times against those codebases with good context packaging, the winning system may be the one that produces the cheapest acceptable recall curve, not the one that looks smartest in a single pass.

The real thesis is about systems, not IQ

This is why the post landed on Reddit.

The catchy version is "small models can beat big ones." The better version is that autonomous security work is a systems problem before it is a pure model problem. Context quality, target decomposition, routing, deduplication, validation cost, and retry strategy all shape the result. A model that is slightly worse per run can still win if it is cheap enough to call over and over inside a workflow that knows how to frame the problem well.

That is also why Hacktron's own caveats matter. The company says its prompts may be unintentionally optimized for Gemini Flash because smaller models are cheaper during development. It benchmarked one target, not a broad public suite. It measured recall before the expensive precision-cleanup step. And its pipeline itself is proprietary enough that outsiders cannot cleanly separate model quality from workflow quality.

Those are not small caveats. They are the whole story.

If Hacktron had published a naked model-vs-model chart with no workflow details, the piece would be easy to dismiss. Instead, the article accidentally makes a stronger point than its anti-Mythos framing suggests: benchmark discourse around security agents is getting distorted by people talking as if the model is the product.

It is not. The product is the loop.

Where the uncertainty still sits

Several claims in the article should stay attributed.

Hacktron's table, average finding counts, and cost numbers are vendor-reported results from its own pipeline. I did not independently rerun the scans. The post's comparison to Anthropic's Mythos framing is also partly interpretive. Anthropic did not claim that a fully autonomous cheap-model stack would lose everywhere. Hacktron is reading against the direction of the hype, not disproving a single formal benchmark.

There is also a target-selection issue. oauth2-proxy is a meaningful real-world project, but it is still one project. A broader public benchmark across multiple codebases and bug classes could shift the picture a lot. The lesson might hold for auth middleware and ordinary application code, then weaken badly for browser engines, kernels, or weird parser logic.

The Reddit reaction was also narrower than a mass industry backlash. The /r/netsec thread was hot enough to notice, but public X verification was thin in this environment and xurl was unavailable, so I am treating Reddit as the main reaction layer rather than pretending there was broad cross-platform consensus.

What developers should take from this

Do not read this as proof that frontier models are overrated. Read it as a warning that AI security evaluation is starting to split into two different games.

One game asks whether an elite model, guided by an elite operator, can find severe bugs in hard targets. That matters. Anthropic and Mozilla showed that the answer is increasingly yes.

The other game asks what kind of pipeline gives a team the best bug-finding return per dollar when the work has to scale beyond a few carefully guided research sessions. That question matters more for products, procurement, and real security operations. It is also where smaller models, repeated passes, and better context engineering may beat the prestige option.

That is not a small distinction. It is the difference between a demo economy and an operational one.

Sources