One of the stranger things about AI security is how many people trust benchmark scores they would never trust anywhere else.
If someone told you a new static analyzer catches 90% of vulnerabilities, your first question would be: 90% of what? In what code? Under what assumptions? What did it miss? But when an LLM benchmark shows a leaderboard, people often skip those questions and go straight to conclusions.
I did too, until I tried replicating one.
I rebuilt a ZeroDayBench-style workflow locally: a vulnerability detection framework, an LLM integration path, and a deliberately vulnerable target application. Nothing exotic. Just enough structure to test the same claims in a controlled environment. The result was useful not because it proved the benchmark wrong, but because it showed where benchmark truth ends and engineering truth begins.
The easiest part was getting impressive-looking output. Pattern-heavy classes—weak auth, some IDOR shapes, straightforward injection patterns—are where current systems look good. They identify common signatures quickly. They generate patches that are often plausible on first read. They produce neat reports. If your goal is demo quality, you can stop there and look successful.
The hard part starts when you ask a less flattering question: would you ship this?
That is where things change. Complex logic flaws remain hard. Context-dependent authorization mistakes remain hard. Patch confidence numbers do not mean much if the patch changes behavior in ways your tests do not cover. A model can produce code that looks cleaner and still quietly break security invariants.
This is the core distinction: an LLM can be a fast triage engine without being a reliable security authority.
A lot of confusion comes from collapsing those roles. People want one system to do both: broad discovery and final judgment. But those are different jobs. Discovery rewards speed and recall. Judgment rewards precision and determinism. Most failures I saw came from pretending those constraints were compatible by default.
The phrase autonomous remediation is a good example. It sounds like a capability. Often it is a packaging choice. If you do not force deterministic validation after patch generation, autonomy just means nobody checked carefully.
The practical framing that held up was simple: assistant, not authority.
Used that way, the system is very valuable. It widens the search space. It drafts candidate fixes. It shortens the loop between suspicion and investigation. But the final gate has to stay deterministic: reproducible checks, explicit exploitability criteria, and human review where business logic is involved.
This also changes what a benchmark should optimize for.
Empirical snapshot from the local replication harness
To make this less abstract, here is what one controlled local run actually looked like:
| Vulnerability class | Test cases | Positive findings |
|---|---|---|
| SQL Injection | 4 | 4 |
| XSS | 3 | 3 |
| IDOR | 3 | 3 |
| Authentication bypass | 3 | 0 |
| Total | 13 | 10 |
Operational and cost signals from the same run:
- End-to-end testing consumed roughly ~1,000 tokens (estimated cost: ~$0.04).
- A focused AI-assisted pass used 203 tokens for recon and 135 tokens for vulnerability analysis.
- The biggest bottlenecks were not model quality, but environment friction:
- target app runtime failures,
- toolchain version mismatch (for example, Python 3.12 requirements),
- and missing Docker/runtime dependencies for some tools.
This is exactly why a single benchmark score is not enough. Even when detection looks strong in a controlled slice, delivery confidence still depends on reproducibility, runtime constraints, and deterministic validation.
Data sources in this workspace:
05-Workspace/docs/FINAL_TEST_REPORT.md05-Workspace/docs/FINAL_TESTING_REPORT.md
Most benchmark reporting overweights single-number performance and underweights operational friction. In practice, teams care about questions like these:
- How many false positives did this generate?
- How often did suggested patches preserve behavior?
- How long did triage take with and without the model?
- How much reviewer effort did this save or create?
- Which vulnerability classes improved, and which stayed weak?
Without those, you can get a high score and still build a workflow that drains security engineering time.
There is also a reproducibility problem. If the same model shows different outcomes under minor prompt or context changes, you do not really have a stable benchmark result. You have a screenshot of one run. The fix is boring but necessary: fixed targets, fixed seeds where possible, versioned prompts, and repeated trials with variance reporting.
None of this makes benchmarks less useful. It makes them more useful by giving them the right job.
A good benchmark should not be a trophy generator. It should be a decision tool. It should help you decide whether a model belongs in your pipeline, at what stage, and with what guardrails. It should help you estimate risk and staffing implications. It should make you less likely to be surprised in production.
The best next step is not another leaderboard screenshot. It is comparative runs under controlled conditions, across multiple models, with per-class deltas and reviewer-cost metrics. That tells you where to trust automation and where to keep humans in the loop.
In other words: benchmark less like marketing, and more like engineering.