ZeroDayBench Replication: What Actually Holds Up in Practice

Sat, 21 Mar 2026 18:48:00 +0700

One of the stranger things about AI security is how many people trust benchmark scores they would never trust anywhere else.

If someone told you a new static analyzer catches 90% of vulnerabilities, your first question would be: 90% of what? In what code? Under what assumptions? What did it miss? But when an LLM benchmark shows a leaderboard, people often skip those questions and go straight to conclusions.

Vulnerability-Research on Napat's Inverse Blog

ZeroDayBench Replication: What Actually Holds Up in Practice