PoC Quality Gates: How to Pass Vulnerability Program Triage

There’s a moment every vulnerability researcher dreads. You’ve spent weeks on a PoC. You submit it. The triage team comes back with: “Unable to reproduce.”

Not “invalid.” Not “out of scope.” Just… we couldn’t make it crash.

That response tells you nothing about your research. It tells you everything about your presentation.

The real bottleneck

Most people think vulnerability programs reject PoCs because the finding isn’t serious enough. In my experience, that’s rarely the case. Programs reject PoCs because the triager spent fifteen minutes trying to reproduce it, failed, and moved on to the next one in their queue.

Triage is not peer review. It’s triage — a battlefield metaphor that’s more accurate than most people realize. Someone is lying on a stretcher, bleeding. The doctor has thirty seconds to decide: operate now, or move to the next patient. Your PoC is the patient. Your documentation is whether the doctor can find the wound.

The triager is not your adversary. They’re a resource-constrained professional doing an exhausting job. Most triage teams at major programs are understaffed, burnout-prone, and processing dozens of submissions a week from researchers whose documentation ranges from excellent to unintelligible. They want to accept your finding. You just need to make it possible.

The fundamental insight is this: a vulnerability is not a fact about code. It is a fact about code as demonstrated by you. If the demonstration fails, the fact doesn’t exist — at least not to the person making the decision.

What “reproducible” actually means

Reproducibility has three layers, and most researchers nail only the first.

The first layer is: it crashes on my machine. Congratulations. You have a local artifact, not a vulnerability report.

The second layer is: it crashes on your machine. This is where most submissions die. Different kernel config, different compiler flags, different ASAN settings — suddenly your use-after-free becomes a silent corruption that manifests as something completely unrelated, or doesn’t manifest at all.

The third layer — the one that actually gets accepted — is: here is a self-contained environment that reproduces this on any compatible host. Not “run this script and hope.” Not even “here are my config flags.” Here is a Dockerfile that builds the exact kernel, a syzkaller reproduction program that hammers the race window, and a docker-compose up that produces the crash within two minutes. The modern gold standard for Layer 3 is not documentation — it’s infrastructure.

Telling a researcher to “list their config flags” is 2015-era advice. Professional PoCs in 2026 are delivered as reproducible environments: a Dockerfile, a Vagrantfile, or a syzkaller repro case that eliminates the “it doesn’t crash on my machine” variable entirely. If the triager can reproduce your finding by running a single command, you’ve won. If they can’t, you’ve lost — regardless of how real the bug is.

Reproducibility is not binary

There’s a subtlety that most writing on this topic misses: reproducibility isn’t a switch. It’s a probability distribution.

A clean null-pointer dereference crashes every time. A use-after-free with specific heap layout might crash 60% of the time. A race condition with a narrow timing window might crash 10% of the time. A double-fetch vulnerability with CPU affinity requirements might crash 2% of the time.

If you submit a PoC that crashes 10% of the time and the triager runs it three times, there’s a 73% chance they’ll see nothing. They’ll mark it “unable to reproduce” and you’ll feel cheated.

The solution is statistical reproducibility. Your PoC should include a reliability harness — a loop with a bounded runtime that guarantees a crash within a reasonable timeframe. If your bug triggers 10% of the time, a loop of 50 iterations will crash with 99.5% probability. Run it for two minutes. If it doesn’t crash, something is wrong with the environment, not the bug.

The key artifacts for statistical reproducibility:

A loop count or time budget (“run for 120 seconds or until crash”)
A success signal (exit code 0 on crash detection, non-zero on timeout)
A reliability metric (“triggers in ~45 seconds on average, 100% within 2 minutes”)
Heap grooming or setup steps that narrow the timing window

A triager who sees “crashes within 30 seconds, run ./harness.sh” is going to get a very different result than one who sees “crashes sometimes, try running it a few times.” The first one respects their time. The second one wastes it.

The negative control

Here’s an idea that sounds obvious but almost nobody does: include a test that doesn’t crash.

If you claim your PoC triggers a use-after-free in function X, also show that the same test without your trigger input does not crash. This seems trivial. But it’s incredibly powerful, because it tells the triager: “I understand the difference between my bug and the baseline noise of this system.”

Most kernel code crashes sometimes, under certain conditions, for reasons completely unrelated to your vulnerability. If you can’t distinguish your crash from the background crash rate, you can’t claim you’ve demonstrated anything. This is especially critical for race conditions, where random crashes from unrelated bugs can masquerade as your finding.

A negative control takes five minutes to add and eliminates the single most common triage objection: “we see crashes, but we’re not sure they’re caused by your PoC.”

Evidence, not narrative

The biggest mistake I see is treating a vulnerability report like a blog post. Researchers write flowing narratives about their discovery process. They explain their thought process. They describe the “journey.”

Nobody cares about your journey.

What triage needs is evidence. Structured, machine-readable, unambiguous evidence. Think of it like a court filing. The judge doesn’t want to hear how you felt when you found the bug. They want the exhibit list.

For memory corruption, this means kernel log signatures: the exact KASAN output, the BUG: message, the call trace. Not “the system became unstable.” Not “I observed unexpected behavior.” The raw, copy-pasteable log line that proves memory corruption occurred at this address, in this function, through this path.

For logic bugs, this means a concrete input that produces the wrong output, with the expected output clearly stated. Not “authorization can be bypassed.” But: “sending this exact request to this endpoint returns data belonging to user ID 42 when authenticated as user ID 7.”

The difference between these two approaches is the difference between a report that gets a “duplicate” or “informative” label and one that gets a severity rating and a bounty.

The age of silent mitigations

Here’s a problem that’s getting worse fast: a crash is becoming a luxury.

Modern targets ship with hardware-assisted mitigations — MTE (Memory Tagging Extension), PAC (Pointer Authentication Codes), CET (Control-flow Enforcement Technology) — and software-level defenses like CFI (Control Flow Integrity). These don’t just make exploitation harder. They make demonstration harder. A use-after-free that would have crashed spectacularly on a 2020 kernel now gets silently killed by the monitor. The process dies with a SIGSEGV that looks identical to any other SIGSEGV. Or worse — the corruption is detected and the operation is simply rejected, with no crash at all.

If your PoC relies on a visible crash as evidence, you’re fighting the last war.

The modern evidence standard for mitigated targets is different. When a crash isn’t possible, your evidence package needs to shift from crash logs to state observation: register snapshots showing the corrupted pointer before PAC catches it, branch-tracing logs showing the gadget chain was reachable before CFI blocked it, MTE fault reports showing the tag mismatch. These are harder to collect, but they’re also more convincing — because they prove you understand the vulnerability at the hardware level, not just the software level.

A report that says “here is the MTE fault report showing tag mismatch at this address” will be taken more seriously than one that says “I disabled MTE and it crashes.” The first researcher understood the target. The second one defeated it.

Vulnerability as code

If you automate your quality checks — a script that builds, runs, and classifies each PoC — you’ve done something important beyond saving time. You’ve made it possible for someone else to verify your work by running one command.

This is huge. Triage teams are more likely to trust a PoC that comes with a Makefile and a test.sh than one that comes with a paragraph saying “compile with gcc and run as root.” Not because they’re lazy. Because a test script is falsifiable. They can run it, and if it works, the trust is earned. If it doesn’t, the report is dead — but it would have died anyway, just slower.

I’ve started thinking of this as “vulnerability as code.” The PoC is not a document. It’s a program. Programs should compile, run, and produce deterministic output. If your PoC can’t do that, it’s not ready to submit.

The gold standard isn’t a Makefile — it’s a full reproduction environment. A docker-compose.yml that sets up the exact kernel version, the exact toolchain, the exact config flags, and runs your PoC inside it. Or a syzkaller repro case that the triager can drop into their existing syzkaller instance. This eliminates the “it worked on my machine” problem entirely. The triager doesn’t need your machine. They need a Docker daemon.

Yes, containerizing kernel exploits is harder than containerizing web exploits. Yes, QEMU-based setups are heavier than Docker. But the investment pays for itself across every submission you make. You build the environment once and reuse it for every PoC in the same target family. And every time a triager pulls your repo and sees docker-compose up, your credibility goes up — not because you’re flashy, but because you respected their time.

The compound effect of a standard library

One thing I’ve noticed: the researchers who consistently submit clean, well-evidenced PoCs tend to have something in common. They don’t treat each submission as a one-off event. They maintain a personal library — a standard collection of primitives, triggers, and reproduction environments that they reuse and refine across projects.

Think of it as a researcher’s dojo. Every time you build a reliable PoC, you’re not just proving a vulnerability — you’re adding a tool to your standard library. The negative control template becomes reusable. The Docker environment becomes a base image. The one-sentence description becomes a muscle memory. The reliability harness becomes a script you copy and adapt.

This is where the “Second Brain” methodology isn’t just productivity theater — it’s a direct quality amplifier. A researcher with a well-organized knowledge base is less likely to submit sloppy work, not because they’re more careful, but because they’ve already solved the problem before. They’re not starting from scratch each time. They’re composing from tested components.

The gap between a junior researcher and a senior one isn’t talent. It’s the size and organization of their standard library.

The uncomfortable truth about escalation claims

Let me talk about the elephant in the room. Many PoCs claim to demonstrate an exploit — full root access, arbitrary code execution, data exfiltration — when what they actually demonstrate is a crash. The escalation path is described in text, sometimes with pseudocode, but never actually implemented.

I understand why researchers do this. A crash is a finding. An exploit is a bounty. The financial incentive pushes people to frame their work in the most impressive light possible.

But here’s the problem: when a triager sees “privilege escalation to root” and the PoC just panics the kernel, something breaks. Not just the report. The trust relationship.

And here’s what nobody wants to admit: triagers keep score. Not formally — there’s no spreadsheet with your name on it (that you know of). But internally, experienced triage teams develop a mental model of each researcher. “This one overclaims.” “This one’s reliable.” “This one submits noise.” Once you’re flagged as an overclaimer, the damage is not to that one report. It’s to every future report you submit. Your critical findings get deprioritized. Your edge cases get the benefit of the doubt — the wrong kind.

Reputation in this field is a slow-building, fast-destroying asset. Overclaiming is reputational suicide — not just for one submission, but for your entire track record within a program. You become the researcher who cried root.

The smarter play — counterintuitive as it seems — is to claim less than you can prove. “I can reliably trigger a use-after-free in this kernel function. I believe this is exploitable for LPE, but I have not yet demonstrated the full chain.” That report is more likely to get a higher severity rating than one that claims root without proof. Because the first one is honest, and honesty is the scarcest resource in vulnerability triage.

A framework, not a checklist

What I’ve found works is a simple classification for every PoC I write:

Trigger. It crashes, and I can prove it. I have logs, I have signatures, I have a negative control, and I have a reliability harness.

Primitive. It doesn’t crash, but I can demonstrate a controlled effect — arbitrary read, arbitrary write, type confusion. The primitive is reliable even if the exploitation is theoretical.

Concept. I believe this code path is vulnerable, but I haven’t triggered it yet. This is research, not a report.

Overclaimed. I said it does something, but the evidence only supports something weaker.

That last category is the one nobody wants to admit exists in their portfolio. But it does. The fastest way to improve as a vulnerability researcher is to go through your old submissions and honestly classify them. If you’re like most people, you’ll find at least one that you’d rather not think about.

The one-sentence test

Before you submit anything, try this: describe the finding in one sentence, using only factual claims — including the impact.

Not: “This is a critical vulnerability that could allow attackers to compromise the system.”

Not even: “Sending a malformed packet of [this format] to [this endpoint] causes a use-after-free in [this function], demonstrated on kernel [version].”

But: “An unprivileged user can trigger a use-after-free in [this function] via [this input path], leading to Local Privilege Escalation by corrupting [this kernel object], demonstrated on kernel [version] with [these config flags], producing [this KASAN signature].”

Knowing where it crashes is step one. Knowing what it breaks is step two. Without the impact vector, you’ve described a bug, not a vulnerability. The one-sentence test forces you to articulate both. If you can’t fill in the “leading to [impact]” part with something concrete, you probably have a concept, not a finding.

If you can write that sentence, it should be the first thing in your report. If you can’t, you’re not ready to submit.

The meta-lesson

I think the deeper lesson here isn’t really about vulnerability research at all. It’s about empathy for the person on the other side of your submission.

Every field has gatekeepers. Researchers submit papers to conferences. Startups pitch investors. Writers send manuscripts to editors. And in every case, the people on the other side are overwhelmed, underpaid, and looking for reasons to say yes — but they can only say yes to things they can verify. Saying yes to an unverifiable claim isn’t generosity. It’s negligence.

The people who consistently get past the gatekeepers aren’t necessarily the most talented. They’re the ones who understand that the gatekeeper’s job is hard, and make it easier. Not by being clever, but by being clear. Not by demanding trust, but by building it — one reproducible, well-evidenced submission at a time.

In vulnerability research, clarity means reproducibility. It means evidence. It means treating your PoC not as a demonstration of your skill, but as a gift to the person who has to evaluate it. A gift that says: I know your job is hard. I’ve done everything I can to make this easy for you. All you have to do is press enter.

The best PoC I ever submitted wasn’t the most sophisticated. It was the one where the triager didn’t have to ask a single follow-up question.

Appendix: PoC Submission Template

If you want to put these ideas into practice immediately, here’s the structure I use for every submission:

## One-Sentence Summary (with Impact)
[An unprivileged user can trigger X in function F via input path P,
leading to IMPACT by corrupting/exploiting TARGET, demonstrated on
kernel/app V with config C, producing log signature S]

## Classification
[ ] Trigger    [ ] Primitive    [ ] Concept    [ ] Overclaimed

## Environment
- Target:       [kernel version / app version / firmware]
- Toolchain:    [compiler, version, flags]
- Config:       [relevant .config flags or environment variables]
- Mitigations:  [KASLR, KPTI, CFI, MTE, PAC — which are ON/OFF]
- Reproduction: [docker-compose up | Vagrant up | syzkaller repro]

## Reliability
- Trigger rate: [~X% per attempt, ~Y seconds average]
- Harness:      [./harness.sh runs N iterations or T seconds]
- Negative:     [./harness.sh --baseline produces no crash in T seconds]

## Evidence Package
- Positive:     [attached log showing crash / fault / wrong output]
- Negative:     [attached log showing baseline run — no crash]
- Signature:    [exact grep-able string from log]
- Mitigated:    [if applicable: MTE fault report, PAC mismatch, CFI block]

## Steps to Reproduce
1. git clone <repo> && cd <dir>
2. docker-compose up   (or: syzkaller repro -config repro.cfg)
3. ./harness.sh        (or: ./harness.sh --baseline for negative control)
4. dmesg | grep "<signature>"

## Claimed Impact
[What this vulnerability *actually* demonstrates — with the specific
kernel object, data structure, or privilege boundary affected.
If the escalation is theoretical, say so explicitly.]

## What I Did NOT Prove
[Optional but recommended. "I did not demonstrate code execution.
I did not bypass CFI. I did not test on kernel 6.x."]

This template forces honesty before you submit. If you can’t fill in the “Evidence Package” section, you’re not ready. If you can’t fill in the “Reliability” section with a concrete trigger rate, you haven’t run your PoC enough times. If you can’t fill in the “What I Did NOT Prove” section, you’re probably overclaiming. And if the one-sentence summary doesn’t include an impact vector, you’ve found a bug but not yet a vulnerability.

The real bottleneck#

What “reproducible” actually means#

Reproducibility is not binary#

The negative control#

Evidence, not narrative#

The age of silent mitigations#

Vulnerability as code#

The compound effect of a standard library#

The uncomfortable truth about escalation claims#

A framework, not a checklist#

The one-sentence test#

The meta-lesson#

Appendix: PoC Submission Template#