The Sandpile Theory of Security

Most companies think about security the wrong way.

They think a vulnerability is a thing. A defect. A ticket. A row in a dashboard. Something with a CVE number, a CVSS score, an owner, and a due date.

This is a natural way to think, because it is how security work is usually organized. A scanner finds a vulnerability. Someone triages it. Someone patches it. The number goes down. Progress.

But this view is misleading in the same way that looking at one grain of sand is misleading if what you care about is an avalanche.

In real systems, vulnerabilities are not independent objects. They are connected. They accumulate. They interact with identity systems, deployment pipelines, shared libraries, cloud permissions, abandoned services, developer habits, and organizational incentives. A single bug is rarely the whole story. It is usually the match. The fuel has been piling up for years.

This is why vulnerability management often feels strangely futile. You fix hundreds of issues and still feel unsafe. Then some tiny thing — a forgotten S3 bucket, an exposed token, one dependency in one obscure service — causes a disaster wildly out of proportion to its apparent size.

That is not bad luck. That is the nature of complex systems.

A better model for security is not a checklist. It is a sandpile.

Imagine dropping grains of sand, one by one, onto a table. At first nothing interesting happens. The pile grows. Each grain is harmless. But eventually the pile reaches a critical state. After that, the next grain may do nothing, or it may cause a small slide, or it may cause half the pile to collapse.

The strange thing is that you cannot tell by looking at the next grain how big the slide will be. The cause looks tiny. The consequence can be huge.

This is self-organized criticality. And it is a much better metaphor for security than the spreadsheet.

The Modern Enterprise Is a Sandpile

Every dependency is a grain. Every exception to a policy is a grain. Every over-permissioned role, stale secret, unpatched library, manual production access, orphaned server, and “temporary” firewall rule is a grain. Each one seems tolerable. None is the obvious cause of catastrophe.

Then one day there is an incident report.

The report says the attacker exploited a specific vulnerability. This is true in the narrow sense, and false in the useful sense. The attacker did not merely exploit a vulnerability. They exploited the shape of the pile.

This explains why the distribution of attacks is so lopsided. In theory, every vulnerability matters. In practice, only a small fraction matter a lot. The world does not exploit vulnerabilities evenly. Attackers do not distribute effort like compliance auditors. They swarm around the few bugs that are reachable, automatable, reliable, profitable, and connected to valuable systems.

So the distribution has a fat tail. A small number of exploited vulnerabilities account for a huge share of real attacks. This should change how we think about prioritization.

The old question was:

How severe is this vulnerability?

The better question is:

Where is this vulnerability in the system?

A critical vulnerability on a dead internal box may matter less than a medium vulnerability in an internet-facing service connected to identity, secrets, and deployment permissions. Severity is not irrelevant, but it is not enough. CVSS tells you how sharp the knife is. It does not tell you whether the knife is sitting in a locked drawer or already pressed against your neck.

This is why EPSS-like thinking is closer to reality than CVSS-only thinking. Exploit probability matters because the attacker’s economy matters. Reachability matters because topology matters. Blast radius matters because the system matters.

Security teams already know this intuitively. They know not all criticals are critical. They know some “mediums” make them nervous. They know the scariest finding is often not the one with the biggest number, but the one that sits at the intersection of too many things.

A shared library used by everything.

A CI/CD system that can write to production.

An identity provider trusted by every service.

A cloud IAM role with permissions no one fully understands.

A logging agent with access everywhere.

These are not assets. They are hubs.

And in a network, hubs matter disproportionately.

Hubs and Load-Bearing Systems

This is one of the most important lessons security can borrow from network science: risk is not spread evenly. Most nodes are boring. A few are load-bearing. If one ordinary service is compromised, you have an incident. If your identity provider is compromised, you have a company-wide event. If your build pipeline is compromised, you may not even know which systems are still yours.

This means a mature security program should look less like gardening and more like structural engineering.

The gardener tries to remove every weed.

The structural engineer asks which columns hold up the building.

Most vulnerability management programs still behave like gardeners. They count weeds. They create charts showing weed reduction over time. They celebrate when the number of weeds falls by 17%.

But the attacker is not trying to admire your garden. The attacker is looking for the column.

This is why “patch everything” is both obviously right and practically wrong. Of course, in an ideal world, everything would be patched. In the real world, time is finite, engineers are busy, systems are old, dependencies are tangled, and patching can break production. A strategy that assumes infinite capacity is not a strategy. It is a wish.

The uncomfortable truth is that zero vulnerability is not a realistic goal. Worse, it may not even be the right mental model.

Complex systems generate vulnerability the way cities generate traffic. You can reduce it. You can route it. You can make it less deadly. But you cannot eliminate it while the system remains alive and changing.

A company that ships software will create vulnerabilities. A company that uses cloud services will create misconfigurations. A company that hires humans will create exceptions. Every act of growth adds grains to the pile.

So the question is not:

How do we prevent every avalanche?

The question is:

How do we prevent the avalanche that buries the village?

This leads to a different kind of vulnerability management.

1. Map the Pile

First, map the pile.

Not every asset. Not every theoretical dependency. The actual topology of risk. Which systems talk to which systems? Which identities can assume which roles? Which services can deploy to production? Which libraries are everywhere? Which vendors sit inside the blast radius? Which logs contain secrets? Which “internal” systems are one proxy rule away from the internet?

Most companies do not know their real topology. They have diagrams, but diagrams are often aspirations. The real architecture is in Terraform drift, Slack messages, emergency exceptions, forgotten scripts, and the muscle memory of senior engineers.

Attackers discover the real architecture faster than defenders because they have fewer assumptions. They follow what works.

2. Prioritize the Head, Not the Tail

Second, prioritize the head, not the tail.

The long tail of vulnerabilities can absorb infinite attention. There will always be old packages, minor findings, weak configurations, and theoretical issues. Some should be fixed. But if the tail consumes the team, the head wins.

The head is where exploitation is likely, reachability is real, and blast radius is large. A vulnerability in the head deserves urgency. A vulnerability in the tail deserves process. Mixing them is how teams burn out.

This is also why dashboards can be dangerous. A dashboard rewards what is countable. But risk is not proportional to count. Closing 500 low-risk findings may look better than fixing one toxic combination of internet exposure, privilege escalation, and production access. But the latter may matter more.

Good security metrics should make the important thing visible, not merely the abundant thing.

3. Watch for Pre-Critical Signals

Third, watch for pre-critical signals.

Before a sandpile collapses, it often becomes unstable in subtle ways. In organizations, the same thing happens. Systems heading toward failure usually give off signals before the incident.

MTTR quietly gets longer.
Security exceptions become normal.
Alert volume rises, but trust in alerts falls.
No one knows who owns key systems.
Infrastructure changes become scary.
Patching a dependency requires five teams and three weeks of meetings.
The backlog grows faster than it shrinks.

These are not just operational annoyances. They are signs that the pile is steepening.

The naive view says the problem is that there are too many vulnerabilities. The systems view says the problem is that the organization has lost the ability to absorb change. That is much more dangerous.

A company with many vulnerabilities but fast response may be safer than a company with fewer vulnerabilities but slow, brittle, confused response.

Resilience beats cleanliness.

4. Create Small Avalanches on Purpose

Fourth, create small avalanches on purpose.

This is the part that sounds wrong until you think about it. If large avalanches are caused by accumulated instability, then one way to prevent them is to release instability before it becomes catastrophic. Forest managers use controlled burns. Security teams should do the same.

Whether it is a bug bounty, a simulated breach, a red team getting a little too close to production, or forcing a credential rotation in the middle of a Tuesday, the point is not theater. The point is to discover where the pile wants to slide while the slide is still small.

Many companies avoid these exercises because they are disruptive. But that is exactly why they are useful. A test that never disrupts anything is often just a ritual. The real question is not whether the system looks secure when undisturbed. The real question is how it behaves when poked.

Attackers are poking it anyway. Better to poke it first.

5. Design for Graceful Failure

Fifth, design for graceful failure.

Most security programs still speak the language of prevention. Prevent compromise. Prevent exploitation. Prevent unauthorized access.

Prevention matters. But in a sufficiently complex system, prevention alone is a fantasy. Something will fail. Someone will click. Some dependency will break. Some token will leak. Some vendor will be compromised. Some zero-day will arrive on a Friday night.

The mature question is:

What happens next?

Can the attacker move laterally?

Can they reach secrets?

Can they alter builds?

Can they impersonate employees?

Can they persist?

Can they destroy logs?

Can they turn one compromised service into the whole company?

This is where resilience becomes more important than purity. The best systems are not the ones that never fail. They are the ones whose failures stay small.

A great security architecture is not a wall. It is a set of compartments.

The goal is not to make intrusion impossible. The goal is to make intrusion disappointing.

An attacker gets into one service and finds no credentials.

They steal a token and find it is short-lived.

They compromise a workload and cannot reach metadata.

They phish an employee and hit strong device binding.

They find a vulnerability but cannot reach the vulnerable path.

They get code execution but not deployment authority.

Every disappointment is a small avalanche instead of a big one.

This is the deeper reason hub nodes matter. You cannot protect everything equally, but you can make sure the things that connect everything are unusually hard to abuse. Identity, CI/CD, cloud IAM, secrets management, endpoint management, and shared libraries are not just systems. They are the load-bearing beams of the company.

Treating them like ordinary assets is malpractice.

From Spreadsheets to Graphs

The security industry often talks as if the future will be solved by better scoring. Better severity scores, better asset scores, better risk scores. Scores are useful, but they can also preserve the wrong worldview. They imply that risk is a property of individual items.

But the worst incidents are usually properties of relationships.

A medium bug plus public exposure plus weak identity plus excessive permission plus poor logging plus slow response equals a crisis. None of the ingredients alone tells the story. The risk lives in the combination.

This is why the next generation of vulnerability management should look more like graph analysis than spreadsheet sorting.

Not “show me all critical CVEs.”

Show me vulnerabilities reachable from the internet that touch services with privileged identities.

Show me dependencies used by twenty production systems where exploit code exists.

Show me assets whose compromise creates a path to deployment pipelines.

Show me identity routes from a developer laptop to production secrets.

Show me systems with rising MTTR and unclear ownership.

Show me where a small trigger can produce a large cascade.

That is the real work: finding the avalanche paths.

The companies that understand this will still patch vulnerabilities. But patching will not be the center of the worldview. The center will be topology, exploitability, concentration, and resilience.

They will ask fewer questions like:

How many vulnerabilities do we have?

And more questions like:

Which vulnerabilities could become avalanches?

That one change may be the difference between a security program that looks good and a security program that works.

Because the attacker does not care about your average risk.

The attacker cares about the steepest part of the pile.

References

Bak, Per. “How Nature Works: The Science of Self-Organized Criticality.” Springer (1996).
EPSS (Exploit Prediction Scoring System) — https://www.first.org/epss/
CVSS (Common Vulnerability Scoring System) — https://www.first.org/cvss/
Newman, M.E.J. “Networks: An Introduction.” Oxford University Press (2010).

The Modern Enterprise Is a Sandpile#

Hubs and Load-Bearing Systems#

1. Map the Pile#

2. Prioritize the Head, Not the Tail#

3. Watch for Pre-Critical Signals#

4. Create Small Avalanches on Purpose#

5. Design for Graceful Failure#

From Spreadsheets to Graphs#

References#