A gigawatt AI data center feels strong. A fortress: fences, guards, cameras, backup generators, thick walls. Mathematically, it can be the most fragile thing you have ever built.
The cloud is not in the sky. It is a building. A very expensive building, wired to a substation, fed by fiber, filled with GPUs, and cooled by industrial water and power systems.
For about twenty years, "data center security" mostly meant cybersecurity: firewalls, identity, encryption, zero trust. That work still matters. It is no longer the whole job. AI turns a data center into something stranger, a machine that converts electricity into intelligence, and those machines are now among the most capital-dense objects on Earth.
The numbers are public. OpenAI's Stargate was announced as $500B over four years, later framed as a $500B, 10GW program. Meta announced a $10B, 4 million square foot AI campus in Louisiana. TSMC put its US investment at $165B. These are not normal buildings. They are energy-to-intelligence machines.
Our first real lesson building this in Israel was simple. The bigger the machine, the more dangerous it is to put it all in one place.
What a data center actually is
Strip the marketing away and a data center is a machine for two jobs: push power into racks, pull heat out of them, safely, for years. Most of the cost and most of the risk lives in that plumbing, not in the walls.
Power comes in high and gets stepped down stage by stage. The grid delivers high voltage, often 138, 230, or 345 kV. An on-site substation steps it to medium voltage, around 11 to 33 kV. Near the hall, transformers drop it to about 415 V, then the rack takes it lower, and the chip runs at under a volt.
Why keep voltage high for as long as possible? Because power lost as heat in a wire rises with the square of the current, P = I squared times R. Higher voltage means lower current for the same power, which means less heat and less copper. That single equation shapes the entire building.
The hardware is heavier than it sounds. The big step-down transformers run 50 to 100 MVA each (an MVA is roughly a megawatt of capacity), they are custom-built, and lead times can pass a year because they depend on a special grain-oriented steel with few suppliers. You cannot order one next-day.
So operators buy redundancy and name it precisely. N means exactly enough gear to run. N+1 means one spare. 2N means a full second set. The Uptime Institute turns this into Tiers: a Tier III site is concurrently maintainable, you can service any part without going dark; a Tier IV site is fault tolerant, a single failure should not take it down.
Inside, the building is already broken into pieces on purpose. A hall is split into pods of roughly 1.6 to 2.5 MW, each with its own transformers, switchgear, and a diesel generator the size of a locomotive engine, around 3 MW and 4,000 horsepower, holding a day or two of fuel. Batteries in the UPS carry the load for the sixty or so seconds it takes a generator to start, reacting in under ten milliseconds. Even one building is built around dividing risk. The whole industry already talks this way: engineers will tell you an electrical fault has a smaller "blast radius" than a cooling failure. Blast radius is not my metaphor. It is shop language.
Compute is heat
Here is the part people miss. Every watt you put into a GPU comes back out as about a watt of heat. A 100 MW hall is a 100 MW heater. Cooling is not a side system. It is half the machine.
Efficiency gets one number: PUE, total facility power divided by IT power. A PUE of 1.5 means you burn an extra half watt of overhead for every watt of compute. The rough industry average is around 1.6; the best hyperscalers run near 1.1. Water gets its own number, WUE, in liters per kWh, and it adds up fast. A mid-size site can drink hundreds of millions of liters a year if it leans on evaporative cooling.
Chillers are the biggest non-IT draw, so the cheapest cooling is the cooling you skip. "Free cooling" uses the outside air directly, and it works best where the air is dry, because dry air cools further through evaporation. That is one quiet reason a desert is attractive. As racks get denser, air alone stops working, and direct-to-chip liquid cooling becomes mandatory rather than optional.
Why everything wants to be in one place
Now the honest case for concentration, because it is real. Training a large model is synchronous. Tens of thousands of GPUs work in lockstep and wait on each other, so the links between them have to be fast and short. High-speed copper only reaches a couple of meters, and the in-rack NVLink fabric is far faster than the network between racks. The cheaper your tokens need to be, the tighter you pack the GPUs.
That is why rack power keeps climbing. Racks sat under 10 kW for years. Nvidia's GB200 NVL72 packs 72 GPUs into one liquid-cooled rack at 120 kW and up. The 2027 roadmap, built on a new 800 VDC power architecture, points at 600 kW to a megawatt per rack.
The power physics is just as blunt. Deliver 600 kW at the old 54 V and you need around 11,000 amps of current. Do it at 800 V and you need about 750. Since loss scales with current squared, that is roughly 200 times less heat lost in the same copper. Density is not stupidity. It is efficiency. That is exactly what makes it dangerous.
The grid is a failure domain too
Concentration does not stop at the fence. A training run is synchronized, so its power draw is jagged. Meta's Llama 3 paper describes tens of thousands of GPUs swinging power at the same instant, "on the order of tens of megawatts," enough to stress the grid. Engineers have literally run fake workloads just to keep the swings from hurting the power system.
At gigawatt scale this stops being a site problem. Grid analysts model losing a few gigawatts of load at once and find a real path to a cascade. It is not hypothetical. On April 28, 2025, the Iberian grid lost about 2,200 MW in a span of seconds, and Spain and Portugal went fully dark in under half a minute. In July 2024, a single fault knocked roughly 1.5 GW of Virginia data centers off the grid at once. Put a gigawatt behind one interconnection and a local hiccup becomes a regional one.
The plan we didn't build
When we started planning gigawatt-scale AI capacity in Israel, the first instinct was the obvious one. Build the biggest possible bunker. One giant site, one campus, one national machine. It looks clean on a slide, it simplifies procurement, it makes the networking easy, and it makes the spreadsheet investors love.
Then we asked one security question. How much of the country's AI capacity are we putting inside one failure domain?
That question killed the plan. Not because a giant site cannot be built, but because in Israel you cannot pretend geography is theoretical. Drones, rockets, grid events, and war-risk insurance are operating conditions here, not consulting words.
The UAE made it concrete. On March 1, 2026, Iranian drones struck two Amazon AWS data centers in Dubai directly. Two of three availability zones in the ME-CENTRAL-1 region went down. 109 services failed. Banks, payments platforms, and ride-hailing apps went dark. The buildings had every standard protection: fences, cameras, redundant power, backup generators. None of it was rated for a drone strike. AWS told customers to migrate workloads to another region and warned that recovery would take weeks, given the physical damage involved. BBC, Data Center Dynamics.
Ukraine has taught the same lesson at far larger scale. The point is thicker walls are the wrong answer. A smaller target per site is the right one.
The unit of security is the failure domain
A failure domain is a part of a system that can fail on its own without taking the rest down. Google Cloud uses exactly this language: a zone or region is a failure domain, and spreading across them improves availability. Google's own example: two things at 99.9% availability, placed in separate failure domains, drop the odds of both failing at once toward one in a million, if they are truly independent.
This is not a metric I invented. It is the same idea four serious fields already use, just pointed at GPUs.
| Field | What they already call it |
|---|---|
| Cloud | Failure domains, zones, regions |
| Insurance | Probable maximum loss |
| Power grids | N-1 contingency planning |
| Defense | Mission assurance |
| Critical infrastructure | Consequence-based risk (CISA) |
| Data centers | Uptime Tier III and Tier IV |
So the security question is not "how big is the building?" It is "how much value, power, and national capability can be lost in one local event, and how fast can the rest keep running?"
The math is just division
Take a $40B AI platform. There are a few ways to build it.
One giant site: one event can reach the whole $40B. Split it into four independent sites: each holds about $10B. Split it into twelve 100 MW modules: each holds about $3.3B. That is the entire trick. Division.
| Architecture | Failure domains | Value per domain | Concentration |
|---|---|---|---|
| One giant site | 1 | $40B | 1.00 |
| Four sites | 4 | $10B | 0.25 |
| Twelve modules | 12 | $3.3B | 0.08 |
The concentration number is the same one regulators use for markets: square each site's share of the total and add them up. One site scores 1.0. Four equal sites score 0.25, which behaves like four real sites. Twelve modules score about 0.08, like twelve. The rule of thumb: distribution cuts your worst single-event exposure by roughly 1 minus 1 over n. Four sites is a 75% smaller worst case. Twelve modules is about 92% smaller.
To make value density concrete: Meta's Louisiana campus is roughly $10B across about 371,600 square meters, near $27,000 of investment per built square meter. Stargate is about $50M per megawatt at the program level. A boring GPU hall can be worth more per square meter than a famous tower. Not because the walls are special, but because every rack is full of capital and every megawatt is monetized.
The honest part
I want to be careful here, because the easy version of this argument is wrong. Distribution does not always lower your total expected loss. If every site has the same independent chance of failure, the long-run average loss can look about the same whether you have one site or four.
What distribution reliably lowers is different and still decisive: the worst single event, the tail risk, the recovery time, and how easy you are to coerce. The probability that all your sites fail at once is small only if they are independent. If they are independent, that probability is p to the power n. The moment they share a common weakness, it jumps to that common-mode risk no matter how many sites you have. The entire job is keeping the shared risk small.
Distributed on paper is not distributed
This is where people fool themselves. They say "distributed" because there are multiple buildings. But multiple buildings can still be one failure domain.
Four buildings on the same substation are not four failure domains. Four campuses on the same fiber route are not four failure domains. Four sites running on one brittle control plane can still fail together. Real separation has to cover power, fiber, cooling, water, the security perimeter, operations, control systems, permits, and emergency response. The test is one question: what would have to fail for all of them to go at once? The grid blackouts above are exactly that, common-mode failure in the wild.
It costs more, and that is the point
Distribution is more expensive. Full stop. You duplicate substations, cooling plants, security, and staff, you lose economies of scale, and you pay a networking penalty when sites sit far apart. In a calm place you could call that waste.
In a high-risk place it is the floor, not a luxury. And the risk is getting worse, not better. Long-range ballistic missiles and cheap drones put fixed, high-value sites within reach from much further away, and AI compute itself is becoming a weapon, the thing that runs the targeting, the intelligence, the decision loop in a real conflict. The more strategic the compute, the more a single concentrated site reads as a single point of national failure. So I model single-event disruption scenarios the way insurers, cloud architects, and grid planners already do: failure domains, probable maximum loss, blast radius, continuity of operations. Not attack recipes. Box sizes. The smaller the box, the easier it is to insure, defend, finance, recover, and explain to a government.
The same lesson is now in orbit
The strange part is that the industry is about to make the identical mistake one altitude higher. I wrote about it in They're Made Out of Compute: the plan to fly data centers into space, packed into tight clusters, fragile, exposed, and so far undefended. Same instinct, same blind spot. Whether the failure domain is a desert campus or an orbital cluster, the question never changes. How much can one event take?
The Israel lesson
Israel is a small country with real advantages for this: sun, talent, urgency, and a genuine need for its own compute. Small countries do not get to treat infrastructure risk as theoretical. If AI compute becomes critical to defense, the economy, science, and government, then data centers are national infrastructure, and national infrastructure should not hang on one building.
Gigawatt-scale AI is the goal. The single-site concentration model is what we moved away from. A network of smaller factories, modular, hardened, connected, and spread across real failure domains, is the shape that survives.
The most valuable square meter in the AI age may not be in Manhattan. It may be a GPU hall wired to a few hundred megawatts. But the most secure AI platform will not be the one with the biggest building. It will be the one with the smartest spread of risk. Dense is efficient. Distributed is survivable. When intelligence becomes infrastructure, survivable wins.
In April 2024 I stopped my bike near home and filmed what looked like a scene from a film: interceptors and incoming fire lighting up the sky over Israel. I have thought about it ever since. It is hard to look at a site plan the same way after that. A map stops being real estate and becomes a list of failure domains.
Sources
- AWS ME-CENTRAL-1 UAE drone strikes, March 1-3, 2026: two facilities directly struck, two of three availability zones offline, 109+ services degraded, recovery expected to take weeks. BBC, Data Center Dynamics.
- OpenAI Stargate, announced at $500B over four years and later framed as a $500B, 10GW program. OpenAI, OpenAI.
- Meta's $10B, 4 million square foot AI data center in Richland Parish, Louisiana. Opportunity Louisiana.
- TSMC's planned US investment of up to $165B. TSMC.
- Nvidia GB200 NVL72: 72 GPUs in one liquid-cooled rack, 120 kW and up, single NVLink domain. Nvidia.
- Nvidia 800 VDC architecture for 1 MW racks and beyond, starting 2027. Nvidia.
- Synchronized training power swings on the order of tens of megawatts. Meta, The Llama 3 Herd of Models.
- How a data center's power and cooling actually work (power chain, PUE, WUE, pods, generators, redundancy). SemiAnalysis, Datacenter Anatomy Part 1 and Part 2.
- Grid cascade risk from large synchronized loads, and the July 2024 Virginia event where ~1.5 GW of data centers disconnected at once. SemiAnalysis.
- April 28, 2025 Iberian blackout: about 2,200 MW lost in seconds, full collapse in under half a minute. ENTSO-E final report.
- Failure domains, zones, and regions as the unit of reliability. Google Cloud.
- Uptime Institute Tier classification (Tier III concurrently maintainable, Tier IV fault tolerant). Uptime Institute.
- Critical infrastructure framed by the consequence of disruption. CISA.
Disclosure: Magn builds AI compute infrastructure on the ground, in Israel. Our position is the one in this piece: at gigawatt scale, AI data centers are national critical infrastructure, and they should be distributed across real failure domains, not concentrated in one place.
Compute helped me draft and proof this story.