Why a Validator Is Not a Server

A few quarters ago, a protocol team we had been speaking with about validator architecture decided to bring operations in-house. We had walked them through how we run signing clusters with threshold signing. They chose to run a single signer. Within months they had a double-sign event, the slashing was enforced on-chain, and the validator was retired. We are not naming them. We are also not telling this story for sport. We are telling it because the architecture they did not ship is the entire point of this article.

Five years ago we wrote a piece called "On Validator Setup — Part 2" describing how 01node ran validators in 2019: a primary node, a backup node, sentries in front, YubiHSM behind, VPN-only access. That architecture worked. It still works. But on its own it is no longer enough, and the reason is the failure mode we just described — a single signer that, under the wrong concurrent state, can sign two blocks at the same height. In Cosmos and Ethereum that is the worst thing a validator can do.

A validator is no longer a single node

In 2019 the assumption baked into most validator documentation was that a validator is one machine. One IP. One signing key. The operator put a sentinel in front, kept the key on a hardware module, and that was the architecture.

In 2026 that assumption is wrong. A validator at 01node is a cluster of machines that cooperate to produce a single signature. Several of them have to agree. None of them on its own can sign anything. The signing key, in the strict sense, does not exist on any one of them — only shards of it do.

This is not a security upgrade in the marketing sense. It is a structural change. A 2019 validator could double-sign because one process held one full key. A 2026 validator can not, because no process holds the full key.

The 2-of-3 cluster, in plain math

The base configuration of a signing cluster is 2-of-3 threshold. Three nodes. To produce a valid signature, two of them have to participate. If one of them is unreachable, the remaining two still sign and the validator does not miss a block. If two are unreachable, the cluster correctly fails to sign rather than risking a partial state.

Three properties fall out of this directly:

- Fault tolerance of one node. Any single node can be down — for a kernel upgrade, a hardware swap, a network outage at one datacenter — and the cluster keeps signing. We do rolling upgrades by taking nodes out one at a time. Zero blocks missed during a chain-binary upgrade is the normal expected outcome, not a heroic one. - Slashing-resistance to single-node compromise. If an attacker pops one of the three nodes — through a kernel zero-day, a side-channel attack on the host, anything — they hold one shard of the signing key. That shard, on its own, signs nothing. They would need to compromise a second node, in a different physical and network environment, simultaneously. - No single intentional double-sign. Even an insider with full credentials on one node can not unilaterally produce a duplicate signature. The protocol forces them to involve a second node, which can refuse.

The slashing-protection step before the math runs

Two-of-three threshold signing prevents a class of attacks. It does not, on its own, prevent every double-sign. The cluster can still in principle be asked to sign two blocks at the same height — for example after a hard reboot where state got out of sync, or during a network partition where the cluster is running in two halves.

For that, every signing cluster runs a check before it signs anything. It looks at the highest block it has already signed and refuses to sign any block at a lower or equal height with a different hash. That check is a single boolean comparison; it runs in microseconds; it sits in the critical path before the cryptographic part of the signature is even computed.

In Cosmos-ecosystem deployments we run this through Horcrux, which has the comparison built into its state machine. On Ethereum we run Web3Signer, which has the equivalent check via its slashing-protection database. Either way, the cluster has to clear the local "have I already signed at this height" gate before it does anything else. A misconfigured cluster can not get past it. A compromised cluster can not get past it. A confused cluster after a partition heal can not get past it.

This is the part that the team in our opening anecdote did not have. They had a single signer, and they had no last-signed-block enforcement layer in front of it. When their host briefly desynced and re-signed, there was nothing to stop it.

Key sharding is the security property that matters

The cluster does not just split work; it splits the key. In a 2-of-3 deployment, the validator signing key is generated as three cryptographic shards. Each node holds one shard. No node ever has, or can reconstruct, the full key on its own.

The signing operation never reconstructs the key either. Two nodes each compute a partial signature using their shard, and the partial signatures are combined into a valid full signature. The full key never materialises in memory, on disk, or in transit.

That property is what changes the threat model. In 2019, "key compromise" meant someone got the YubiHSM PIN and dumped the key. In 2026, "key compromise" requires simultaneous compromise of two of three independently administered hosts in different physical and network environments, plus the network paths between them, plus the timing to combine partial signatures inside the protocol window. The bar is not slightly higher. It is qualitatively different.

What we run, by ecosystem

For Cosmos-ecosystem chains — Cosmos Hub, Celestia, Osmosis, Persistence, Babylon, and the rest of the chains we validate — the threshold signer is Horcrux. We have Horcrux running in production across our Cosmos-ecosystem footprint, behind YubiHSM key custody.

For Ethereum, the equivalent role is split between two technologies. Web3Signer runs as the slashing-protected remote signer, sitting between our validator clients and the HSM-custodied keys. On top of that, for the validator clusters we operate inside Lido's Simple DVT Module, we run Distributed Validator Technology via Obol and SSV. We are signed into five Lido Simple DVT clusters — four with Obol, one with SSV — and each of those clusters is itself a threshold-signing arrangement across multiple operators, of which we are one.

The honest distinction: Horcrux and Web3Signer give us threshold signing inside our own infrastructure. Obol and SSV give us threshold signing across multiple operators, where we are one of several. Both reduce single-point-of-failure risk; the second one also reduces operator-collusion risk by including operators we do not control.

The trade-offs

There is no free lunch in this architecture, and we want to be specific about the cost.

- Complexity is real. A 2-of-3 cluster has more moving parts than a single signer. State has to be coordinated across nodes. Network paths have to be reliable. Monitoring has to cover not just node health but cluster consensus. The on-call engineer has to understand multi-node failure modes. - Hardware cost is multiplied. Three signers across three environments cost more than one signer in one environment. For us, that cost is the price of admission to the architecture; for some smaller operators, it is the reason they still run single signers and accept the slashing risk. - First-block-of-cluster risk. Bringing a fresh cluster online for a new chain has its own checklist; you can not skip the part where you verify the slashing-protection database is initialised on every node before you allow the validator to attest.

Those are the actual disadvantages. None of them is "this architecture does not work". All of them are "this architecture takes engineering to operate". The team in our opening anecdote was not wrong to want to operate their own validator. They were wrong about the floor of complexity required to do it without slashing.

Why we are publishing this now

This is the third article in a sequence. The first, "On Validator Setup — Part 2", was a 2019 description of how we ran a single-signer Cosmos validator with HSM-backed keys and dual-DC redundancy. The second, "On Validator Setup: Seven Years Later", is the full architectural review we published earlier this month — what we kept, what we added, what we would do differently. This piece is the focused middle: the specific reason a 2026 validator is a cluster and not a node.

For counterparties evaluating us as an operator, the cluster architecture is the part that translates directly into "zero slashing in six years across forty-plus mainnets". The trust-pack version is on /security with explicit active / in-progress / planned status per control. The case-study version is on /case-studies and /operator-credentials with on-chain verifiable receipts. This article is the why behind both.

The team we declined to name, by the way, has since rebuilt their operations on a clustered architecture. We hope they ship blocks for a long time.

StakingInfrastructureValidatorEngineering01node

Share this article

Help others discover quality infrastructure insights.

Share LinkedIn

Why a Validator Is Not a Server

A validator is no longer a single node

The 2-of-3 cluster, in plain math

The slashing-protection step before the math runs

Key sharding is the security property that matters

What we run, by ecosystem

The trade-offs

Why we are publishing this now

Related Articles

On Validator Setup: Seven Years Later

Why Bare Metal Matters for Decentralization

Zero-Latency Monitoring via eBPF

Stay updated