speed
arrow_backBack to Blog
Engineering|12 MIN READ|MAR 23, 2025

1.5 Milliseconds: How Our Signing Speed Triggered the Horcrux Race Condition

On March 6, 2025, our Osmosis validator double-signed at block 30968345. Two signatures, 1.5 milliseconds apart, exposed a Horcrux race condition that had been latent in production for 20 months. We were one of hundreds of operators running the affected version — and the only one fast enough to hit the bug. This is what happened, why it happened to us specifically, and how we reimbursed every affected delegator in full.

terminal

01node Team

Infrastructure Engineers

At 21:35:48.759 UTC on March 6, 2025, our Osmosis validator produced a signature on block 30968345. At 21:35:48.760 UTC — 1.5 milliseconds later — it produced a different signature on the same height. That window is shorter than the round-trip time to your nearest DNS resolver. It is shorter than the keyboard event loop on a modern operating system. It is also the window in which a latent race condition in Horcrux v3.3.1 reordered the signing state and produced two conflicting signatures for the same block.

The result was a double-sign, a 5% slash of 75,306 OSMO, a tombstoned validator, and an obligation we honoured: every one of the 4,650 affected delegators was reimbursed in full out of our operational reserve within days. The total reimbursement, recorded on-chain, was 75,305,896,736 uOSMO — the full slash, returned.

This article is the technical and operational record of that incident, written in the same voice as the rest of our engineering content. We are publishing it because the alternative — leaving the on-chain history without explanation — is not an option a serious operator takes. Below: what happened, why it happened to us specifically, what Strangelove and we did about it, and what changed in how we talk about our slashing posture afterward.

The race condition, in 200 words

Horcrux is a threshold-signing daemon for Cosmos-ecosystem validators. We have used it in production since 2022. In normal operation, when a validator client asks Horcrux to sign a block, Horcrux first checks its persistent signature state to ensure it has not already signed at that Height-Round-Step (HRS). If the state shows a prior signature, the request is rejected. If not, Horcrux computes the partial signature, updates the state, and returns the result.

In versions v3.1.0 through v3.3.1 — every Horcrux release since July 2023 — that “check then write” sequence was implemented with two separate locks rather than a single atomic operation. From the official Strangelove security advisory: “The HRSKey() method used a read lock to check the current signature state. The cacheAndMarshal() method used a separate write lock to update the state. Because these unlocked in the middle to perform checks rather than occurring under a single lock, they were not atomic.”

In a sequential workload — one signing request at a time — that split-lock pattern is harmless. The first signature completes its check, writes state, and the next request sees the updated state. In a workload with two near-simultaneous requests, both can pass the “have I signed?” check before either has finished writing the “yes I have” state update. Both proceed to sign. The race condition manifests.

Why us, and why nobody else

Strangelove’s own advisory contains the framing that we want to be clear about, verbatim:

> “affecting one validator out of hundreds that have been using the affected software versions to validate over the past few years”

> “probability on typical hardware in the range of 1 in 1 billion per signed vote”

Two pieces of information sit inside those sentences. The first: the bug had been in production for 20 months across hundreds of validators and nobody — including Strangelove’s own internal monitoring — had observed it trigger. The second: the rate at which the bug could trigger was a function of the timing window between the read-check and the write-update. On infrastructure where that window was large in millisecond terms — say, a software signer running on a shared cloud VM with normal OS scheduling jitter — the race condition was practically unreachable. On infrastructure where that window was small — sub-millisecond — the race condition became practically reachable.

Our signing infrastructure runs on dedicated AMD EPYC bare metal in two Tier III datacenters, behind YubiHSM 2 hardware key custody, on AS41536 with direct peering to upstream carriers. The network and compute paths between our validator client and our Horcrux signing nodes are measured in tens of microseconds. When Osmosis block 30968345 propagated, our validator client issued two sign requests to Horcrux within the timing envelope of a single TCP retransmission window. The split-lock pattern raced. We hit the 1-in-a-billion event.

We do not say this to deflect responsibility. We were running the software. The slashing happened on our validator. Delegators delegated to us, not to Horcrux. We are presenting the timing analysis because, after the incident, it shaped how we think about validator infrastructure latency as a security variable rather than only a performance variable.

The response timeline

The chain-level events:

- March 6, 2025 — 21:35:48 UTC. Double-sign at block 30968345. Validator immediately tombstoned by Osmosis. 5% slash of validator and delegator stake applied automatically (~75,306 OSMO). - March 6, 2025 — 23:25 UTC. We confirmed root cause was non-trivial (not a configuration error, not a key conflict, not infrastructure failure) and reported the incident upstream to Strangelove. - March 7, 2025 — 01:03 UTC. Strangelove identified the race condition. - March 7, 2025 — later same day. Horcrux v3.3.2 released with the fix: a single mutex covering both read and write of signature state. From the advisory: “The fix implements a single mutex lock that covers both the reading of the current signature state and the subsequent writing of any updates” — replacing the split locks with a single atomic operation, ensuring “only one signature request for a given HRS can proceed at a time.”

On our side, the patched Horcrux version was deployed across our Cosmos-ecosystem signing infrastructure within hours of release. We launched a new Osmosis validator (the old one being permanently tombstoned and unrecoverable) and published a delegator-facing redelegation guide that walked through moving stake from the jailed validator to the new one using Keplr.

The reimbursement

Strangelove publicly stated in the advisory that they would “be working with 01node to reimburse those impacted by the tombstone event slash.” Without waiting for that arrangement to formalise, we reimbursed the affected delegators ourselves from our operational reserve. The numbers, recorded on-chain on Osmosis:

- 4,650 delegator addresses received a refund - 75,305,896,736 uOSMO total reimbursed (= 75,305.9 OSMO; the full amount slashed) - The refunds were issued as direct transfers from our reserve address to each affected delegator’s wallet — no claim process, no application, no missed addresses.

For an institutional delegator reading this years later: the question that matters for due diligence is not “did this operator slash?” — the on-chain record makes that visible whether or not the operator discloses it. The question that matters is: when slashing happens, what is the operator’s policy? Our policy, demonstrated by this event, is that delegators do not absorb operational losses caused by software running on our infrastructure. Slashing penalties from vendor-attributed bugs are paid by us, not by delegators. We will hold that policy regardless of whether the vendor reimburses us first, after, or not at all.

What changed in our infrastructure

The most consequential change was upstream: Horcrux v3.3.2 patched the underlying race condition for everyone, not just for us. The single-mutex fix eliminates the timing window that made our hardware able to trigger the bug. Every Horcrux user globally became un-affected by this specific issue the moment they upgraded.

On our side, the operational change is narrow: we now treat “signing-loop latency” as a first-class security telemetry signal alongside the standard validator metrics. Sub-millisecond signing intervals between adjacent slots are not a problem; they are normal for our infrastructure. But monitoring the distribution of those intervals is now part of how we observe our own signing stack — not because we expect another race-condition bug, but because the next class of subtle concurrency issue is more likely to be visible in those timing distributions than in any other observable.

We did not change our threshold-signing architecture. The Horcrux 2-of-3 threshold model remained sound; the bug was in the local-state code path that every threshold signer must implement, not in the cross-operator coordination. We did not change our HSM-backed key custody. The YubiHSM boundary held; no key material was ever at risk. We did not change our chain selection or our delegator commitments. We onboarded the new Osmosis validator and continued operations.

What we changed about how we talk about slashing

Before March 2025, our public-facing materials carried “zero slashing events” as a headline credential. After March 2025, that statement became inaccurate. We are updating the materials accordingly.

The replacement framing is “zero delegator loss” — outcome-focused rather than input-focused. The framing is supported by the data: across six years of operations and 40+ proof-of-stake mainnets, total delegator losses caused by 01node operations are zero. One slashing event occurred; full reimbursement was issued. The net delegator outcome is zero.

This is a deliberate framing choice. Saying “zero slashing” after the Osmosis incident would be dishonest. Saying nothing about historical incidents in a context where they are on-chain verifiable would be performatively dishonest. Saying “zero delegator loss because we reimburse” is what is actually true and what we want institutional counterparties evaluating us to know.

We are also using this incident as the canonical example in a separate compliance-track article we have been writing about institutional incident response. The pattern that worked for us is the pattern that works in regulated finance: rapid root-cause analysis with the vendor, full disclosure of the technical chain of events, reimbursement of impacted parties before settling vendor reimbursement, and public documentation of what changed afterward. Operators that get this pattern right earn institutional trust; operators that hide or minimise pay later, when the discrepancy between public claims and on-chain history surfaces during due diligence.

How to verify everything in this article

Every claim above maps to a public source:

- The double-sign event, the tombstoning, and the slash: visible on the Osmosis chain at block 30968345 and on Mintscan’s Osmosis validator history. - The technical root cause and the fix: Strangelove’s security advisory GHSA-6wxf-7784-62fp on GitHub. - The reimbursement: on-chain transfers from our reserve address on Osmosis, totalling 75,305,896,736 uOSMO across 4,650 destination addresses. - The new validator: address listed on the /networks/osmosis page; redelegation is supported in Keplr via the standard Cosmos delegation flow.

For institutional counterparties running vendor risk on us, the relevant disclosed-incident summary, framed for inclusion in vendor-risk responses, is on the security page at /security#disclosed-incidents. The unredacted detail, including reimbursement transaction hashes and our internal post-mortem, is delivered under NDA on request to [email protected].

For delegators reading this who were affected by the incident: thank you for your patience while we resolved the chain-level mechanics and processed reimbursement. The new Osmosis validator address is available on our delegation page. If you have not yet re-delegated, you may do so at your convenience.

StakingInfrastructureValidatorEngineering01node

Share this article

Help others discover quality infrastructure insights.

ShareLinkedIn
mail

Stay updated

Follow us for more technical insights, infrastructure deep-dives, and ecosystem updates.