Skip to content

Conversation

@alanshaw
Copy link
Member

@alanshaw alanshaw commented Oct 29, 2025

📖 Preview

The Storacha Network is not setup well to deal with failure cases. We have implemented the happy path, but there are various things that can go wrong that we currently have no story for. This RFC attempts to enumerate a bunch of them that currently live in my head, with the hope that we can gain consensus on what should happen and allow further RFC(s) to be opened and/or specs created/amended to deal with them.

@alanshaw alanshaw requested a review from a team October 29, 2025 10:36
Comment on lines +21 to +22
* Node leaving the network
* How do we detect this and clean up? How do we repair all the data it was storing? We should stop considering that node as a candidate for uploads.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we currently store the number of intended replicas of a blob as state somewhere? I think ideally we automatically attempt to re-establish the requested number of replicas any time the actual number drops, which I think is exactly when a node leaves (voluntarily or involuntarily).

I'm curious what we want to do in the case where the node held the only copy. Do we simply lose the data, and call that the risk of a single replica? That seems pretty reasonable to me, as replicated data should be considered the norm, but we should probably make sure we're comfortable being explicit about that.

Comment on lines +15 to +16
* Allocation failure
* If we cannot allocate on a node, we should try a _different_ node. We should consider that an allocation failure could be temporary and caused by a full disk or adverse network conditions. Aside - we should consider increasing probability of allocating on nodes that are close to the client uploading the data to mitigate firewalls (inability to reach the node) and allow for faster upload times.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +7 to +8
* Replication failure
* What happens when a replication does not succeed? Currently nothing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, I'd like to handle this with the same mechanism as when a node leaves: the number of actual replicas is not equal to the requested number, so we attempt to replicate on nodes until it is.

Comment on lines +9 to +10
* Data deletion
* Long standing problem. We have a better story for this on storage nodes but no implementation. An interesting/difficult problem is how to deal with deletion in the context of our PDP root aggregates as well as our Filecoin deal aggregations. There is also data stored on the IPNI chain on the indexer that needs to be cleaned up. Or consider implementing [RFC #52](https://github.com/storacha/RFC/pull/52)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This is interesting. Leaving Filecoin aside, is this a signal that the strategy of aggregating blobs for PDP is flawed? The cost of not aggregating is the increased gas/storage ratio for many small blobs. I'm spitballing, but could there be a way to carry a similar cost ratio all the way to the Storacha customer, to discourage that behavior, or to compensate for it? Or would that just be too much complexity for too little value?

Comment on lines +11 to +12
* Data loss
* If a node loses some data, what happens? How do we know? We should probably be less inclined to use them for storing data. How do we ensure minimum replicas are maintained? This is linked to proving failure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'd like to handle this idempotently, like when a replication fails or a node leaves: if data is lost, that replica no longer exists, so the number of replicas is too low, so we replicate. (That doesn't address how to weight that node lower, though.)

Comment on lines +13 to +14
* Proving failure
* If a node falls below the successful proving threshold, what happens? We probably should NOT attempt to store _more_ data on that node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect some sort of softer version of the data loss case. The word I want to use here is "slashing", but I think that usually implies something harsher than what we probably want here, at least at first. If bad behavior continues, it should probably escalate into something that feels more like "slashing", though.

Comment on lines +17 to +18
* Accept failure
* After uploading a blob the upload service should at least retry a blob/accept invocation to a storage node. If the node responds with an error receipt, what is the implication? Has the node failed to store the data? How should it affect our desire to store _more_ data on the node?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the node "successfully" returns an error receipt, we can be confident (I think) that it knows that it failed, so we can call that a failure to store. We should probably choose a different node and try again.

It should probably also play into the "reputation"/weighting system, yeah. But maybe with a lower impact than other things, since it's not terrible to run into this is and have to try another node, unless it happens a lot, at which point the effect would add up.

If the node fails to return a receipt, I'm not exactly sure. Can we follow up by asking for the receipt, and ideally get a success receipt, a fail receipt, or an affirmative 404 that the receipt doesn't exist and (presumably) the storage didn't happen?

Comment on lines +19 to +20
* Retrieval failure
* How do we observe and verify failures to retrieve? How do we prevent storage nodes from accepting retrieval invocations, claiming the egress and not sending the data?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In AWS or Cloudflare, you pretty much have to trust the host to tell the truth about what it sent your clients. The best assurance you have is logging: if the logs don't match the bill, there's an obvious problem, and if the logs don't match what the clients receive, you have a less obvious/certain problem, but still something that hopefully is evident enough to be caught.

Can we make similar logging available enough to customers in real-ish time so that they can keep the nodes honest by reporting issues to us? We'd still have to trust the nodes a fair bit, but maybe issues become manual investigations with a penalty of getting kicked off the network for any fraud?

* Node leaving the network
* How do we detect this and clean up? How do we repair all the data it was storing? We should stop considering that node as a candidate for uploads.

A lot of these failure cases can boil down to us maintaining node weights, and how each one of these situations affects a weighting. That said, it would be good to find some time to consider whether weights is the right solution or if there is a better alternative, like a broader reputation system, that is maybe provable and observable. There is some much harder problems here though and it would be good to find some time to consider what we could put in place to either solve or align incentives in a way that makes it undesirable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's enough to sign an audit log of our reasoning for altering weights. I don't think we need to "prove" our justifications with unimpeachable evidence, such that (eg) a smart contract could come to the same conclusions. I think we make assertions based on our own policies and judgements, make those decisions transparent, and then stand behind them.

Ultimately, we're a central-but-replaceable service holding this routing table. If people don't like our management of it, if we do our job right, they could run their own on top of the same network. This portion is akin to Bluesky: someone else can build their own Bluesky clone on atproto if they don't like how the company is running the app.

Copy link
Member

@fforbeck fforbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just sharing some thoughts here. Not sure how feasible they are at this stage.

The Storacha Network is not setup well to deal with failure cases. We have implemented the happy path, but there are various things that can go wrong that we currently have no story for. This RFC attempts to enumerate a bunch of them that currently live in my head, with the hope that we can gain consensus on what _should_ happen and allow further RFC(s) to be opened and/or specs created/amended to deal with them.

* Replication failure
* What happens when a replication does not succeed? Currently nothing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the replication is our durability contract with the user, I believe we should keep retrying until the minimum replication factor is reached; otherwise, we need to mark that blob as "at_risk" or something so we don't silently fail to replicate, and it becomes a data loss in the future when nodes are not available.
Perhaps having some sort of control loop to periodically check if a blob replica is below the target threshold, if so, create new replicas on healthy nodes.


* Replication failure
* What happens when a replication does not succeed? Currently nothing.
* Data deletion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we handle it in two phases:

  1. First, mark data for deletion in a central auditable log that is publicly available for nodes to query. This info gets propagated to the nodes that hold the replica (for quick lookup)
  2. Kick off the bytes clean-up process in the background when data is requested from the node, and that is included in the local delete intent log of the node, so it is not served to the client

Regarding the PDP root aggregates, would it be possible to make the root aggregate expire every X min/hour/day/etc, and then you rebuild it considering the deletion data? I mean, we won't include the blob that was marked as deleted in the local deletion log.

With that, we can check if a blob is no longer referenced by any currently provable root we can do the byte clean-up. The deletion won't happen immediately, but in a time window we define.

For Filecoin, is it required to store on Filecoin by default? Like, can we make Filecoin opt-in, not default durability?
Then we can say by default, the client data stays in our hot tier, where he has control, fast performance, and the ability to delete. He would optionally publish to Filecoin for more durability, but that path is more "permanent" (I'm assuming we can't simply delete), and then we would recommend using encryption if the client wants to store there. That would allow us to stop serving and delete any encryption keys we may hold (if we hold any).

* What happens when a replication does not succeed? Currently nothing.
* Data deletion
* Long standing problem. We have a better story for this on storage nodes but no implementation. An interesting/difficult problem is how to deal with deletion in the context of our PDP root aggregates as well as our Filecoin deal aggregations. There is also data stored on the IPNI chain on the indexer that needs to be cleaned up. Or consider implementing [RFC #52](https://github.com/storacha/RFC/pull/52)
* Data loss
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we will need to continuously check data possession and content retrieval, then if any failures are found, we need to report them as data/replica loss. So we can kick off some sort of repair process to restore the required replication level. Do we have a global map of the replicas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants