[RFC] Pluggable ClusterManagerService

### Is your feature request related to a problem? Please describe

In https://github.com/opensearch-project/OpenSearch/issues/17957, @yupeng9 proposed a "cloud-native" architecture for OpenSearch that replaces the cluster model with a decoupled "controller" that pushes configuration to external storage to be consumed by data nodes that will adopt the specified configuration.

### The status quo

That issue has its own diagram, but I think a better way of looking at is is that the "normal" way of running OpenSearch can be summarized like this:

<img width="411" height="161" alt="Image" src="https://github.com/user-attachments/assets/993186e8-1319-4549-a137-ee7196f7dbf8" />

That is, cluster managers and data nodes are tightly-coupled within a cluster. The lead cluster manager builds "**the** cluster state" which is replicated to every other node. 

### Where we want to get

The linked cloud-native RFC proposed a decoupled architecture that looks like most cloud services. Ignoring the pull-based indexing and segment replication parts, I think the following captures the important parts:

<img width="461" height="121" alt="Image" src="https://github.com/user-attachments/assets/59556f6e-09d5-4328-bdc5-e96710c0f0ef" />

(Also, unlike the linked cloud-native RFC, this highlights the fact that "someone" needs to react to the data plane state to know how to assign shards and potentially move shards around. We've assumed that's the control plane, though it could theoretically be a distinct data plane component. Shard management doesn't need to be tightly coupled to metadata management.)

### Remote cluster state

There has been work done on OpenSearch core to implement remote cluster state. It doesn't yet deal with the global nature of cluster state, and is aiming more for a lazy-loading approach to avoid pushing the full cluster state everywhere. The current implementation kind of looks like:

<img width="411" height="281" alt="Image" src="https://github.com/user-attachments/assets/25aea919-6524-4534-bd76-394f370d7edc" />

Pros:
* Reuses existing cluster manager logic.
* Focuses on compatibility with existing logic.
* Better potential for plugin integration, since it is OpenSearch code.
* Delivers incremental value for users of the existing cluster model. (More scalable, more durable.)

Cons:
* Will take a long time to fully decouple cluster managers from existing cluster assumptions.
* Maintains consistency guarantees of existing cluster model, which prevents us from reaping the benefit of a decoupled, eventually-consistent design.

### "Cloud-native" work in progress

Over the past year, we've implemented that architecture with the data plane logic encapsulated in a plugin in https://github.com/opensearch-project/cluster-etcd/tree/main/plugin. The controller is implemented in https://github.com/TejasNaikk/cluster-controller/. Essentially, the architecture looks like:

<img width="541" height="156" alt="Image" src="https://github.com/user-attachments/assets/803d9c29-12ec-420d-881a-829ee392ba9b" />

Pros:
* Doesn’t require any refactoring of existing cluster manager code. (In fact, we just disable all existing cluster manager code. `ClusterManagerService` never gets created in this implementation.)
* Quick to deliver direct benefits of decoupled metadata management. (Cannot have cluster manager issues if you don’t have cluster managers. Don’t need to break apart the cluster state if you construct it from parts.)
* Works on the foundational assumption that metadata is eventually consistent, prioritizing availability.

Cons:
* Duplication of effort / logic from existing cluster managers.
* New features need to be duplicated as well, over time.
* Potential for drift from cluster manager behavior over time (albeit consistent behavior from the standalone controller).
* Does not play well with plugins that manage custom objects. CRUD operations on plugin entities should be a control plane operation.

I don't really like the name "cloud-native" for this implementation (since it's vague) and prefer "clusterless", since there is no cluster formation in the OpenSearch sense. Data nodes start up and don't even try to join a cluster.

Based on some review with @shwetathareja and her colleagues, the data plane side seems promising, but the standalone controller is too decoupled from OpenSearch and essentially needs to duplicate work from the cluster manager. Indeed, you can see the APIs to manage indices, aliases, and templates [here](https://github.com/TejasNaikk/cluster-controller/tree/main/src/main/java/io/clustercontroller/api/handlers). There's also new shard allocation logic implemented [here](https://github.com/TejasNaikk/cluster-controller/blob/main/src/main/java/io/clustercontroller/allocation/ShardAllocator.java).

### Describe the solution you'd like

After giving it some more thought, I think we can borrow the idea from the data plane implementation in [the cluster-etcd plugin](https://github.com/opensearch-project/cluster-etcd/tree/main/plugin), for the cluster manager side of things. Essentially, we would end up with a design that would try to look something like the following:

<img width="601" height="151" alt="Image" src="https://github.com/user-attachments/assets/77a8d37b-8748-42a2-8753-158396277180" />

The basic idea is as follows:

1. We use real OpenSearch cluster managers in **their own cluster**. That is, we have a cluster of (say) 3 cluster managers and one of them is elected the leader.
2. Cluster manager APIs would be unchanged at the REST layer. For example, `PUT /<index>` would still pass through `RestCreateIndexAction` to `TransportCreateIndexAction`.
3. We still route the request to the elected leader (per the logic inherited from `TransportClusterManagerNodeAction`). We can benefit from OpenSearch's existing leader election logic to avoid concurrent updates to the config store. (See below for discussion of multi-tenancy and scaling operations across the "control plane cluster".)
4. **New logic**: Pretty much every cluster manager API ends up submitting a `ClusterStateUpdateTask`. For example, [this](https://github.com/opensearch-project/OpenSearch/blob/8dd128666c41c5875c134177404b2c6d920a278b/server/src/main/java/org/opensearch/cluster/metadata/MetadataCreateIndexService.java#L380) is where it happens for create index. These tasks currently take a full cluster state and return a full cluster state with whatever operation applied. One goal of this exercise is to remove cluster state size limits by only materializing required cluster state. **Proposal**: Any `ClusterStateUpdateTask` should specify what parts of cluster state it needs for its input. Note that it is fine if the `ClusterStateUpdateTask` receives too much (so the status quo, where it gets everything, will still work).
5. Currently, the `submitClusterStateUpdateTask` API gets sent to the singleton `ClusterManagerService`. The proposal here is that we extract from `ClusterManagerService` the parts that strictly require membership in a "cluster". This is almost complete, since we have `clusterStatePublisher` and `clusterStateSupplier`. We just need a way to a) replace `clusterStateSupplier` with a function that can return a partial cluster state (see the previous option) and b) let a plugin provide an implementation of `ClusterStatePublisher`.
6. For a "clusterless" or "cloud-native" implementation, the `ClusterManagerService` will be able to read the relevant parts of "cluster state" from the plugged-in `clusterStateSupplier`, which will fetch the relevant parts from the external store to build the (partial) `ClusterState` object on the fly. Then `ClusterManagerService` will pass it to the `ClusterStateUpdateTask` (on its task executor), which will produce a new `ClusterState`. The old and new `ClusterState` are passed as a `ClusterChangedEvent` to the plugged-in `clusterStatePublisher`, which will write the required changes to the external store to be consumed by data nodes (or to be consumed by a future cluster manager operation). Note that the plugin may cache parts of cluster state to avoid reads from the external store. (The plugin should assume that no other writer is modifying the goal state for the managed cluster.)

Pros:
* Pushes cluster state reading and writing with an external store into a plugin, not OpenSearch core (unlike existing remote cluster state logic, which is littered through core). In particular, the decisions of how to decompose and serialize cluster state are removed from core. As long as the logic is consistent with the data plane plugin, we are free to evolve.
* Reuses existing OpenSearch cluster manager logic (unlike the standalone controller that needs to reimplement each API, plus shard allocation logic, etc).
* Relatively quick to get started. 
    * It's a fairly tactical change to `ClusterManagerService` to add the plugin points. Of course, the plugin needs to know how to handle all possible ClusterState changes, but at least we know that the new cluster state has been validated by existing logic.
    * While it initially limits the maximum size of a managed cluster, we could assume that by default any `ClusterStateUpdateTask` requires full cluster state. Eventually, we would want every task to require a (small) subset of the overall cluster state.

Cons:
* We don't get to kill off cluster managers altogether -- I was so looking forward to that. (Though we do still get rid of "clusters".)

This is still not fully complete, of course. Since the data nodes never join the cluster containing the cluster manager nodes, we never process node join or removal requests. We also need an alternative to the `Coordinator`/`JoinHelper` logic that periodically looks at the heartbeats / actual state published to the external store to add and remove nodes, calling `AllocationService#reroute` as needed.

We also need to be careful to differentiate between the `ClusterService` used to manage the cluster of cluster managers and the `ClusterService` for the "cluster" of data nodes that we're managing. In theory, the former gets wired up fully from the `Node` constructor, while the latter gets injected (into Actions for example).

### Related component

Cluster Manager

### Describe alternatives you've considered

See the first section.

### Additional context

Above, I alluded to multi-tenancy. If we're still going to bring up several cluster manager hosts, it would be nice if they could manage more than one set of data nodes -- we could call it a "cluster" of data nodes, though the data nodes don't join or form a "cluster".

We can call it a "tenant" (as in "multi-tenant"), a "data cluster" (since it would feel a lot like a traditional OpenSearch cluster, though no individual node has or needs the full cluster state, and nodes don't chatter amongst themselves), or even a "namespace" (if we want to think of it as a logical grouping of nodes and indices, such that shards from those indices are assigned to those nodes).

Regardless of what we call it, each data cluster can be managed independently. That is, if we have no shared resources, a single pool of cluster managers can manage many data clusters. In this case, the elected leader of the cluster managers can be responsible for delegating management of individual data clusters to the other cluster manager nodes. (We never want multiple nodes managing the same data cluster, but picking one for each cluster is essentially the same problem as picking a primary for each index shard.)

Identifying a particular "tenant/data cluster/namespace" for a request is a challenge. In the standalone Controller implementation, we prepended the target to the existing REST API path (like `PUT /<cluster-name>/<index-name>`). Modifying all existing cluster manager REST APIs to add new routes (only if we're running in a mode where multi-tenancy makes sense) is a bit of a hassle, though. Maybe we could add a URL param or something?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Pluggable ClusterManagerService #20443

Is your feature request related to a problem? Please describe

The status quo

Where we want to get

Remote cluster state

"Cloud-native" work in progress

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Pluggable ClusterManagerService #20443

Description

Is your feature request related to a problem? Please describe

The status quo

Where we want to get

Remote cluster state

"Cloud-native" work in progress

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions