Skip to content

[BUG] cluster state call blocks http worker threadΒ #20426

@bowenlan-amzn

Description

@bowenlan-amzn

Describe the bug

When handling /_cluster/state/_all/_all requests, the ClusterState.toXContent() serialization runs synchronously on the Netty HTTP worker thread. For large clusters, this serialization can take 4+ seconds, during which the HTTP worker thread is blocked and cannot process other requests.

Related component

Cluster Manager

To Reproduce

 98.2% (491ms out of 500ms) cpu usage by thread 'opensearch[...][http_server_worker][T#2]'
      org.opensearch.common.settings.Settings.convertMapsToArrays(Settings.java:195)
      org.opensearch.common.settings.Settings.toXContent(Settings.java:602)
      org.opensearch.cluster.metadata.IndexMetadata$Builder.toXContent(IndexMetadata.java:1875)
      org.opensearch.cluster.metadata.Metadata$Builder.toXContent(Metadata.java:1962)
      org.opensearch.cluster.metadata.Metadata.toXContent(Metadata.java:1088)
      org.opensearch.cluster.ClusterState.toXContent(ClusterState.java:535)
      org.opensearch.rest.action.admin.cluster.RestClusterStateAction$1.buildResponse(RestClusterStateAction.java:166)
      ...
      io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
      io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)

Expected behavior

Heavy serialization work should be offloaded to a generic thread pool instead of running on the Netty I/O thread, similar to how other CPU-intensive operations are handled.

Could be fixed by following this pattern, offloading to management thread pool.

// Process serialization on GENERIC pool since the serialization of the raw mappings to XContent can be too slow to execute
// on an IO thread
threadPool.executor(ThreadPool.Names.MANAGEMENT)
.execute(ActionRunnable.wrap(this, l -> new RestBuilderListener<GetMappingsResponse>(channel) {
@Override
public RestResponse buildResponse(final GetMappingsResponse response, final XContentBuilder builder)
throws Exception {
if (threadPool.relativeTimeInMillis() - startTimeMs > timeout.millis()) {
throw new OpenSearchTimeoutException("Timed out getting mappings");
}
builder.startObject();
response.toXContent(builder, request);
builder.endObject();
return new BytesRestResponse(RestStatus.OK, builder);
}
}.onResponse(getMappingsResponse)));

Additional Details

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    πŸ†• New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions