Skip to content

[BUG] Workload group(Query Group) creation fails when request hits data node due to updated_at validation against CM clockย #20485

@hyanggeun

Description

@hyanggeun

Describe the bug

Summary
Creating a WLM workload group (PUT /_wlm/workload_group) intermittently fails with WorkloadGroup.updatedAtInMillis is not a valid epoch when the REST request is served by a data node. The failure rate increases when traffic is routed to data nodes, but disappears when sent directly to the cluster-manager node.

Environment

  • OpenSearch: 3.4.0
  • Feature: Workload Management (WLM) workload groups
  • TLS enabled
  • Cluster with separate data and cluster-manager nodes
  • NTP appears healthy (subโ€‘50ms jitter, no obvious clock skew in node stats)

Current Behavior
Request sometimes returns:

{
  "error": {
    "root_cause": [
      { "type": "illegal_argument_exception", "reason": "WorkloadGroup.updatedAtInMillis is not a valid epoch" }
    ],
    "type": "illegal_argument_exception",
    "reason": "WorkloadGroup.updatedAtInMillis is not a valid epoch"
  },
  "status": 400
}

It happens more often when the REST request is served by a data node; sending the same request directly to the cluster-manager node avoids the error.

Related component

Plugins

To Reproduce

  1. check node timestamps
GET _nodes/stats?filter_path=nodes.*.name,nodes.*.timestamp
 "nodes": {
    "RIYgIe82SJKB6_v0tFhrHw":{
       "timestamp":1769418930198,
       "name":"cm-0"
    },
    "uRuwSYQYQOCVumlqnvp9AQ":{
       "timestamp":1769418930193,
       "name":"data-0"
    },
    "lbuwkeheQg6e8CYkUrmhBQ":{
       "timestamp":1769418930202,
       "name":"cm-1"
    },
    "QnvaRRBlRv6kg2DGoXKmiw":{
       "timestamp":1769418930215,
       "name":"cm-2"
    },
    "88Hk51bYRailzRngI33wmg":{
       "timestamp":1769418930211,
       "name":"data-1"
}
  1. call request directly in data-1 node(1769418930211) -> cm-0 node (1769418930198)

  2. api call failed in data-1 node

curl -XPUT https://localhost:9200/_wlm/workload_group
{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "WorkloadGroup.updatedAtInMillis is not a valid epoch"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "WorkloaGroup.updatedAtInMillis is not a valid epoch"
  },
  "status": 400
}

Expected behavior

Query group creation should not fail due to minor clock skew between REST node and cluster-manager.

Additional Details

Plugins
workload management

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: Ubuntu
  • Version: 22

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions