-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
Describe the bug
Summary
Creating a WLM workload group (PUT /_wlm/workload_group) intermittently fails with WorkloadGroup.updatedAtInMillis is not a valid epoch when the REST request is served by a data node. The failure rate increases when traffic is routed to data nodes, but disappears when sent directly to the cluster-manager node.
Environment
- OpenSearch: 3.4.0
- Feature: Workload Management (WLM) workload groups
- TLS enabled
- Cluster with separate data and cluster-manager nodes
- NTP appears healthy (subโ50ms jitter, no obvious clock skew in node stats)
Current Behavior
Request sometimes returns:
{
"error": {
"root_cause": [
{ "type": "illegal_argument_exception", "reason": "WorkloadGroup.updatedAtInMillis is not a valid epoch" }
],
"type": "illegal_argument_exception",
"reason": "WorkloadGroup.updatedAtInMillis is not a valid epoch"
},
"status": 400
}It happens more often when the REST request is served by a data node; sending the same request directly to the cluster-manager node avoids the error.
Related component
Plugins
To Reproduce
- check node timestamps
GET _nodes/stats?filter_path=nodes.*.name,nodes.*.timestamp
"nodes": {
"RIYgIe82SJKB6_v0tFhrHw":{
"timestamp":1769418930198,
"name":"cm-0"
},
"uRuwSYQYQOCVumlqnvp9AQ":{
"timestamp":1769418930193,
"name":"data-0"
},
"lbuwkeheQg6e8CYkUrmhBQ":{
"timestamp":1769418930202,
"name":"cm-1"
},
"QnvaRRBlRv6kg2DGoXKmiw":{
"timestamp":1769418930215,
"name":"cm-2"
},
"88Hk51bYRailzRngI33wmg":{
"timestamp":1769418930211,
"name":"data-1"
}
-
call request directly in
data-1node(1769418930211) ->cm-0node (1769418930198) -
api call failed in data-1 node
curl -XPUT https://localhost:9200/_wlm/workload_group
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "WorkloadGroup.updatedAtInMillis is not a valid epoch"
}
],
"type": "illegal_argument_exception",
"reason": "WorkloaGroup.updatedAtInMillis is not a valid epoch"
},
"status": 400
}
Expected behavior
Query group creation should not fail due to minor clock skew between REST node and cluster-manager.
Additional Details
Plugins
workload management
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
- OS: Ubuntu
- Version: 22
Additional context
Add any other context about the problem here.