Skip to content

Add Minimum Load Threshold Setting#519

Open
mskouba wants to merge 1 commit intokubeai-project:mainfrom
mskouba:add-min-threshold
Open

Add Minimum Load Threshold Setting#519
mskouba wants to merge 1 commit intokubeai-project:mainfrom
mskouba:add-min-threshold

Conversation

@mskouba
Copy link
Contributor

@mskouba mskouba commented May 6, 2025

We have been using KubeAI to load balance across hundreds of vLLM instances using the PrefixHash load balancing strategy. We've encountered an edge case that occurs when two conditions are met, 1). you have hundreds of instances running and 2). your traffic is such that a large percentage of traffic (say 25%) has the same prefix.

This leads to issues where the load threshold calculated is a very small number as traffic ramps up due to the large number of instances in the denominator. Due to this low load threshold no instances appear ready for the next request. This would be acceptable if each request had a unique prefix as the default instance the request is assigned to would be more or less random and distribute load relatively evenly. However, when some prefixes are very common, those requests are all routed to the same instance as the traffic ramps up.

A potential solution to this situation is to define a setting, minimum load threshold, which can be set to essentially override the calculated threshold. This would allow requests with the common prefixes to be distributed more evenly to other instances without encountering situations where the high number of instances causes all instances to appear to be exceeding the load threshold and thus default to the default instance.

Would love to get your feedback on this issue and potential solution!

@samos123 samos123 requested a review from nstogner May 7, 2025 00:07
@samos123
Copy link
Contributor

samos123 commented May 7, 2025

Thanks for the detailed write up. No need to change anything yet, need to think some more and review more thoroughly.

Another approach could be dynamically tweaking the replication number: https://www.kubeai.org/reference/kubernetes-api/#prefixhash

replication integer Replication is the number of replicas of each endpoint on the hash ring.
Higher values will result in a more even distribution of load but will
decrease lookup performance.

The default is 256, but maybe that number should be dynamic based on number of instances.

Curious, did you modify the replication number and play around with that? That would be a valuable datapoint.

@mskouba
Copy link
Contributor Author

mskouba commented May 7, 2025

Thanks for taking a look and thinking this one over. We have tried increasing replication, specifically we tried values of 256, 1024, and 131072. All of these tests showed the same behavior where one instance becomes overloaded.

My understanding of the replication parameter is that it determines how many entries each endpoint has on the hash ring. If that is true, then these results seem logical since the common prefix requests will always match same endpoint replica on the hash ring regardless of whether there are 10 thousand entries or 10 million.

If there are any other settings or info that would be useful for me to provide please let me know.

@nstogner
Copy link
Contributor

nstogner commented May 8, 2025

KubeAI produces some metrics that should help diagnose what is happening here: https://github.com/substratusai/kubeai/blob/8007ab256b536f8b15e31dd14bd2c1b9eadb3e3e/internal/metrics/metrics.go#L43

Can you verify the version of KubeAI you are running? (the image tag on the KubeAI Deployment).

In theory your usecase should be handled by meanLoadFactor (https://www.kubeai.org/reference/kubernetes-api/#prefixhash). Have you tried using a value like 110%?

@samos123
Copy link
Contributor

samos123 commented May 8, 2025

More details on the scenario from @mskouba

I think the combination of having hundreds of instances and a prefix that is shared by a large portion of traffic leads to issues due to how the threshold to determine if an endpoint is overloaded is calculated.

When we have say 300 endpoints and are ramping up traffic, let’s use 100 requests as an example with 25% of those requests having the same prefix, we have 25 requests with the same prefix hash and thus the same default endpoint. When it checks for load on an endpoint with no requests, it calculates the average load across the group as 100+1 / 300, so roughly a 1/3. This gets multiplied by the load factor, this can be 1.10 or 1.25 or even 2, but let’s use. 1.25 as an example. So the threshold var is 1/3 * 1.25 or 5/12. This is then compared to the endpoints load (0) plus 1 which is always greater than the threshold of 5/12 until we have a high amount of traffic that makes that avgLoad high enough to mitigate this. But until that happens these requests can’t find any endpoints that appear suitable and thus schedule on the default endpoint. This works out fine with different prefixes on each request as they all have different default endpoints, but with the prefix being the same on all those requests they all have the same endpoint and get routed there.

Replication doesn’t help since the common prefix means they all end up on the same part of the hash ring and thus get the same default endpoint and mean load factor isn’t super effective because that avgLoad value is so small due to the high number of instances it’s being averaged across that the mean load value would have to be set so high that it wouldn’t work as intended once the system hits high throughput. That’s how I ended up with the imperfect solution of another parameter, another solution would be to get rid of the +1 when checking if the endpoints load exceeds the threshold, but not sure that is really an acceptable solution.

@alpe
Copy link
Collaborator

alpe commented May 12, 2025

Very nice analysis of the problem @samos123 ! I can confirm your findings.
This is not a trivial problem and to help with a solution, I started this request simulator that can be used to compare different implementations.

One alternative solution that came to my mind is to compare the load of an endpoint not with the avg but the next neighbour only when the totalLoad+1 < len(endpoints). This works very well for this setup but it would be good to have some more scenarios defined to achieve an optimal solution for real world cases.
An example implementaiton is in the same branch.

Please note: while the load distribution is not perfect there is a significant performance gain. For 15k requests on 100/300 nodes I get from 3.7s -> 35ms total run time and 24.95% -> 0.91% of requests for the top endpoint on my box.

@mskouba
Copy link
Contributor Author

mskouba commented May 14, 2025

Thanks @alpe , the simulator is a great tool for this! As for the various suggested options, I've done some testing with each approach in a live environment with a high number of vLLM endpoints. Overall, the approach of removing the + 1 to the endpoint load when comparing endpoint load to the calculated threshold seems to be the best approach for this use case as it spreads the load well at all request volumes while keeping throughput and cache hit ratio high.

The additional parameter approach appears to add more complexity than is needed given that the previous approach worked very well. While this approach could still be feasible, it also requires the parameter to be well tuned for each situation, something that could prove challenging in an environment that scales significantly from low instances to very high scale.

When testing the neighbour load approach, There was definitely a significant improvement but I still saw some endpoints that become overburdened. This makes some sense as this approach smoothes the load from one instance to the next instances but still theoretically allows the number of requests routed to the original instance to increase indefinitely as you linearly add requests and endpoints. This can be seen when running the simulator as well.

Curious to hear your thoughts on the + 1 removal approach and whether there are any drawbacks I haven't considered.

@alpe
Copy link
Collaborator

alpe commented May 15, 2025

Thanks @mskouba for testing the different scenarios in a prod environment! Very interesting findings.
First, I could not replicate this results with the simulator but when I also decreased the in-flight count of the endpoints (to simulate request completion), it made sense.
When the requests pile up super fast, removing the "+ 1" gives the same result as the original code. But when there is a free endpoint available, the distribution is much better than the "next neighbour" algorithm.

@samos123
Copy link
Contributor

+1 to the +1 removal approach. Appreciate you both digging deep into that. I would like to get the +1 approach and the simulator merged.

@mskouba
Copy link
Contributor Author

mskouba commented May 19, 2025

Thanks for the feedback, I've opened #528 with the +1 removal update.

samos123 pushed a commit that referenced this pull request May 23, 2025
Update chwbl load balancing to not add one to the endpoint's load when
determining whether an endpoint can accommodate an additional request.
Per the conversation in #519, this solution helps handle traffic better
in situations where there are many endpoints and many requests with a
common prefix.

Co-authored-by: Michael Kouba <michael.kouba@calypsoai.com>
thehonker pushed a commit to thehonker/kubeai that referenced this pull request Oct 6, 2025
Update chwbl load balancing to not add one to the endpoint's load when
determining whether an endpoint can accommodate an additional request.
Per the conversation in kubeai-project#519, this solution helps handle traffic better
in situations where there are many endpoints and many requests with a
common prefix.

Co-authored-by: Michael Kouba <michael.kouba@calypsoai.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants