Add Minimum Load Threshold Setting#519
Conversation
|
Thanks for the detailed write up. No need to change anything yet, need to think some more and review more thoroughly. Another approach could be dynamically tweaking the replication integer Replication is the number of replicas of each endpoint on the hash ring. The default is 256, but maybe that number should be dynamic based on number of instances. Curious, did you modify the replication number and play around with that? That would be a valuable datapoint. |
|
Thanks for taking a look and thinking this one over. We have tried increasing replication, specifically we tried values of 256, 1024, and 131072. All of these tests showed the same behavior where one instance becomes overloaded. My understanding of the replication parameter is that it determines how many entries each endpoint has on the hash ring. If that is true, then these results seem logical since the common prefix requests will always match same endpoint replica on the hash ring regardless of whether there are 10 thousand entries or 10 million. If there are any other settings or info that would be useful for me to provide please let me know. |
|
KubeAI produces some metrics that should help diagnose what is happening here: https://github.com/substratusai/kubeai/blob/8007ab256b536f8b15e31dd14bd2c1b9eadb3e3e/internal/metrics/metrics.go#L43 Can you verify the version of KubeAI you are running? (the image tag on the KubeAI Deployment). In theory your usecase should be handled by |
|
More details on the scenario from @mskouba I think the combination of having hundreds of instances and a prefix that is shared by a large portion of traffic leads to issues due to how the threshold to determine if an endpoint is overloaded is calculated. When we have say 300 endpoints and are ramping up traffic, let’s use 100 requests as an example with 25% of those requests having the same prefix, we have 25 requests with the same prefix hash and thus the same default endpoint. When it checks for load on an endpoint with no requests, it calculates the average load across the group as 100+1 / 300, so roughly a 1/3. This gets multiplied by the load factor, this can be 1.10 or 1.25 or even 2, but let’s use. 1.25 as an example. So the threshold var is 1/3 * 1.25 or 5/12. This is then compared to the endpoints load (0) plus 1 which is always greater than the threshold of 5/12 until we have a high amount of traffic that makes that avgLoad high enough to mitigate this. But until that happens these requests can’t find any endpoints that appear suitable and thus schedule on the default endpoint. This works out fine with different prefixes on each request as they all have different default endpoints, but with the prefix being the same on all those requests they all have the same endpoint and get routed there. Replication doesn’t help since the common prefix means they all end up on the same part of the hash ring and thus get the same default endpoint and mean load factor isn’t super effective because that avgLoad value is so small due to the high number of instances it’s being averaged across that the mean load value would have to be set so high that it wouldn’t work as intended once the system hits high throughput. That’s how I ended up with the imperfect solution of another parameter, another solution would be to get rid of the +1 when checking if the endpoints load exceeds the threshold, but not sure that is really an acceptable solution. |
|
Very nice analysis of the problem @samos123 ! I can confirm your findings. One alternative solution that came to my mind is to compare the load of an endpoint not with the avg but the next neighbour only when the Please note: while the load distribution is not perfect there is a significant performance gain. For 15k requests on 100/300 nodes I get from |
|
Thanks @alpe , the simulator is a great tool for this! As for the various suggested options, I've done some testing with each approach in a live environment with a high number of vLLM endpoints. Overall, the approach of removing the + 1 to the endpoint load when comparing endpoint load to the calculated threshold seems to be the best approach for this use case as it spreads the load well at all request volumes while keeping throughput and cache hit ratio high. The additional parameter approach appears to add more complexity than is needed given that the previous approach worked very well. While this approach could still be feasible, it also requires the parameter to be well tuned for each situation, something that could prove challenging in an environment that scales significantly from low instances to very high scale. When testing the neighbour load approach, There was definitely a significant improvement but I still saw some endpoints that become overburdened. This makes some sense as this approach smoothes the load from one instance to the next instances but still theoretically allows the number of requests routed to the original instance to increase indefinitely as you linearly add requests and endpoints. This can be seen when running the simulator as well. Curious to hear your thoughts on the + 1 removal approach and whether there are any drawbacks I haven't considered. |
|
Thanks @mskouba for testing the different scenarios in a prod environment! Very interesting findings. |
|
+1 to the +1 removal approach. Appreciate you both digging deep into that. I would like to get the +1 approach and the simulator merged. |
|
Thanks for the feedback, I've opened #528 with the +1 removal update. |
Update chwbl load balancing to not add one to the endpoint's load when determining whether an endpoint can accommodate an additional request. Per the conversation in #519, this solution helps handle traffic better in situations where there are many endpoints and many requests with a common prefix. Co-authored-by: Michael Kouba <michael.kouba@calypsoai.com>
Update chwbl load balancing to not add one to the endpoint's load when determining whether an endpoint can accommodate an additional request. Per the conversation in kubeai-project#519, this solution helps handle traffic better in situations where there are many endpoints and many requests with a common prefix. Co-authored-by: Michael Kouba <michael.kouba@calypsoai.com>
We have been using KubeAI to load balance across hundreds of vLLM instances using the PrefixHash load balancing strategy. We've encountered an edge case that occurs when two conditions are met, 1). you have hundreds of instances running and 2). your traffic is such that a large percentage of traffic (say 25%) has the same prefix.
This leads to issues where the load threshold calculated is a very small number as traffic ramps up due to the large number of instances in the denominator. Due to this low load threshold no instances appear ready for the next request. This would be acceptable if each request had a unique prefix as the default instance the request is assigned to would be more or less random and distribute load relatively evenly. However, when some prefixes are very common, those requests are all routed to the same instance as the traffic ramps up.
A potential solution to this situation is to define a setting, minimum load threshold, which can be set to essentially override the calculated threshold. This would allow requests with the common prefixes to be distributed more evenly to other instances without encountering situations where the high number of instances causes all instances to appear to be exceeding the load threshold and thus default to the default instance.
Would love to get your feedback on this issue and potential solution!