Add support for multi-process port sharing with CIBIR.#5798
Add support for multi-process port sharing with CIBIR.#5798ProjectsByJackHe wants to merge 20 commits intomainfrom
Conversation
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (66.66%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #5798 +/- ##
==========================================
- Coverage 85.99% 85.07% -0.93%
==========================================
Files 60 60
Lines 18729 18732 +3
==========================================
- Hits 16106 15936 -170
- Misses 2623 2796 +173 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
guhetier
left a comment
There was a problem hiding this comment.
In general, I am concerned that we keep adding incremental exceptions to the port reservation logic to solve the next issue, but without having a clear design goal.
This code will be hard to maintain and confusing for apps as there is no simple rule about what can be done with ports.
I think we need to take the time soon to come up with a clear story about when can port be shared and when they can't + document it + check we implement it.
I agree. I added a CIBIR.md. |
please see the updated XDP.md and CIBIR.md |
|
Maybe we should actually go through the rest of the standardization process for https://datatracker.ietf.org/doc/draft-banks-quic-cibir/ since you're still using it.... |
Do you have more context about it? I am interested in understanding whether we should work on standardizing CIBIR or if another solution was adopted by other deployments that we should align with. |
I know of no other solution/proposal that achieves what CIBIR does. We can chat more on Discord about the process. |
src/platform/datapath_winuser.c
Outdated
| "[data][%p] CIBIR detected, %s", | ||
| Socket, | ||
| "ignoring port collision by assuming some \ | ||
| other MsQuic CIBIR process has reserved the OS port. \ |
There was a problem hiding this comment.
It's too bad we don't have a way to test if MsQuic with CIBIR is the other endpoint - unless we do?
Could we create a very simple (cooperative, not adversarial) protocol to send a UDP packet to the port and check if the response seems MsQuic-CIBIR-like? I suppose this would introduce a path where the other process has exited, and then we have to retry binding...potentially forever.
There was a problem hiding this comment.
I think your suggestion is: when a port collision occurs and CIBIR is enabled, instead of assuming the
other process that previously reserved the port is also a CIBIR process, we should first do a quick test to
verify that?
If that is indeed the suggestion I would like to offer a little pushback because that seems a bit overkill.
I think it's reasonable to expect apps that enable CIBIR to know what they are doing.
Moving this discussion offline...
There was a problem hiding this comment.
Reactivating: my request is more generally to consider whether we can apply any heuristic beyond "it's probably CIBIR" before potentially stealing an OS port.
A concrete suggestion is to follow the existing port reservation model here, but add support for persistent reservations, which can span processes, rather than ephemeral reservations, which are constrained to one process.
There was a problem hiding this comment.
@mtfriesen Maybe you can give some more details about how you see things?
Do you mean enforcing a permanent port reservation to exist? Or trying to use a potentially existing one and ignore the failure if it doesn't exist? Something else?
I am not very familiar with permanent port reservations (from the doc, I am somewhat confused about what the token is meant to help with).
There was a problem hiding this comment.
To be super clear, while I provided one concrete suggestion, my request is to demonstrate clear reasoning for whatever we choose to do:
- Describe the problem (potential port stealing) and reason about its likelihood and impact. Random, hard-to-reproduce failures on 2% of prod machines would be frustrating, as would tests or dev environments randomly failing.
- Come up with some possible solutions, including existing APIs that might help, or anything else we can implement with reasonable upfront and maintenance cost.
- Evaluate whether the costs/risks of any of the solutions in (2) would be enough to offset eliminating the expected reliability issues in (1).
It's not obvious to me, after quite a few years in networking at Microsoft, that customers/deployments can be trusted to "know what they're doing" and some rather popular concepts (containers) exist precisely because it's hard to know what you're doing at scale.
| _In_ uint8_t Length | ||
| ) | ||
| { | ||
| uint64_t Value = 0; |
There was a problem hiding this comment.
please assert that Length <= 8
| // In the case of a port collision while cibir is enabled, | ||
| // MsQuic assumes another MsQuic cibir process has done the | ||
| // job of reserving the OS ports via socket creation/bind. | ||
| // At least that process can fall back to using the OS stack if |
There was a problem hiding this comment.
That comment concerns me: if CIBIR is enabled, we need XDP to honor it. Falling back to OS sockets seems problematic to me, we don't honor an option the app set without reporting it.
There was a problem hiding this comment.
... except if you consider that CIBIR can also be used for dispatching between different host, with a load balancer dispatching instead of XDP. But that should not be a transparent fallback.
src/platform/datapath_winuser.c
Outdated
There was a problem hiding this comment.
nit: Consider not logging an error when we are actually fine because CIBIR is in use. It will avoid confusion when looking at logs, "fake errors" are misleading.
There was a problem hiding this comment.
Agreed, this falls into my general concern that we don't have clear and robust error detection and error handling here. Our behavior should be clearly defined, with the ability to fill out a matrix of possible actions by MsQuic and non-MsQuic (and CIBIR and non-CIBIR) processes on the system, and the matrix should have reasonable observabvle behavior in each entry.
|
In my opinion, this entire code path is becoming a maintenance problem (independently from this specific PR). Reviewing changes in a function that long and with multiple level of loops, tests and jump is not reasonable. We need to refactor it, with intentional design. We cannot expect a user to understand all the special cases and subtly different behavior that have emerged from this code. |
This comment has been made multiple times, and while I agree there is probably room for improvement, as we've discussed, port pools rich with features tend to become intrinsically complicated when properly implemented. |
|
Complex and complicated are different. Portpool are complex, but what we expose to our users or the way we implement it doesn't have to be complicated and confusing. A good implementation can abstract the complexity and present a clear behavior, this is not what we are doing now. |
Description
Fixes #5795
The XDP datapath can be configured to intercept packets based on QUIC Connection ID instead of local port.
This behavior existed in MsQuic but was not heavily exercised until recently.
One issue was that MsQuic always attempted to reserve UDP / TCP sockets for each application server process.
But for multiple server processes that may want to share a single port, we would run into port collision errors.
This PR adds support for CIBIR across multiple processes on the same port and document the behavior
Testing
A new DataPathTest was added.
Documentation
Settings.md