Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4978 +/- ##
==========================================
+ Coverage 86.08% 86.13% +0.04%
==========================================
Files 144 144
Lines 17540 17658 +118
Branches 35 35
==========================================
+ Hits 15100 15210 +110
- Misses 2171 2183 +12
+ Partials 269 265 -4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
4b80868 to
f0af4a7
Compare
|
I considered a solution using timers in the blocking select statements, but couldn't convince myself that it would be a good solution. If we had a solution which relied on a 30 second timer to unblock the broadcaster, I think that would mean users could get into a situation where their nginx conf was late for that same amount of time, meaning dropped traffic if backend pod IPs changed. The current solution of separate goroutines with the subscriber being able to unblock the publisher I think solves the deadlocking issue without having to use timers. However, I think the new complexity does introduce possible edge cases or race conditions, but hopefully the checks that are put in place and unit tests cover them + functional tests. |
|
Regarding the release note, users aren't going to have any idea what the "broadcaster" is since it's an internal detail. Let's be more generic to just mention the deadlock when sending config. |
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent an occasional deadlock in the NGINX agent deployment broadcaster loop (triggered by unsubscribe + publish timing) so configuration updates continue to reach agents reliably. It also updates test coverage around shutdown/unsubscribe behavior and increases conformance test timeouts to accommodate slower TLS cases.
Changes:
- Split broadcaster responsibilities into separate subscriber-management and publisher goroutines, introducing contexts/locking to avoid deadlock scenarios.
- Add/extend unit tests to cover unsubscribe and shutdown interactions while publishing.
- Increase conformance test runner
go testtimeout to 15 minutes.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 11 comments.
| File | Description |
|---|---|
internal/controller/nginx/agent/broadcast/broadcast.go |
Refactors broadcaster loop into separate subscriber/publisher paths with cancellation and locking. |
internal/controller/nginx/agent/broadcast/broadcast_test.go |
Adds tests for publish/unsubscribe/shutdown interactions and send completion behavior. |
tests/Makefile |
Increases conformance go test timeout to 15m. |
Comments suppressed due to low confidence (1)
internal/controller/nginx/agent/broadcast/broadcast.go:108
Subscribe()unconditionally sends onb.subCh. If the broadcaster has already shut down (stopCh closed / ctx canceled) andsubscriber()has exited, this send will block forever. Consider makingsubChbuffered and/or using aselectthat returns early onb.broadcasterCtx.Done()(similarly forCancelSubscriptionsending onunsubCh).
func (b *DeploymentBroadcaster) Subscribe() SubscriberChannels {
listenCh := make(chan NginxAgentMessage)
responseCh := make(chan struct{})
id := string(uuid.NewUUID())
// Create listener context as child of broadcaster context
//nolint:gosec // G118: cancel is called when unsubscribing in subscriber()
listenerCtx, cancel := context.WithCancel(b.broadcasterCtx)
subscriberChans := SubscriberChannels{
ID: id,
ListenCh: listenCh,
ResponseCh: responseCh,
}
storedChans := storedChannels{
id: id,
listenCh: listenCh,
responseCh: responseCh,
listenerCtx: listenerCtx,
cancel: cancel,
}
b.subCh <- storedChans
return subscriberChans
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I think I missed a case in the original bug. The original bug dealt with an issue of an unsubscribe message having a race with a publishCh message (or just regular nginx update message). There would be a deadlock where the publishCh message case just blocks forever due to waiting for a non-existing listener to respond since it already unsubscribed. However, there is also the case of a race condition of a subscribe and publishCh message. The DeploymentBroadcaster's Send function is this: However, if we are in a scenario where a Subscriber subscribes (nginx agent connects to NGF or re-connects) at the same time a Send message is in transit, there is a possibility where the Send message to the publishCh goes first, so NGF sends the message to all the previous subscribers (not the new one), finishes successfully, then returns len(b.listeners) > 0 which returns true since it did send to some of the listeners. Then the subscribe comes in, adds the agent to the listeners, but nothing triggers to send it the message. So I think this could lead to stale nginx configuration with no real tellers / clues since NGF thinks it fully updated all its listeners. Since Agent doesn't pull config (I don't think it does), it instead waits for NGF to push, this means that in this scenario there won't really be a change / update to the nginx conf until there is an event batch + nginx conf change. My current solution does not solve this issue, so I'm working on a solution. However, this seems like a more likely tie to issues like #4697. |
|
@bjee19 The case you're describing should be solved by |
|
@sjberman That function seems to make sense to me but I still think there is a small edge case potential between when Unless I'm missing something, I still think the edge case persists, now the new subscriber has nginx conf, but its not guaranteed to be the latest conf. And it won't be until a new conf is generated and pushed to the subscriber. If the logic is valid, I don't think this is a likely scenario, but I still think it is a possibility, especially when dealing in scenarios where there are replicas of nginx per gateway constantly spinning up and down (aka more subscribe calls happening). Any thoughts? |
|
@bjee19 Yeah that's a good point. It's a really tight race, but it technically could exist between those two points. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
internal/controller/nginx/agent/broadcast/broadcast.go:108
Subscribeunconditionally does a blocking send onb.subCh. If the broadcaster is already shutting down (nosubscriber()goroutine receiving), this can block forever. Consider selecting onb.broadcasterCtx.Done()when sending tosubCh(and returning an error/empty subscription) to avoid shutdown-time deadlocks.
func (b *DeploymentBroadcaster) Subscribe() SubscriberChannels {
listenCh := make(chan NginxAgentMessage)
responseCh := make(chan struct{})
id := string(uuid.NewUUID())
// Create listener context as child of broadcaster context
//nolint:gosec // G118: cancel is called when unsubscribing in subscriber()
listenerCtx, cancel := context.WithCancel(b.broadcasterCtx)
subscriberChans := SubscriberChannels{
ID: id,
ListenCh: listenCh,
ResponseCh: responseCh,
}
storedChans := storedChannels{
id: id,
listenCh: listenCh,
responseCh: responseCh,
listenerCtx: listenerCtx,
cancel: cancel,
}
b.subCh <- storedChans
return subscriberChans
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // either the new subscriber gets the lock, applies the latest config before subscribing, | ||
| // then applies any concurrent updates that come in after subscribing, or the subscriber gets the lock after the | ||
| // concurrent update, and applies that latest config in setInitialConfig() before subscribing, | ||
| // ensuring it doesn't miss any updates. |
There was a problem hiding this comment.
What about the approach of subscribing beforehand?
Also, should we actually be using read locks instead of a write lock? setInitialConfig isn't changing the files, right?
There was a problem hiding this comment.
I asked copilot about the approach of subscribing beforehand and with a drain to filter out stale configs, and it basically said that approach was more complex and could open us up to more race conditions and ordering issues.
Its proposed solution of moving the lock did seem simpler and I couldn't find really too many downsides.
Yea it makes sense to use a read lock here.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I think this is good to review, copilot reviews has refined it enough |
|
Running the pipeline just a few times with the expanded functional tests to ensure changes still work after a few iterations. |
Proposed changes
Problem: In the broadcaster, a situation would occur occasionally involving an unsubscribe and publish message which could result in a deadlock happening in the loop, resulting in no nginx configurations being sent to agent.
Solution: Add a separate goroutine to the broadcaster loop which handles the publishing of messages while keeping the original to handle subscriptions and stopping the publisher loop. Now, if a message is being handled in the publishCh, while that listener has already unsubscribed, any blocking statements should exit via context cancel and the system should be able to progress. Also increases conformance test max timeout to 15 minutes because of tls tests.
Testing: Unit tests were added. Ran functional tests through pipeline many times and the problem did not occur.
Closes #4842
Checklist
Before creating a PR, run through this checklist and mark each as complete.
Release notes
If this PR introduces a change that affects users and needs to be mentioned in the release notes,
please add a brief note that summarizes the change.