Skip to content

Fix broadcaster loop#4978

Open
bjee19 wants to merge 17 commits intomainfrom
fix/broadcaster-deadlock
Open

Fix broadcaster loop#4978
bjee19 wants to merge 17 commits intomainfrom
fix/broadcaster-deadlock

Conversation

@bjee19
Copy link
Copy Markdown
Contributor

@bjee19 bjee19 commented Mar 23, 2026

Proposed changes

Problem: In the broadcaster, a situation would occur occasionally involving an unsubscribe and publish message which could result in a deadlock happening in the loop, resulting in no nginx configurations being sent to agent.

Solution: Add a separate goroutine to the broadcaster loop which handles the publishing of messages while keeping the original to handle subscriptions and stopping the publisher loop. Now, if a message is being handled in the publishCh, while that listener has already unsubscribed, any blocking statements should exit via context cancel and the system should be able to progress. Also increases conformance test max timeout to 15 minutes because of tls tests.

Testing: Unit tests were added. Ran functional tests through pipeline many times and the problem did not occur.

Closes #4842

Checklist

Before creating a PR, run through this checklist and mark each as complete.

  • I have read the CONTRIBUTING doc
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked that all unit tests pass after adding my changes
  • I have updated necessary documentation
  • I have rebased my branch onto main
  • I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

Release notes

If this PR introduces a change that affects users and needs to be mentioned in the release notes,
please add a brief note that summarizes the change.

Fix issue with communication between NGF and NGINX Pods which could lead to deadlocking and stale NGINX configuration.

@github-actions github-actions bot added the bug Something isn't working label Mar 23, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 23, 2026

Codecov Report

❌ Patch coverage is 90.12346% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.13%. Comparing base (d5f9a6a) to head (aec518d).
⚠️ Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
...rnal/controller/nginx/agent/broadcast/broadcast.go 91.89% 6 Missing ⚠️
internal/controller/nginx/agent/command.go 66.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4978      +/-   ##
==========================================
+ Coverage   86.08%   86.13%   +0.04%     
==========================================
  Files         144      144              
  Lines       17540    17658     +118     
  Branches       35       35              
==========================================
+ Hits        15100    15210     +110     
- Misses       2171     2183      +12     
+ Partials      269      265       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions github-actions bot added the tests Pull requests that update tests label Mar 24, 2026
@bjee19 bjee19 force-pushed the fix/broadcaster-deadlock branch from 4b80868 to f0af4a7 Compare March 25, 2026 17:52
@bjee19
Copy link
Copy Markdown
Contributor Author

bjee19 commented Mar 25, 2026

I considered a solution using timers in the blocking select statements, but couldn't convince myself that it would be a good solution. If we had a solution which relied on a 30 second timer to unblock the broadcaster, I think that would mean users could get into a situation where their nginx conf was late for that same amount of time, meaning dropped traffic if backend pod IPs changed.

The current solution of separate goroutines with the subscriber being able to unblock the publisher I think solves the deadlocking issue without having to use timers. However, I think the new complexity does introduce possible edge cases or race conditions, but hopefully the checks that are put in place and unit tests cover them + functional tests.

@bjee19 bjee19 marked this pull request as ready for review March 25, 2026 21:25
@bjee19 bjee19 requested a review from a team as a code owner March 25, 2026 21:25
@sjberman sjberman requested a review from Copilot March 25, 2026 21:41
@sjberman
Copy link
Copy Markdown
Collaborator

sjberman commented Mar 25, 2026

Regarding the release note, users aren't going to have any idea what the "broadcaster" is since it's an internal detail. Let's be more generic to just mention the deadlock when sending config.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent an occasional deadlock in the NGINX agent deployment broadcaster loop (triggered by unsubscribe + publish timing) so configuration updates continue to reach agents reliably. It also updates test coverage around shutdown/unsubscribe behavior and increases conformance test timeouts to accommodate slower TLS cases.

Changes:

  • Split broadcaster responsibilities into separate subscriber-management and publisher goroutines, introducing contexts/locking to avoid deadlock scenarios.
  • Add/extend unit tests to cover unsubscribe and shutdown interactions while publishing.
  • Increase conformance test runner go test timeout to 15 minutes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 11 comments.

File Description
internal/controller/nginx/agent/broadcast/broadcast.go Refactors broadcaster loop into separate subscriber/publisher paths with cancellation and locking.
internal/controller/nginx/agent/broadcast/broadcast_test.go Adds tests for publish/unsubscribe/shutdown interactions and send completion behavior.
tests/Makefile Increases conformance go test timeout to 15m.
Comments suppressed due to low confidence (1)

internal/controller/nginx/agent/broadcast/broadcast.go:108

  • Subscribe() unconditionally sends on b.subCh. If the broadcaster has already shut down (stopCh closed / ctx canceled) and subscriber() has exited, this send will block forever. Consider making subCh buffered and/or using a select that returns early on b.broadcasterCtx.Done() (similarly for CancelSubscription sending on unsubCh).
func (b *DeploymentBroadcaster) Subscribe() SubscriberChannels {
	listenCh := make(chan NginxAgentMessage)
	responseCh := make(chan struct{})
	id := string(uuid.NewUUID())
	// Create listener context as child of broadcaster context
	//nolint:gosec // G118: cancel is called when unsubscribing in subscriber()
	listenerCtx, cancel := context.WithCancel(b.broadcasterCtx)

	subscriberChans := SubscriberChannels{
		ID:         id,
		ListenCh:   listenCh,
		ResponseCh: responseCh,
	}
	storedChans := storedChannels{
		id:          id,
		listenCh:    listenCh,
		responseCh:  responseCh,
		listenerCtx: listenerCtx,
		cancel:      cancel,
	}

	b.subCh <- storedChans
	return subscriberChans

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@bjee19
Copy link
Copy Markdown
Contributor Author

bjee19 commented Mar 26, 2026

I think I missed a case in the original bug. The original bug dealt with an issue of an unsubscribe message having a race with a publishCh message (or just regular nginx update message). There would be a deadlock where the publishCh message case just blocks forever due to waiting for a non-existing listener to respond since it already unsubscribed.

However, there is also the case of a race condition of a subscribe and publishCh message. The DeploymentBroadcaster's Send function is this:

// Send the message to all listeners. Wait for all listeners to respond.
// Returns true if there were listeners that received the message.
func (b *DeploymentBroadcaster) Send(message NginxAgentMessage) bool {
	b.publishCh <- message
	<-b.doneCh

	return len(b.listeners) > 0
}

However, if we are in a scenario where a Subscriber subscribes (nginx agent connects to NGF or re-connects) at the same time a Send message is in transit, there is a possibility where the Send message to the publishCh goes first, so NGF sends the message to all the previous subscribers (not the new one), finishes successfully, then returns len(b.listeners) > 0 which returns true since it did send to some of the listeners.

Then the subscribe comes in, adds the agent to the listeners, but nothing triggers to send it the message. So I think this could lead to stale nginx configuration with no real tellers / clues since NGF thinks it fully updated all its listeners.

Since Agent doesn't pull config (I don't think it does), it instead waits for NGF to push, this means that in this scenario there won't really be a change / update to the nginx conf until there is an event batch + nginx conf change.

My current solution does not solve this issue, so I'm working on a solution. However, this seems like a more likely tie to issues like #4697.

@sjberman
Copy link
Copy Markdown
Collaborator

@bjee19 The case you're describing should be solved by setInitialConfig when a new subscriber comes in. This ensures that when a subscribe call first happens, it gets the latest config even if one isn't in the queue for it.

@bjee19
Copy link
Copy Markdown
Contributor Author

bjee19 commented Mar 26, 2026

@sjberman That function seems to make sense to me but I still think there is a small edge case potential between when setInitialConfig is called in commandService.Subscribe() and when broadcaster.Subscribe() is called. In this part https://github.com/nginx/nginx-gateway-fabric/blob/main/internal/controller/nginx/agent/command.go#L160-L162, the deployment file lock has been unlocked after setInitialConfig and there is a potential for the handler to call updateNginxConf and generate the files plus call UpdateConfig which sends a message to the broadcaster all before the message to the subscribe channel is read from the broadcaster.

Unless I'm missing something, I still think the edge case persists, now the new subscriber has nginx conf, but its not guaranteed to be the latest conf. And it won't be until a new conf is generated and pushed to the subscriber. If the logic is valid, I don't think this is a likely scenario, but I still think it is a possibility, especially when dealing in scenarios where there are replicas of nginx per gateway constantly spinning up and down (aka more subscribe calls happening).

Any thoughts?

@sjberman
Copy link
Copy Markdown
Collaborator

@bjee19 Yeah that's a good point. It's a really tight race, but it technically could exist between those two points.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

internal/controller/nginx/agent/broadcast/broadcast.go:108

  • Subscribe unconditionally does a blocking send on b.subCh. If the broadcaster is already shutting down (no subscriber() goroutine receiving), this can block forever. Consider selecting on b.broadcasterCtx.Done() when sending to subCh (and returning an error/empty subscription) to avoid shutdown-time deadlocks.
func (b *DeploymentBroadcaster) Subscribe() SubscriberChannels {
	listenCh := make(chan NginxAgentMessage)
	responseCh := make(chan struct{})
	id := string(uuid.NewUUID())
	// Create listener context as child of broadcaster context
	//nolint:gosec // G118: cancel is called when unsubscribing in subscriber()
	listenerCtx, cancel := context.WithCancel(b.broadcasterCtx)

	subscriberChans := SubscriberChannels{
		ID:         id,
		ListenCh:   listenCh,
		ResponseCh: responseCh,
	}
	storedChans := storedChannels{
		id:          id,
		listenCh:    listenCh,
		responseCh:  responseCh,
		listenerCtx: listenerCtx,
		cancel:      cancel,
	}

	b.subCh <- storedChans
	return subscriberChans

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// either the new subscriber gets the lock, applies the latest config before subscribing,
// then applies any concurrent updates that come in after subscribing, or the subscriber gets the lock after the
// concurrent update, and applies that latest config in setInitialConfig() before subscribing,
// ensuring it doesn't miss any updates.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the approach of subscribing beforehand?

Also, should we actually be using read locks instead of a write lock? setInitialConfig isn't changing the files, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked copilot about the approach of subscribing beforehand and with a drain to filter out stale configs, and it basically said that approach was more complex and could open us up to more race conditions and ordering issues.

Its proposed solution of moving the lock did seem simpler and I couldn't find really too many downsides.

Yea it makes sense to use a read lock here.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@bjee19
Copy link
Copy Markdown
Contributor Author

bjee19 commented Mar 27, 2026

I think this is good to review, copilot reviews has refined it enough

@bjee19
Copy link
Copy Markdown
Contributor Author

bjee19 commented Mar 27, 2026

Running the pipeline just a few times with the expanded functional tests to ensure changes still work after a few iterations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working release-notes tests Pull requests that update tests

Projects

Status: 🆕 New

Development

Successfully merging this pull request may close these issues.

Subscriber and broadcaster select statement logic results in NGF occasionally deadlocking

3 participants