Fix broadcaster loop by bjee19 · Pull Request #4978 · nginx/nginx-gateway-fabric

bjee19 · 2026-03-23T21:03:57Z

Proposed changes

Problem: In the broadcaster, a situation would occur occasionally involving an unsubscribe and publish message which could result in a deadlock happening in the loop, resulting in no nginx configurations being sent to agent.

Solution: Add a separate goroutine to the broadcaster loop which handles the publishing of messages while keeping the original to handle subscriptions and stopping the publisher loop. Now, if a message is being handled in the publishCh, while that listener has already unsubscribed, any blocking statements should exit via context cancel and the system should be able to progress. Also increases conformance test max timeout to 15 minutes because of tls tests.

Testing: Unit tests were added. Ran functional tests through pipeline many times and the problem did not occur.

Closes #4842

Checklist

Before creating a PR, run through this checklist and mark each as complete.

I have read the CONTRIBUTING doc
I have added tests that prove my fix is effective or that my feature works
I have checked that all unit tests pass after adding my changes
I have updated necessary documentation
I have rebased my branch onto main
I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

Release notes

If this PR introduces a change that affects users and needs to be mentioned in the release notes,
please add a brief note that summarizes the change.

Fix issue with communication between NGF and NGINX Pods which could lead to deadlocking and stale NGINX configuration.

codecov · 2026-03-23T21:07:05Z

Codecov Report

❌ Patch coverage is 90.12346% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.13%. Comparing base (d5f9a6a) to head (aec518d).
⚠️ Report is 13 commits behind head on main.

Files with missing lines	Patch %	Lines
...rnal/controller/nginx/agent/broadcast/broadcast.go	91.89%	6 Missing ⚠️
internal/controller/nginx/agent/command.go	66.66%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4978      +/-   ##
==========================================
+ Coverage   86.08%   86.13%   +0.04%     
==========================================
  Files         144      144              
  Lines       17540    17658     +118     
  Branches       35       35              
==========================================
+ Hits        15100    15210     +110     
- Misses       2171     2183      +12     
+ Partials      269      265       -4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

bjee19 · 2026-03-25T21:24:02Z

I considered a solution using timers in the blocking select statements, but couldn't convince myself that it would be a good solution. If we had a solution which relied on a 30 second timer to unblock the broadcaster, I think that would mean users could get into a situation where their nginx conf was late for that same amount of time, meaning dropped traffic if backend pod IPs changed.

The current solution of separate goroutines with the subscriber being able to unblock the publisher I think solves the deadlocking issue without having to use timers. However, I think the new complexity does introduce possible edge cases or race conditions, but hopefully the checks that are put in place and unit tests cover them + functional tests.

sjberman · 2026-03-25T21:42:28Z

Regarding the release note, users aren't going to have any idea what the "broadcaster" is since it's an internal detail. Let's be more generic to just mention the deadlock when sending config.

Copilot

Pull request overview

This PR aims to prevent an occasional deadlock in the NGINX agent deployment broadcaster loop (triggered by unsubscribe + publish timing) so configuration updates continue to reach agents reliably. It also updates test coverage around shutdown/unsubscribe behavior and increases conformance test timeouts to accommodate slower TLS cases.

Changes:

Split broadcaster responsibilities into separate subscriber-management and publisher goroutines, introducing contexts/locking to avoid deadlock scenarios.
Add/extend unit tests to cover unsubscribe and shutdown interactions while publishing.
Increase conformance test runner go test timeout to 15 minutes.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 11 comments.

File	Description
`internal/controller/nginx/agent/broadcast/broadcast.go`	Refactors broadcaster loop into separate subscriber/publisher paths with cancellation and locking.
`internal/controller/nginx/agent/broadcast/broadcast_test.go`	Adds tests for publish/unsubscribe/shutdown interactions and send completion behavior.
`tests/Makefile`	Increases conformance `go test` timeout to 15m.

Comments suppressed due to low confidence (1)

internal/controller/nginx/agent/broadcast/broadcast.go:108

Subscribe() unconditionally sends on b.subCh. If the broadcaster has already shut down (stopCh closed / ctx canceled) and subscriber() has exited, this send will block forever. Consider making subCh buffered and/or using a select that returns early on b.broadcasterCtx.Done() (similarly for CancelSubscription sending on unsubCh).

func (b *DeploymentBroadcaster) Subscribe() SubscriberChannels {
	listenCh := make(chan NginxAgentMessage)
	responseCh := make(chan struct{})
	id := string(uuid.NewUUID())
	// Create listener context as child of broadcaster context
	//nolint:gosec // G118: cancel is called when unsubscribing in subscriber()
	listenerCtx, cancel := context.WithCancel(b.broadcasterCtx)

	subscriberChans := SubscriberChannels{
		ID:         id,
		ListenCh:   listenCh,
		ResponseCh: responseCh,
	}
	storedChans := storedChannels{
		id:          id,
		listenCh:    listenCh,
		responseCh:  responseCh,
		listenerCtx: listenerCtx,
		cancel:      cancel,
	}

	b.subCh <- storedChans
	return subscriberChans

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/controller/nginx/agent/broadcast/broadcast.go

internal/controller/nginx/agent/broadcast/broadcast_test.go

internal/controller/nginx/agent/broadcast/broadcast.go

internal/controller/nginx/agent/broadcast/broadcast_test.go

bjee19 · 2026-03-26T06:33:31Z

I think I missed a case in the original bug. The original bug dealt with an issue of an unsubscribe message having a race with a publishCh message (or just regular nginx update message). There would be a deadlock where the publishCh message case just blocks forever due to waiting for a non-existing listener to respond since it already unsubscribed.

However, there is also the case of a race condition of a subscribe and publishCh message. The DeploymentBroadcaster's Send function is this:

// Send the message to all listeners. Wait for all listeners to respond.
// Returns true if there were listeners that received the message.
func (b *DeploymentBroadcaster) Send(message NginxAgentMessage) bool {
	b.publishCh <- message
	<-b.doneCh

	return len(b.listeners) > 0
}

However, if we are in a scenario where a Subscriber subscribes (nginx agent connects to NGF or re-connects) at the same time a Send message is in transit, there is a possibility where the Send message to the publishCh goes first, so NGF sends the message to all the previous subscribers (not the new one), finishes successfully, then returns len(b.listeners) > 0 which returns true since it did send to some of the listeners.

Then the subscribe comes in, adds the agent to the listeners, but nothing triggers to send it the message. So I think this could lead to stale nginx configuration with no real tellers / clues since NGF thinks it fully updated all its listeners.

Since Agent doesn't pull config (I don't think it does), it instead waits for NGF to push, this means that in this scenario there won't really be a change / update to the nginx conf until there is an event batch + nginx conf change.

My current solution does not solve this issue, so I'm working on a solution. However, this seems like a more likely tie to issues like #4697.

sjberman · 2026-03-26T14:27:30Z

@bjee19 The case you're describing should be solved by setInitialConfig when a new subscriber comes in. This ensures that when a subscribe call first happens, it gets the latest config even if one isn't in the queue for it.

bjee19 · 2026-03-26T17:37:35Z

@sjberman That function seems to make sense to me but I still think there is a small edge case potential between when setInitialConfig is called in commandService.Subscribe() and when broadcaster.Subscribe() is called. In this part https://github.com/nginx/nginx-gateway-fabric/blob/main/internal/controller/nginx/agent/command.go#L160-L162, the deployment file lock has been unlocked after setInitialConfig and there is a potential for the handler to call updateNginxConf and generate the files plus call UpdateConfig which sends a message to the broadcaster all before the message to the subscribe channel is read from the broadcaster.

Unless I'm missing something, I still think the edge case persists, now the new subscriber has nginx conf, but its not guaranteed to be the latest conf. And it won't be until a new conf is generated and pushed to the subscriber. If the logic is valid, I don't think this is a likely scenario, but I still think it is a possibility, especially when dealing in scenarios where there are replicas of nginx per gateway constantly spinning up and down (aka more subscribe calls happening).

Any thoughts?

sjberman · 2026-03-26T17:41:52Z

@bjee19 Yeah that's a good point. It's a really tight race, but it technically could exist between those two points.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

internal/controller/nginx/agent/broadcast/broadcast.go:108

Subscribe unconditionally does a blocking send on b.subCh. If the broadcaster is already shutting down (no subscriber() goroutine receiving), this can block forever. Consider selecting on b.broadcasterCtx.Done() when sending to subCh (and returning an error/empty subscription) to avoid shutdown-time deadlocks.

func (b *DeploymentBroadcaster) Subscribe() SubscriberChannels {
	listenCh := make(chan NginxAgentMessage)
	responseCh := make(chan struct{})
	id := string(uuid.NewUUID())
	// Create listener context as child of broadcaster context
	//nolint:gosec // G118: cancel is called when unsubscribing in subscriber()
	listenerCtx, cancel := context.WithCancel(b.broadcasterCtx)

	subscriberChans := SubscriberChannels{
		ID:         id,
		ListenCh:   listenCh,
		ResponseCh: responseCh,
	}
	storedChans := storedChannels{
		id:          id,
		listenCh:    listenCh,
		responseCh:  responseCh,
		listenerCtx: listenerCtx,
		cancel:      cancel,
	}

	b.subCh <- storedChans
	return subscriberChans

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/controller/nginx/agent/broadcast/broadcast.go

internal/controller/nginx/agent/broadcast/broadcast_test.go

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/controller/nginx/agent/broadcast/broadcast_test.go

internal/controller/nginx/agent/broadcast/broadcast.go

internal/controller/nginx/agent/command.go

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/controller/nginx/agent/broadcast/broadcast.go

sjberman · 2026-03-26T22:38:08Z

internal/controller/nginx/agent/command.go

+	// either the new subscriber gets the lock, applies the latest config before subscribing,
+	// then applies any concurrent updates that come in after subscribing, or the subscriber gets the lock after the
+	// concurrent update, and applies that latest config in setInitialConfig() before subscribing,
+	// ensuring it doesn't miss any updates.


What about the approach of subscribing beforehand?

Also, should we actually be using read locks instead of a write lock? setInitialConfig isn't changing the files, right?

I asked copilot about the approach of subscribing beforehand and with a drain to filter out stale configs, and it basically said that approach was more complex and could open us up to more race conditions and ordering issues.

Its proposed solution of moving the lock did seem simpler and I couldn't find really too many downsides.

Yea it makes sense to use a read lock here.

internal/controller/nginx/agent/broadcast/broadcast.go

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/controller/nginx/agent/command.go

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/controller/nginx/agent/broadcast/broadcast.go

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/controller/nginx/agent/command.go

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/controller/nginx/agent/broadcast/broadcast.go

bjee19 · 2026-03-27T20:02:34Z

I think this is good to review, copilot reviews has refined it enough

bjee19 · 2026-03-27T21:43:53Z

Running the pipeline just a few times with the expanded functional tests to ensure changes still work after a few iterations.

github-project-automation bot added this to NGINX Gateway Fabric Mar 23, 2026

github-project-automation bot moved this to 🆕 New in NGINX Gateway Fabric Mar 23, 2026

github-actions bot added the bug Something isn't working label Mar 23, 2026

github-actions bot added the tests Pull requests that update tests label Mar 24, 2026

Add separation of subscribe and publish logic to broadcaster

f0af4a7

bjee19 force-pushed the fix/broadcaster-deadlock branch from 4b80868 to f0af4a7 Compare March 25, 2026 17:52

bjee19 added 3 commits March 25, 2026 13:58

Add fixes to unit test race issues

3c122f4

Fix comments

740fc4c

Reorder send select cases

24d7424

nginx-bot bot added the release-notes label Mar 25, 2026

bjee19 marked this pull request as ready for review March 25, 2026 21:25

bjee19 requested a review from a team as a code owner March 25, 2026 21:25

sjberman requested a review from Copilot March 25, 2026 21:41

Copilot started reviewing on behalf of sjberman March 25, 2026 21:42 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

bjee19 added 3 commits March 25, 2026 15:26

Refactor publisher to exit on context instead of closing responseCh

0976348

Change responseCh to receive only

613e14a

Add another context cancel check

50e8b7e

Adjust file lock to remove race condition

6dafff0

bjee19 requested a review from Copilot March 26, 2026 21:21

Copilot started reviewing on behalf of bjee19 March 26, 2026 21:22 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

internal/controller/nginx/agent/broadcast/broadcast.go Outdated Show resolved Hide resolved

internal/controller/nginx/agent/broadcast/broadcast.go Outdated Show resolved Hide resolved

internal/controller/nginx/agent/broadcast/broadcast_test.go Outdated Show resolved Hide resolved

Add copilot suggestions

8b9a9f4

bjee19 requested a review from Copilot March 26, 2026 21:55

Copilot started reviewing on behalf of bjee19 March 26, 2026 21:55 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

internal/controller/nginx/agent/broadcast/broadcast_test.go Outdated Show resolved Hide resolved

internal/controller/nginx/agent/broadcast/broadcast.go Outdated Show resolved Hide resolved

internal/controller/nginx/agent/command.go Show resolved Hide resolved

Refine comments

00f193b

bjee19 requested a review from Copilot March 26, 2026 22:23

Copilot started reviewing on behalf of bjee19 March 26, 2026 22:24 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

internal/controller/nginx/agent/broadcast/broadcast.go Show resolved Hide resolved

sjberman reviewed Mar 26, 2026

View reviewed changes

Use read locks

467eef5

sjberman reviewed Mar 27, 2026

View reviewed changes

Remove stopCh and address other small feedback

59ff539

bjee19 requested a review from Copilot March 27, 2026 17:40

Copilot started reviewing on behalf of bjee19 March 27, 2026 17:41 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

internal/controller/nginx/agent/command.go Outdated Show resolved Hide resolved

bjee19 requested a review from Copilot March 27, 2026 17:51

Copilot started reviewing on behalf of bjee19 March 27, 2026 17:51 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

internal/controller/nginx/agent/broadcast/broadcast.go Show resolved Hide resolved

Refactor commandService Subscribe to prevent race condition

b29a97e

bjee19 requested a review from Copilot March 27, 2026 19:38

Copilot started reviewing on behalf of bjee19 March 27, 2026 19:39 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

internal/controller/nginx/agent/command.go Show resolved Hide resolved

bjee19 requested a review from Copilot March 27, 2026 19:53

Copilot started reviewing on behalf of bjee19 March 27, 2026 19:54 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

internal/controller/nginx/agent/broadcast/broadcast.go Show resolved Hide resolved

bjee19 added 4 commits March 27, 2026 13:46

Update CI to run many functional tests

91db9c1

Add some fixes to the CI pipeline testing

09d89dc

Fix build binary

aa0cf93

Add some more sleeps

aec518d

Conversation

bjee19 commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Release notes

Uh oh!

codecov bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bjee19 commented Mar 25, 2026

Uh oh!

sjberman commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bjee19 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjberman commented Mar 26, 2026

Uh oh!

bjee19 commented Mar 26, 2026

Uh oh!

sjberman commented Mar 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

sjberman Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

bjee19 Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

bjee19 commented Mar 23, 2026 •

edited

Loading

codecov bot commented Mar 23, 2026 •

edited

Loading

sjberman commented Mar 25, 2026 •

edited

Loading

bjee19 commented Mar 26, 2026 •

edited

Loading