Cleanly shutdown CBS and queue watching thread by MrCreosote · Pull Request #102 · kbase/JobRunner

MrCreosote · 2025-07-28T16:59:13Z

When running integration tests with the callback server via Callback, I found that the tests would hang forever, I suspect because the queue watching process would never terminate. This PR reworks shutdown for the callback server and watch thread when setup via JobRunner.callback(). For now it doesn't alter how run() works, since that has watch loop and CBS termination built in and seems to work as is.

The new stop() and wait_for_stop() methods will be exercised in the next PR.

I manually tested that starting and curling the root of the CBS in a container works.

When running integration tests with the callback server via Callback, I found that the tests would hang forever, I suspect because the queue watching process would never terminate. This PR reworks shutdown for the callback server and watch thread when setup via JobRunner.callback(). For now it doesn't alter how run() works, since that has watch loop and CBS termination built in and seems to work as is. The new stop() and wait_for_stop() methods will be exercised in the next PR. I manually tested that starting and curling the root of the CBS in a container works.

codecov · 2025-07-28T17:02:27Z

Codecov Report

❌ Patch coverage is 38.63636% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.35%. Comparing base (d744990) to head (36e0ce6).
⚠️ Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
JobRunner/JobRunner.py	42.10%	11 Missing ⚠️
JobRunner/Callback.py	37.50%	10 Missing ⚠️
JobRunner/callback_server.py	33.33%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #102      +/-   ##
==========================================
- Coverage   79.23%   78.35%   -0.89%     
==========================================
  Files          13       13              
  Lines        1074     1095      +21     
==========================================
+ Hits          851      858       +7     
- Misses        223      237      +14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

MrCreosote · 2025-07-28T17:04:49Z

It looks like the stuff I changed didn't have tests. I'll add them in the next PR

Copilot

Pull Request Overview

This PR implements a clean shutdown mechanism for the callback server (CBS) and queue watching thread when using JobRunner.callback(). Previously, integration tests would hang indefinitely because the queue watching process and callback server would never terminate properly.

Key changes include:

Added shutdown event mechanism using multiprocessing.Event to coordinate termination between processes
Converted synchronous blocking operations to threaded operations with proper cleanup
Introduced stop() and wait_for_stop() methods for graceful shutdown

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
JobRunner/callback_server.py	Added shutdown event parameter and async listener to cleanly stop the Sanic server
JobRunner/JobRunner.py	Added threading support, shutdown coordination with Event, and new stop/wait methods
JobRunner/Callback.py	Refactored to use direct JobRunner instance instead of subprocess, enabling proper cleanup

Copilot · 2025-07-29T00:56:36Z

JobRunner/JobRunner.py

+        After calling this method this instance of the job runner is no longer usable.
+        """
+        self._shutdown_event.set()
+        self._stop = True
+        self.wait_for_stop()
+
+    def wait_for_stop(self):


[nitpick] Calling wait_for_stop() within stop() creates a blocking operation that may not be desired. Consider separating the shutdown initiation (stop) from the waiting (wait_for_stop) to give callers more control over when to block.

Suggested change

After calling this method this instance of the job runner is no longer usable.

"""

self._shutdown_event.set()

self._stop = True

self.wait_for_stop()

def wait_for_stop(self):

This method initiates the shutdown process but does not block.

Call `wait_for_stop()` to block until the shutdown is complete.

"""

self._shutdown_event.set()

self._stop = True

def wait_for_stop(self):

"""

Block until the callback server and watch thread have finished.

"""

Calling wait_for_stop() within stop() creates a blocking operation that may not be desired.

The blocking operation is how it's supposed to work - the calling thread should block until the CBS process and thread have exited and no zombie processes are left

Copilot · 2025-07-29T00:56:42Z

JobRunner/callback_server.py

+        @app.after_server_start
+        async def shutdown_listener(app, _):
+            while not shutdown_event.is_set():
+                await asyncio.sleep(0.1)


[nitpick] The 0.1 second polling interval in the shutdown listener creates unnecessary CPU usage. Consider using a larger interval (e.g., 1.0 second) or implementing an event-driven approach if more responsive shutdown is needed.

Suggested change

await asyncio.sleep(0.1)

await asyncio.sleep(1.0)

100 milliseconds is an eternity for a modern processor

Besides, one of your AI compatriots suggested it. Argue with him / her / it

I can't wait for the point where code reviews are just AIs arguing with each other

bio-boris · 2025-07-29T01:08:08Z

JobRunner/Callback.py

        return job_params

-    def run(self):
+    def start(self):


Can we make this a more descriptive name ? It is not clear what is being started.

I have forgotten what much of the code does, and from looking at these files it is not obvious what even is a "CallBack" vs a "CallBackServer"

self.cbs is no longer used, is that right? if so, it can be removed.

Fixed both issues

Callback.py is a small helper to start the callback server from the command line (or a docker command, really). callback_server.py is the Sanic server code.

That said, no argument from me about naming

MrCreosote requested a review from bio-boris July 28, 2025 16:59

bio-boris requested a review from Copilot July 29, 2025 00:55

Copilot AI reviewed Jul 29, 2025

View reviewed changes

bio-boris reviewed Jul 29, 2025

View reviewed changes

Remove unused instance var, clarify Callback.start() method

36e0ce6

bio-boris approved these changes Jul 29, 2025

View reviewed changes

MrCreosote merged commit f037f7f into main Jul 29, 2025
10 of 12 checks passed

bio-boris mentioned this pull request Aug 15, 2025

Update release notes #115

Merged

Conversation

MrCreosote commented Jul 28, 2025

Uh oh!

codecov bot commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

MrCreosote commented Jul 28, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Jul 28, 2025 •

edited

Loading