Skip to content

fix: assertion failure in JobQueue::stop#5774

Draft
a1q123456 wants to merge 34 commits intodevelopfrom
a1q123456/fix-job-queue-stop
Draft

fix: assertion failure in JobQueue::stop#5774
a1q123456 wants to merge 34 commits intodevelopfrom
a1q123456/fix-job-queue-stop

Conversation

@a1q123456
Copy link
Copy Markdown
Contributor

@a1q123456 a1q123456 commented Sep 8, 2025

High Level Overview of Change

We're getting the assertion failure in Antithesis in JobQueue::stop at line 316 and it's because m_processCount will be 0 when there's a coroutine suspended (either by the first yield() before it reaches the user function or by an yield() in the user function) and nSuspend_ will be 1. The assertion failure indicates that coroutines in rippled do not have threads to resume and exit cleanly when the job queue is getting closed.

This PR makes JobQueue resume all suspended coroutines so that coroutines can check if it should exit after waking up.

Context of Change

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (non-breaking change that only restructures code)
  • Performance (increase or change in throughput and/or latency)
  • Tests (you added tests for code that already exists, or your new feature included in this PR)
  • Documentation update
  • Chore (no impact to binary, e.g. .gitignore, formatting, dropping support for older tooling)
  • Release

API Impact

  • Public API: New feature (new methods and/or new fields)
  • Public API: Breaking change (in general, breaking changes should only impact the next api_version)
  • libxrpl change (any change that may affect libxrpl or dependents of libxrpl)
  • Peer protocol change (must be backward compatible or bump the peer protocol version)

@a1q123456 a1q123456 requested a review from a team September 8, 2025 13:52
…a suspended coroutine

Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
@a1q123456 a1q123456 force-pushed the a1q123456/fix-job-queue-stop branch from 636bfd2 to 32a3f0a Compare September 8, 2025 13:54
@codecov
Copy link
Copy Markdown

codecov Bot commented Sep 8, 2025

Codecov Report

❌ Patch coverage is 92.18750% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.9%. Comparing base (fe9c8d5) to head (02793cd).
⚠️ Report is 195 commits behind head on develop.

Files with missing lines Patch % Lines
include/xrpl/core/Coro.ipp 95.0% 2 Missing ⚠️
src/libxrpl/core/detail/JobQueue.cpp 87.5% 2 Missing ⚠️
include/xrpl/core/JobQueue.h 87.5% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##           develop   #5774     +/-   ##
=========================================
- Coverage     79.9%   79.9%   -0.0%     
=========================================
  Files          839     840      +1     
  Lines        65528   65582     +54     
  Branches      7273    7267      -6     
=========================================
+ Hits         52330   52373     +43     
- Misses       13198   13209     +11     
Files with missing lines Coverage Δ
include/xrpl/core/JobQueue.h 90.9% <87.5%> (-9.1%) ⬇️
include/xrpl/core/Coro.ipp 94.8% <95.0%> (-5.2%) ⬇️
src/libxrpl/core/detail/JobQueue.cpp 90.3% <87.5%> (-0.5%) ⬇️

... and 438 files with indirect coverage changes

Impacted file tree graph

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread src/xrpld/core/Coro.ipp Outdated
std::lock_guard lock(jq_.m_mutex);
if (shouldStop())
{
return;
Copy link
Copy Markdown
Contributor Author

@a1q123456 a1q123456 Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is very hard to cover in unit tests but this is valid, because we have to call jobQueue->stop() and then call coro->yield() right after m_suspendedCoros gets copied and before we call jobCounter_.join in JobQueue::stop

@a1q123456 a1q123456 requested review from vlntb and ximinez September 9, 2025 10:08
@a1q123456
Copy link
Copy Markdown
Contributor Author

Seems that we're getting an error in Antithesis, converting this PR to draft.

@a1q123456 a1q123456 marked this pull request as draft September 16, 2025 09:45
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
@a1q123456 a1q123456 added the DraftRunCI Normally CI does not run on draft PRs. This opts in. label Sep 16, 2025
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
@a1q123456 a1q123456 removed the DraftRunCI Normally CI does not run on draft PRs. This opts in. label Sep 18, 2025
@a1q123456 a1q123456 marked this pull request as ready for review September 18, 2025 08:45
a1q123456 and others added 6 commits September 18, 2025 22:29
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
@ximinez ximinez removed their request for review September 24, 2025 18:32
@bthomee bthomee requested a review from Bronek September 24, 2025 18:32
a1q123456 and others added 3 commits September 25, 2025 14:47
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Comment thread include/xrpl/core/Coro.ipp
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
@a1q123456 a1q123456 requested a review from gowsiany November 21, 2025 16:54
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
@bthomee bthomee requested review from vvysokikh1 and removed request for Bronek November 25, 2025 15:01
// coro->post().
// When post() failed, we won't get a thread to let
// the Coro finish. We should ignore the coroutine and
// let it destruct, as the JobQueu has been signaled to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: JobQueue.

Comment thread include/xrpl/core/JobQueue.h Outdated

/** Returns true if the coroutine should stop executing */
[[nodiscard]] bool
shouldStop() const;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will change the contract of the class. The class description suggests that when a JobQueue stops, it wait for all the jobs to finish. Here you seem to be adding an abort functionality. This will change the expected behaviour of the class. I would recommend finding an alternate solution for the problem.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's an alternate solution and this approach is quite conventional, the only difference is that we could provide a std::stop_token to all async functions and then the stop token does the same as shouldStop, but adding shouldStop requires least amount of changes.

Comment thread include/xrpl/core/Coro.ipp Outdated
// Resume the coroutine so that it has a chance to clean things up
if (state_ == CoroState::Suspended)
{
resume();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might trigger an out of order execution of the job assigned to this coroutine. I think that could be buggy, if not borderline dangerous.

Example: We created 5 coroutines, each with a job. They all got scheduled in a job queue. We freed coroutine no. 3. It gets resumed/executed, before others.

Copy link
Copy Markdown
Contributor Author

@a1q123456 a1q123456 Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the issue with the boost coroutine where we can't throw an exception directly outside a coroutine when it's suspended. To notify coroutines that the underlying thread pool is getting destructed or the coroutine itself is getting destructed, we have to resume the coroutine and let it exit itself, and then, inside the coroutine, it has to check if it should stop before doing any other job.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the JobQueue class so that we can leverage the internal clean up process of boost coroutine.

Comment on lines +100 to +101
// In the case where make_shared<Coro>() succeeds and then the JobQueue
// stops before coro_ gets executed, post() will still be called and
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to fix this issue, where JobQueue gets stopped before coroutines finish. It shouldn't. It should keep waiting for coroutines.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JobQueue won't fully stop before all coroutines stop.

{
// We should resume the suspended coroutines so that the coroutines
// get a chance to exit cleanly.
for (auto& [_, coro] : suspendedCoros)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class name is JobQueue, so I am going with the assumption of jobs being processed to completion in FIFO order. If you add a new job to the queue at this point, it will be the last job to get executed(since you deactivated queuing). We would know that when this job finishes, the queue is empty. This job can also do any clean-up at this point. It can signal that the queue is now empty and hence JobQueue can stop.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is impossible to add a new job when the job queue finishes except for jobs that resumes coroutines, in which case, the order doesn't matter that much because the coroutines will just exit without doing any other things.

@pratikmankawde
Copy link
Copy Markdown
Contributor

I didn't understand the argument(from PR description) that

The assertion failure indicates that coroutines in rippled do not have threads to resume and exit cleanly when the job queue is getting closed.

How come there are no threads? Workers object is alive in JobQueue class. What am I missing here?

@a1q123456
Copy link
Copy Markdown
Contributor Author

I didn't understand the argument(from PR description) that

The assertion failure indicates that coroutines in rippled do not have threads to resume and exit cleanly when the job queue is getting closed.

How come there are no threads? Workers object is alive in JobQueue class. What am I missing here?

It happens when the thread pool in the job queue has been emptied (during destruction) and then a coroutine just suddenly realises that it needs to resume.

…queue-stop

Signed-off-by: JCW <a1q123456@users.noreply.github.com>

# Conflicts:
#	include/xrpl/core/Coro.ipp
#	src/libxrpl/core/detail/JobQueue.cpp
#	src/test/core/Coroutine_test.cpp
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
@ximinez ximinez modified the milestones: 3.1.0, 3.2.0 Jan 28, 2026
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
…queue-stop

Signed-off-by: JCW <a1q123456@users.noreply.github.com>

# Conflicts:
#	include/xrpl/core/Coro.ipp
#	include/xrpl/core/JobQueue.h
#	src/libxrpl/core/detail/JobQueue.cpp
#	src/test/core/Coroutine_test.cpp
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

This PR has conflicts, please resolve them in order for the PR to be reviewed.

@godexsoft
Copy link
Copy Markdown
Contributor

We recently merged a refactor to develop that enables clang-tidy's readability-identifier-naming. Your branch now has heavy conflicts that are largely mechanical. Below is a workflow that aligns your branch's naming with develop before merging, which should minimize the merge conflicts.

One-time setup

If you don't already have clang-tidy working in your env, on macOS:

brew install llvm@21
# Follow brew's hint to put $(brew --prefix llvm@21)/bin on PATH so run-clang-tidy is found.

Workflow on your branch (before merging develop)

1. Grab the new .clang-tidy from develop without pulling anything else. Sync your fork on GitHub first, then:

git remote -v   # should show 'upstream' among others; if not:
# git remote set-url upstream git@github.com:XRPLF/rippled.git
git fetch upstream
git checkout upstream/develop -- .clang-tidy

2. Reconfigure conan/cmake so compile_commands.json is fresh.

3. Apply renames for the files modified in your PR:

git diff --name-only $(git merge-base HEAD upstream/develop) HEAD \
  | grep -E '\.(cpp|h|hpp|ipp)$' \
  | xargs run-clang-tidy -p build -fix -allow-no-checks
# or -p .build, or whatever your build dir is called

4. Build + test, then commit as a single dedicated commit:

cmake --build build -j8
git commit -am "refactor: Align identifier naming with develop"

5. Now merge develop:

git merge upstream/develop

Extra

Run clang-tidy once more after the merge to catch any stragglers introduced from develop's side:

run-clang-tidy -p build -fix -allow-no-checks src tests
# or -p .build, or whatever your build dir is called

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug PR: has conflicts Triaged Issue/PR has been triaged for viability, liveliness, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants