fix: assertion failure in JobQueue::stop#5774
Conversation
…a suspended coroutine Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
636bfd2 to
32a3f0a
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #5774 +/- ##
=========================================
- Coverage 79.9% 79.9% -0.0%
=========================================
Files 839 840 +1
Lines 65528 65582 +54
Branches 7273 7267 -6
=========================================
+ Hits 52330 52373 +43
- Misses 13198 13209 +11
🚀 New features to boost your workflow:
|
| std::lock_guard lock(jq_.m_mutex); | ||
| if (shouldStop()) | ||
| { | ||
| return; |
There was a problem hiding this comment.
This case is very hard to cover in unit tests but this is valid, because we have to call jobQueue->stop() and then call coro->yield() right after m_suspendedCoros gets copied and before we call jobCounter_.join in JobQueue::stop
|
Seems that we're getting an error in Antithesis, converting this PR to draft. |
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
| // coro->post(). | ||
| // When post() failed, we won't get a thread to let | ||
| // the Coro finish. We should ignore the coroutine and | ||
| // let it destruct, as the JobQueu has been signaled to |
|
|
||
| /** Returns true if the coroutine should stop executing */ | ||
| [[nodiscard]] bool | ||
| shouldStop() const; |
There was a problem hiding this comment.
This will change the contract of the class. The class description suggests that when a JobQueue stops, it wait for all the jobs to finish. Here you seem to be adding an abort functionality. This will change the expected behaviour of the class. I would recommend finding an alternate solution for the problem.
There was a problem hiding this comment.
I don't think there's an alternate solution and this approach is quite conventional, the only difference is that we could provide a std::stop_token to all async functions and then the stop token does the same as shouldStop, but adding shouldStop requires least amount of changes.
| // Resume the coroutine so that it has a chance to clean things up | ||
| if (state_ == CoroState::Suspended) | ||
| { | ||
| resume(); |
There was a problem hiding this comment.
This might trigger an out of order execution of the job assigned to this coroutine. I think that could be buggy, if not borderline dangerous.
Example: We created 5 coroutines, each with a job. They all got scheduled in a job queue. We freed coroutine no. 3. It gets resumed/executed, before others.
There was a problem hiding this comment.
This is the issue with the boost coroutine where we can't throw an exception directly outside a coroutine when it's suspended. To notify coroutines that the underlying thread pool is getting destructed or the coroutine itself is getting destructed, we have to resume the coroutine and let it exit itself, and then, inside the coroutine, it has to check if it should stop before doing any other job.
There was a problem hiding this comment.
I've updated the JobQueue class so that we can leverage the internal clean up process of boost coroutine.
| // In the case where make_shared<Coro>() succeeds and then the JobQueue | ||
| // stops before coro_ gets executed, post() will still be called and |
There was a problem hiding this comment.
I think we need to fix this issue, where JobQueue gets stopped before coroutines finish. It shouldn't. It should keep waiting for coroutines.
There was a problem hiding this comment.
JobQueue won't fully stop before all coroutines stop.
| { | ||
| // We should resume the suspended coroutines so that the coroutines | ||
| // get a chance to exit cleanly. | ||
| for (auto& [_, coro] : suspendedCoros) |
There was a problem hiding this comment.
The class name is JobQueue, so I am going with the assumption of jobs being processed to completion in FIFO order. If you add a new job to the queue at this point, it will be the last job to get executed(since you deactivated queuing). We would know that when this job finishes, the queue is empty. This job can also do any clean-up at this point. It can signal that the queue is now empty and hence JobQueue can stop.
There was a problem hiding this comment.
It is impossible to add a new job when the job queue finishes except for jobs that resumes coroutines, in which case, the order doesn't matter that much because the coroutines will just exit without doing any other things.
|
I didn't understand the argument(from PR description) that
How come there are no threads? |
It happens when the thread pool in the job queue has been emptied (during destruction) and then a coroutine just suddenly realises that it needs to resume. |
…queue-stop Signed-off-by: JCW <a1q123456@users.noreply.github.com> # Conflicts: # include/xrpl/core/Coro.ipp # src/libxrpl/core/detail/JobQueue.cpp # src/test/core/Coroutine_test.cpp
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
…queue-stop Signed-off-by: JCW <a1q123456@users.noreply.github.com> # Conflicts: # include/xrpl/core/Coro.ipp # include/xrpl/core/JobQueue.h # src/libxrpl/core/detail/JobQueue.cpp # src/test/core/Coroutine_test.cpp
Signed-off-by: JCW <a1q123456@users.noreply.github.com>
|
This PR has conflicts, please resolve them in order for the PR to be reviewed. |
|
We recently merged a refactor to One-time setupIf you don't already have clang-tidy working in your env, on macOS: brew install llvm@21
# Follow brew's hint to put $(brew --prefix llvm@21)/bin on PATH so run-clang-tidy is found.Workflow on your branch (before merging develop)1. Grab the new git remote -v # should show 'upstream' among others; if not:
# git remote set-url upstream git@github.com:XRPLF/rippled.git
git fetch upstream
git checkout upstream/develop -- .clang-tidy2. Reconfigure conan/cmake so 3. Apply renames for the files modified in your PR: git diff --name-only $(git merge-base HEAD upstream/develop) HEAD \
| grep -E '\.(cpp|h|hpp|ipp)$' \
| xargs run-clang-tidy -p build -fix -allow-no-checks
# or -p .build, or whatever your build dir is called4. Build + test, then commit as a single dedicated commit: cmake --build build -j8
git commit -am "refactor: Align identifier naming with develop"5. Now merge develop: git merge upstream/developExtraRun clang-tidy once more after the merge to catch any stragglers introduced from develop's side: run-clang-tidy -p build -fix -allow-no-checks src tests
# or -p .build, or whatever your build dir is called |
High Level Overview of Change
We're getting the assertion failure in Antithesis in JobQueue::stop at line 316 and it's because
m_processCountwill be 0 when there's a coroutine suspended (either by the first yield() before it reaches the user function or by an yield() in the user function) and nSuspend_ will be 1. The assertion failure indicates that coroutines in rippled do not have threads to resume and exit cleanly when the job queue is getting closed.This PR makes JobQueue resume all suspended coroutines so that coroutines can check if it should exit after waking up.
Context of Change
Type of Change
.gitignore, formatting, dropping support for older tooling)API Impact
libxrplchange (any change that may affectlibxrplor dependents oflibxrpl)