Skip to content

Comments

Feature/thread safe async await#21

Open
iv461 wants to merge 35 commits intomainfrom
feature/thread_safe_async_await
Open

Feature/thread safe async await#21
iv461 wants to merge 35 commits intomainfrom
feature/thread_safe_async_await

Conversation

@iv461
Copy link
Owner

@iv461 iv461 commented Feb 20, 2026

Makes the async/await API thread-safe:

  • Adds synchronization for the timeout timers
  • Adds synchronization/locking for TFListener
  • Adds synchronization for get_goal_id (rclcpp_action)
  • Multiple coroutine race fixes

Adds Thread Sanitizer (TSAN) instrumentation compile flags and a list of rules for ignoring all the noise (dataraces in CycloneDDS).

TSAN evaluation

I've tested extensively with TSAN to find data races. For this, I've made the synchronization conditional and then wrote test cases that provoke data races. For this, I'm using the MultiThreadedExecutor with 8 threads, and spawning 100 timers that use the ICEY async/await API for TF, services and actions.
To ensure that the timer callbacks are called concurrently, they belong to an reentrant callback group.
Using this setup, TSAN correctly detected multiple data-races. on the task-timer hashtable,

Races fixed

  • Race/Use-after-free between launch_async that dispatches an async operation to a new thread in promise.await_suspend(), and the destructor of this promise that is called when another thread resumes the coroutine that holds this same promise. Fixed in cdb6094
  • Race on coroutine handle with detached coroutines, i.e. coroutines that are not awaited. These are only the "top-level" coroutines that are called from the ROS callback, which is not a coroutine but a regular function. When the destructor of the task runs, it checks whether the coroutine is done. Meanwhile, another thread might resume this coroutine and race on the coroutine handle occurs. This is solved by simply setting the coroutine_ to nullptr when detaching, this works because with detached coroutines, the task does not have to do anything with the coroutine.
  • Races with the cancellation functions in the action client, added locking the appropriate mutexes

While testing with TSAN, I found among the races in ICEY also two races in rclcpp:

  • rclcpp: Race for the PRNG state in when creating an UUID for an action goal (ClientBase:generate_goal_id) (fixed in 0df6334)
  • A race occurs when cleaning up internal references/std:weak_ptr<TimerBase> : This one is very likely a false-positive caused by TSAN using non-instrumented rclcpp code. It can't possibly be a race because it happens in the control block of a shared_ptr, which is thread-safe.

Other changes

  • Fast single-threaded code is used in notify_if_any_relevant_transform_was_received if multi-theading support is disabled
  • Fixed some use-after-free bugs in the unit-tests (AsyncAwaitTwoNodeTest) related to capturing by reference in a coroutine lambda
  • Improved the actions API to use the regular rclcpp_actions type for WrappedResult so that users do not have to rewrite their code unnecessarily when using ICEY
  • Improved clarity of action examples
  • Improved API parity with regular ROS: Passing callbackgroup arguments and action options

TODO:

  • I still need a good solution to arbitrate the race between the timeout and the response callback (TF/service, etc.). We have to arbitrate this race because the executor cannot guarantee that only one of the two will fire. Ideally, this would be handled in the executor, but I can't modify it. In ICEY, I only want to use public APIs. Ideally, the solution would be lock-free and would not require dynamic memory allocations per request. However, I don't think this is possible. Instead, I'm trying to reuse the task cancellation operation, which is an atomic test-and-set operation, as arbitration. However, this suffers from the ABA problem, as described in the last paragraph of the first section of the Wikipedia article. The problem is that the task key is the promise's address, which may be the same in subsequent calls.

@iv461 iv461 linked an issue Feb 20, 2026 that may be closed by this pull request
Ivo Ivanov added 20 commits February 20, 2026 23:32
…n anymore and can executore multiple timers from multiple nodes, but the MT executor only uses the main thread ?!
…sary occurence of an icey-type that is the same as in rclcpp_action.
…tched the async operation and detached the coroutine checks again if this coroutine is finished in it's destructor, while another thread continues the same coroutine. The fix is easy, since we do not have to do anything with this coroutine state in case we detached, we simply set the pointer to the coroutine state (aka. coroutine_handle) to zero
Ivo Ivanov added 6 commits February 22, 2026 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support multi-threaded executor

1 participant