Skip to content

Support more types in decoupled lookback fastpath#7575

Open
bernhardmgruber wants to merge 2 commits intoNVIDIA:mainfrom
bernhardmgruber:lookback_types
Open

Support more types in decoupled lookback fastpath#7575
bernhardmgruber wants to merge 2 commits intoNVIDIA:mainfrom
bernhardmgruber:lookback_types

Conversation

@bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Feb 9, 2026

The goal is to reduce the use of is_primitive<T> in the long term.

@bernhardmgruber bernhardmgruber requested a review from a team as a code owner February 9, 2026 14:04
@github-project-automation github-project-automation bot moved this to Todo in CCCL Feb 9, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Feb 9, 2026
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

static constexpr ::cuda::std::size_t max_smem_per_block = 48 * 1024;

// The size in bytes of the largest machine word that can be atomically read/written in a single instruction.
inline static constexpr int largest_atomic_word_size = 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true for all architectures

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know TBH. We hardcoded 16 in the past and I know the build-in atomics like __nv_atomic_load support up to 16 bytes. Also, cub::detail::store_release etc. are also only implemented for up to 16 bytes. I think we are fine wrt. this PR, but we may revisit this in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this requires SM90 otherwise its only 8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name of the variable may not be accurate then. We have ld.acquire.gpu.v2.u64 on SM70, and ld.cg.v2.u64+__threadfence() before that. (I am looking at the implementation of cub::detail::load_acquire). What this variable means is what is the largest message size in bytes that we can pass from one thread to another using a store release and a load acquire.

::cuda::std::_If<is_primitive<ValueT>::value && (sizeof(ValueT) + sizeof(KeyT) < 16),
// TODO(bgruber): remove the check for is_primitive<ValueT> in CCCL 4.0
::cuda::std::_If<(is_primitive<ValueT>::value || ::cuda::std::is_trivially_copyable_v<ValueT>)
&& (sizeof(ValueT) + sizeof(KeyT) < largest_atomic_word_size),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(sizeof(ValueT) + sizeof(KeyT) < largest_atomic_word_size) or
(sizeof(ValueT) + sizeof(KeyT) <= largest_atomic_word_size)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it must be less than, since we also need to fit the status word into the 16 bytes.

The goal is to reduce the use of is_primitive<T> in the long term.
Comment on lines +59 to +60
static_assert(!cub::detail::is_primitive_v<segment>);
// static_assert(!cuda::std::is_trivially_copyable_v<segment>); // TODO(bgruber): why is this important?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pauleonix I need your input here please. Why does segment need to be a non-primitive type and what should this mean here? Because it is trivially copyable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wanted to explicitly test the non-primitive path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this PR make segment take the primitive path now, because it's trivially copyable. So I guess segment must be made non-trivially copyable by e.g. adding a non-trivial destructor or a copy ctor? Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed offline.

@github-actions
Copy link
Contributor

😬 CI Workflow Results

🟥 Finished in 6h 04m: Pass: 92%/99 | Total: 7d 05h | Max: 6h 04m | Hits: 27%/102486

See results here.

template <typename T, bool SINGLE_WORD = detail::is_primitive<T>::value>
template <typename T,
// TODO(bgruber): remove the check for is_primitive<T> in CCCL 4.0
bool single_word = detail::is_primitive<T>::value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool single_word = detail::is_primitive<T>::value
bool SingleWord = detail::is_primitive<T>::value

typename KeyT,
bool SINGLE_WORD = detail::is_primitive<ValueT>::value && (sizeof(ValueT) + sizeof(KeyT) < 16)>
// TODO(bgruber): remove the check for is_primitive<ValueT> in CCCL 4.0
bool single_word = (detail::is_primitive<ValueT>::value || ::cuda::std::is_trivially_copyable_v<ValueT>)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool single_word = (detail::is_primitive<ValueT>::value || ::cuda::std::is_trivially_copyable_v<ValueT>)
bool SingleWord = (detail::is_primitive<ValueT>::value || ::cuda::std::is_trivially_copyable_v<ValueT>)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

4 participants