Support more types in decoupled lookback fastpath#7575
Support more types in decoupled lookback fastpath#7575bernhardmgruber wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| static constexpr ::cuda::std::size_t max_smem_per_block = 48 * 1024; | ||
|
|
||
| // The size in bytes of the largest machine word that can be atomically read/written in a single instruction. | ||
| inline static constexpr int largest_atomic_word_size = 16; |
There was a problem hiding this comment.
I don't think this is true for all architectures
There was a problem hiding this comment.
I don't know TBH. We hardcoded 16 in the past and I know the build-in atomics like __nv_atomic_load support up to 16 bytes. Also, cub::detail::store_release etc. are also only implemented for up to 16 bytes. I think we are fine wrt. this PR, but we may revisit this in the future.
There was a problem hiding this comment.
I believe this requires SM90 otherwise its only 8
There was a problem hiding this comment.
I think the name of the variable may not be accurate then. We have ld.acquire.gpu.v2.u64 on SM70, and ld.cg.v2.u64+__threadfence() before that. (I am looking at the implementation of cub::detail::load_acquire). What this variable means is what is the largest message size in bytes that we can pass from one thread to another using a store release and a load acquire.
| ::cuda::std::_If<is_primitive<ValueT>::value && (sizeof(ValueT) + sizeof(KeyT) < 16), | ||
| // TODO(bgruber): remove the check for is_primitive<ValueT> in CCCL 4.0 | ||
| ::cuda::std::_If<(is_primitive<ValueT>::value || ::cuda::std::is_trivially_copyable_v<ValueT>) | ||
| && (sizeof(ValueT) + sizeof(KeyT) < largest_atomic_word_size), |
There was a problem hiding this comment.
(sizeof(ValueT) + sizeof(KeyT) < largest_atomic_word_size) or
(sizeof(ValueT) + sizeof(KeyT) <= largest_atomic_word_size)?
There was a problem hiding this comment.
I think it must be less than, since we also need to fit the status word into the 16 bytes.
The goal is to reduce the use of is_primitive<T> in the long term.
0c8dc6d to
cb10516
Compare
| static_assert(!cub::detail::is_primitive_v<segment>); | ||
| // static_assert(!cuda::std::is_trivially_copyable_v<segment>); // TODO(bgruber): why is this important? |
There was a problem hiding this comment.
@pauleonix I need your input here please. Why does segment need to be a non-primitive type and what should this mean here? Because it is trivially copyable.
There was a problem hiding this comment.
We wanted to explicitly test the non-primitive path.
There was a problem hiding this comment.
The changes in this PR make segment take the primitive path now, because it's trivially copyable. So I guess segment must be made non-trivially copyable by e.g. adding a non-trivial destructor or a copy ctor? Does that make sense?
There was a problem hiding this comment.
Confirmed offline.
😬 CI Workflow Results🟥 Finished in 6h 04m: Pass: 92%/99 | Total: 7d 05h | Max: 6h 04m | Hits: 27%/102486See results here. |
| template <typename T, bool SINGLE_WORD = detail::is_primitive<T>::value> | ||
| template <typename T, | ||
| // TODO(bgruber): remove the check for is_primitive<T> in CCCL 4.0 | ||
| bool single_word = detail::is_primitive<T>::value |
There was a problem hiding this comment.
| bool single_word = detail::is_primitive<T>::value | |
| bool SingleWord = detail::is_primitive<T>::value |
| typename KeyT, | ||
| bool SINGLE_WORD = detail::is_primitive<ValueT>::value && (sizeof(ValueT) + sizeof(KeyT) < 16)> | ||
| // TODO(bgruber): remove the check for is_primitive<ValueT> in CCCL 4.0 | ||
| bool single_word = (detail::is_primitive<ValueT>::value || ::cuda::std::is_trivially_copyable_v<ValueT>) |
There was a problem hiding this comment.
| bool single_word = (detail::is_primitive<ValueT>::value || ::cuda::std::is_trivially_copyable_v<ValueT>) | |
| bool SingleWord = (detail::is_primitive<ValueT>::value || ::cuda::std::is_trivially_copyable_v<ValueT>) |
The goal is to reduce the use of
is_primitive<T>in the long term.