In sonic-moe for sm100, it is mentioned that there is a relay warp arriving at the peer-CTA's mbarrier to signal tcgen05.mma.2cta that the cp.async of gather_A completes. However, this mbarrier arrive violates the ptx memory model. The call stack is as follows:
So the equivalent ptx being called here is mbarrier.cta.arrive.release.shared::cluster (when unspecified, the default scope is cta and default semantic is release). However, this arrive does not generate a release-acquire relation between the relay warp and 2cta mma warp in the cluster scope. There are two legal ways to do this.
Producer cta:
cp.async addr, val
cp.async.wait_group 0
mbarrier.cluster.arrive.release.shared::cluster
Consumer cta:
mbarrier.cluster.try_wait.acquire.shared::cta
tcgen05.mma.2cta addr
however, this will emit a MEMBAR.GPU in the hot loop. An optimized way to do this is:
Producer cta:
cp.async addr, val
cp.async.wait_group 0
fence.release.sync_restrict::shared::cta.cluster
mbarrier.cluster.arrive.relaxed.shared::cluster
Consumer cta:
mbarrier.cluster.try_wait.acquire.shared::cta
tcgen05.mma.2cta addr
which avoids the MEMBAR.GPU.
In sonic-moe for sm100, it is mentioned that there is a relay warp arriving at the peer-CTA's mbarrier to signal
tcgen05.mma.2ctathat thecp.asyncof gather_A completes. However, this mbarrier arrive violates the ptx memory model. The call stack is as follows:So the equivalent ptx being called here is
mbarrier.cta.arrive.release.shared::cluster(when unspecified, the default scope isctaand default semantic isrelease). However, this arrive does not generate arelease-acquirerelation between the relay warp and 2cta mma warp in theclusterscope. There are two legal ways to do this.Producer cta: cp.async addr, val cp.async.wait_group 0 mbarrier.cluster.arrive.release.shared::cluster Consumer cta: mbarrier.cluster.try_wait.acquire.shared::cta tcgen05.mma.2cta addrhowever, this will emit a
MEMBAR.GPUin the hot loop. An optimized way to do this is:Producer cta: cp.async addr, val cp.async.wait_group 0 fence.release.sync_restrict::shared::cta.cluster mbarrier.cluster.arrive.relaxed.shared::cluster Consumer cta: mbarrier.cluster.try_wait.acquire.shared::cta tcgen05.mma.2cta addrwhich avoids the
MEMBAR.GPU.