Skip to content

[Hipblastl][tensilelite] Speed up AlmostEqual#4909

Merged
Alex-Vasile merged 1 commit intodevelopfrom
users/alvasile/faster_comparison
Mar 2, 2026
Merged

[Hipblastl][tensilelite] Speed up AlmostEqual#4909
Alex-Vasile merged 1 commit intodevelopfrom
users/alvasile/faster_comparison

Conversation

@Alex-Vasile
Copy link
Contributor

@Alex-Vasile Alex-Vasile commented Feb 26, 2026

Motivation

Current Implementation of AlmostEquals stores intermediary results back in input datatypes (e.g. Half or Float8) which are not natively supported by cpus.

This results in a lot of conversions back and forth within the the hot loop of ReferenceValidator.cpp. So much so, that for several Tensile yaml files, this comparison function ends up taking more time than the reference gemm calculation.

This part 1 of 2. Follow up PR will focus on restructuring to be able to vectorize and unroll the hot loop.

Also, not all tests were correctly handling infinity and nan checks.

Technical Details

The current implementations keep intermediary results in input format (e.g. Half) which is not natively supported by CPUs. This introduces a lot of extra conversions for each calculation.

For all non-native dtypes, cast them to float (which is big enough for their precision and range) and perform all operations with floats.

Now all types which represent inf and nan will handle these cases correctly using the first part of the check (a == b || <abs_diff_check>).

Test Plan

  1. Run current tests.
  2. Add extensive testing of current implementation.

Test Result

Passing tests

Submission Checklist

@codecov-commenter
Copy link

codecov-commenter commented Feb 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

❌ Your project status has failed because the head coverage (76.88%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #4909   +/-   ##
========================================
  Coverage    66.18%   66.18%           
========================================
  Files         1742     1742           
  Lines       269819   269819           
  Branches     37507    37507           
========================================
  Hits        178577   178577           
  Misses       75570    75570           
  Partials     15672    15672           
Flag Coverage Δ *Carryforward flag
hipBLAS 90.67% <ø> (ø) Carriedforward from 8e3a4bb
hipBLASLt 43.55% <ø> (ø)
hipCUB 82.38% <ø> (ø) Carriedforward from 8e3a4bb
hipDNN 81.95% <ø> (ø) Carriedforward from 8e3a4bb
hipFFT 55.93% <ø> (ø) Carriedforward from 8e3a4bb
hipRAND 76.12% <ø> (ø) Carriedforward from 8e3a4bb
hipSOLVER 68.81% <ø> (ø) Carriedforward from 8e3a4bb
hipSPARSE 84.70% <ø> (ø) Carriedforward from 8e3a4bb
rocBLAS 47.97% <ø> (ø) Carriedforward from 8e3a4bb
rocFFT 52.93% <ø> (ø) Carriedforward from 8e3a4bb
rocRAND 57.06% <ø> (ø) Carriedforward from 8e3a4bb
rocSOLVER 76.88% <ø> (ø) Carriedforward from 8e3a4bb
rocSPARSE 71.53% <ø> (ø) Carriedforward from 8e3a4bb

*This pull request uses carry forward flags. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Alex-Vasile Alex-Vasile force-pushed the users/alvasile/faster_comparison branch from 74b1745 to 8cd9c60 Compare February 27, 2026 19:05
Copy link
Contributor

@talumbau talumbau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@Alex-Vasile Alex-Vasile enabled auto-merge (squash) February 27, 2026 22:22
@Alex-Vasile Alex-Vasile force-pushed the users/alvasile/faster_comparison branch from 8cd9c60 to 7abee7c Compare February 27, 2026 22:23
@math-ci-webhook
Copy link

perfci run on commit cb6bbf6

math-ci run

Signed-off-by: Alex Vasile <48962821+Alex-Vasile@users.noreply.github.com>
@Alex-Vasile Alex-Vasile force-pushed the users/alvasile/faster_comparison branch from 7abee7c to 26f5b04 Compare March 2, 2026 14:06
@math-ci-webhook
Copy link

perfci run on commit 6601702

math-ci run

@Alex-Vasile Alex-Vasile merged commit 4ca3b9f into develop Mar 2, 2026
34 checks passed
@Alex-Vasile Alex-Vasile deleted the users/alvasile/faster_comparison branch March 2, 2026 20:05
assistant-librarian bot pushed a commit to ROCm/hipBLASLt that referenced this pull request Mar 2, 2026
[Hipblastl][tensilelite] Speed up AlmostEqual

## Motivation

Current Implementation of AlmostEquals stores intermediary results back
in input datatypes (e.g. Half or Float8) which are not natively
supported by cpus.

This results in a lot of conversions back and forth within the the hot
loop of ReferenceValidator.cpp. So much so, that for several Tensile
yaml files, this comparison function **ends up taking more time than the
reference gemm calculation**.

This part 1 of 2. Follow up PR will focus on restructuring to be able to
vectorize and unroll the hot loop.

Also, not all tests were correctly handling infinity and nan checks.

## Technical Details

The current implementations keep intermediary results in input format
(e.g. Half) which is not natively supported by CPUs. This introduces a
lot of extra conversions for each calculation.

For all non-native dtypes, cast them to float (which is big enough for
their precision and range) and perform all operations with floats.

Now all types which represent inf and nan will handle these cases
correctly using the first part of the check (`a == b ||
<abs_diff_check>`).

## Test Plan

1. Run current tests.
2. Add extensive testing of current implementation.

## Test Result

Passing tests

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Signed-off-by: Alex Vasile <48962821+Alex-Vasile@users.noreply.github.com>
kokolchin pushed a commit to kokolchin/rocm-libraries that referenced this pull request Mar 4, 2026
## Motivation

Current Implementation of AlmostEquals stores intermediary results back
in input datatypes (e.g. Half or Float8) which are not natively
supported by cpus.

This results in a lot of conversions back and forth within the the hot
loop of ReferenceValidator.cpp. So much so, that for several Tensile
yaml files, this comparison function **ends up taking more time than the
reference gemm calculation**.

This part 1 of 2. Follow up PR will focus on restructuring to be able to
vectorize and unroll the hot loop.

Also, not all tests were correctly handling infinity and nan checks.

## Technical Details

The current implementations keep intermediary results in input format
(e.g. Half) which is not natively supported by CPUs. This introduces a
lot of extra conversions for each calculation.

For all non-native dtypes, cast them to float (which is big enough for
their precision and range) and perform all operations with floats.

Now all types which represent inf and nan will handle these cases
correctly using the first part of the check (`a == b ||
<abs_diff_check>`).

## Test Plan

1. Run current tests.
2. Add extensive testing of current implementation.

## Test Result

Passing tests

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Signed-off-by: Alex Vasile <48962821+Alex-Vasile@users.noreply.github.com>
NaveenElumalaiAMD pushed a commit that referenced this pull request Mar 6, 2026
## Motivation

Current Implementation of AlmostEquals stores intermediary results back
in input datatypes (e.g. Half or Float8) which are not natively
supported by cpus.

This results in a lot of conversions back and forth within the the hot
loop of ReferenceValidator.cpp. So much so, that for several Tensile
yaml files, this comparison function **ends up taking more time than the
reference gemm calculation**.

This part 1 of 2. Follow up PR will focus on restructuring to be able to
vectorize and unroll the hot loop.

Also, not all tests were correctly handling infinity and nan checks.

## Technical Details

The current implementations keep intermediary results in input format
(e.g. Half) which is not natively supported by CPUs. This introduces a
lot of extra conversions for each calculation.

For all non-native dtypes, cast them to float (which is big enough for
their precision and range) and perform all operations with floats.

Now all types which represent inf and nan will handle these cases
correctly using the first part of the check (`a == b ||
<abs_diff_check>`).

## Test Plan

1. Run current tests.
2. Add extensive testing of current implementation.

## Test Result

Passing tests

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Signed-off-by: Alex Vasile <48962821+Alex-Vasile@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants