Skip to content

gdb: add "catch hiperr" to break on HIP API errors#129

Merged
amd-shahab merged 1 commit into
amd-stagingfrom
users/shvahedi/hip-catch-errs
Jun 10, 2026
Merged

gdb: add "catch hiperr" to break on HIP API errors#129
amd-shahab merged 1 commit into
amd-stagingfrom
users/shvahedi/hip-catch-errs

Conversation

@amd-shahab

@amd-shahab amd-shahab commented May 15, 2026

Copy link
Copy Markdown
Contributor

This change relies on "Add __hipOnError debugger hook for HIP API failures" PR to be functional. Although, without it, ROCgdb won't be broken. Please read the commit message for the description of this change itself.

--

What

This change adds a new feature to GDB, when it is targeted for
amdgcn-amd-amdhsa, that enables it to stop whenever HIP API
errors are returned. An abridged sample of execution looks
like this:

(gdb) catch hiperr
(gdb) run
Thread 1 "hip" hit Catchpoint 1 (HIP error)
hipErrorInvalidDevice (101): invalid device ordinal

Moreover, a $_hiperr convenience variable is added that holds
the error code. For further details, please see the updated
documentation in this change.

Design

Most of the implementation is in the newly added module called
break-catch-hiperr.c. The naming and its context follows the
break-catch-throw.c module.

The whole mechanism work by setting a breakpoint (catchpoint)
on a __hipOnError() symbol, that must be provided by the CLR
library. Enabling the catchpoint when there's no __hipOnError
symbol, simply has no effect.

There's a architect dependent section that is responsible for
extracting the error parameters (code, name, description) from
a single argument passed to the __hipOnError() function as a
structure pointer. The expected format of this struct is:

uint32_t version; // at this point, only 1 is supported
uint32_t code; // the numerical error code
const char *name; // name of the error
const char *desc; // a phrase about the error

The code is devised in such a way that if the __hipOnError()
exists, but for whatever reason the parameters cannot be deduced,
it won't interfere with the catchpoint being triggered. It will
affect the reported data of the catchpoint though.

Test

A new test, gdb.rocm/hip-catch-errors, is added to the mix to
ensure that:

  1. (Pending) catchpoints work.
  2. Multiple catchpoints in one execution are reported accordingly.
  3. Errors of indirect HIP calls like "kenel<<<...>>>()" are caught.
  4. The convenience variable can be used for conditional situations.
  5. Conditional and temporary catchpoints work as expected.

ROCM-1548

@amd-shahab amd-shahab self-assigned this May 15, 2026
@amd-shahab amd-shahab requested a review from a team as a code owner May 15, 2026 18:55
@lumachad

Copy link
Copy Markdown
Collaborator

@amd-shahab If this shouldn't be tested yet and isn't mergeable until the dependent PR, feel free to label it "ci:skip".

@lancesix lancesix left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First quick first pass of comments.

One question regarding the test: what kind of guarantee do we have that the error code and error string are going to be stable over time? Let's imagine that the hip runtime decides to localize the error string (not the name). Couldn't that be a legitimate thing to do, but it would break the test, would it?

Comment thread gdb/doc/gdb.texinfo Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.cpp Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp Outdated
Comment thread gdb/amd64-tdep.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
@amd-shahab amd-shahab added the ci:skip Skip all pre-commit / CI jobs while the label is up label May 18, 2026
Comment thread gdb/doc/gdb.texinfo Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp
Comment thread gdb/amd64-tdep.c
Comment thread gdb/amd64-tdep.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c Outdated
Comment thread gdb/break-catch-hiperr.c
@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 823bf4f to 745f0bc Compare May 18, 2026 10:43
@amd-shahab

amd-shahab commented May 18, 2026

Copy link
Copy Markdown
Contributor Author
  • error string not part of API

  • __hipOnError -> @samp{__hipOnError}

  • "int main" -> "int\nmain"

  • gdb_test -> gdb_test_on_output

  • target_read_stack -> target_read_memory

  • drop the "ignored" param name

  • Else -> Otherwise

  • NULL -> nullptr

  • removed dangling "backtrace" command from the documentation

  • !arg -> arg != nullptr

  • binay -> binary

  • relvant -> relevant

  • bool print_one (const bp_location **) -> bool print_one (const bp_location **last_loc)

  • rename variable "x" to "dontcare"

  • comment on true/false return of amd64_fetch_hiperr_parameters()

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 745f0bc to 1da282b Compare May 18, 2026 10:52
@amd-shahab

Copy link
Copy Markdown
Contributor Author
  • emit a warning when the "version" is not expected.

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 1da282b to d616325 Compare May 18, 2026 12:37
@amd-shahab

Copy link
Copy Markdown
Contributor Author
  • make the invocation of the test fully argument dependent
  • add comments for every method in break-catch-hiperr.c

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from d616325 to 4b40f98 Compare May 18, 2026 14:19
@amd-shahab

Copy link
Copy Markdown
Contributor Author
  • check for the existence of __hipOnError symbol. If not found, skip the test.

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 4b40f98 to 5005bc1 Compare May 18, 2026 19:21
@amd-shahab

Copy link
Copy Markdown
Contributor Author
  • check for the right frame (__hipOnError)

Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp
@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch 5 times, most recently from 81df767 to 5a4629e Compare May 19, 2026 14:34
@amd-shahab amd-shahab removed the ci:skip Skip all pre-commit / CI jobs while the label is up label May 19, 2026
@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch 3 times, most recently from b2e7ebe to 3ff9b7d Compare May 20, 2026 14:14
@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 6674390 to 159656f Compare June 8, 2026 12:20
@amd-shahab amd-shahab requested a review from lancesix June 8, 2026 13:53
@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 159656f to 79a5af5 Compare June 8, 2026 20:09
@amd-shahab

Copy link
Copy Markdown
Contributor Author
  • $_hiperr prints as hipError_t if that type is known to GDB.
  • Add tests for that.
  • Update the documentation to reflect that.

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 79a5af5 to 93dcde0 Compare June 9, 2026 08:00
@amd-shahab

Copy link
Copy Markdown
Contributor Author
  • Add test for when there's no debug symbols

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 93dcde0 to be5705e Compare June 9, 2026 10:25
@amd-shahab

Copy link
Copy Markdown
Contributor Author
  • remove static from static struct block_symbol bs definition, so the symbol will always be looked up.

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from be5705e to 1d5cec4 Compare June 9, 2026 11:25

@palves palves left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit log, drop the "when it is targeted for amdgcn-amd-amdhsa" clause from the first sentence.

Comment thread gdb/doc/gdb.texinfo Outdated
Comment thread gdb/doc/gdb.texinfo
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp Outdated
Comment thread gdb/amd64-tdep.c Outdated
Comment thread gdb/doc/gdb.texinfo Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp Outdated
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp
Comment thread gdb/testsuite/gdb.rocm/hip-catch-errors.exp Outdated
@palves palves assigned amd-shahab and unassigned palves Jun 9, 2026
@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 1d5cec4 to f06b3aa Compare June 9, 2026 14:06
@amd-shahab

Copy link
Copy Markdown
Contributor Author
  • remove "amdgcn target" in commit log
  • put $_hiperr example with conditional breakpoint
  • test: do not match thread name so things work on Windows
  • doc: reword the first paragraph
  • comments: remove "supposedly"
  • fix a few typos

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch 2 times, most recently from df5f4c7 to e66e680 Compare June 9, 2026 15:26
@palves

palves commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator
  • test: do not match thread name so things work on Windows

AFAICS, you're still matching the thread name. The thread name is the thing in double quotes.

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from e66e680 to 58dd05a Compare June 9, 2026 16:04
@amd-shahab

Copy link
Copy Markdown
Contributor Author
  • check hipError_t is a 4-byte enum.

@amd-shahab

Copy link
Copy Markdown
Contributor Author
  • test: do not match thread name so things work on Windows

AFAICS, you're still matching the thread name. The thread name is the thing in double quotes.

Now the pattern is:

set hit_line "Thread $::decimal .* hit $type $::decimal \\(HIP error\\)"

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 58dd05a to 1f98718 Compare June 9, 2026 16:12
@amd-shahab

Copy link
Copy Markdown
Contributor Author
  • test: replace the double-quoted part in hit_line with an any (.*).
  • test: fix indentation of a closing } for the foreach

@amd-shahab amd-shahab force-pushed the users/shvahedi/hip-catch-errs branch from 1f98718 to 080a051 Compare June 9, 2026 16:44
@amd-shahab

amd-shahab commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author
  • doc: add hiperr to catch section of the documentation. cross-linked by the "HIP API Errors" subsection.

@amd-shahab

Copy link
Copy Markdown
Contributor Author
  • test: spell out the whole header line for the output of info types hipError_t.

Comment thread gdb/doc/gdb.texinfo Outdated
Comment thread gdb/doc/gdb.texinfo Outdated
Comment thread gdb/doc/gdb.texinfo Outdated
This change adds a new feature to GDB that enables it to stop
whenever HIP API errors are returned.  An abridged sample of
such execution looks like:

  (gdb) catch hiperr
  (gdb) run
  Thread 1 "prog" hit Catchpoint 1 (HIP error)
  HIP API call failed with error hipErrorInvalidDevice (101):
    invalid device ordinal

Moreover, a $_hiperr convenience variable is added that holds
the error.  For further details, please read the updated section
of documentation in this change.

Most of the implementation is in the newly added module called
break-catch-hiperr.c.  The naming and its context follows the
break-catch-throw.c module.

The whole mechanism works by setting a breakpoint (catchpoint)
on a __hipOnError() symbol, that must be provided by the CLR
library.  Enabling the catchpoint when there's no __hipOnError
symbol, simply has no effect.

There's an architecture-dependent section that is responsible
for extracting the error parameters (error number, error name,
error string) from a single argument passed to the __hipOnError()
function as a structure pointer.  The expected format of this
struct is:

  uint32_t    version;   // for now version is 1; future ones will
                         // be backward compatible by means of
                         // extending this structure
  uint32_t    err_no;    // the error number
  const char *err_name;  // name of the error
  const char *err_str;   // a phrase about the error

The code is devised in such a way that if the __hipOnError()
exists, but for whatever reason the parameters cannot be deduced,
it won't interfere with the catchpoint being triggered.  It will
affect the reported data of the catchpoint though.

A new test, gdb.rocm/hip-catch-errors, is added to the mix to
ensure that:

1. (Pending) catchpoints work.
2. Multiple catchpoints in one execution are reported accordingly.
3. Errors of indirect HIP calls like "kernel<<<...>>>()" are caught.
4. The convenience variable can be used for conditional situations.
5. Conditional and temporary catchpoints work as expected.
6. $_hiperr is printed correctly in different contexts.

Change-Id: Iff0808350cf7856da17ff07be8ed10c76f9370f8
@amd-shahab

amd-shahab commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author
  • doc: drop [if condition...] part from catch hiperr entry.
  • doc: reword the catch hiperr section to the suggested paragraph.
  • doc: add @cindex for HIP runtime error to go with HIP runtime error convenience variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants