gdb: add "catch hiperr" to break on HIP API errors#129
Conversation
|
@amd-shahab If this shouldn't be tested yet and isn't mergeable until the dependent PR, feel free to label it "ci:skip". |
lancesix
left a comment
There was a problem hiding this comment.
First quick first pass of comments.
One question regarding the test: what kind of guarantee do we have that the error code and error string are going to be stable over time? Let's imagine that the hip runtime decides to localize the error string (not the name). Couldn't that be a legitimate thing to do, but it would break the test, would it?
823bf4f to
745f0bc
Compare
|
745f0bc to
1da282b
Compare
|
1da282b to
d616325
Compare
|
d616325 to
4b40f98
Compare
|
4b40f98 to
5005bc1
Compare
|
81df767 to
5a4629e
Compare
b2e7ebe to
3ff9b7d
Compare
6674390 to
159656f
Compare
159656f to
79a5af5
Compare
|
79a5af5 to
93dcde0
Compare
|
93dcde0 to
be5705e
Compare
|
be5705e to
1d5cec4
Compare
palves
left a comment
There was a problem hiding this comment.
In commit log, drop the "when it is targeted for amdgcn-amd-amdhsa" clause from the first sentence.
1d5cec4 to
f06b3aa
Compare
|
df5f4c7 to
e66e680
Compare
AFAICS, you're still matching the thread name. The thread name is the thing in double quotes. |
e66e680 to
58dd05a
Compare
|
Now the pattern is: |
58dd05a to
1f98718
Compare
|
1f98718 to
080a051
Compare
|
|
This change adds a new feature to GDB that enables it to stop
whenever HIP API errors are returned. An abridged sample of
such execution looks like:
(gdb) catch hiperr
(gdb) run
Thread 1 "prog" hit Catchpoint 1 (HIP error)
HIP API call failed with error hipErrorInvalidDevice (101):
invalid device ordinal
Moreover, a $_hiperr convenience variable is added that holds
the error. For further details, please read the updated section
of documentation in this change.
Most of the implementation is in the newly added module called
break-catch-hiperr.c. The naming and its context follows the
break-catch-throw.c module.
The whole mechanism works by setting a breakpoint (catchpoint)
on a __hipOnError() symbol, that must be provided by the CLR
library. Enabling the catchpoint when there's no __hipOnError
symbol, simply has no effect.
There's an architecture-dependent section that is responsible
for extracting the error parameters (error number, error name,
error string) from a single argument passed to the __hipOnError()
function as a structure pointer. The expected format of this
struct is:
uint32_t version; // for now version is 1; future ones will
// be backward compatible by means of
// extending this structure
uint32_t err_no; // the error number
const char *err_name; // name of the error
const char *err_str; // a phrase about the error
The code is devised in such a way that if the __hipOnError()
exists, but for whatever reason the parameters cannot be deduced,
it won't interfere with the catchpoint being triggered. It will
affect the reported data of the catchpoint though.
A new test, gdb.rocm/hip-catch-errors, is added to the mix to
ensure that:
1. (Pending) catchpoints work.
2. Multiple catchpoints in one execution are reported accordingly.
3. Errors of indirect HIP calls like "kernel<<<...>>>()" are caught.
4. The convenience variable can be used for conditional situations.
5. Conditional and temporary catchpoints work as expected.
6. $_hiperr is printed correctly in different contexts.
Change-Id: Iff0808350cf7856da17ff07be8ed10c76f9370f8
|
This change relies on "Add __hipOnError debugger hook for HIP API failures" PR to be functional. Although, without it, ROCgdb won't be broken. Please read the commit message for the description of this change itself.
--
What
This change adds a new feature to GDB, when it is targeted for
amdgcn-amd-amdhsa, that enables it to stop whenever HIP API
errors are returned. An abridged sample of execution looks
like this:
(gdb) catch hiperr
(gdb) run
Thread 1 "hip" hit Catchpoint 1 (HIP error)
hipErrorInvalidDevice (101): invalid device ordinal
Moreover, a $_hiperr convenience variable is added that holds
the error code. For further details, please see the updated
documentation in this change.
Design
Most of the implementation is in the newly added module called
break-catch-hiperr.c. The naming and its context follows the
break-catch-throw.c module.
The whole mechanism work by setting a breakpoint (catchpoint)
on a __hipOnError() symbol, that must be provided by the CLR
library. Enabling the catchpoint when there's no __hipOnError
symbol, simply has no effect.
There's a architect dependent section that is responsible for
extracting the error parameters (code, name, description) from
a single argument passed to the __hipOnError() function as a
structure pointer. The expected format of this struct is:
uint32_t version; // at this point, only 1 is supported
uint32_t code; // the numerical error code
const char *name; // name of the error
const char *desc; // a phrase about the error
The code is devised in such a way that if the __hipOnError()
exists, but for whatever reason the parameters cannot be deduced,
it won't interfere with the catchpoint being triggered. It will
affect the reported data of the catchpoint though.
Test
A new test, gdb.rocm/hip-catch-errors, is added to the mix to
ensure that:
ROCM-1548