8381650: Vector rotate operations on AArch64 with NEON#30574
8381650: Vector rotate operations on AArch64 with NEON#30574raneashay wants to merge 1 commit intoopenjdk:masterfrom
Conversation
Before this patch, on AArch64 processors with NEON, `RotateLeftV` nodes were decomposed to expressions of the form `(x << n) | (x >> (N-n))`, where `n` is the number of bits to rotate by and `N` is the size of the type. `RotateRightV` nodes were similar, with `n` substitued by `N-n`. This decomposition happens at the level of Ideal graph rewrites, and these expressions translate to three instructions on NEON: `SHL` for `x << n`, `USHR` for `x >> (N-n)`, and `ORR` for combining the two values. However, NEON supports the `SLI` instruction, which shifts left while also preserving the destination register's low bits that a pure left-shift operation would have overwritten with zeroes. This allows us to lower a rotate operations into `USHR` + `SLI` instructions, thus emitting one fewer instruction than before. Of course, this only works when the bits to rotate by is a known constant, so this patch does not modify the lowering of variable-count rotates, letting them decompose into LeftShift + RightShift + Or nodes as before. Perhaps of note, this patch enables the optimized lowering for not just 32- and 64-bit integers, but also for subword types (specifically, `byte` and `short` types). I've included a good deal of tests for coverage, but I am unsure whether there is anything in the rest of C2's compilation that might break from allowing subword types to be lowered the same way as `int` and `long` types.
|
👋 Welcome back raneashay! A progress list of the required criteria for merging this PR into |
|
❗ This change is not yet ready to be integrated. |
|
@raneashay The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
| int raw_shift = (int)$shift$$constant; | ||
|
|
||
| // Compute left and right shift amounts. | ||
| int lshift, rshift; | ||
| if (opc == Op_RotateLeftV) { | ||
| lshift = raw_shift & (esize - 1); | ||
| rshift = esize - lshift; | ||
| } else { | ||
| assert(opc == Op_RotateRightV, "unexpected opcode"); | ||
| rshift = raw_shift & (esize - 1); | ||
| lshift = esize - rshift; | ||
| } |
There was a problem hiding this comment.
This looks somewhat rococo.
| int raw_shift = (int)$shift$$constant; | |
| // Compute left and right shift amounts. | |
| int lshift, rshift; | |
| if (opc == Op_RotateLeftV) { | |
| lshift = raw_shift & (esize - 1); | |
| rshift = esize - lshift; | |
| } else { | |
| assert(opc == Op_RotateRightV, "unexpected opcode"); | |
| rshift = raw_shift & (esize - 1); | |
| lshift = esize - rshift; | |
| } | |
| int raw_shift = checked_cast<int>(opc == Op_RotateLeftV ? | |
| $shift$$constant : -$shift$$constant); | |
| int lshift = raw_shift & (esize - 1); | |
| int rshift = -lshift & (esize - 1); | |
| $src$$FloatRegister, rshift); | ||
| __ sli($dst$$FloatRegister, get_arrangement(this), | ||
| $src$$FloatRegister, lshift); | ||
| } |
There was a problem hiding this comment.
Please move all of this logic to class MacroAssembler.
Before this patch, on AArch64 processors with NEON,
RotateLeftVnodeswere decomposed to expressions of the form
(x << n) | (x >> (N-n)),where
nis the number of bits to rotate by andNis the size of thetype.
RotateRightVnodes were similar, withnsubstitued byN-n.This decomposition happens at the level of Ideal graph rewrites, and
these expressions translate to three instructions on NEON:
SHLforx << n,USHRforx >> (N-n), andORRfor combining the twovalues.
However, NEON supports the
SLIinstruction, which shifts left whilealso preserving the destination register's low bits that a pure
left-shift operation would have overwritten with zeroes. This allows us
to lower a rotate operations into
USHR+SLIinstructions, thusemitting one fewer instruction than before. Of course, this only works
when the bits to rotate by is a known constant, so this patch does not
modify the lowering of variable-count rotates, letting them decompose
into LeftShift + RightShift + Or nodes as before.
Perhaps of note, this patch enables the optimized lowering for not just
32- and 64-bit integers, but also for subword types (specifically,
byteandshorttypes). I've included a good deal of tests forcoverage, but I am unsure whether there is anything in the rest of C2's
compilation that might break from allowing subword types to be lowered
the same way as
intandlongtypes.Progress
Issue
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/30574/head:pull/30574$ git checkout pull/30574Update a local copy of the PR:
$ git checkout pull/30574$ git pull https://git.openjdk.org/jdk.git pull/30574/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 30574View PR using the GUI difftool:
$ git pr show -t 30574Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/30574.diff
Using Webrev
Link to Webrev Comment