Skip to content

merge main into amd-staging#1597

Merged
z1-cciauto merged 59 commits intoamd-stagingfrom
amd/merge/upstream_merge_20260227164042
Feb 28, 2026
Merged

merge main into amd-staging#1597
z1-cciauto merged 59 commits intoamd-stagingfrom
amd/merge/upstream_merge_20260227164042

Conversation

@ronlieb
Copy link
Collaborator

@ronlieb ronlieb commented Feb 27, 2026

No description provided.

jpienaar and others added 30 commits February 27, 2026 18:44
…lvm#181252)

Replace manual region dissolution code in
simplifyBranchConditionForVFAndUF with using general
removeBranchOnConst. simplifyBranchConditionForVFAndUF now just creates
a (BranchOnCond true) or updates BranchOnTwoConds.

The loop then gets automatically removed by running removeBranchOnConst.

This removes a bunch of special logic to handle header phi replacements
and CFG updates. With the new code, there's no restriction on what kind
of header phi recipes the loop contains.

Note that VPEVLBasedIVRecipe needs to be marked as readnone. This is
technically unrelated, but I could not find an independent test that
would be impacted.

The code to deal with epilogue resume values now needs updating, because
we may simplify a reduction directly to the start value.

PR: llvm#181252
Currently for thin-lto, the imported static global values (functions,
variables, etc) will be promoted/renamed from e.g., foo() to
foo.llvm.<hash>(). Such a renaming caused difficulties in live patching
since function name is changed ([1]).

It is possible that some global value names have to be promoted to avoid
name collision and linker failure. But in practice, majority of name
promotions can be avoided.

In [2], the suggestion is that thin-lto pre-link decides whether
a particular global value needs name promotion or not. If yes, later on
in thinBackend() the name will be promoted.

I compiled a particular linux kernel version (latest bpf-next tree)
and found 1216 global values with suffix .llvm.<hash>. With this patch,
the number of promoted functions is 2, 98% reduction from the
original kernel build.

If some native objects are not participating with LTO, name promotions
have to be done to avoid potential linker issues. So the current
implementation cannot be on by default. But in certain cases, e.g., linux kernel
build, people can enable lld flag --lto-whole-program-visibility to reduce the
number of functions like foo.llvm.<hash>().

For ThinLTOCodeGenerator.cpp which is used by llvm-lto tool and a
few other rare cases, reducing the number of renaming due to promotion,
is not implemented as lld flag '-lto-whole-program-visibility' is not supported
in ThinLTOCodeGenerator.cpp for now. In summary, this pull request
only supports llvm-lto2 style workflow.

  [1] https://lpc.events/event/19/contributions/2212
  [2] https://discourse.llvm.org/t/rfc-avoid-functions-like-foo-llvm-for-kernel-live-patch/89400
Update the test to more cleanly handle making a 'blocking' call using a
custom command instead of python `time.sleep`, which we cannot easily
interrupt.

This should improve the overall performance of the tests, locally they
took around 30s and now finish in around 6s.
…cross incremental scans (llvm#183328)

Add a test that verifies symlink aliases to a module map directory
produce the same PCM across incremental scans.
When there's a dependency cycle between modules, the dependency scanner
may encounter a deadlock. This was caused by not respecting the lock
timeout. But even with the timeout implemented, leaving
`unsafeMaybeUnlock()` unimplemented means trying to take a lock after a
timeout would still fail and prevent making progress. This PR implements
this API in a way to avoid UB on `std::mutex` (when it's unlocked by
someone else than the owner). Lastly, this PR makes sure that
`unsafeUnlock()` ends the wait of existing threads, so that they don't
need to hit the full timeout amount.

This PR also implements `-fimplicit-modules-lock-timeout=<seconds>` that
allows tweaking the default 90-second lock timeout, and adds `#pragma
clang __debug sleep` that makes it easier to achieve desired execution
ordering.

rdar://170738600
…part 22) (llvm#183681)

Tests converted from test/Lower: intentout-deallocate.f90
Tests converted from test/Lower/Intrinsics: abs.f90, achar.f90,
acospi.f90, adjustl.f90
Part 1 of changes needed for USM alloc/dealloc impl.


This is part of the SYCL support upstreaming effort. The relevant RFCs
can be found here:


https://discourse.llvm.org/t/rfc-add-full-support-for-the-sycl-programming-model/74080
https://discourse.llvm.org/t/rfc-sycl-runtime-upstreaming/74479

---------

Signed-off-by: Tikhomirova, Kseniya <kseniya.tikhomirova@intel.com>
Enable Flang to match Clang behavior for command-line recording in DWARF
producer strings when using -grecord-command-line.

Signed-off-by: Yangyu Chen <cyy@cyyself.name>
Just like other bitcode libs such as ockl.bc ocml.bc, link asanrtl.bc
with '-mlink-builtin-bitcode' in the driver when GPU ASan is enabled.
…llvm#183781)

Two related crashes were fixed in vector.mask handling:

1. MaskOp::fold() crashes with a null pointer dereference when the mask
is all-true and the mask body has no maskable operation (only a
vector.yield). getMaskableOp() returns nullptr in this case, and the
fold was calling nullptr->dropAllUses(). Fixed by returning failure()
when there is no maskable op, deferring to the canonicalizer.

2. CanonializeEmptyMaskOp creates an invalid arith.select when the mask
type is a vector (e.g., vector<1xi1>) but the result type is a scalar
(e.g., i32). arith.select with a vector condition requires the value
types to be vectors of the same shape. Fixed by bailing out when any
result type doesn't match the mask shape.

Regression tests are added for both cases.

Fixes llvm#177833
…83199)

Using physical register 0, aka NoRegister, also just looked suspicious.
…vm#178587)" (llvm#183782)

There is a conflict with existing code. See
  llvm#178587
Revert and resolve the conflict and then will submit later.
This allows us to support more lifetimes, and also gets rid of
the quadratic call to isPotentiallyReachable.

Reviewers: pcc, usama54321

Reviewed By: pcc

Pull Request: llvm#182425
Instead of excluding the whole package, push any existing parse_headers
failures to individual targets. In some cases we can avoid suppressing a
target by adding a few missing deps.
… support (llvm#183442)

This is the second of three patches aimed to support indirect symbol
handling for the SystemZ backend. An external name is added for both MC
sections and symbols and makes the relevant printers and writers utilize
the external name when present. Furthermore, the ALIAS HLASM instruction
is emitted after every XATTR instruction.

Depends on llvm#183441.
…4171)

When hoisting loop invariant instructions, we can preserve profile
metadata because it depends solely on the condition (which is loop
invariant) rather than where we are in the control flow graph.
…x) (llvm#183363)

add a pre-commit test case for Inefficient asm of std::bit_floor(x) for
powerpc.
)

Summary:
This enables primarily `stop.cpp` and `descriptor.cpp`. Requires a
little bit of wrangling to get it to compile. Unlike the CUDA build,
this build uses an in-tree libc++ configured for the GPU. This is
configured without thread support, environment, or filesystem, and it is
not POSIX at all. So, no mutexes, pthreads, or get/setenv.

I tested stop, but i don't know if it's actually legal to exit from
OpenMP offloading.
…m#182512)

LLVM converts sqrt libcall to intrinsic call if the argument is within
the range(greater than or equal to 0.0). In this case the compiler is
not able to deduce the non-negativity on its own. Extended ValueTracking
to understand such loops.

Have created new ABI's for matching Intrinsics with three operands
(those existed only for 2 operands)
`matchSimpleTernaryIntrinsicRecurrence` and `matchThreeInputRecurrence`.

Fixes llvm#174813
…lvm#181030)

This implements the TOKENIZE intrinsic per the Fortran 2023 Standard.

TOKENIZE is a more complicated addition to the flang intrinsics, as it
is the first subroutine that has multiple unique footprints. Intrinsic
functions have already addressed this challenge, however subroutines and
functions are processed slightly differently and the function code was
not a good 1:1 solution for the subroutines. To solve this the function
code was used as an example to create error buffering within the
intrinsics Process and select the most appropriate error message for a
given subroutine footprint.

A simple FIR compile test was added to show the proper compilation of
each case. A thorough negative path test has also been added, ensuring
that all possible errors are reported as expected.

Testing prior to commit:

= check-flang ==========================================
```
Testing Time: 139.51s

Total Discovered Tests: 4153
  Unsupported      :   77 (1.85%)
  Passed           : 4065 (97.88%)
  Expectedly Failed:   11 (0.26%)


FLANG Container Test completed 2 minutes (160 s).

Total Time: 2 minutes (160 s)
Completed : Wed Feb 11 04:05:50 PM CST 2026
```

= check-flang-rt ==========================================
```
Testing Time: 1.55s

Total Discovered Tests: 258
  Passed: 258 (100.00%)


FLANG Container Test completed 0 minutes (55 s).

Total Time: 0 minutes (56 s)
Completed : Wed Feb 11 04:08:32 PM CST 2026
```

= llvm-test-suite ==========================================
```
Testing Time: 1886.64s

Total Discovered Tests: 6926
  Passed: 6926 (100.00%)


CCE SLES Container debug compile completed 31 minutes (1895 s).
CCE SLES Container debug install completed in 0 minutes (0 s).

Total Time: 31 minutes (1895 s)
Completed : Wed Feb 11 05:46:52 PM CST 2026
```

Additionally, (FYI) an executable test has been written and will be
added to the llvm-test-suite under a separate PR.

---------

Co-authored-by: Kevin Wyatt <kwyatt@hpe.com>
…#183176)

Adjusting `VariableReferenceStorage` to only need to track permanent vs
temporary storage by making `VariableStore` the common base class.

Moved the subclasses of `VariableStore` into the Variables.cpp file,
since they're no long referenced externally.

Expanding on the tests by adding an updated core dump with variables in
the argument scope we can use to validate variable storage.
…183405)

This commit updates the LLVM::decomposeValue and LLVM::composeValue
methods to handle aggregate types - LLVM arrays and structs, and to have
different behaviors on dealing with types like pointers that can't be
bitcast to fixed-size integers. This allows the "any type" on
gpu.subgroup_broadcast to be more comprehensive - you can broadcast a
memref to a subgroup by decomposing it, for example.

(This branched off of getting an LLM to implement
ValueuboundsOpInterface on subgroup_broadcast, having it add handling
for the dimensions of shaped types, and realizing that there's no
fundamental reason you can't broadcast a memref or the like)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…low (llvm#181755)

Rather than mapping out full "reachability" between blocks in a region
to find loops and using `LoopBlocks` to find the bodies of said loops,
use SCCs (strongly-connected components) to provide this information.

This brings in LLVM's generic `SCCIterator` (which uses Tarjan's
algorithm) as the implementation for sorting the basic blocks of the CFG
into their SCCs.

This PR greatly reduces the compile-time footprint of the pass, making
memory use and time taken negliable where it might have previously
caused stalls and OOM before (e.g. llvm#47793,
usagi-coffee/tree-sitter-abl#114)

------

Supersedes llvm#179722

Fixes llvm#47793
Fixes llvm#165041 (probably)

Thanks to @jkbz64 for the initial investigations (w/ AI; see llvm#179722)
into why this pass was slow and memory consuming and showing that SCCs
were the key.

Also thanks to the Cheerp compiler project for bringing `SCCIterator` to
light in this context ([blog
post](https://cheerp.io/blog/control-flow#fix-the-irreducible-control-flow),
[implementation](https://github.com/leaningtech/cheerp-compiler/blob/master/llvm/lib/CheerpUtils/FixIrreducibleControlFlow.cpp)).
Fix linking of 'ockl.bc' for OpenMP by switching from
`-mlink-bitcode-file` to `-mlink-builtin-bitcode`
…lvm#182640)

This patch makes it so that renumbering indices when inserting
instructions into the SlotIndexes analysis renumbers the entire list if
the list is otherwise densely packed. This fixes a case we saw on
AArch64 with a lot of spills where every single spill instruction
insertion required a renumbering of most of the instructions in a large
function, making the operation approximately quadratic.

This is not NFC as heuristics depend on the SlotIndex numbers, although
this should mostly be a wash as LRs should be extended ~equally.
This PR adds `JSONFormat` support for reading and writing
`TUSummaryEncoding`. The implementation exploits similarities in the
structures of `TUSummary` and `TUSummaryEncoding` by reusing existing
`JSONFormat` support for `TUSummary`. Duplication of tests has been
avoided by parameterizing the test fixture that runs all relevant
read/write tests against `TUSummary`, for `TUSummaryEncoding`. This
ensures that the two serialization paths remain in lockstep.
Bigcheese and others added 14 commits February 27, 2026 12:29
After header search has found a header it looks for module maps that
cover that header. This patch uses the parsed representation of module
maps to do this search instead of relying on FileEntryRef lookups after
stating headers in module maps.

This behavior is currently gated behind the
`-fmodules-lazy-load-module-maps` `-cc1` flag.
…perand is a block argument of its successor (llvm#183797)

When `simplifyBrToBlockWithSinglePred` merges a block into its sole
predecessor, it calls `inlineBlockBefore` which replaces each block
argument with the corresponding value passed by the branch. If one of
those values is itself a block argument of the successor block, the call
`replaceAllUsesWith(arg, arg)` is a no-op. Any uses of that argument
outside the block (e.g. in a downstream block) are therefore not
replaced, and when the successor block is erased the argument is
destroyed while those uses are still live, triggering the assertion
`use_empty() && "Cannot destroy a value that still has uses\!"` in
`IRObjectWithUseList::~IRObjectWithUseList`.

Guard against this by returning early when any branch operand is a block
argument owned by the destination block.

Fixes llvm#126213
…m#181177)

Checks that isReversibleBranch() returns false
 - when the immediate value is 63 and needs +1 adjustment
 - when the immediate value is 0 and needs -1 adjustment

Checks that reverseBranchCondition() adjusts
 - the opcode
 - the immediate operand if necessary (+/-1)
 - the register operands if necessary (swap)
This variable ends up being unused in builds without assertions. Mark it
[[maybe_unused]] per the coding standards.
…and UF. (llvm#181252)"

This reverts commit 9c53215.

Appears to cause crashes with ordered reductions, revert while I
investigate
…m#183825)

Currently, as pointed out in the reviews for llvm#183405, decomposeValues
and composeValues should be able to emit zexts and truncations for cases
like i48 and vector<3xi16> becoming i32s but currently that's an assert.
This commit fixes that limitation.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Account for masked VPInstruction when verifying the operands in the
constructor. Fixes a crash when trying to unroll VPlans for predicated
early exits.
The `exact` flag with the following semantics

> If the `exact` attribute is present, it is assumed that the index type
width
> is such that the conversion does not lose information. When this
assumption
>    is violated, the result is poison.

can be added to index_cast and index_castui operations. This unlocks
the following lowerings:

*   index_cast (signed) exact    -> trunc nsw
*   index_castui (unsigned) exact -> trunc nuw
*   index_castui nneg exact       -> trunc nuw nsw

Changes:

* Adds ArithExactFlagInterface.
* Updates Arith_IntBinaryOpWithExactFlag to use ArithExactFlagInterface
* Update IndexCastOp and IndexCastUIOp to declare
`ArithExactFlagInterface`
* Update canonicalization patterns
* Update roundtrip, lowering, and canonicalization tests.
Updates formatter_bytecode.py to support compilation and disassembly for
synthetic formatters, in other words support for multiple functions
(signatures).

This includes a number of other changes:
* String parsing and encoding have bugs fixed
* CLI args are updated, primarily to support an output file
* Added uleb encoding/decoding support

This work is a prelude the ongoing work of a Python to formatter
bytecode compiler. The python compiler to emit assembly, and this module
(formatter_bytecode) will compile it into binary bytecode.
Fixing test failures on my local desktop with incremental
building.
@z1-cciauto
Copy link
Collaborator

@z1-cciauto z1-cciauto merged commit e884a8c into amd-staging Feb 28, 2026
37 checks passed
@z1-cciauto z1-cciauto deleted the amd/merge/upstream_merge_20260227164042 branch February 28, 2026 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.