Skip to content

Add NonLazyBind to __ykrt_control_point declaration.#300

Merged
ltratt merged 1 commit intomainfrom
ykllvm-nonlazybind-control-point
Feb 1, 2026
Merged

Add NonLazyBind to __ykrt_control_point declaration.#300
ltratt merged 1 commit intomainfrom
ykllvm-nonlazybind-control-point

Conversation

@Pavel-Durov
Copy link
Copy Markdown

@Pavel-Durov Pavel-Durov commented Jan 23, 2026

This PR optimises calls to __ykrt_control_point by bypassing PLT (Procedure Linkage Table) indirection, reducing call overhead in the hot path of the interpreter loop.

Changes

1. ControlPoint.cpp — Add NonLazyBind Attribute

NF = Function::Create(FType, GlobalVariable::ExternalLinkage,
                      YK_NEW_CONTROL_POINT, M);
// Use NonLazyBind to avoid PLT indirection, reducing call overhead.
NF->addFnAttr(Attribute::NonLazyBind);

Marks the __ykrt_control_point function declaration with NonLazyBind, signalling to the backend that PLT should be avoided.

2. X86MCInstLower.cpp — Handle NonLazyBind in Patchpoint Lowering

When lowering patchpoints with a GlobalAddress target, check for the NonLazyBind attribute:

case MachineOperand::MO_GlobalAddress: {
  const GlobalValue *GV = CalleeMO.getGlobal();
  if (const Function *F = dyn_cast<Function>(GV)) {
    UseGOTPCREL = F->hasFnAttribute(Attribute::NonLazyBind);
  }
  // ...
}

If NonLazyBind is set, emit a GOT-relative load instead of an immediate move:

if (UseGOTPCREL) {
  // Emit: mov symbol@GOTPCREL(%rip), %ScratchReg
  MCSymbol *Sym = MCIL.GetSymbolFromOperand(CalleeMO);
  const MCExpr *Expr = MCSymbolRefExpr::create(
      Sym, MCSymbolRefExpr::VK_GOTPCREL, Ctx);

  EmitAndCountInstruction(MCInstBuilder(X86::MOV64rm)
                              .addReg(ScratchReg)
                              .addReg(X86::RIP)
                              .addImm(1)
                              .addReg(0)
                              .addExpr(Expr)
                              .addReg(0));
  EncodedBytes = 7 + (X86II::isX86_64ExtendedReg(ScratchReg) ? 3 : 2);
}

Code Generation Comparison

Without NonLazyBind With NonLazyBind
mov $symbol, %reg mov symbol@GOTPCREL(%rip), %reg
call *%reg call *%reg
PLT stub → lazy resolution Direct GOT load → eager binding
12–13 bytes 9–10 bytes

Rationale

  • PLT calls go through a stub that resolves the symbol lazily on first use
  • GOTPCREL calls load the address directly from the Global Offset Table (resolved at startup)
  • The control point is called on every iteration of the interpreter loop, making PLT overhead significant
  • This follows the same pattern LLVM uses in X86Subtarget::classifyGlobalFunctionReference() for functions with NonLazyBind

Byte Encoding

  • MOV64rm (RIP-relative): 7 bytes (REX.W + opcode + ModR/M + 32-bit displacement)
  • CALL64r: 2 bytes (normal registers) or 3 bytes (extended registers R8–R15)
  • Total: 9–10 bytes vs 12–13 bytes for immediate move

Comment thread llvm/lib/Target/X86/X86MCInstLower.cpp
Comment thread llvm/lib/Target/X86/X86MCInstLower.cpp
@vext01
Copy link
Copy Markdown

vext01 commented Jan 28, 2026

I ran simple.c under rr using ykllvm main and your branch and can confirm that this change does what it says on the tin.

In main, when we call the control point, we have a call [r11] which jumps to PLT resolution routines.

The first time this happens, this routine does quite a lot of computation (lots of looping and strcmp()), before we eventually land at __ykrt_control_point.

Subsequent calls to the control point are cheaper: the call [r11] still jumps to the PLT resolution routine, but the previous resolution is cached and we quickly jump to __ykrt_control_point. There is still a cmp presumably to check if the target has been cached (and presumably you can invalidate the cache by (e.g.) dlopen()ing a library that contains a symbol with the same name?).

So, your change saves us time by:

  • avoiding symbol resolution on first call to the control point (quite expensive, but only done once).
  • avoiding mov, test, jne for subsequent calls to the control point (less expensive than the initial symbol resolution, but probably worthwhile given the frequency we call the control point).

So, although you've touched parts of LLVM none of us are particularly familiar with:

  • a) your explanation makes sense to me.
  • b) it only affects patchpoints.

This means that the impact of the change is (mostly) annexed.

One improvement I would suggest is gating the modified behavior though. Perhaps behind the same flag that gates the pass that does control point injection (-yk-patch-control-point)? This way, no llvm tests that use patchpoints are touched. When the flag is not used, the original llvm behavior should be used.

@stephenrkell
Copy link
Copy Markdown

This looks good to me.

The only downside I can think of is that some tools like to breakpoint the PLT and so miss cross-DSO calls that don't do this. Actually only ltrace does this, that I'm aware of, but there may be others.

Where are the incoming calls to __ykrt_control_point coming from? Are they always cross-DSO or otherwise possibly-spanning-over-2GB? If not, then you could maybe do something even more direct if you don't care about keeping __ykrt_control_point preemptible. But I guess you've thought of that.

@jryans
Copy link
Copy Markdown

jryans commented Jan 28, 2026

In broad terms, this seems sensible enough to me, though I confess I am also not especially familiar with this slice of the LLVM codebase.

I do wonder if you could potentially reduce duplication by triggering the code paths you've borrowed parts from (instead of duplicating bits of their behaviour), but perhaps that's not especially important in an experimental LLVM fork like this.

@ltratt
Copy link
Copy Markdown

ltratt commented Jan 28, 2026

Thanks both for the reviews!

@Pavel-Durov Do we think deduplication is possible / reasonable, or is it a little harder than someone like me (who doesn't know the codebase) might think?

@Pavel-Durov
Copy link
Copy Markdown
Author

I ran simple.c under rr using ykllvm main and your branch and can confirm that this change does what it says on the tin.

In main, when we call the control point, we have a call [r11] which jumps to PLT resolution routines.

The first time this happens, this routine does quite a lot of computation (lots of looping and strcmp()), before we eventually land at __ykrt_control_point.

Subsequent calls to the control point are cheaper: the call [r11] still jumps to the PLT resolution routine, but the previous resolution is cached and we quickly jump to __ykrt_control_point. There is still a cmp presumably to check if the target has been cached (and presumably you can invalidate the cache by (e.g.) dlopen()ing a library that contains a symbol with the same name?).

So, your change saves us time by:

* avoiding symbol resolution on first call to the control point (quite expensive, but only done once).

* avoiding `mov, test, jne` for subsequent calls to the control point (less expensive than the initial symbol resolution, but probably worthwhile given the frequency we call the control point).

So, although you've touched parts of LLVM none of us are particularly familiar with:

* a) your explanation makes sense to me.

* b) it only affects patchpoints.

This means that the impact of the change is (mostly) annexed.

One improvement I would suggest is gating the modified behavior though. Perhaps behind the same flag that gates the pass that does control point injection (-yk-patch-control-point)? This way, no llvm tests that use patchpoints are touched. When the flag is not used, the original llvm behavior should be used.

Added 8b7954905e9c63cae18b613135d1b8dba4d7a51d, de65382e96916b2d8e680bc331a44fc280bd0005

@Pavel-Durov
Copy link
Copy Markdown
Author

@stephenrkell

Where Do Calls to __ykrt_control_point Come From?

Interpreter code calls __ykrt_control_point(mt, &loc) inside the main loop.
Example in yklua

Are They Always Cross-DSO?

Calls to __ykrt_control_point are cross-DSO.

The architecture is:

┌─────────────────────────────────┐     ┌──────────────────────────┐
│  Interpreter Binary             │────►│  libykcapi.so (cdylib)   │
│                                 │     │                          │
│  - Compiled with ykllvm         │     │  - Contains ykrt         │
│  - Links: -lykcapi              │     │  - Exports:              │
│  - Calls __ykrt_control_point   │     │    __ykrt_control_point  │
│    via patchpoint               │     │                          │
└─────────────────────────────────┘     └──────────────────────────┘                       

@Pavel-Durov
Copy link
Copy Markdown
Author

Thanks both for the reviews!

@Pavel-Durov Do we think deduplication is possible / reasonable, or is it a little harder than someone like me (who doesn't know the codebase) might think?

I think this duplication is minimal and acceptable (maybe I am missing something here), the patchpoint path is a narrow, self-contained change in X86MCInstLower.cpp.
The main benefit here is avoiding touching shared, target-agnostic code (like SelectionDAGBuilder). This also makes the patch smaller and easier (I think) to maintain when merging upstream LLVM.

@ltratt
Copy link
Copy Markdown

ltratt commented Jan 30, 2026

Works for me. @vext01 OK with you? If so, we're probably ready for squashing.

@vext01
Copy link
Copy Markdown

vext01 commented Jan 30, 2026

Please squash

Apply the NonLazyBind function attribute to __ykrt_control_point so that,
together with the X86MCInstLower change, patchpoint calls to the control
point avoid PLT trampolines.
@Pavel-Durov Pavel-Durov force-pushed the ykllvm-nonlazybind-control-point branch from de65382 to fb75d84 Compare January 31, 2026 10:58
@Pavel-Durov
Copy link
Copy Markdown
Author

Done 👉 fb75d84

@ltratt ltratt added this pull request to the merge queue Feb 1, 2026
Merged via the queue into main with commit 0d65c10 Feb 1, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants