Skip to content

linux/efa: Fix config probe race with file-based completion#369

Open
mjonuschat wants to merge 1 commit intoamzn:masterfrom
mjonuschat:fix/efa-config-race
Open

linux/efa: Fix config probe race with file-based completion#369
mjonuschat wants to merge 1 commit intoamzn:masterfrom
mjonuschat:fix/efa-config-race

Conversation

@mjonuschat
Copy link
Copy Markdown

Issue #, if available: #364

Description of changes:

Fix config probe race condition by replacing PID polling with file-based completion signaling

The EFA driver's cmake-based config probing launches compile_conftest.sh processes in the background
via nohup in runbg.sh and tracks them by polling /proc/$pid. The nohup wrapper exits before
the actual compile and config.h write completes, causing the PID to disappear from /proc
prematurely. On machines with high core counts this results in config.h missing a significant number
of defines, leading to build failures or a misconfigured driver module.

Replace PID polling with file-based completion signaling. Each background probe now touches a
done-file on completion, and the waiter polls for that file instead. Config probes continue to run in
parallel. Renamed wait_for_pid to wait_for_completion and pids to pending throughout to
reflect the new mechanism.

Testing

Built against kernel 6.16 on a 72-CPU machine. A known-good baseline (57 defines) was established using a synchronous build. The fix was then verified over 10 consecutive builds, all producing identical sorted config.h output (10/10
PASS). PID polling consistently produced only 30-33 out of 57 defines under the same conditions.

Approach Defines Expected Result
PID polling (before) 30-33 57 FAIL
File-based completion signaling (this fix) 57 57 PASS (10/10)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

The config probing system launches compile_conftest.sh processes in the
background via nohup in runbg.sh and tracks them by polling /proc/$pid.
The nohup wrapper exits before the actual compile and config.h write
completes, causing the PID to disappear from /proc prematurely. On
machines with high core counts this results in config.h missing a
significant number of defines.

Replace PID polling with file-based completion signaling. Each
background probe now touches a done-file on completion, and the waiter
polls for that file instead. Config probes continue to run in parallel.

Rename wait_for_pid to wait_for_completion and pids to pending
throughout to reflect the new mechanism.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant