Skip to content

Fix severe memory leak during HMC thermalization and skip sweeps#1

Open
ThinkOffApp wants to merge 1 commit intomainfrom
fix-memory-leak
Open

Fix severe memory leak during HMC thermalization and skip sweeps#1
ThinkOffApp wants to merge 1 commit intomainfrom
fix-memory-leak

Conversation

@ThinkOffApp
Copy link
Copy Markdown
Owner

This PR fixes a critical memory leak in the su2_2q_signal_scan.py script. The C++ lattice objects allocated via the action staple within the multihit/skip sweeps were not being garbage collected during the long thermalization loops, leading to OOM crashes. The PR actively calls the clear_gpt_caches() function and gc.collect() inside the thermalization and skip sweep loops to maintain a flat memory profile.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1436606cea

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

save_checkpoint("thermalization", i + 1, meas_start)

# Free memory aggressively during thermalization sweeps
clear_gpt_caches()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop invalidating stencil caches on every sweep

Calling clear_gpt_caches() inside the thermalization loop clears default_staple_cache/default_exp_cache before the next one_sweep, so GPT must rebuild cached transport/stencil kernels on every sweep instead of reusing them (gpt/lib/gpt/qcd/gauge/stencil/staple.py:37-56, gpt/lib/gpt/core/foundation/lattice/matrix/exp.py:188-203). In long runs with large ntherm, this turns a one-time setup into per-sweep overhead and can drastically slow experiments; cache clearing should be batched less frequently rather than done every iteration.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant