Skip to content

Comments

Fix race condition causing flaky hang in WritePreparedTransactionSeqnoTest #14361

Open
anand1976 wants to merge 1 commit intofacebook:mainfrom
anand1976:txn_test
Open

Fix race condition causing flaky hang in WritePreparedTransactionSeqnoTest #14361
anand1976 wants to merge 1 commit intofacebook:mainfrom
anand1976:txn_test

Conversation

@anand1976
Copy link
Contributor

Summary

  • Fix a race condition in three WritePreparedTransactionSeqnoTest tests (SeqnoGoesBackwardsDuringErrorRecovery, SeqnoDiscrepancyDuringErrorRecovery, ConcurrentWritesDuringErrorRecovery) that could cause permanent hangs.
  • The tests inject a filesystem error during flush via a WriteManifest sync point callback, then wait for background error recovery to complete. The bug was in the ordering of operations after recovery starts: SetFilesystemActive(true) was called before ClearCallBack, allowing a window where recovery's ResumeImpl could trigger the callback and re-disable the filesystem. This left the filesystem permanently disabled, causing all recovery retries to fail and exit without firing the RecoverSuccess sync point, leaving the test thread blocked forever.
  • The fix swaps the order so ClearCallBack is called before SetFilesystemActive(true), ensuring the filesystem cannot be re-disabled by a late callback firing.

Test plan

  • Stress tested with gtest_parallel (500 iterations, 32 workers, 60s timeout) with no hangs observed.
  • Previously reproduced the hang at ~7% rate under stress with 15s timeout before the fix.

…Test

Clear the WriteManifest sync point callback before re-enabling the
filesystem, not after. The previous ordering allowed recovery's
ResumeImpl to trigger the callback between SetFilesystemActive(true)
and ClearCallBack, re-disabling the filesystem permanently and causing
recovery to exit without firing the RecoverSuccess sync point, leaving
the test thread blocked forever.
@meta-codesync
Copy link

meta-codesync bot commented Feb 20, 2026

@anand1976 has imported this pull request. If you are a Meta employee, you can view this in D93929251.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants