Skip to content

[fix](editlog) Fix EditLog.txId data race by converting to AtomicLong#62059

Open
dataroaring wants to merge 1 commit intoapache:masterfrom
dataroaring:fix/editlog-txid-race
Open

[fix](editlog) Fix EditLog.txId data race by converting to AtomicLong#62059
dataroaring wants to merge 1 commit intoapache:masterfrom
dataroaring:fix/editlog-txid-race

Conversation

@dataroaring
Copy link
Copy Markdown
Contributor

Summary

  • The txId field is a plain long read/written by multiple threads (flusher, DDL callers, bulk writers) without synchronization
  • Long writes are not guaranteed atomic (JLS §17.7), and even on platforms with atomic 64-bit writes, there is no happens-before ordering between threads
  • Convert txId to AtomicLong and update all 6 read/write sites

Test plan

  • Verify FE master starts and writes edit logs normally
  • Verify edit log rolls correctly after edit_log_roll_num transactions

🤖 Generated with Claude Code

The txId field is a plain long that is read and written by multiple threads
without synchronization:
- The flusher thread increments txId in flushEditLog() (txId += batch.size())
- DDL caller threads increment txId in logEditDirectly() (txId++)
- logEdit(op, List) also increments txId (txId += entries.size())

Since long writes are not atomic on all JVMs (JLS §17.7 permits non-atomic
64-bit writes), concurrent read/write can produce torn values. Even on
platforms with atomic 64-bit writes, the lack of happens-before ordering
means one thread's write may not be visible to another.

Fix: convert txId from plain long to AtomicLong, updating all read/write
sites to use the appropriate atomic operations.
Copilot AI review requested due to automatic review settings April 2, 2026 14:28
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a data race on EditLog.txId by replacing a plain long with an AtomicLong, ensuring atomic reads/writes across multiple threads interacting with edit log flushing and direct logging.

Changes:

  • Convert txId from long to AtomicLong.
  • Update all txId increment/read/reset sites to use atomic operations.
  • Adjust logging to print the atomic value via get().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +267 to 274
txId.addAndGet(batch.size());
// update statistics, etc. (optional, can be added as needed)
if (txId >= Config.edit_log_roll_num) {
LOG.info("txId {} is equal to or larger than edit_log_roll_num {}, will roll edit.", txId,
if (txId.get() >= Config.edit_log_roll_num) {
LOG.info("txId {} is equal to or larger than edit_log_roll_num {}, will roll edit.", txId.get(),
Config.edit_log_roll_num);
rollEditLog();
txId = 0;
txId.set(0);
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get() / rollEditLog() / set(0) sequence is not atomic with respect to concurrent increments. Another thread can increment txId after the threshold check (or during rollEditLog()), and set(0) can then wipe out those increments, causing missed roll accounting and potentially delaying/accelerating future rolls. Consider guarding the roll decision/reset with a CAS pattern (e.g., capture the post-increment value and compareAndSet it back to 0 only if unchanged) or synchronizing the roll/reset with the same lock used for rolling so increments can’t be lost.

Copilot uses AI. Check for mistakes.
Comment on lines +1657 to 1664
long newTxId = txId.incrementAndGet();

if (txId >= Config.edit_log_roll_num) {
LOG.info("txId {} is equal to or larger than edit_log_roll_num {}, will roll edit.", txId,
if (newTxId >= Config.edit_log_roll_num) {
LOG.info("txId {} is equal to or larger than edit_log_roll_num {}, will roll edit.", newTxId,
Config.edit_log_roll_num);
rollEditLog();
txId = 0;
txId.set(0);
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to flushEditLog(), incrementAndGet() followed by rollEditLog() and set(0) can lose increments performed by other threads while the roll is in progress, because the reset is unconditional. If logEditDirectly() can run concurrently with other increment sites, prefer making the reset conditional (CAS) or performing the roll+reset under a shared lock that also covers all txId increments related to rolling.

Copilot uses AI. Check for mistakes.
Comment on lines +267 to +270
txId.addAndGet(batch.size());
// update statistics, etc. (optional, can be added as needed)
if (txId >= Config.edit_log_roll_num) {
LOG.info("txId {} is equal to or larger than edit_log_roll_num {}, will roll edit.", txId,
if (txId.get() >= Config.edit_log_roll_num) {
LOG.info("txId {} is equal to or larger than edit_log_roll_num {}, will roll edit.", txId.get(),
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

txId.get() is called multiple times after an increment, which can make logs harder to interpret under concurrency (the logged value may not match the value used for the decision). Prefer capturing the post-increment value in a local long (e.g., from addAndGet(...)) and using that for both the threshold check and the log message.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants