Skip to content

[fix](editlog) Fix BDBJEJournal.write() burning journal IDs on failure#62063

Closed
dataroaring wants to merge 1 commit intoapache:masterfrom
dataroaring:fix/bdbje-journal-id
Closed

[fix](editlog) Fix BDBJEJournal.write() burning journal IDs on failure#62063
dataroaring wants to merge 1 commit intoapache:masterfrom
dataroaring:fix/bdbje-journal-id

Conversation

@dataroaring
Copy link
Copy Markdown
Contributor

Summary

  • Both single-item write() and batch write() call nextJournalId.getAndIncrement() / getAndAdd() before the actual BDB put/commit
  • If the write fails (e.g. ReplicaWriteException, InsufficientAcksException), the journal IDs are permanently consumed but no journal entry exists
  • On retry, the write uses a new ID, creating gaps in the journal ID sequence
  • These gaps can confuse the replayer which expects contiguous journal IDs
  • Fix: read the current ID with get() before writing, advance with incrementAndGet() / addAndGet() only after successful commit. Both write methods are synchronized, so the pattern is safe.

Test plan

  • Verify journal IDs remain contiguous after transient write failures
  • Verify replayer correctly replays all journals without gaps

🤖 Generated with Claude Code

…uccessful write

Both single-item write() and batch write() call nextJournalId.getAndIncrement()
/ getAndAdd() before the actual BDB put/commit. If the write fails (e.g.
ReplicaWriteException, InsufficientAcksException), the journal IDs are
permanently consumed but no journal entry exists for them. On retry, the
write uses a new ID, creating gaps in the journal ID sequence.

These gaps can confuse the replayer which expects contiguous journal IDs,
potentially causing it to skip entries or fail to find expected journals.

Fix: read the current ID with nextJournalId.get() before writing, and only
advance the ID with incrementAndGet()/addAndGet() after the write succeeds.
Both write methods are synchronized, so the get-then-advance pattern is safe.
Copilot AI review requested due to automatic review settings April 2, 2026 14:29
@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 2, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes journal ID gaps by only advancing nextJournalId after a successful BDBJE put/commit, preventing “burned” IDs on transient write failures that confuse the replayer.

Changes:

  • Batch write(JournalBatch) now reads nextJournalId with get() and advances via addAndGet() only after a successful transaction commit.
  • Single-item write(short, Writable) now reads nextJournalId with get() and advances via incrementAndGet() only after a successful put.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

long dataSize = 0;
long firstId = nextJournalId.getAndAdd(entitySize);
// Reserve IDs only after successful commit to avoid burning IDs on write failure.
long firstId = nextJournalId.get();
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nextJournalId.addAndGet(entitySize) is performed inside the retry loop. This is safe only if the method exits/breaks immediately after a successful commit; otherwise a future refactor (or a subtle control-flow change) could advance the counter multiple times for the same batch. To make this robust, consider structuring the success path to return/break immediately after txn.commit() (and then advance once), or perform the addAndGet in a single place that is executed exactly once when writeSucceed is finalized.

Copilot uses AI. Check for mistakes.

txn.commit();
txn = null;
nextJournalId.addAndGet(entitySize);
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nextJournalId.addAndGet(entitySize) is performed inside the retry loop. This is safe only if the method exits/breaks immediately after a successful commit; otherwise a future refactor (or a subtle control-flow change) could advance the counter multiple times for the same batch. To make this robust, consider structuring the success path to return/break immediately after txn.commit() (and then advance once), or perform the addAndGet in a single place that is executed exactly once when writeSucceed is finalized.

Copilot uses AI. Check for mistakes.
Comment on lines +242 to +243
// id is the key. Reserve ID only after successful write to avoid burning IDs on failure.
// This is safe because the method is synchronized.
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says “Reserve ID only after successful write”, but the code still selects the ID before the write; what changes is when nextJournalId is advanced. Consider rewording to something like “Advance nextJournalId only after successful write/commit” to match the actual behavior and avoid confusion for future maintainers.

Suggested change
// id is the key. Reserve ID only after successful write to avoid burning IDs on failure.
// This is safe because the method is synchronized.
// id is the key. We advance nextJournalId only after successful write/commit
// to avoid burning IDs on failure. This is safe because the method is synchronized.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants