[fix](editlog) Fix concurrency bugs and robustness issues in editlog subsystem by dataroaring · Pull Request #62051 · apache/doris

dataroaring · 2026-04-02T11:01:12Z

Summary

Fix 8 bugs in the editlog subsystem (concurrency, data loss, correctness):

EditLog.txId data race — Convert txId from long to AtomicLong. Flusher and DDL threads read/write without synchronization, risking torn reads.
EditLog.logEdit(op, List) missing roll check — Bulk-write path never checked edit_log_roll_num, so edit log wouldn't roll.
BDBEnvironment.close() not thread-safe — close() iterated openedDatabases without writeLock, while openDatabase() modifies the list concurrently. Causes ConcurrentModificationException when replayer handles RollbackException while HTTP/heartbeat threads call getMaxJournalId().
EditLog.flushEditLog() silent hang — Producer threads blocked on req.lock.wait() never notified when System.exit(-1) is delayed by shutdown hooks. Now notifies all pending items before exit.
BDBJEJournal.write() burns journal IDs on failure — nextJournalId was advanced before the write, so failures skip IDs permanently. Now advanced only after successful commit.
BDBEnvironment.removeDatabase() index corruption — Manual index tracking computed wrong removal index. Replaced with iterator.remove().
JournalObservable.upperBound() off-by-one — Binary search with right = size - 1 could miss notifying observers whose targetJournalVersion exactly matches journalId. Fixed to standard upper_bound.
Dead editStream field — Never assigned (always null), so getEditLogSize() always NPEs. Removed both.

Test plan

Verify FE master starts and writes edit logs normally
Verify follower FE replays journals correctly
Verify SHOW FRONTENDS works concurrently with journal operations
Verify edit log rolls after edit_log_roll_num transactions via both single and bulk write paths

🤖 Generated with Claude Code

Thearas · 2026-04-02T11:01:18Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

Copilot

Pull request overview

This PR hardens the FE editlog/journal subsystem by addressing concurrency hazards, NPE risks, and robustness issues around batching, rolling, and BDB environment lifecycle.

Changes:

Fix editlog rolling/counting behavior across code paths (queue-based flushing, direct writes, and list/batch writes).
Improve failure handling and thread-safety in the editlog flusher and BDB environment/database management.
Prevent NPEs and reduce ID “burn” on failed BDB journal writes.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
fe/fe-core/src/main/java/org/apache/doris/qe/JournalObservable.java	Replaces custom binary search with a standard `upper_bound`-style implementation for observer notification.
fe/fe-core/src/main/java/org/apache/doris/persist/EditLog.java	Converts `txId` to `AtomicLong`, adds roll checks for batch-list logging, and notifies queued producers on fatal flush errors.
fe/fe-core/src/main/java/org/apache/doris/journal/bdbje/BDBJEJournal.java	Defers advancing `nextJournalId` until after successful commit/write to avoid burning IDs on failures.
fe/fe-core/src/main/java/org/apache/doris/journal/bdbje/BDBEnvironment.java	Fixes `removeDatabase()` removal logic and makes `close()` thread-safe by holding the write lock.
fe/fe-core/src/main/java/org/apache/doris/ha/BDBHA.java	Adds null-check for `getReplicationGroupAdmin()` in `getLeader()` to avoid NPE during partitions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-02T11:07:13Z

fe/fe-core/src/main/java/org/apache/doris/persist/EditLog.java

+            // Notify all pending items so producer threads don't hang if System.exit is delayed
+            for (EditLogItem req : batch) {
+                synchronized (req.lock) {
+                    req.finished = true;
+                    req.logId = -1;
+                    req.lock.notifyAll();
+                }
+            }


In the error path you notify waiters by setting finished=true and logId=-1, but the caller side (e.g. logEditWithQueue() / EditLogItem.await()) will treat this as a normal completion and continue executing follow-up logic with an invalid logId until the JVM actually halts. Consider propagating a fatal signal instead (e.g., store the throwable on the request and have await/logEditWithQueue throw an Error when completion is due to failure), so callers don’t proceed with partially persisted state during shutdown hooks / delayed exit.

Suggested change

// Notify all pending items so producer threads don't hang if System.exit is delayed

for (EditLogItem req : batch) {

synchronized (req.lock) {

req.finished = true;

req.logId = -1;

req.lock.notifyAll();

}

}

Copilot · 2026-04-02T11:07:13Z

fe/fe-core/src/main/java/org/apache/doris/ha/BDBHA.java

    public InetSocketAddress getLeader() {
        ReplicationGroupAdmin replicationGroupAdmin = environment.getReplicationGroupAdmin();
+        if (replicationGroupAdmin == null) {
+            return null;


getLeader() now returns null when getReplicationGroupAdmin() is null, but several call sites assume a non-null return and dereference it (e.g. StatisticsUtil.isMaster() calls getLeader().equals(...)). To avoid introducing new NPEs, either keep the previous “throw” behavior but with a clearer exception (so existing try/catch paths handle it), or update the contract/callers to null-check consistently before use.

Suggested change

return null;

throw new IllegalStateException("ReplicationGroupAdmin is not initialized");

…subsystem 1. EditLog.txId data race: Convert `txId` from plain `long` to `AtomicLong`. The flusher thread and DDL caller threads both read/write `txId` without synchronization, causing torn reads on 64-bit values. 2. EditLog.logEdit(op, List<T>) missing roll check: This bulk-write path incremented txId but never checked against `edit_log_roll_num`, so the edit log would never roll when this path was used. 3. BDBEnvironment.close() not thread-safe: `close()` iterated `openedDatabases` without holding `lock.writeLock()`, while `openDatabase()` concurrently modifies the same list under the write lock. This causes ConcurrentModificationException when a RollbackException triggers close() on the replayer thread while HTTP/heartbeat/RPC threads call openDatabase() via getMaxJournalId(). 4. EditLog.flushEditLog() silent hang on fatal error: When `System.exit(-1)` is called after a write failure, producer threads blocked on `req.lock.wait()` are never notified. If `System.exit` is delayed by shutdown hooks, those threads hang indefinitely. 5. BDBJEJournal.write() burns journal IDs on failure: Both single and batch write methods called `nextJournalId.getAndIncrement/getAndAdd` before the actual write, so a failed write permanently skips journal IDs. Now IDs are only advanced after successful commit. 6. BDBEnvironment.removeDatabase() index corruption: Manual index tracking with `openedDatabases.remove(index)` computed wrong index when the target was not the first element. Replaced with `iterator.remove()`. 7. JournalObservable.upperBound() off-by-one: The binary search used `right = size - 1` and `right = middle - 1`, which could miss notifying observers whose targetJournalVersion exactly equals the journalId. Fixed to standard upper_bound pattern. 8. Remove dead `editStream` field and `getEditLogSize()` method from EditLog. The field was never assigned (always null), so `getEditLogSize()` would always NPE.

dataroaring · 2026-04-02T14:24:14Z

Splitting into individual PRs for each fix.

Copilot AI review requested due to automatic review settings April 2, 2026 11:01

Copilot started reviewing on behalf of dataroaring April 2, 2026 11:02 View session

Copilot AI reviewed Apr 2, 2026

View reviewed changes

dataroaring force-pushed the editlog_review branch from c935ad4 to 1ec3092 Compare April 2, 2026 14:15

dataroaring changed the title ~~[fix](editlog) Fix concurrency bugs, NPE risks in editlog subsystem~~ [fix](editlog) Fix concurrency bugs and robustness issues in editlog subsystem Apr 2, 2026

dataroaring closed this Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix](editlog) Fix concurrency bugs and robustness issues in editlog subsystem#62051

[fix](editlog) Fix concurrency bugs and robustness issues in editlog subsystem#62051
dataroaring wants to merge 1 commit intoapache:masterfrom
dataroaring:editlog_review

dataroaring commented Apr 2, 2026 •

edited

Loading

Uh oh!

Thearas commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

Copilot AI Apr 2, 2026

Uh oh!

dataroaring commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	return null;
	throw new IllegalStateException("ReplicationGroupAdmin is not initialized");

Conversation

dataroaring commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Thearas commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

dataroaring commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dataroaring commented Apr 2, 2026 •

edited

Loading