Skip to content

branch-4.1: [fix](streaming-job) Fix PG replication slot leak when streaming task is cancelled during pause/resume #62010#62033

Open
github-actions[bot] wants to merge 1 commit intobranch-4.1from
auto-pick-62010-branch-4.1
Open

branch-4.1: [fix](streaming-job) Fix PG replication slot leak when streaming task is cancelled during pause/resume #62010#62033
github-actions[bot] wants to merge 1 commit intobranch-4.1from
auto-pick-62010-branch-4.1

Conversation

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot commented Apr 2, 2026

Cherry-picked from #62010

… is cancelled during pause/resume (#62010)

### What problem does this PR solve?

  Problem Summary:

When a PostgreSQL CDC streaming job is paused and resumed, the PG
replication slot
  can be permanently leaked, causing all subsequent tasks to fail with:
  `replication slot "doris_cdc_xxx" is active for PID xxx`

  **Root cause:**

The CDC client reuses a single `SourceReader` instance per jobId
(`Env.getOrCreateReader`).
When FE cancels a task (PAUSE), the BE HTTP connection is closed, but
the CDC client's
`buildStreamRecords` thread may still be blocked in `pollRecords` (up to
15s timeout).
Before the old task finishes, the new task (after RESUME) arrives at the
same CDC client
and calls `prepareStreamSplit`, which overwrites `this.streamReader`
with a new Fetcher
without closing the old one. The old Debezium reader (holding the PG
replication connection)
is leaked — its reference is lost, so `finishSplitRecords` in the old
task's finally block
  closes the new Fetcher instead, and the PG slot is never released.

From the logs, the slot remained occupied for 25+ minutes until the test
timed out:
Failed to start replication stream at LSN{0/318EBC8}; when setting up
multiple connectors
for the same database host, please make sure to use a distinct
replication slot name for each.

  **Fix:**

  Close the previous stream/binlog reader before creating a new one in
`prepareStreamSplit` (PG) and `prepareBinlogSplit` (MySQL). This ensures
the old
Debezium connection is properly released when a new task reuses the same
SourceReader.
@github-actions github-actions bot requested a review from yiguolei as a code owner April 2, 2026 04:19
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@dataroaring dataroaring closed this Apr 2, 2026
@dataroaring dataroaring reopened this Apr 2, 2026
@hello-stephen
Copy link
Copy Markdown
Contributor

run buildall

@JNSimba
Copy link
Copy Markdown
Member

JNSimba commented Apr 2, 2026

run external

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants