Skip to content

[fix](executor) prevent BE crash when split process throws unexpectedly#62044

Open
eldenmoon wants to merge 2 commits intoapache:masterfrom
eldenmoon:exp-safe-executor
Open

[fix](executor) prevent BE crash when split process throws unexpectedly#62044
eldenmoon wants to merge 2 commits intoapache:masterfrom
eldenmoon:exp-safe-executor

Conversation

@eldenmoon
Copy link
Copy Markdown
Member

Catch exceptions around split->process() in TimeSharingTaskExecutor and
convert them to split failure status.

This avoids worker thread termination and BE crash for cases :

erminate called after throwing an instance of 'doris::Exception' what(): [E-7412] assert cast err:[E-7412] Bad cast from
...
doris::vectorized::ScannerSplitRunner::process_for(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/scan/scanner_scheduler.cpp:420 10# doris::vectorized::PrioritizedSplitRunner::process() at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/executor/time_sharing/prioritized_split_runner.cpp:104 11# doris::vectorized::TimeSharingTaskExecutor::_dispatch_thread() at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/executor/time_sharing/time_sharing_task_executor.cpp:568 12#

, while keeping MEM_ALLOC_FAILED mapped to
MemoryLimitExceeded.

Copilot AI review requested due to automatic review settings April 2, 2026 07:30
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Prevents backend worker-thread termination (and potential BE crash) when split->process() throws unexpectedly in the time-sharing scan executor by converting thrown exceptions into split failure statuses.

Changes:

  • Wrap PrioritizedSplitRunner::process() invocation in a try/catch to prevent exceptions from escaping the dispatch thread.
  • Map doris::Exception (including special-casing MEM_ALLOC_FAILED) and other exceptions to appropriate Status errors returned via Result.
  • Keep enable_thread_catch_bad_alloc scoped around split->process() to preserve existing memory-exception behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +556 to +569
auto blocked_future_result = [&]() -> Result<SharedListenableFuture<Void>> {
try {
doris::enable_thread_catch_bad_alloc++;
Defer defer {[&]() { doris::enable_thread_catch_bad_alloc--; }};
return split->process();
} catch (const doris::Exception& e) {
if (e.code() == doris::ErrorCode::MEM_ALLOC_FAILED) {
return unexpected(Status::MemoryLimitExceeded(
"PreCatch error code:{}, {}, __FILE__:{}, __LINE__:{}, "
"__FUNCTION__:{}",
e.code(), e.to_string(), __FILE__, __LINE__, __PRETTY_FUNCTION__));
}
return unexpected(e.to_status());
} catch (const std::exception& e) {
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This try/catch block duplicates the exception-to-Status mapping logic already implemented in common/exception.h (including the enable_thread_catch_bad_alloc guard and the "PreCatch" message). To avoid future divergence, consider factoring the conversion into a shared helper (e.g., a function that converts an Exception/std::exception to Status) and reuse it here for the Result<...> error path.

Copilot uses AI. Check for mistakes.
 Catch exceptions around split->process() in TimeSharingTaskExecutor and
convert them to split failure status.

This avoids worker thread termination and BE crash for cases :

```
erminate called after throwing an instance of 'doris::Exception' what(): [E-7412] assert cast err:[E-7412] Bad cast from
...
doris::vectorized::ScannerSplitRunner::process_for(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/scan/scanner_scheduler.cpp:420 10# doris::vectorized::PrioritizedSplitRunner::process() at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/executor/time_sharing/prioritized_split_runner.cpp:104 11# doris::vectorized::TimeSharingTaskExecutor::_dispatch_thread() at /home/zcp/repo_center/doris_release/doris/be/src/vec/exec/executor/time_sharing/time_sharing_task_executor.cpp:568 12#
```
w
, while keeping MEM_ALLOC_FAILED mapped to
MemoryLimitExceeded.
@eldenmoon eldenmoon force-pushed the exp-safe-executor branch from c1429c7 to a67425a Compare April 2, 2026 07:44
@eldenmoon eldenmoon added dev/4.1.x dev/4.0.x usercase Important user case type label labels Apr 2, 2026
…executor.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@eldenmoon
Copy link
Copy Markdown
Member Author

/review

@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 29125 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 8f13f3a212046cbc7af127306d470223b5a72f7e, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17628	3663	3658	3658
q2	q3	10655	926	601	601
q4	4682	470	355	355
q5	7485	1353	1141	1141
q6	188	164	137	137
q7	911	955	753	753
q8	9301	1380	1274	1274
q9	5402	5324	5252	5252
q10	6239	2021	1779	1779
q11	480	278	270	270
q12	841	681	509	509
q13	18046	2795	2169	2169
q14	284	287	262	262
q15	q16	861	861	788	788
q17	1028	1072	747	747
q18	6460	5683	5537	5537
q19	1150	1286	1065	1065
q20	595	546	412	412
q21	4376	2586	2063	2063
q22	483	401	353	353
Total cold run time: 97095 ms
Total hot run time: 29125 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4603	4588	4570	4570
q2	q3	4721	4775	4166	4166
q4	2049	2075	1324	1324
q5	4897	4896	5152	4896
q6	210	178	141	141
q7	2004	1795	1586	1586
q8	3341	3093	3039	3039
q9	8345	8286	8365	8286
q10	4601	4520	4262	4262
q11	574	399	382	382
q12	654	728	510	510
q13	2649	3050	2426	2426
q14	302	329	309	309
q15	q16	768	770	685	685
q17	1344	1261	1235	1235
q18	7982	7385	7044	7044
q19	1127	1156	1127	1127
q20	2272	2263	2078	2078
q21	5985	5351	4846	4846
q22	517	492	414	414
Total cold run time: 58945 ms
Total hot run time: 53326 ms

@doris-robot
Copy link
Copy Markdown

TPC-DS: Total hot run time: 179999 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 8f13f3a212046cbc7af127306d470223b5a72f7e, data reload: false

query5	4349	641	506	506
query6	339	230	204	204
query7	4216	621	326	326
query8	327	268	222	222
query9	8742	3878	3893	3878
query10	472	407	341	341
query11	6642	5481	5148	5148
query12	185	136	127	127
query13	1278	625	441	441
query14	5628	5191	4703	4703
query14_1	4170	4086	4094	4086
query15	205	211	181	181
query16	979	480	453	453
query17	938	752	628	628
query18	2458	498	367	367
query19	241	223	189	189
query20	135	130	131	130
query21	229	144	113	113
query22	13957	14870	14631	14631
query23	18353	17093	16682	16682
query23_1	16769	16867	16587	16587
query24	7468	1735	1321	1321
query24_1	1361	1340	1320	1320
query25	576	521	441	441
query26	1260	333	178	178
query27	2668	611	371	371
query28	4492	1879	1887	1879
query29	960	685	582	582
query30	300	234	192	192
query31	1083	1043	931	931
query32	97	70	72	70
query33	551	348	293	293
query34	1176	1147	700	700
query35	727	769	655	655
query36	1251	1184	1062	1062
query37	157	101	84	84
query38	3082	3012	2976	2976
query39	915	890	872	872
query39_1	832	840	831	831
query40	238	158	136	136
query41	61	59	58	58
query42	274	275	274	274
query43	311	314	277	277
query44	
query45	210	194	191	191
query46	1155	1286	800	800
query47	2342	2337	2246	2246
query48	386	400	294	294
query49	637	532	439	439
query50	710	285	217	217
query51	4331	4337	4269	4269
query52	278	283	263	263
query53	329	337	272	272
query54	334	285	263	263
query55	99	96	88	88
query56	325	322	328	322
query57	1746	1725	1485	1485
query58	305	281	278	278
query59	2872	2991	2733	2733
query60	339	334	326	326
query61	161	158	155	155
query62	703	622	573	573
query63	311	266	273	266
query64	5321	1452	1141	1141
query65	
query66	1499	484	411	411
query67	24280	24262	24163	24163
query68	
query69	485	347	330	330
query70	1006	1006	994	994
query71	375	337	323	323
query72	3231	2882	2594	2594
query73	813	778	456	456
query74	9824	9691	9618	9618
query75	3535	3366	3009	3009
query76	2297	1139	775	775
query77	396	398	334	334
query78	11381	11360	10802	10802
query79	1472	1144	827	827
query80	817	773	668	668
query81	465	273	243	243
query82	1399	150	119	119
query83	378	286	263	263
query84	306	144	116	116
query85	873	528	453	453
query86	388	344	279	279
query87	3318	3199	3103	3103
query88	3594	2714	2721	2714
query89	477	412	376	376
query90	1991	182	176	176
query91	181	169	141	141
query92	78	78	68	68
query93	904	881	501	501
query94	553	348	304	304
query95	646	358	419	358
query96	984	763	350	350
query97	2670	2679	2557	2557
query98	240	238	224	224
query99	1063	1074	966	966
Total cold run time: 258776 ms
Total hot run time: 179999 ms

@doris-robot
Copy link
Copy Markdown

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.93% (20046/37875)
Line Coverage 36.52% (188197/515326)
Region Coverage 32.80% (146182/445658)
Branch Coverage 33.94% (63969/188496)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 43.48% (10/23) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.71% (26599/37094)
Line Coverage 54.67% (280857/513777)
Region Coverage 51.79% (232944/449771)
Branch Coverage 53.19% (100567/189062)

@eldenmoon
Copy link
Copy Markdown
Member Author

skip check_coverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dev/4.0.x dev/4.1.x usercase Important user case type label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants