add exit code dependent retry policy by aspiringmind-code · Pull Request #9276 · dmwm/CRABServer

aspiringmind-code · 2026-02-24T13:35:55Z

cmsdmwmbot · 2026-02-24T13:40:41Z

Jenkins results:

Python3 Pylint check: succeeded
Pycodestyle check: succeeded
- 76 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2738/artifact/artifacts/PullRequestReport.html

belforte · 2026-02-24T14:31:05Z

src/python/TaskWorker/Actions/RetryJob.py

+        delay = policy.get("delay", 900)
+        self.logger.info(f"Sleeping {delay} seconds before retry (exit code {exitCode})")
+        time.sleep(delay)
+        if exitCode in [8020, 8021, 8022, 8028, 84, 85, 86, 92, 134, 8001, 65]:


why is this line there ? Isn't what to do fully defined by the table ?

belforte · 2026-02-24T14:36:51Z

src/python/TaskWorker/Actions/RetryJob.py

+# Exit-code dependent retry policy
+# ----------------------------------------------------------------------
+
+EXIT_RETRY_POLICY = {


This dictionary, which will grow as we add other exit codes, could be better organized.
By increasing exit code value and contain only the "long ones" e.g. 8021
The short exit codes can be added later by using short_code=long_code%128 or if we want to keep making it easy for the reader to find a short exit code here, add it as key in the sub-dictionary

belforte

see a couple inline comments

belforte · 2026-02-24T15:58:16Z

more on the "substance", it is not good to use sleep inside RetryJob i.e. inside PostJob (which calls this).
We are limited by the how many PostJob can run concurrently, due to memory constrain. The preferred implementation would be what is done when waiting for ASO, exit with a proper exit code which tells Dagman to rerun the Post (or Pre ?) step after a delay (example of delay in PreJob is the use of deferTime in there).

Notice that delaying the PostJob also delays the status reporting, the DAG node is still not completed. Rather once we introduce re-submission delays of several hours (days ?) we should worry about properly reporting this to user.
I think that currently jobs are reported in "toRetry" or "cooloff" (unfortunately there is some inconsistency) when the DAG node is completed with error but not resubmitted yet. At least that's a status that appears at times, but I have
not done a careful study of the current implementation.

cmsdmwmbot · 2026-02-24T17:30:37Z

Jenkins results:

Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 105 comments to review
Pycodestyle check: succeeded
- 185 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2739/artifact/artifacts/PullRequestReport.html

belforte · 2026-02-24T17:48:36Z

src/python/TaskWorker/Actions/PreJob.py

+        if os.path.exists(retry_info_file):
+            try:
+                with open(retry_info_file, "r", encoding="utf-8") as fd:
+                    retry_info = literal_eval(fd.read())


time to make it a JSON file ?

…d site

cmsdmwmbot · 2026-02-24T18:43:56Z

Jenkins results:

Python3 Pylint check: failed
- 1 warnings and errors that must be fixed
- 110 comments to review
Pycodestyle check: succeeded
- 191 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2740/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2026-03-30T15:01:18Z

Jenkins results:

Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 120 comments to review
Pycodestyle check: succeeded
- 223 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2766/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2026-03-31T08:18:14Z

Jenkins results:

Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 120 comments to review
Pycodestyle check: succeeded
- 222 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2769/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2026-03-31T09:59:18Z

Jenkins results:

Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 120 comments to review
Pycodestyle check: succeeded
- 221 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2770/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2026-03-31T11:33:25Z

Jenkins results:

Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 122 comments to review
Pycodestyle check: succeeded
- 264 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2771/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2026-03-31T13:30:37Z

Jenkins results:

Python3 Pylint check: failed
- 4 warnings and errors that must be fixed
- 114 comments to review
Pycodestyle check: succeeded
- 266 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2772/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2026-03-31T14:41:27Z

Jenkins results:

Python3 Pylint check: failed
- 4 warnings and errors that must be fixed
- 115 comments to review
Pycodestyle check: succeeded
- 266 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2773/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2026-04-01T11:24:23Z

Jenkins results:

Python3 Pylint check: failed
- 4 warnings and errors that must be fixed
- 117 comments to review
Pycodestyle check: succeeded
- 267 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2776/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2026-04-01T14:01:27Z

Jenkins results:

Python3 Pylint check: failed
- 4 warnings and errors that must be fixed
- 117 comments to review
Pycodestyle check: succeeded
- 267 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2777/artifact/artifacts/PullRequestReport.html

add exit code dependent retry policy

4c94acf

belforte reviewed Feb 24, 2026

View reviewed changes

adding retry delay condition to needsDefer

e70893e

belforte reviewed Feb 24, 2026

View reviewed changes

add exit code dependent ability to change maxmemory, maxjobruntime an…

595b71b

…d site

aspiringmind-code added 2 commits March 30, 2026 11:00

Merge branch 'dmwm:master' into improve_resubmit

3614084

add resubmit_counter and eff max retries

b76d758

use ExprTree

69afe55

remove abort and chnage jobads too

5b92334

remove jobconst, add to adstoPort

9bd8224

strictly policy dependent

9fca35e

avoid exprtree

8b3db7e

proper use of inkey and test with 8020

1e980e2

be free of use_resubmit_info

f492af6

aspiringmind-code marked this pull request as ready for review April 1, 2026 14:25

Conversation

aspiringmind-code commented Feb 24, 2026

Uh oh!

cmsdmwmbot commented Feb 24, 2026

Uh oh!

belforte Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

belforte Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

belforte left a comment

Choose a reason for hiding this comment

Uh oh!

belforte commented Feb 24, 2026

Uh oh!

cmsdmwmbot commented Feb 24, 2026

Uh oh!

belforte Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

cmsdmwmbot commented Feb 24, 2026

Uh oh!

cmsdmwmbot commented Mar 30, 2026

Uh oh!

cmsdmwmbot commented Mar 31, 2026

Uh oh!

cmsdmwmbot commented Mar 31, 2026

Uh oh!

cmsdmwmbot commented Mar 31, 2026

Uh oh!

cmsdmwmbot commented Mar 31, 2026

Uh oh!

cmsdmwmbot commented Mar 31, 2026

Uh oh!

cmsdmwmbot commented Apr 1, 2026

Uh oh!

cmsdmwmbot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants