Skip to content

add exit code dependent retry policy#9276

Open
aspiringmind-code wants to merge 12 commits intodmwm:masterfrom
aspiringmind-code:improve_resubmit
Open

add exit code dependent retry policy#9276
aspiringmind-code wants to merge 12 commits intodmwm:masterfrom
aspiringmind-code:improve_resubmit

Conversation

@aspiringmind-code
Copy link
Copy Markdown
Contributor

Fix #9264

@cmsdmwmbot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: succeeded
  • Pycodestyle check: succeeded
    • 76 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2738/artifact/artifacts/PullRequestReport.html

delay = policy.get("delay", 900)
self.logger.info(f"Sleeping {delay} seconds before retry (exit code {exitCode})")
time.sleep(delay)
if exitCode in [8020, 8021, 8022, 8028, 84, 85, 86, 92, 134, 8001, 65]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this line there ? Isn't what to do fully defined by the table ?

# Exit-code dependent retry policy
# ----------------------------------------------------------------------

EXIT_RETRY_POLICY = {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dictionary, which will grow as we add other exit codes, could be better organized.
By increasing exit code value and contain only the "long ones" e.g. 8021
The short exit codes can be added later by using short_code=long_code%128 or if we want to keep making it easy for the reader to find a short exit code here, add it as key in the sub-dictionary

Copy link
Copy Markdown
Member

@belforte belforte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see a couple inline comments

@belforte
Copy link
Copy Markdown
Member

more on the "substance", it is not good to use sleep inside RetryJob i.e. inside PostJob (which calls this).
We are limited by the how many PostJob can run concurrently, due to memory constrain. The preferred implementation would be what is done when waiting for ASO, exit with a proper exit code which tells Dagman to rerun the Post (or Pre ?) step after a delay (example of delay in PreJob is the use of deferTime in there).

Notice that delaying the PostJob also delays the status reporting, the DAG node is still not completed. Rather once we introduce re-submission delays of several hours (days ?) we should worry about properly reporting this to user.
I think that currently jobs are reported in "toRetry" or "cooloff" (unfortunately there is some inconsistency) when the DAG node is completed with error but not resubmitted yet. At least that's a status that appears at times, but I have
not done a careful study of the current implementation.

@cmsdmwmbot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 105 comments to review
  • Pycodestyle check: succeeded
    • 185 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2739/artifact/artifacts/PullRequestReport.html

if os.path.exists(retry_info_file):
try:
with open(retry_info_file, "r", encoding="utf-8") as fd:
retry_info = literal_eval(fd.read())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time to make it a JSON file ?

@cmsdmwmbot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 110 comments to review
  • Pycodestyle check: succeeded
    • 191 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2740/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 120 comments to review
  • Pycodestyle check: succeeded
    • 223 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2766/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 120 comments to review
  • Pycodestyle check: succeeded
    • 222 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2769/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 120 comments to review
  • Pycodestyle check: succeeded
    • 221 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2770/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 122 comments to review
  • Pycodestyle check: succeeded
    • 264 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2771/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 114 comments to review
  • Pycodestyle check: succeeded
    • 266 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2772/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 115 comments to review
  • Pycodestyle check: succeeded
    • 266 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2773/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 117 comments to review
  • Pycodestyle check: succeeded
    • 267 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2776/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link
Copy Markdown

Jenkins results:

  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 117 comments to review
  • Pycodestyle check: succeeded
    • 267 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/2777/artifact/artifacts/PullRequestReport.html

@aspiringmind-code aspiringmind-code marked this pull request as ready for review April 1, 2026 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

improve resubmission policies

3 participants