-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
I am encountering an issue when submitting multiple jobs using torchrun-hpc. If these are granted an allocation at the same time the directory naming structure will conflict, if ran from the same directory at the same second. I don't think this is too uncommon to expect, for users submitting 10+ jobs at the same time. Example of directory naming: torchrun_hpc-scaffold_2026-01-28_06h25m09s
I think the most trivial change would be to add ms to the end, which is significantly less likely to conflict. Even better would be a solution that is fully unique, like scheduler jobids, but this would not apply if running interactively.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels