Jobs Now Always Automatically Requeued On Prolog Failure
As of Thursday, October 26th, jobs that fail to start due to a prolog script error will always be requeued.
Prolog scripts are configured by system administrators to verify the health of a node. They verify things like file system availability and network bandwidth. They run after a job is granted an allocation, but before that job begins execution. A prolog script error occurs when a node allocated for a job is found to be unfit to run that job. These errors indicate a problem with the node, not a problem with the job.
This change will affect users who specify the --no-requeue
option for their jobs, which is preferable in cases where a job that fails mid-run cannot restart cleanly without user intervention. The behavior without this option is the same as before: Jobs are requeued upon any system failure, whether it occurs in the prolog scripts before job execution or in the middle of job execution.
The aim of this configuration change is to reduce the need for user intervention when systems issues prevent a job from starting.
If you have any questions about this change or you experience any other issues, please contact us.