Slurm controller

Investigating potential scheduling delays

Resolved

A fix addressing the root cause of the scheduling delays has been deployed. Job dispatch times have returned to normal, and the issue is now resolved.

We appreciate users’ patience while we worked with the Slurm development team to identify and address the problem.

Monitoring

The root cause of the potential scheduling delays reported earlier has been identified as a bug that caused the job scheduler to make inefficient decisions on systems where many jobs request licenses (like Sherlock), resulting in jobs waiting longer than expected to start.

The workaround currently in place has been validated, and scheduling is back to normal: no further delays are being observed. We are keeping this issue open until an official fix is released upstream and deployed on Sherlock.

Problem Identified

Work continues with the scheduler developers on this issue, and good progress is being made. A likely source of the scheduling delays has been identified, and we are now working on validating possible workarounds, before a fix can be developed, tested and deployed.

As a reminder, all jobs will eventually start, so no action is required on your part. We appreciate your patience and will continue to post updates as we approach final resolution.

Investigating

The scheduling delays are still being investigated.

As mentioned initially, all jobs eventually get execute, so no action is required on the user part, besides a little bit more patience than usual.

We’re aware of the trouble this may cause, and are working with the scheduler developers to identify the problem and find a path to resolution.

Investigating

We’re currently investigating some scheduling delays with jobs that have recently been submitted on Sherlock.

Under certain circumstances, jobs may take longer to be dispatched and wait in queue for longer than usual. All jobs will eventually start, so we recommend keeping them in queue and to avoid cancelling jobs (re-submitting them later will only put them back at the end of the line).

We’re working with the scheduler support and development teams on this incident, and will post updates when we have them.