Slurm controller

Scheduler issues

Resolved

Still waiting on a definitive diagnostic from the vendor, but the issue hasn’t occurred in the last month, so we’re marking that issue as resolved, and will of course continue to monitor the system very closely.

Monitoring

The problem has been mitigated at around 8pm last night, and has been stable since. The state of the scheduler is back to normal, jobs should be allocated and be running normally again.

Investigating

This problem unfortunately started occurring again today. We’re re-opening the incident: scheduler latencies and jobs stuck in COMPLETING state are unfortunately to be expected. We’re working on the issue and will post updates as they become available.

Resolved

We’re still waiting on a definitive diagnostic and fix from the vendor, but the issue hasn’t occurred in the last 3 weeks, so we’re marking that issue as resolved, and will of course continue to monitor the system very closely.
The temporary limits on the job submission rate and multi-partition jobs have been lifted.

Monitoring

Occurrences of the problem seem to have decreased, some workarounds have been put in place, but no definitive fix has been deployed, and the scheduler developers are still investigating. We’ll continue to monitor the issue.

Investigating

We’re still investigating the issue with the scheduler support team. In the meantime, and in order to try to mitigate the impact and to minimize the frequency of these incidents, we’ve temporarily added two additional restrictions on Sherlock:

  1. job submission rate is now limited to 300 jobs per user per hour (that’s 5 jobs/minute). When the limit is exceeded, job submissions will be rejected with the following message: error: Reached jobs per hour limit
  2. job submission to multiple partitions is now disabled: jobs can still be submitted to multiple partitions (so scripts don’t need to be modified), but only one partition will be considered for dispatch by the scheduler.

We hope that this will help limit the occurrences of those blocking situations, while a proper solution is being worked on.

Investigating

The scheduler development team is still investigating the issue, and proposed a few ways to gather additional diagnostic information. As the issue is not readily reproducible, we have to wait until the next occurrence to gather more data. We continue working on the problem and will provide more updates when we have them.

Investigating

The issue is still on-going, no new update to report yet. The scheduler can appear to be working most of the time, but jobs can get stuck in CG state at any time. Vendor support is working on the problem, and we’ll post updates as they become available.

Investigating

The issue is still ongoing and being investigated, vendor support is working on diagnosing the problem.
Possible symptoms include: jobs stuck in CG state, jobs appearing to be running but not producing any output, jobs being re-queued multiple times, or “sdev” sessions hanging.

Avoiding to submit or cancel large series on jobs while we’re working on the problem will help us diagnose the issue faster.

Investigating

We’re still observing periods of decreased responsiveness from the scheduler, with jobs staying in the “completing” state for extended periods of times. We’re still investigating and are very well aware of the inconvenience.

Monitoring

Vendor support provided a fix that has been deployed. The scheduler is back to its normal operational state. We’ll continue to monitor it but we’re confident the incident has been resolved.

Investigating

The scheduler is currently down, vendor support is engaged.

Investigating

We’re currently investigating issues with the scheduler, where jobs could appear stuck in “Completing” (CG) state, or shown as “Running” (R) but not producing any output.