Issue with GPU jobs

Resolved Jul 5, 2023 10:40 AM PDT

The issue has been resolved and GPU jobs are running again.

Problem Identified Jul 5, 2023 8:54 AM PDT

To avoid repeated job failures and re-queuing, all the GPU nodes have been drained.

Users can continue to submit GPU jobs to any partition, and they will run when the software issue with the scheduler has been resolved.

Problem Identified Jul 5, 2023 6:18 AM PDT

We’re currently seeing an issue with the scheduler, where GPU jobs may be failing under some circumstances with messages like this:

****** Error in `slurmstepd: [23098430.interactive]': free(): invalid next size (fast): 0x00002ab338000d60 ***
======= Backtrace: =========
/usr/lib64/libc.so.6(+0x81329)[0x2ab32822d329]
/usr/lib64/slurm/libslurmfull.so(slurm_xfree+0x25)[0x2ab326b65d48]
/usr/lib64/slurm/gpu_nvml.so(gpu_p_usage_read+0x422)[0x2ab32c888018]
/usr/lib64/slurm/libslurmfull.so(gpu_g_usage_read+0xa)[0x2ab326b77dc3]
/usr/lib64/slurm/jobacct_gather_linux.so(+0x2d19)[0x2ab32b095d19]
/usr/lib64/slurm/jobacct_gather_linux.so(jag_common_poll_data+0x12d)[0x2ab32b09653d]
/usr/lib64/slurm/jobacct_gather_linux.so(jobacct_gather_p_poll_data+0x47)[0x2ab32b0952ee]
/usr/lib64/slurm/libslurmfull.so(+0x180aa5)[0x2ab326b8caa5]
/usr/lib64/slurm/libslurmfull.so(+0x180cf1)[0x2ab326b8ccf1]
/usr/lib64/libpthread.so.0(+0x7ea5)[0x2ab32766fea5]
/usr/lib64/libc.so.6(clone+0x6d)[0x2ab3282aab0d]

This is due to a new GPU usage gathering feature introduced in the Slurm 23.02 version. Developers are aware of the issue, they’ve reproduced it, and they’re working on a fix.