Issue with GPU jobs

解決済み 2023年07月05日 10:40 PDT

The issue has been resolved and GPU jobs are running again.

原因確定 2023年07月05日 08:54 PDT

To avoid repeated job failures and re-queuing, all the GPU nodes have been drained.

Users can continue to submit GPU jobs to any partition, and they will run when the software issue with the scheduler has been resolved.

原因確定 2023年07月05日 06:18 PDT

We’re currently seeing an issue with the scheduler, where GPU jobs may be failing under some circumstances with messages like this:

****** Error in `slurmstepd: [23098430.interactive]': free(): invalid next size (fast): 0x00002ab338000d60 ***
======= Backtrace: =========
/usr/lib64/libc.so.6(+0x81329)[0x2ab32822d329]
/usr/lib64/slurm/libslurmfull.so(slurm_xfree+0x25)[0x2ab326b65d48]
/usr/lib64/slurm/gpu_nvml.so(gpu_p_usage_read+0x422)[0x2ab32c888018]
/usr/lib64/slurm/libslurmfull.so(gpu_g_usage_read+0xa)[0x2ab326b77dc3]
/usr/lib64/slurm/jobacct_gather_linux.so(+0x2d19)[0x2ab32b095d19]
/usr/lib64/slurm/jobacct_gather_linux.so(jag_common_poll_data+0x12d)[0x2ab32b09653d]
/usr/lib64/slurm/jobacct_gather_linux.so(jobacct_gather_p_poll_data+0x47)[0x2ab32b0952ee]
/usr/lib64/slurm/libslurmfull.so(+0x180aa5)[0x2ab326b8caa5]
/usr/lib64/slurm/libslurmfull.so(+0x180cf1)[0x2ab326b8ccf1]
/usr/lib64/libpthread.so.0(+0x7ea5)[0x2ab32766fea5]
/usr/lib64/libc.so.6(clone+0x6d)[0x2ab3282aab0d]

This is due to a new GPU usage gathering feature introduced in the Slurm 23.02 version. Developers are aware of the issue, they’ve reproduced it, and they’re working on a fix.