Issue with GPU jobs

解決済み

The issue has been resolved and GPU jobs are running again.

原因確定

To avoid repeated job failures and re-queuing, all the GPU nodes have been drained.

Users can continue to submit GPU jobs to any partition, and they will run when the software issue with the scheduler has been resolved.

原因確定

We’re currently seeing an issue with the scheduler, where GPU jobs may be failing under some circumstances with messages like this:

****** Error in `slurmstepd: [23098430.interactive]': free(): invalid next size (fast): 0x00002ab338000d60 ***
======= Backtrace: =========
/usr/lib64/libc.so.6(+0x81329)[0x2ab32822d329]
/usr/lib64/slurm/libslurmfull.so(slurm_xfree+0x25)[0x2ab326b65d48]
/usr/lib64/slurm/gpu_nvml.so(gpu_p_usage_read+0x422)[0x2ab32c888018]
/usr/lib64/slurm/libslurmfull.so(gpu_g_usage_read+0xa)[0x2ab326b77dc3]
/usr/lib64/slurm/jobacct_gather_linux.so(+0x2d19)[0x2ab32b095d19]
/usr/lib64/slurm/jobacct_gather_linux.so(jag_common_poll_data+0x12d)[0x2ab32b09653d]
/usr/lib64/slurm/jobacct_gather_linux.so(jobacct_gather_p_poll_data+0x47)[0x2ab32b0952ee]
/usr/lib64/slurm/libslurmfull.so(+0x180aa5)[0x2ab326b8caa5]
/usr/lib64/slurm/libslurmfull.so(+0x180cf1)[0x2ab326b8ccf1]
/usr/lib64/libpthread.so.0(+0x7ea5)[0x2ab32766fea5]
/usr/lib64/libc.so.6(clone+0x6d)[0x2ab3282aab0d]

This is due to a new GPU usage gathering feature introduced in the Slurm 23.02 version. Developers are aware of the issue, they’ve reproduced it, and they’re working on a fix.

2 該当サービス: