$HOME

NFS issues

解決済み

After a week of regular production without any incident, we’re now confident that the NFS problem has been definitely fixed and we’re marking this issue as resolved.

監視中

A patch has been provided by the vendor to fix these NFS issues. It has been deployed over the week-end and the situation seem to have stabilized now. We will obviously continue monitoring the situation very closely until we’re absolutely positive the issue has been definitely fixed.

更新済み

New occurrences of that issue are forcing us to reboot a good chunk of the compute nodes on Sherlock 2.0. Nodes and partitions may be unavailable during that time.

原因確定

We continue to experience recurrent issues and interruptions of service with our NFS filer, that serves both $HONE and $PI_HOME. It also hosts software modules and some critical components of the scheduler infrastructure.

The range of symptoms is pretty wide and includes:

  • impossibility to connect to the login nodes,
  • frozen sessions,
  • Slurm error messages when submitting jobs or querying the queue,
  • stuck jobs on compute nodes that don’t seem to progress

We’re very aware of the impact this has on our community on users and we’ll continue putting pressure on our vendor for a timely resolution.

監視中

The vendor identified the source of the issue, and completed initial testing of a fix. Some more validation is required before the patch can be released and deployed on our systems.

原因確定

We’re still experiencing NFS issues on Sherlock 2.0. Vendor support is engaged.

解決済み

The issue has been identified and worked around, vendor support has been notified and is working on the reported data. We’ll continue monitoring the system until this is completely fixed.

未処理

We’re experiencing NFS related issues on Sherlock 2.0, login nodes are currently not allowing proper connection, investigation is underway.