$HOME

NFS issues

Résolu

After a week of regular production without any incident, we’re now confident that the NFS problem has been definitely fixed and we’re marking this issue as resolved.

Sous surveillance

A patch has been provided by the vendor to fix these NFS issues. It has been deployed over the week-end and the situation seem to have stabilized now. We will obviously continue monitoring the situation very closely until we’re absolutely positive the issue has been definitely fixed.

Mis à jour

New occurrences of that issue are forcing us to reboot a good chunk of the compute nodes on Sherlock 2.0. Nodes and partitions may be unavailable during that time.

Problème identifié

We continue to experience recurrent issues and interruptions of service with our NFS filer, that serves both $HONE and $PI_HOME. It also hosts software modules and some critical components of the scheduler infrastructure.

The range of symptoms is pretty wide and includes:

  • impossibility to connect to the login nodes,
  • frozen sessions,
  • Slurm error messages when submitting jobs or querying the queue,
  • stuck jobs on compute nodes that don’t seem to progress

We’re very aware of the impact this has on our community on users and we’ll continue putting pressure on our vendor for a timely resolution.

Sous surveillance

The vendor identified the source of the issue, and completed initial testing of a fix. Some more validation is required before the patch can be released and deployed on our systems.

Problème identifié

We’re still experiencing NFS issues on Sherlock 2.0. Vendor support is engaged.

Résolu

The issue has been identified and worked around, vendor support has been notified and is working on the reported data. We’ll continue monitoring the system until this is completely fixed.

Ouvert

We’re experiencing NFS related issues on Sherlock 2.0, login nodes are currently not allowing proper connection, investigation is underway.