NFS issues

Resolved Mar 19, 2018 10:44 AM PDT

After a week of regular production without any incident, we’re now confident that the NFS problem has been definitely fixed and we’re marking this issue as resolved.

Monitoring Mar 13, 2018 9:42 AM PDT

A patch has been provided by the vendor to fix these NFS issues. It has been deployed over the week-end and the situation seem to have stabilized now. We will obviously continue monitoring the situation very closely until we’re absolutely positive the issue has been definitely fixed.

Updated Mar 6, 2018 11:37 AM PST

New occurrences of that issue are forcing us to reboot a good chunk of the compute nodes on Sherlock 2.0. Nodes and partitions may be unavailable during that time.

Problem Identified Mar 5, 2018 7:56 AM PST

We continue to experience recurrent issues and interruptions of service with our NFS filer, that serves both $HONE and $PI_HOME. It also hosts software modules and some critical components of the scheduler infrastructure.

The range of symptoms is pretty wide and includes:

impossibility to connect to the login nodes,
frozen sessions,
Slurm error messages when submitting jobs or querying the queue,
stuck jobs on compute nodes that don’t seem to progress

We’re very aware of the impact this has on our community on users and we’ll continue putting pressure on our vendor for a timely resolution.

Monitoring Feb 21, 2018 10:59 AM PST

The vendor identified the source of the issue, and completed initial testing of a fix. Some more validation is required before the patch can be released and deployed on our systems.

Problem Identified Feb 6, 2018 3:17 PM PST

We’re still experiencing NFS issues on Sherlock 2.0. Vendor support is engaged.

Resolved Feb 3, 2018 8:45 PM PST

The issue has been identified and worked around, vendor support has been notified and is working on the reported data. We’ll continue monitoring the system until this is completely fixed.

Opened Feb 3, 2018 6:25 PM PST

We’re experiencing NFS related issues on Sherlock 2.0, login nodes are currently not allowing proper connection, investigation is underway.