Filesystem instabilities

Resolved Mar 19, 2019 1:33 PM PDT

Even though we may not have been updating this issue much, a lot has happened over the last weeks to address the ongoing $SCRATCH issues. We’re very well aware of their impact on all Sherlock users’ work, and that many users depend on it for their research and publications. Which is why we wanted to apologize for the recurring disruptions and will send a more detailed note to the mailing-list to provide some more context.

But right now, we’d like to announce that all the issues that have surfaced since last maintenance have been properly identified, diagnosed and fixed, and that all the patches have been deployed. So we’re confident that the /scratch instabilities are now behind us, and that things should work smoothly from now on.

Again, we’re very sorry about the turbulent last few weeks, and would encourage users to refer the upcoming email announcement for more details.

Monitoring Feb 28, 2019 12:21 PM PST

The recent $SCRATCH stability issues have now been addressed.

We’re now deploying a last set of fixes that should address the remaining latency problem, and should restore the original performance levels of $SCRATCH. We’ll keep that incident open until the fixes are completely deployed and will continue to closely monitor the filesystem.

Again, thanks for your patience and understanding while we’ve been working on this issue.

Monitoring Feb 21, 2019 8:54 AM PST

We’ve been actively testing a set of patches to address a variety of issues affecting $SCRATCH, and we’re now ready to start deploying them on the cluster.

To that effect, we’ll need to take a more aggressive approach than usual to ensure the fixes are deployed in a reasonable amount of time. This means that we’ll start actively draining compute nodes, rather than waiting for them to become idle and available for patching. This won’t affect running jobs, but it may have an impact on the wait times in queue as it will temporarily reduce the amount of resources available to run jobs.

Thanks again for your patience and understanding while we’re working on those issues.

Monitoring Feb 14, 2019 8:40 AM PST

The cause of the ongoing issue has been identified, and we’ve deployed a workaround last night, that has helped stabilize the situation overnight.
We’re working closely with the filesystem developers to get a definitive fix for the issue, which we hope will get resolved soon.

Investigating Feb 13, 2019 4:33 PM PST

We’re making progress towards identifying the cause of the ongoing issue with $SCRATCH, and are currently working with our vendor’s support and development teams to get this fixed as quickly as possible.

We really appreciate your patience and understanding during those frustrating times, and deeply apologize for the disruption.

Monitoring Feb 13, 2019 9:29 AM PST

The hardware issues identified earlier have been resolved.
We’re still investigating some performance issues and periodic slowdowns, and will continue monitoring the system very closely.

Investigating Feb 8, 2019 9:53 AM PST

We’re currently investigating a degraded hardware component on the filesystem.

Monitoring Feb 8, 2019 8:37 AM PST

We’re confident most of the issues on $SCRATCH have been identified and fixed. We’re keeping a close look on the filesystem and will continue to monitor the situation.

Investigating Feb 6, 2019 10:57 AM PST

While we’re working on restoring compute nodes to service, we’re also aware of some instabilities on $SCRATCH. Symptoms could include seemingly empty directories, blocking commands (such as ls, cd) or overall slowness.
We’re investigating the issue.