Maintenance extension

Resolved Feb 6, 2019 6:01 PM PST

The vast majority of compute nodes have been restored to service, which marks the official end of yesterday’s maintenance.
Thanks again for your patience and understanding while we were working on restoring service.

Monitoring Feb 6, 2019 10:55 AM PST

We’re still making progress in restoring all compute nodes to service, and we hope to have most of them up and running later today.

Monitoring Feb 6, 2019 12:38 AM PST

We’re re-opening the cluster to user logins and we’ve lifted the scheduler reservation, so pending jobs have started running again.

Please note that Sherlock is not in full production state now, as many nodes are still down and need more work. We’ll continue to work on the remaining nodes tomorrow, but in the meantime, users can connect, access their files and submit jobs to the scheduler.

Again, we’re very sorry about the inconvenience and appreciate your patience and understanding whilewe’re working through those problems.

Problem Identified Feb 5, 2019 11:24 PM PST

We’re making some progress towards a resolution, but putting nodes back into a workable state takes a considerable amount of time, and progression is slow. Currently, about half of the cluster is still unavailable, and we may have to open the cluster to users while nodes are still being worked on.

We’re continuing to work on the situation and will keep this issue updated.

Investigating Feb 5, 2019 10:07 PM PST

We continue working on issues that arose during the scheduled maintenance, to bring Sherlock back in production as quickly as possible. We’ll post updates as they become available.