$OAK

Some Oak components were temporarily unavailable

Resolved
Assessed

This issue was opened retrospectively.

A series of hardware issues occurred on some Oak components over the week-end. Oak is an extensive distributed storage system built to withstand hardware faults, but one failure led to a brief, partial disruption of the file system, impacting a single storage chassis. The SRC Oak team went on site to fix the issue on Saturday night, and the file system has been operating normally since then.

Some files on Oak might have not been accessible between during this partial outage. Access has been restored since, but processes which were actively using those files may have stayed stuck, and will need to be cleared out by restarting the compute nodes they were running on, A number of compute nodes is currently draining as a result, which may cause a delay in starting new jobs, but will resume normally once the currently running jobs terminate.

If you have any question, please don’t hesitate to contact srcc-support@stanford.edu