Core networking issue

Postmortem

One of the Ethernet switches that provide the core backbone network on Sherlock has experienced a crash last night, which had consequences on the cluster’s internal connectivity:

  • the scheduler may have been unresponsive at times,
  • access to the $HOME and $GROUP_HOME file systems may have been disrupted,
  • some network connectivity issues to both login nodes, DTNs and to the outside may have occurred

Physical intervention was required, and the issue has been fixed at 8:40pm last night. All systems have now returned to normal.

Resolved
Assessed

Network connectivity issues that prevent the scheduler to operate properly have been reported.

4 Affected Services: