Sunday 2nd June 2013

Network Outage on sms-jay

sms-jay suddenly stopped responding. System being looked at by an  engineer.

  • Update (23:31): The issue looks to be the same as the fault affecting the machine last week. However, rather than it being a network driver at fault, it looks like an issue with the access switch that the server is connected to. We’re continuing the investigation. In the mean time, the system has been restarted and failed over onto the secondary access switch.
  • Update (01:52): sms-jay has stopped responding again.
  • Update (03:12): A system reboot did not resolve the issue following the second outage and the switch port that the server is attached to refused to detect the link as being up on either access switch.The cables were replaced but to no avail. The network card in the server was replaced, but to no avail. So the server has been plugged into another pair of switches whilst the offending ports are diagnosed. It does not look to be a switch issue - but there is some fault taking place whereby the link is being forced as down at the switch port level (on two separate switches).
  • Update (03:13): sms-jay is back up and being closely monitored.
  • Update (09:52): Upon reviewing the logs from the machine, we found a series of ECC (RAM) errors being logged on the management console on the device. The RAM will be immediately replaced today whilst the existing sticks undergo testing.
  • Update (13:40): sms-jay has been deliberately powered down for an emergency RAM exchange. It should be back up in approximately 10 mins.
  • Update (14:39): The server has been powered back up and running for approximately 1 hour now, performance will be degraded during RAID array rebuild (from unclean shutdown). New memory and a new CPU has been installed - the CPU and memory is being tested separately to attempt to identify a cause of these repeated issues.
  • Update (15:32): The memory check on all modules has completed 1 full pass without error. The CPU stress test is still running without error. We will leave these components to test for a further 24 hours, however, it looks unlikely that the issue is hardware related.