Saturday 23rd May 2015

Network Network Interruption

Experiencing some packet loss to some internal routes.

  • Update (13:38): We can see an extremely large amount of traffic targetting our edge routers. Updates to follow.
  • Update (13:48): We have blackholed the device being targetted with our upstream providers.
  • Update (13:53): Full service is now restored.

Post-Mortem

Our report from the incident is as follows.

Issue

Total packet loss, some customers servers were completely inaccessible.

Outage Length

The duration was 15 minutes.

Underlying cause

Continual diagnosis of our network core is taking place by the vendor in an effort to identify and resolve the outstanding issues we are experiencing.

This diagnosis involves gathering information from the switches and in some cases, making minor adjustments. A configuration change was made to the network edge filtering that left a window open for attack.

The increased traffic flow targetting the routing engine lead to increased CPU utilisation and a subsequent restart of the packet forwarding process (under current network conditions, this can take up to 10 minutes to recover).

Symptoms

Access to servers behind the affected subnets was impossible.

Resolution

The backup switching/routing network was manually restored to resume connectivity.

After approxomiately 7 minutes, the primary network was restored to full health and traffic was sucessfully, cleanly failed back.

The current network condition is very healthy, with full firewalling and full automated availability of routing and switching. Through continued efforts from the vendor and our own team, the historic issues we have experienced should now be considered resolved.