Experiencing some packet loss to some internal routes.
Post-Mortem
Our report from the incident is as follows.
Issue
Total packet loss, some customers servers were completely inaccessible.
Outage Length
The duration was 15 minutes.
Underlying cause
Continual diagnosis of our network core is taking place by the vendor in an effort to identify and resolve the outstanding issues we are experiencing.
This diagnosis involves gathering information from the switches and in some cases, making minor adjustments. A configuration change was made to the network edge filtering that left a window open for attack.
The increased traffic flow targetting the routing engine lead to increased CPU utilisation and a subsequent restart of the packet forwarding process (under current network conditions, this can take up to 10 minutes to recover).
Symptoms
Access to servers behind the affected subnets was impossible.
Resolution
The backup switching/routing network was manually restored to resume connectivity.
After approxomiately 7 minutes, the primary network was restored to full health and traffic was sucessfully, cleanly failed back.
The current network condition is very healthy, with full firewalling and full automated availability of routing and switching. Through continued efforts from the vendor and our own team, the historic issues we have experienced should now be considered resolved.