Sonassi

Experiencing packet loss within network core.

Update (18:15): High CPU load has been identified on all edge routers, investigation is continuing.
Update (18:30): Both core routers are non-responsive, an engineer on site at the data centre is investigating the issue.
Update (18:45): Upon reboot, one router has resumed partial functionality. Investigation is still continuing.
Update (19:00): No new updates
Update (19:15): Attempts are still being made to restore functionality on both routers, connectivity will be lost as one is power-cycled as the other is booting.
Update (19:30): No new updates
Update (19:45): No new updates
Update (20:10): Both routers are online, but high packet loss is still present. Investigation is still continuing.
Update (20:15): No new updates
Update (20:30): No new updates
Update (20:40): We are seeing more consistent network throughput, packet loss is significantly reduced, however, issue does not appear to be fully resolved.
Update (20:50): Our monitoring is reporting 0% packet loss and full network health. However, we have been forced to shutdown an entire switcing network to resolve the issue - the network is currently running in a “degraded” state. We are still continuing to investigate the issue.
Update (22:00): The network has been fully restored and is no longer degraded. The issue is resolved, a post-mortem will be posted in due time.
Update (01:45): HA failover test attempted to verify fix.

Post-Mortem

Our report from the incident is as follows.

Issue

Significant packet loss, effecting all servers at our Joule House location, causing a total service loss to all servers.

Outage Length

The duration was 97 minutes.

Underlying cause

Currently under investigation with Juniper TAC to identify and isolate the issue. It appears to be a repeat incident whereby flooding within a single access switch caused significant control plane CPU consumption within other network devices.

Symptoms

Our external monitoring probes immediately reported the fault. End users will have noticed the issue as it had an effect on overall service.

Resolution

Whilst the sympoms have been resolved, we believe the underlying issue to still be present and the result of firmware bug. The issue has been elevated to the vendor for in depth analysis and urgent review.

Our network and transit team is continuing to investigate, replicate and resolve this issue in an isolated environment in parallel with Juniper TACs efforts, so that a swift resolution can be reached.

The network status will remain as high until we are confident of a permanent resolution.

Monday 13th April 2015