Sonassi

Cloudflare

2025-11-18T12:04:00+00:00

We are currently aware of an outage at Cloudflare that is affecting customers who use that service. Please see https://www.cloudflarestatus.com/

A fix has been implemented and we believe the incident is now resolved. We are continuing to monitor for errors to ensure all services are back to normal.

Issue affecting my.sonassi.com

2025-11-12T02:20:00+00:00

We are aware of an issue affecting my.sonassi.com currently.

We're working to resolve the issue as soon as possible.

Update: 04:33 We're continuing to work on the issue, we have no further update at this time.

Update : All systems are functioning correctly

Planned Maintenance - Transit Link Migration

2025-10-02T15:03:03+00:00

Date/Time

Start: 13/10/25 23:00 BST
Finish: 14/10/25 02:00 BST

Actions

We will be conducting a maintenance activity in our UK network to migrate the BGP transit connectivity link for the Sonassi platform from a legacy core device to a newer device as part of a core network refresh.

Estimated Downtime

No downtime is expected, though during the maintenance period services will be considered "at risk" - we'll be monitoring the network closely throughout the period as a precaution.

Network Interruption

2025-06-20T13:46:00+00:00

Disruption

There is a outage that is affecting US connectivity has now been resolved.

Maintenance - DC7

2025-04-25T09:11:00+00:00

Date/Time 25/04/2025 1030-1700 BST

Disruption None expected

Estimated Downtime We have Generator backup and UPS in place - No downtime is expected.

Actions Following Power issues in the DC7 Datacenter earlier this week we will be conducting Mains failure and Diagnostic testing today from 1030 - 1700.

Network Interruption

2025-04-23T17:28:00+00:00

We are experiencing a power outage within our data centres located in Manchester (DC6 and DC7) which may affect some services.

We are currently coordinating with our building managers to restore systems as quickly as possible.

*****Update 18:43

Power has been restored. We're checking services and ensuring services are online..

*****Update 19:27

All Services are online. We're continuing to monitor the situation.

if you are encountering issues, please raise an Emergency ticket where our engineers will looking into any issues.

Emergency Maintenance: Router Reload

2025-04-14T15:53:00+00:00

Unfortunately, we have encountered a situation that requires urgent and immediate action to stabilise an element of our network. Therefore, we plan to conduct emergency maintenance this evening, 14/04/2025.

The work specifically will be to reload the device ‘DC13-DH4-N540’ as the device is in an unstable state. We are writing to inform you of our intention to complete this maintenance.

What will happen?

Our Network team will reload the N540 router. During the maintenance window, there may be disruption to some customers as we are going to reload the routers; however, this should be minimal as the router is part of a redundant pair. Customers should consider their network to be at risk during the maintenance window.

When will the work be carried out?

The maintenance work will begin at 23:00 (BST) on Mon 14/04/2025 and last until 23:55 (BST) on Tue 15/04/2025.

Who will be completing the work?

The task will be completed by our experienced Network Engineers.

Will I experience downtime?

During the maintenance window, there is a risk that customers may experience some interruptions to their network traffic; however, this may not be visible as the traffic should continue to pass through the second device in the redundant pair.

If you have any further questions or concerns regarding this work, please get in touch with your Account Manager.

Kind regards,

The Sonassi Team

US Infrastructure - Planned Maintenance

2024-12-20T12:43:00+00:00

Date/Time

Start Date / Time: 13/1/25 04:00 EST (13/1/25 09:00 GMT)
Finish Date / Time: 13/1/25 08:00 EST (13/1/25 13:00 GMT)

Actions

We'll be carrying out a software upgrade on one of our network routers in our US location, to address a bug in it's current version where some interfaces are stuck in a downed state.

This will have no effect on customers in our UK location.

Disruption

As our network routers each have a redundant pair, network traffic should continue to pass via the paired router instead.

Estimated Downtime

We don't expect any downtime, but during the maintenance window there will be an increased risk to services as we'll be operating with less network redundancy temporarily.

US Infrastructure - Planned Maintenance

2024-10-15T13:58:42+00:00

Date/Time

Start Date / Time: 23/10/24 23:00 EST (24/10/24 03:00 GMT)
Finish Date / Time: 24/10/24 05:00 EST (24/10/24 09:00 GMT)

Actions

We're upgrading parts of our networking equipment that manage our DDoS protection in our US location.

This will have no effect on customers in our UK location.

Disruption

The DDoS protection element of the US network will be unavailable during this time.

Estimated Downtime

No downtime is expected.

Network Interruption

2024-02-02T10:50:00+00:00

11:50 We are currently investigating a major network outage to the Sonassi platform. Our teams are investigating this internally and with have further updates as soon as they are available.

12:09 We believe to have identified a fix this issue and are liaising with the engineers on site to bring services back online. We will keep you updated.

12:18 Connectivity has now been reinstated and we are monitoring the issue further before closing this outage off.

12:29 Connectivity is stable and operational. We will now set this incident back to operational and continue a true root cause investigation with our senior networking resources. We apologise for the interruption to your service.

Network issue

2023-12-06T06:42:00+00:00

07:42 - Engineers are investigating an issue impacting some customers. Updates will be posted here with more information when available.

08:14 - Our network engineers are working alongside our onsite teams and hope to have full connectivity reinstated shortly.

08:16 - Connectivity has been reinstated. We are still investigating root cause and checking full resiliency is back in place.

11:15 - Connectivity has remained stable but full resiliency is yet to be reinstated. There are planned emergency works to replace a switch this afternoon to fully reinstate resiliency. More information and communications will be forthcoming once a plan is finalised and booked in with our engineers.

15:45 - In order to fully reinstate resiliency to those impacted by this mornings outage. We will be commencing a switch replacement shortly. During this window all services impacted are deemed "at risk" although we do not expect an impact to services. This is an emergency change which is essential to bring full resiliency back online. We will update this status when the work has been completed.

17:30 - EMERGENCY MAINTENANCE COMPLETED - Our team has completed the emergency maintenance and have added resiliency back to the impacted part of our network. The entire platform is no longer deemed "at risk". We will make no further changes tonight and our team do not expect any further impact to your service relating to this issue. We will be in touch with more information once the full investigation begins tomorrow.

Network Interruption

2023-11-25T20:48:00+00:00

04:05 - Our network engineers identified the fault and has put a work around in place, Service is now operational.

21:48 - We are currently looking into a major outage which looks to be a network related incident. Our escalation teams are working on this and we will provide further updates as soon as possible.

22:20 - We have identified the network outage and are liaising with third parties and our own networking teams to mitigate the impact as soon as possible.

00:09 - Our senior networking team are working on a resolution and currently expect to have a further update in 1 hours time.

01:30 - We are still awaiting a resolution to this network issue. Third parties and our own teams are investigating and making progress in resolving the fault. There will be further updates to follow.

Network Performance issue

2023-08-24T18:17:00+00:00

Our network engineers are currently checking on a possible routing issue external to us which may be impacting connections to the stacks. You may be seeing slow performance or the errors that you see depending on the route your connection takes to our network.

Network performance issues

2023-07-07T08:05:00+00:00

Disruption

We're aware of a disruption to our services and our team are currently investigating.

Updates will follow as soon as possible.

Update (10:55): Services should now be restored, but our team are continuing to monitor this closely.

Network Maintenance Period

2023-06-20T09:00:00+00:00

Date/Time

20/6/23 11:00 BST

Actions

Our network team will be carrying out maintenance on some of the routes within the Sonassi network.

Impact

No downtime or disruption to service is expected.

Network Maintenance Period

2023-06-13T09:00:00+00:00

Date/Time

20/6/23 11:00 BST

Actions

Our network team will be carrying out maintenance on some of the routes within the Sonassi network.

Impact

No downtime or disruption to service is expected.

Network performance issues

2022-12-17T09:30:00+00:00

Disruption

We're aware of a disruption to our services and our network team are currently investigating.

Updates will follow as soon as possible.

update 11:47 GMT.

This issue is currently stable as of 11:15. Engineers are continuing to monitor.

TLS Cipher Adjustment

2022-09-01T10:19:00+00:00

Date/Time

Date / Time: 05/09/21 10:15 BST

Actions

As per notifications sent in July, we will be adjusting the default TLS Ciphers defined on all stack load balancers.

The change removes the usage of RSA and weaker AES-CBC ciphers in order to meet security and PCI requirements.

Impact

The change will mean that the following clients and operating systems will be unable to communicate via HTTPS with the server:

Internet Explorer 11 on Windows Phone 8.1 or lower
Internet Explorer 11 on Windows 8.1 or lower
Safari 8 on Mac OS X 10.10 or lower
Safari 8 on iOS 8.4 or lower

If this isn't possible, then please get in touch and we can ensure that these ciphers remain enabled on your stack however, please be aware that you will need sufficient justification for the usage of weaker ciphers in order to pass PCI scans.

Emergency Maintenance

2022-02-08T12:30:00+00:00

Date/Time

Start Date / Time: 08/02/22 13:30 GMT
Finish Date / Time: 08/02/22 15:30 GMT

Disruption

We need to perform some emergency maintenance on the backend systems that power my.sonassi.com and its management functionality.

Viewing credentials or stack management functionality (such as PHP version changes or issuing VPN bundles) will be unavailable during the maintenance period.

Estimated Downtime

Between 30 minutes - 1 hour.

Actions

my.sonassi.com will continue to be available for support requests during this time.

15:30 - Work has been completed.

Support Portal Down

2021-03-30T23:06:00+00:00

Disruption

Service has now been restored and customers can now submit support requests. - Update (01:23):

We are currently aware of an issue with my.sonassi.com so customers will not be able to raise support requests. We are looking into this and will update as soon as possible - Update (00:10):

Planned Maintenance

2021-03-23T16:35:46+00:00

Date/Time

Start Date / Time: 25/03/21 23:00 GMT
Finish Date / Time: 26/03/21 04:00 GMT

Disruption

Replacement hardware - Network interruption

Estimated Downtime

10 Minutes - 1 Hour of expected downtime depending on location.

Actions

We're replacing certain components of our core network we suspect has been causing issues with the Sonassi network over the past few months.

Network Interruption

2021-03-23T12:35:00+00:00

Disruption

Update (13:35): We're aware of some connectivity issues via The London Internet Exchange (LINX). The Iomart network team are looking into this at the moment to resolve the issue or reroute the traffic.
Update (13:52): Traffic has been rerouted. This issue affected the whole of Iomart and potentially other providers. Our engineers have shut down the connection with LINX and have rerouted traffic via other providers until the issue is resolved.

Network Interruption

2021-03-19T09:01:00+00:00

Disruption

Update (10:00): We’re aware of a large scale network issue. An engineer is already on-site to determine the issue. We will update you as we know more
Update (10:15): We're still working on the network incident. We will update again in 15 minutes.
Update (10:30): We can see parts of the network come back up, including the support portal. We will update further once we have details from the engineers.
Update (10:45): We are still aware of a minority of customers with issues. We will know more about these in 15 minutes.
Update (11:22): The network has been fully functional since 10:50. Some customers stacks have not returned following the issue. If you are seeing any issues please raise a support ticket.
Update (12:43): We're still aware of some customers experiencing issues. We're working with each on a singular basis. Please raise a support ticket if you are experiencing any problems.
Update (13:41): We have isolated the issue affecting our remaining customers. We will be restarting some key network devices to restore full functionality. During this time customers will experience intermittent issues.
Updates (14:02): All network devices have now been restarted with very minor disruption. We are monitoring affect on the remaining customers with issues.

Network Interruption

2021-03-15T16:50:00+00:00

Disruption

An incident has occurred affecting a small number of customers within the datacentre. Our network team are looking into the issue now and we'll update this page once we have some further information
Update (18:39): The issue has been located and fixed by our networking team. Further information will be communicated to customers effected.

Planned Maintenance

2021-03-15T16:31:00+00:00

Date/Time

Start Date / Time: 16/03/21 22:00 GMT
Finish Date / Time: 16/03/21 22:30 GMT

Disruption

Potential total loss of network connectivity for all customer infrastructure.

Estimated Downtime

Up to 30 minutes.

Actions

An issue has been detected with a number of pieces of networking equipment. During the maintenance window these switches will be replaced with different hardware.

Updates

22:04: Maintenance is starting
22:15: Maintenance is still under way
22:58: There appears to be some issues following the maintenance being completed, it is being investigated now
23:00: The network is still experiencing problems and is being investigated
23:15: The network is still experiencing problems and is being investigated
23:30: The network is still experiencing problems and is being investigated
23:40: A network engineer has been dispatched to the data centre following an unknown fault
23:59: The network engineer is on site at the data centre and is investigating
00:15: The network is still experiencing problems and is being investigated
00:30: The network is still experiencing problems and is being investigated
00:45: The network is still experiencing problems and is being investigated
01:00: The network is still experiencing problems and is being investigated
01:15: The network is still experiencing problems and is being investigated
01:30: The network is still experiencing problems and is being investigated
01:45: The network is still experiencing problems and is being investigated
02:00: The network is still experiencing problems and is being investigated
02:15: The network is still experiencing problems and is being investigated
02:30: The network is still experiencing problems and is being investigated
02:45: The network is still experiencing problems and is being investigated
03:00: The network is still experiencing problems and is being investigated
03:15: The network is still experiencing problems and is being investigated
03:33: The underlying issue appears to be resolved and further investigations are underway

Planned Maintenance

2021-03-15T10:35:00+00:00

Date/Time

Start Date / Time: 15/03/21 22:00 GMT
Finish Date / Time: 15/03/21 22:30 GMT

Disruption

Unavailability of support portal and stack control panel. This will not affect customer services/infrastructure.

Estimated Downtime

Up to 30 minutes.

Actions

The support portal platform hardware is being upgraded during which time services will be unavailable.

Updates

22:04: Maintenance is starting
03:27: Maintenance is now complete

Network Interruption

2021-03-13T14:07:00+00:00

Disruption We're aware of a disruption to our services and our network team are currently investigating.

Update (14:39): We're investigating an issue at one of our datacentres
Update (16:31): We can confirm all services should now be back online and further investigation on the issue will be carried out.

Planned Maintenance

2021-02-17T12:06:37+00:00

Date/Time

Start Date / Time: 23/02/21 23:00 GMT
Finish Date / Time: 24/02/21 03:00 GMT

Disruption
Migration of network edge routers.

Estimated Downtime
Up to 15 minutes.

Actions
We will be migrating our network edge routing infrastructure from Equinix MA3 into our own LDEX2 facility, during this window there will be disruption to edge routing whilst routing re-converges in the new location

Planned Maintenance

2021-02-15T22:30:00+00:00

Date/Time

Start Date / Time: 15/02/21 23:30 GMT
Finish Date / Time: 19/02/21 06:00 GMT

Disruption

To all stacks in our main facility

Estimated Downtime

Up to 180 minutes.

Actions

Scheduled upgrades to customers stacks are taking place to improve networking performance and reliability. You will have received an email about this as to the specific date your stack/s will be upgraded.

Planned Maintenance

2021-02-09T17:22:00+00:00

Date/Time

Start Date / Time: 10/02/21 13:00 GMT
Finish Date / Time: 10/02/21 14:30 GMT

Disruption

"This will disrupt Autoscaling, Overflow servers, IPSEC VPN tunnels, and some my.sonassi.com portal functionality."

Estimated Downtime

Up to 90 minutes.

Actions

Upgrade of some critical infrastructure

Planned Maintenance

2021-01-29T09:57:15+00:00

Date/Time

Start Date / Time: 05/02/21 14:00 GMT
Finish Date / Time: 05/02/21 14:10 GMT

Disruption

Replacement of core routers.

Estimated Downtime

Up to 5 minutes.

Actions

Core routers are being upgraded to newer hardware. This should be a seamless zero-downtime migration between existing and new devices.

Planned Maintenance

2021-01-29T09:46:01+00:00

Date/Time

Start Date / Time: 04/02/21 18:00 GMT
Finish Date / Time: 04/02/21 20:00 GMT

Disruption

Maintenance on my.sonassi.com control panel.

Estimated Downtime

Up to 120 minutes.

Actions

Scheduled upgrades to my.sonassi.com are taking place to improve performance and reliability. This is the second of a two-part maintenance programme of the upgrade.

Planned Maintenance

2021-01-29T09:44:00+00:00

Date/Time

Start Date / Time: 03/02/21 18:00 GMT
Finish Date / Time: 03/02/21 19:00 GMT

Disruption

Maintenance on my.sonassi.com control panel.

Estimated Downtime

Up to 60 minutes.

Actions

Scheduled upgrades to my.sonassi.com are taking place to improve performance and reliability. This is the first of a two-part maintenance programme of the upgrade.

Updates

18:10: The maintenance is starting
18:30: Maintenance is largely complete and testing is underway
19:06: Maintenance is now complete

Planned Maintenance

2021-01-29T09:42:00+00:00

Date/Time

Start Date / Time: 01/02/21 22:00 GMT
Finish Date / Time: 01/02/21 22:10 GMT

Disruption

Failover testing of core network devices.

Estimated Downtime

Up to 5 minutes.

Actions

Ahead of the upcoming data centre migration, testing of all core networking equipment is being performed to ensure that failover operates correctly and downtime is kept to a minimum during the procedure.

Updates

22:01: The maintenance has now begun.
22:34: The maintenance is now complete.

Planned Maintenance

2020-12-11T16:51:02+00:00

Date/Time

Start Date / Time: 14/12/20 22:00 GMT
Finish Date / Time: 15/12/20 02:00 GMT

Estimated Downtime

A short network blip is expected during failover.

Disruption

As with any maintenance work there is an increased risk during the maintenance window as network services will be operating on reduced paths.

Planned Maintenance

2020-12-07T10:43:39+00:00

Date/Time

Start Date / Time: 14/12/20 22:00 GMT
Finish Date / Time: 14/12/20 23:00 GMT

Estimated Downtime

Up to 1 hour.

Disruption

Upgrades to internal infrastructure which will disrupt overflow and autoscaling services during the times specified. Other services should remain unaffected.

Network Interruption

2020-11-25T14:21:00+00:00

Disruption We're aware of a disruption to our services and our network team are currently investigating.

Update (15:39): We're investigating an issue at one of our datacentres
Update (15:51): A datacentre engineer is on site is investigating the networking equipment we believe is causing the issue and attempting to rectify the problem.
Update (15:59): We've identified a potential issue and are investigating further
Update (16:16): We have resolved one issue which affected some customer stores. We are reviewing other issues
Update (16:21): my.sonassi.com and overflows are now fully operational. We still have an issue affecting a small number of stores in one data centre. We will continue to update on this
Update (16:43): We are continuing to review connectivity issues at one of our data centres.
Update (17:08): While isolating the issue in the data centre the core network has been affected causing other data centres to have connectivity issues. my.sonassi.com is temporarily down
Update (18:14): As of 17:49 connection to my.sonassi is restored. We still have an outstanding issue with one data centre. The majority of customers are unaffected.
Update (23:41): The network team believe the issue has been identified and the network team are continuing to monitor it throughout the evening.

Planned Maintenance on my.sonassi.com

2020-11-05T16:21:08+00:00

Date/Time

Start Date / Time: 10/11/2020 22:00 GMT
Finish Date / Time: 10/11/2020 22:30 GMT

Disruption

The following services will be unavailable while the work is performed:

Ticketing system
Purchase of SSL certificates and overflow servers

A support telephone number will be provided as an alternative to report any emergency issues while the work is being performed.

Estimated Downtime

15 minutes.

Actions

Our infrastructure team will be performing work to upgrade the platform that hosts the my.sonassi.com ticketing and ordering system.

Planned Maintenance

2020-10-28T16:36:24+00:00

Date/Time

Start Date / Time: 29/10/20 23:00 BST
Finish Date / Time: 30/10/20 03:00 BST

Disruption

Network equipment investigation/replacement per incident 160. There will be brief periods of network instability during this window.

Estimated Downtime

Up to 4 hours.

Actions

Network infrastructure will be monitored whilst configuration is updated to replicate conditions of the previous incident and implement a resolution.

Network Interruption

2020-10-27T16:01:00+00:00

Disruption

Update (17:15): Our monitoring has detected a potential issue with networking performance and stability on a small number of servers, we are currently investigating
Update (17:38): Our monitoring has detected an issue that is impacting the network availability, our network teams are proceeding to investigate further
Update (17:50): We believe services are currently back online, our networking team are continuing to investigate the issue
Update (18:05): A small subset of servers may still be affected, our networking team are still investigating the issue and working to restore full service
Update (18:54): Full service has now been restored, an investigation is underway and a full RFO will be provided

Planned Maintenance

2020-09-11T13:44:00+00:00

Date/Time

Start Date / Time: 11/09/20 18:00 BST
Finish Date / Time: 11/09/20 19:00 BST

Disruption

Support portal platform upgrade per incident 160. The support portal will be entirely unavailable during this period.

Estimated Downtime

Up to 60 minutes.

Actions

The underlying hardware platform for the support portal is being replaced with newer/larger infrastructure. This will result in improved performance of the portal as a whole as well as offering improved reliability.

Updates

18:00: The maintenance has now begun.
19:10: The maintenance is now complete.

Support Portal Issue

2020-08-26T13:47:00+00:00

Disruption

Update (26/08/2020 15:47): It appears we're having some intermittent issues with the support portal (my.sonassi.com). Our engineers are currently investigating this and we aim to have an update in the next 30 minutes.
Update (26/08/2020 15:47): We believe we've identified the issue and are continuing to monitor the situation

Support Portal Issue

2020-08-25T21:12:00+00:00

Disruption

Update (00:02): Our engineers are currently investigating an issue with the Support Portal. The portal will appear offline currently for all customers.
Update (03:04): Service has been restored, a post-mortem will be provided shortly.

Overflow Disruption

2019-11-29T16:51:00+00:00

Overflow Disruption

Update (02:38): We have localised the fault to a routing device, which has been rectified and most affected systems are functioning as expected with their overflows. The remainder are being checked now however no further issues are expected. An RFO will be issued within 24 hours as per policy and we sincerely apologise for the disruption.
Update (22:42): We believe we have identified the issue and corrected it, affected services and overflow systems are being restored.
Update (17:51): An incident has occurred with overflows and systems using them, we are working on identifying and delivering a fix as soon as possible.

Network Interruption

2019-11-23T05:38:00+00:00

Disruption

Update (06:38): Our monitoring has detected an issue with location US-E-1. It is being investigated now.
Update (06:50): A core router has failed, the issue has been escalated to the network engineering team.
Update (07:06): A member of the networking engineering team is investigating the issue.
Update (07:23): The issue is still being investigated. Updates to follow.
Update (07:45): The issue is still being investigated. Updates to follow.
Update (08:02): The issue is still being investigated. Updates to follow.
Update (08:30): The issue is still being investigated. Updates to follow.
Update (09:45): The issue is still being investigated. Updates to follow.
Update (10:08): The issue is still being investigated. Updates to follow.
Update (10:20): All services should now be restored. Customers are advised to contact their account manager for a reason for outage.

Network Interruption

2019-07-24T02:36:00+00:00

Disruption

Update (04:36): A large volume DDOS attack has been identified targeted at a single host. Mitigation has been activated and services remain unaffected. Hosts affected are in the 149.86.96.0/24 range.
Update (10:05): The attack appears to have subsided and mitigation has been deactivated.
Update (13:12): Some customers have reported minor connectivity issues (eg. payment callbacks) whilst mitigation was active. This is being investigated now.
Update (14:44): We are still investigating the underlying issue with our DDOS provider. DDOS mitigation is activated several times per day and operates entirely without error, but there was something about this morning's activation that lead to genuine traffic being classified as malicious (eg. responses from PayPal).

Planned Maintenance

2019-07-09T12:51:00+00:00

Date/Time

Start Date / Time: 18/07/19 23:00 BST
Finish Date / Time: 19/07/19 00:00 BST

Disruption

Core router upgrade programme, per incident 153. All customers services will be affected during this maintenance window whilst the switchover occurs.

Estimated Downtime

Up to 25 minutes.

Actions

Our core routers are being replaced in UK-1/MA3. These devices are responsible for all routing in the UK and as such, all services will be briefly impacted during the necessary maintenance window. The procedure will include powering off the secondary router, installing the replacement - then briefly suspending traffic flows to redirect over the new router. Downtime should be kept to an absolute minimum through this process unless an unplanned incident occurs.

Updates

23:01: The maintenance has now begun.
01:42: The maintenance is now complete.

Planned Maintenance

2019-07-09T12:48:00+00:00

Date/Time

Start Date / Time: 10/07/19 23:00 BST
Finish Date / Time: 11/07/19 00:00 BST

Disruption

Aggregation switch upgrade programme, per incident 153. Several customers services will be affected during this maintenance window during the switchover.

Estimated Downtime

Up to 15 minutes.

Actions

We will be upgrading a number of aggregation switches across UK-1/MA3. The procedure will include powering off the secondary switch, installing the replacement - then briefly suspending traffic flows to redirect over the new switch. Downtime should be kept to an absolute minimum through this process unless an unplanned incident occurs.

Status

01:52: The maintenance is now complete

3rd Party Service Interruption

2019-07-02T13:11:00+00:00

Disruption

Update (14:53): Customers are reporting issues with Gateway Timeouts; commonalities between customers are the use of CloudFlare. We are recommending that all customers disable CloudFlare completely before raising a ticket for support.
Update (15:19): CloudFlare are reporting that a temporary fix is in place and services should resume normality.

Do not raise an emergency ticket if you use CloudFlare. Customers are advised to view https://www.cloudflarestatus.com/ and disable CloudFlare in the mean time.

Network Interruption

2019-06-27T16:08:00+00:00

Disruption

Update (17:47): Some customers have reported connectivity issues with their stacks.
Update (18:25): The connectivity has now been restored on the majority of our network but some customers will still be affected.
Update (19:21): We are still suffering connectivity issues across our network, our engineers are investigating.
Update (19:34): The majority of our network has connectivity but some customers are still affected.
Update (20:09): Connectivity has been restored. Please get in touch with us if you are still experiencing issues.
Update (21:30): Services are confirmed fully operational and the underlying issue appears to have been tracked to a failing/malfunctioning core aggregation switch. This has been permanently powered off and will be replaced. A full RFO will be available within 24 hours.

Post-Mortem

Our report from the incident is as follows.

Issue

Loss of connectivity, high load and periods of unavailability for the entire MA3 facility and a single isolated network segment.

Outage Length

The duration was between 60 to 180 minutes.

Underlying cause

Unfortunately, this was a repeat incident of similar nature to https://status.sonassi.com/incident/149/

We believe a malfunctioning aggregation switch as part of the backbone of the network core began sending out malformed/erroneous L2 packets, driving up CPU utilisation on other routers and switches. Control plane traffic was disturbed and multi-chassis aggregated links degraded, resulting in loss of downstream connectivity to rack pods and subsequent customer stacks.

Different symptoms to the last incident lead diagnosis ultimately down an incorrect path of initial resolution, leading to extended resolution times and the isolation of an entire network segment (a single rack pod).

Symptoms

Our facilities monitoring, and service monitoring probes immediately reported the incident. Customers would have experienced slow page load times through to a completely inaccessible site.

Resolution

A repeat incident of a malfunctioning aggregation switch appeared to be the source of increased CPU load throughout the network; permanently powering off the device (with subsequent replacement due) resolved the underlying issue.

Prevention

The network architecture and equipment in use is more than adequate, offering extreme levels of availability, with multiple layers of redundancy designed into every tier of the network stack. However, a failed/failing device has caused significant network disruption wherein the device had no explicitly failed, demonstrated signs of error or malfunction - other than the successful operation of the network in its absence.

Whilst rare, the eventual degradation of the switch silicon is lead to be the cause and the device (and its paired device) will be replaced with latest generation hardware.

3rd Party Service Interruption

2019-06-24T09:40:00+00:00

Disruption

Update (17:51) Cloudflare have reported the issue resolved at 13:02 UTC, we haven't had any further reports of issues so are marking this resolved.
Update (11:40): Customers are reporting issues with 504 Gateway Timeouts; commonalities between customers are the use of CloudFlare. We are recommending that all customers disable CloudFlare completely before raising a ticket for support.
Update (12:08): CloudFlare has now updated their status page to reflect an outage.

Network Interruption

2019-06-23T11:02:00+00:00

Disruption

Update (13:04): We are currently investigating an incident affecting core routers at MA3.
Update (13:37): The issue has been identified and fixed and we are in the process of restoring connectivity on individual customer stacks.
Update (14:39): The majority of affected customer stacks are online, where we are working through the remaining affected stacks.
Update (16:25): All services are now fully restored. Customers are advised to raise an emergency ticket if they are still experiencing issues.

Post-Mortem

Our report from the incident is as follows.

Issue

Loss of connectivity, high load and periods of unavailability for the entire MA3 facility.

Outage Length

The duration was between 55 to 95 minutes.

Underlying cause

A surge in CPU load on both core routers caused disruption in control plane traffic, where forwarding and routing were operating without issue.

Customer stacks feature two network interfaces, attached to two, diverse switching and routing networks; the active interface is selected by performing an reachability check (using ARP) to its respective gateway. As the core routers were not responding to control plane traffic, ARP requests were being dropped, resulting in stacks taking both primary and secondary interfaces offline; ultimately severing all connectivity.

HA stacks were further affected by this issue, where the shut down interfaces lead to each member of the cluster not being able to reach each other and a "split brain" scenario occurring. Even when connectivity was restored, manual intervention was required to address the split brain.

Symptoms

Our facilities monitoring, and service monitoring probes immediately reported the incident. Customers would have experienced slow page load times through to a completely inaccessible site.

Resolution

A malfunctioning aggregation switch appeared to be the source of increased CPU load throughout the network; rebooting the affected switch was sufficient to allow CPU loads to drop and for stacks to "bring up" their network interfaces after having a successful ARP check.

Prevention

We have identified several areas in which improvements can be made,

In the event of all interfaces failing ARP checks; interface monitoring should revert back to "MII" monitoring (Layer 1, versus L3). Should this incident happen again, all stacks will restore their connectivity within 60 seconds; greatly reducing the impact and subsequent downtime.
This fix has already been deployed.
Only a single ARP target was previously set (the redundant gateway), which lead to all ARP traffic being sent to a single core router. This has been expanded to include the second core router and where applicable, other HA cluster members. This should greatly reduce the event of HA failover and subsequent split-brain occurring.
This fix has already been deployed.

In addition to the above; deploying newer generation aggregation switches and core routers will go a long way towards addressing control plan capacity. Substantial investment in extremely high capacity, latest generation hardware, has already been made and the timeline for replacement of existing hardware will be brought forwards.

3rd Party Service Interruption

2019-06-13T08:05:00+00:00

Disruption

Update (10:05): Customers are reporting issues with 504 Gateway Timeouts; commonalities between customers are the use of Braintree. We are recommending that all customers disable BrainTree completely and active a backup payment gateway before raising a ticket for support.
Update (11:28): BrainTree have acknowledged a major outage on their side, see https://status.braintreepayments.com for updates. Customers should completely remove BrainTree from their store before raising a support ticket.

Planned Maintenance

2019-06-10T10:17:41+00:00

Date/Time

13:30 10/06/2019

Disruption

The control panel and support portal at my.sonassi.com will be briefly inaccessible.

Estimated Downtime

5 minutes.

Actions

Following an earlier incident, maintenance is taking place in order to replace failed devices that the support portal system is currently connected to.

Support Portal Interruption

2019-06-02T05:05:00+00:00

Disruption

Update (07:05): The third party provider that hosts our support portal is experiencing connectivity issues.
Update (08:23): Services have now been restored.

Network Interruption

2019-02-07T09:05:00+00:00

Disruption

Update (10:06): We are currently receiving several reports of downtime. The engineer team is investigating.
Update (10:15): Engineers have been dispatched to the data centre to investigate.
Update (10:30): The issue looks to have been tracked down to a breaker that has tripped, severing the B power circuit
Update (10:45): Power is fully restored, a failed PSU catastrophically failed, tripping the B power circuit and caused the loss of the B switching network for a single rack. Of the 110 hosts within the rack, 14 suffered irrecoverable downtime where failover to the A switching network did not work as designed, an investigation is underway with more details to follow. Services are fully operational.

3rd Party Service Interruption

2019-02-04T13:47:00+00:00

Disruption

Update (14:47): Customers are reporting issues with 504 Gateway Timeouts; commonalities between customers are the use of Braintree/MailChimp. We are recommending that all customers disable MailChimp completely and active a backup payment gateway before raising a ticket for support.

Network Interruption

2019-02-01T07:21:00+00:00

Disruption

Update (08:21): An issue has been detected with DNS resolution for domains utilising ns1.sonassihosting.com
Update (08:30): On initial inspection, the DNS resolvers themselves are healthy and 100% operational, where the issue appears isolated to domains using ns1.sonassihosting.com, ns2.sonassihosting.com and ns3.sonassihosting.com as the primary resolvers
Update (08:31): DNS resolution for domains using ns1.sonassi.com, ns2.sonassi.com and ns3.sonassi.com remain entirely unaffected
Update (08:45): Global DNS resolution for nsX.sonassihosting.com appears healthy and cache refreshes have been initiated with major DNS resolvers (CloudFlare, Google etc.)

Support Portal Interruption

2019-01-07T16:34:00+00:00

Disruption

Update (17:34): We have identified an issue with connectivity for our support portal from some locations. This is currently under investigation.
Update (17:45): Service has resumed normality. Further investigations are underway to confirm the underlying reason for disruption.

MailChimp Outage

2018-12-03T14:54:00+00:00

Disruption

Update (15:40): MailChimp is suffering a global outage. Disable MailChimp immediately on your store if you are experiencing downtime.

Update (15:41): Disable the module using MageRun

  mr_examplecom config:set 'mailchimp/ecommerce/active' 0

Update (17:14): MailChimp's service status is healthy again, but it is recommended to remove/disable the MailChimp module from your store due to inherent reliance on MailChimp's API.

Network Interruption

2018-09-01T15:21:00+00:00

Disruption

Update (17:21): Suspected localised network failure affecting a small number of servers.
Update (17:24): Failover to the secondary network occurred and all services are operational

Network Interruption

2018-07-23T18:01:00+00:00

Disruption

Update (20:00): Some reports of connectivity issues from a handful of locations. This is being investigated. Please do not submit a ticket if you are experiencing issues, follow all updates at http://status.sonassi.com
Update (20:08): All external tests are showing healthy traffic flows, investigation is continuing. Issues appear to be localised to specific ISPs/providers.
Update (20:15): NOC has confirmed there are no known network issues, but we were able to visualise a noticeable drop in internet traffic. The source of the issue is still being investigated, but current signs would indicate this is an external/internet issue.
Update (20:30): Traffic flows are still healthy. The full internal and external network has been vetted and is confirmed error free. NOC have a high degree of confidence this issue was an internet issue (Internet Exchange/Peering point dropping traffic), investigations will continue, however we believe there is no issue with the Sonassi internal or external network.
Update (20:45): Traffic flows have continued to be healthy for 30 minutes. Confirmed as a global external/internet issue and not related to Sonassi's internal or external network.

Network Interruption

2018-07-09T16:17:00+00:00

Disruption

Update (18:05): Suspected localised network failure affecting a small number of servers.
Update (18:15): The issue looks to be isolated to a single rack and is believed to be power related.
Update (18:30): A circuit breaker has tripped in a single rack, resulting in the fail-over of HA stacks and brief periods of unavailability for single-server stacks.
Update (18:37): Power has been restored, but it looks like on restoration that multiple access switches on the B network have booted back up in a "safe-default" state. The device configurations are being re-provisioned.
Update (18:45): Services are operating normally and being monitored.

Post-Mortem

Our report from the incident is as follows.

Issue

Loss of connectivity, high load and periods of unavailability for a single rack of equipment.

Outage Length

The duration was up to 18 minutes.

Underlying cause

A Power Supply Unit (PSU) in another server, within the same rack connected to the B power feed catastrophically failed. The catastrophic failure, unlike that of a typical PSU failure, caused a spike/surge in electrical activity, which triggered the Moulded Case Circuit Breaker (MCCB) for the B power feed to trip and cut off power.

On restoration of power of the B feed, the access switch booted into a “fail-safe” environment, in which the configuration was not fully restored. This led to the stack detecting the physical link was back up and began routing traffic over the B network; however the traffic was not being routed correctly.

Symptoms

Our facilities monitoring, and service monitoring probes immediately reported the incident. Customers would have experienced slow page load times through to a completely inaccessible site.

Resolution

Network engineers immediately restored the running configuration of the affected access switches, and gradually services restored normality. The disruption in network flow and packet loss caused delays in service restoration in the mutli-server stack, as each stack member was unable to cleanly communicate with its peers. This in turn caused load averages to increase, resulting in delayed service restoration even after the network connectivity was fully operational.

Prevention

Monitoring, disaster recovery procedures and staff action were followed exactly as anticipated, rapidly identifying the issue, resolving the issue and keeping downtime to a minimum.

This was an exceptional situation and there is nothing we believe could be performed differently to reduce or avoid downtime.

Planned Maintenance

2018-06-28T12:46:00+00:00

Date/Time

30/06/2018 - ~

Disruption

None.

Estimated Downtime

Eternity.

Actions

TLS v1.0 is permanently disabled in line with the deadline for 30/06/2018. See https://www.sonassi.com/blog/pci-dss-v3-and-tls-v1-0 for more information.

[3rd party issue] Mailchimp API experiencing issues

2018-06-18T10:30:00+00:00

Disruption

We've identified a problem with a third party service, MailChimp.

If your store is experiencing issues and using MailChimp, please disable the module prior to contacting support

Planned Maintenance

2018-06-06T16:13:00+00:00

Date/Time

10:15-10:30 07/06/2018

Disruption

Scheduled upgrades for my.sonassi.com

Estimated Downtime

15 minutes

Actions

Customers are encouraged to call for support during this planned maintenance window.

Network Interruption

2018-05-31T17:04:00+00:00

Disruption

Update (19:04): DNS resolution from CloudFlare (1.1.1.1) looks to be non-responsive. This may affect customer stacks utilising this service.
Update (19:09): DNS resolvers have been changed on all customer stacks to an alternate service whilst this service is unavailable.
Update (19:30): Service has resumed normal status and the network engineering team are reviewing this incident and identifying ways to ensure the nature of this incident can not occur again.

Network Interruption

2018-05-18T10:51:00+00:00

Disruption

Update (12:51): We are investigating a potential network interruption.
Update (12:56): Upon inspection, there was a very brief drop in service from an upstream carrier. We are still investigating, but the matter looks to be resolved almost immediately.

Support Portal Interruption

2018-04-09T19:04:00+00:00

Disruption

Update (20:54): Maintenance is starting on our support portal.
Update (20:57): Maintenance is now complete on our support portal.

Network Interruption

2018-04-05T19:00:00+00:00

Disruption

Update (06/04/2018 21:38): Several of our worldwide monitoring probes are showing DNS failure using Google's public DNS service (8.8.8.8). There appears to be an issue with Google's public DNS service that may affect DNS resolution globally. This is an issue entirely external to Sonassi but as Google's outage impact can be ubiquitous, Sonassi will use our visibility of global networks to help keep customers informed.
Update (06/04/2018 21:51): Commonality of the failed requests looks to be CloudFlare. CloudFlare's status page indicates a failure with their authoritative DNS service.
Update (06/04/2018 09:52): CloudFlare's DNS issues are now resolved.

Network Interruption

2017-12-30T09:37:00+00:00

Disruption

Update (10:18): Our team are currently investigating what appears to be a major outage.
Update (10:31): One of our transit providers appears to have suffered a major outage and we are actively re-routing traffic around them.
Update (10:43): Traffic is now flowing as normal and we're in active contact with one of our upstream providers to understand the nature of the issue.
Update (10:56): The outage has been confirmed and acknowledged by one of our transit providers, caused by the failure of a core router. We are continuing to route traffic around this provider.

Post-Mortem

Our report from the incident is as follows.

Issue

Loss of connectivity from some ISPs.

Outage Length

The duration was 6 minutes.

Underlying cause

The outage experienced was due to one of our transit providers suffering an unexpected reboot of a router in Manchester. High CPU was noted at the time of the reboot which was what was responsible for dropped packets prior to reboot.

Symptoms

Our monitoring probes immediately reported the packet loss, which affected approximately 30% of our total inbound traffic.

Resolution

Our network operations team immediately shut down the connectivity to the affected transit provider and re-routed traffic around them. This restored full connectivity within seconds. The affected transit provider will remain "shut down" until we have seen consistent healthy performance, after which, it will be added to our transit pool.

Sonassi maintains connectivity from multiple independent transit providers to provide internet connectivity resilience. In this instance, a single provider failed, resulting in some traffic being briefly dropped prior to re-routing.

Network Interruption

2017-12-18T20:11:00+00:00

Disruption (18/12/2017)

Update (21:11): A large volume DOS attack has been detected and mitigation has been activated - we are investigating further now.
Update (22:44): The attack is still ongoing and mitigation is still active. Customer services are believed to be wholly unaffected at present.

Network Interruption

2017-12-12T21:17:00+00:00

Disruption (12/12/2017)

Update (22:17): A large volume DOS attack has been detected and mitigation has been activated

Disruption (13/12/2017)

Update (10:30): The attack is still ongoing and mitigation is still active. Some websites that are using inline, outbound HTTP/HTTPS requests may be suffering performance issues.
Update (11:14): The attack is still ongoing and mitigation is still active. Customer services are believed to be wholly unaffected at present.
Update (11:39): The attack is still ongoing and mitigation is still active. Customer services are believed to be wholly unaffected at present.
Update (11:54): The attack is still ongoing and the vector is changing in line with mitigation. Some customer services are being impacted by the changing attack vector.
Update (12:13): The attack is still ongoing and the vector is changing in line with mitigation. Some customer services are being impacted by the changing attack vector.
Update (13:19): The attack is still ongoing and the vector is changing in line with mitigation. Some customer services are being impacted by the changing attack vector.
Update (16:52): The attack is still ongoing and mitigation is still active. Customer services are believed to be wholly unaffected at present.
Update (17:21): The attack is still ongoing and the vector is changing in line with mitigation. Some customer services are being impacted by the changing attack vector.
Update (18:55): The attack is still ongoing and the vector is changing in line with mitigation. Some customer services are being impacted by the changing attack vector.
Update (16:52): The attack is still ongoing and mitigation is still active. Customer services are believed to be wholly unaffected at present.

Disruption (14/12/2017)

Update (10:00): The attack is still ongoing and mitigation is still active. Customer services are believed to be wholly unaffected at present.

Network Interruption

2017-10-04T14:50:00+00:00

Disruption

Update (16:51): A large volume DOS attack has been detected and mitigation has been activated
Update (17:44): The attack is still ongoing and mitigation is still active. Some websites that are using inline, outbound HTTP/HTTPS requests may be suffering performance issues.
Update (18:34): The attack is still ongoing and mitigation is still active. Customer services are believed to be wholly unaffected at present.

Network Interruption

2017-09-30T22:24:00+00:00

Disruption

We are currently investigating a possible network incident.

Update (00:01): A large volume DOS attack has been detected and mitigation is active.
Update (00:16): The attack is still ongoing, however mitigation means that there is no noticeable effect on services.
Update (00:47): The attack is still ongoing, overall performance is good, however the continued, large volume attack is causing minor interruptions in service. The network and security team are actively monitoring this.

Planned Maintenance

2017-06-13T15:39:00+00:00

Date/Time

09:00-10:00 14/06/2017

Disruption

Scheduled upgrades for my.sonassi.com

Estimated Downtime

1 hour

Actions

This is a non-service affecting upgrade of our support portal at my.sonassi.com, these important feature updates will bring new functionality and control for your account. Critical support will be available via email and telephone, where details will be provided on my.sonassi.com during the maintenance.

Network Interruption

2017-03-15T16:21:00+00:00

Disruption

We are currently investigating a possible network incident.

Update (17:24): A large volume DOS attack has been detected and mitigation is active.
Update (17:44): The attack is still ongoing, however mitigation means that there is no effect on services.

Network Interruption

2016-12-06T15:36:00+00:00

Disruption

We are currently investigating a possible network incident.

Update (16:41): We believe that the issue has been identified, restored connectivity and are monitoring the situation. We will post a more detailed update shortly.
Update (16:56): The network has been stable for the past 15 minutes. An upstream transit provider suffered a large amount of packet loss. We have re-routed traffic around this provider and appear to see normal performance. We will continue to monitor with our connectivity to this provider shut down until we have confirmed it is healthy again.
Update (17:24): Upon initial diagnosis, we can confirm the loss of connectivity was caused by link saturation of a provider during a dDoS attack targetted at a single customer. A full post mortem will be available shortly.
Update (17:37): dDoS mitigation remains active, whilst some smaller attacks continue, it is not service affecting.

Post-Mortem

Our report from the incident is as follows.

Issue

Loss of connectivity from some ISPs.

Outage Length

The duration was 9 minutes.

Underlying cause

A large volume DOS attack targeted a single customer and was of such significant volume that it saturated the connectivity of one of our transit providers.

Symptoms

Our monitoring probes immediately reported the attack. Despite the attack being targeted at a single customer, the volume affected all our customers causing high levels of packet loss.

Resolution

Initially, without full information available, we interpreted the packet loss as an issue with a single transit provider and shut down our connectivity to said provider. At that point, traffic re-routed to our other (larger capacity) providers and the issue looked to be resolved as the larger capacity transit "absorbed" the attack. Moments later, our Level3 dDoS mitigation platform automatically activated and began scrubbing the malicious traffic and we restored connectivity to the original transit provider.

From start of attack to mitigation - the total time was 9 minutes. Our dDoS mitigation platform is a relatively new addition to the Sonassi network to offer an unprecedented level of protection to customers - and we are extremely happy that yet another large volume DOS attack was mitigated with only minimal disruption prior to activation.

Planned Maintenance

2016-11-22T15:00:20+00:00

Date/Time

Start Date / Time: 23/11/16 16:00
Finish Date / Time: 23/11/16 18:00

Disruption

Access switch replacement, internal network capacity increase.

Estimated Downtime

Up to 15 minutes

Actions

We will be upgrading each access switches within each affected rack. The procedure will include failing traffic over to the backup switch, replacing the primary switch, restoring normal traffic flows, then replacing the backup switch. Downtime should be kept to an absolute minimum through this process unless an unplanned incident occurs.

Planned Maintenance

2016-11-22T15:00:00+00:00

Date/Time

Start Date / Time: 22/11/16 16:00
Finish Date / Time: 22/11/16 18:00

Disruption

Access switch replacement, internal network capacity increase.

Estimated Downtime

Up to 15 minutes

Actions

Network Interruption

2016-10-25T13:12:00+00:00

Confirmed connectivity issues reported.

Update (15:10): An engineer is looking into this now, updates to follow.
Update (15:25): A upstream transit provider appears to be suffering some packet loss. We have re-routed traffic around this provider and appear to see normal performance. We will continue to monitor.

Post-Mortem

Our report from the incident is as follows.

Issue

Very brief loss of connectivity affecting around 20% of total traffic.

Outage Length

The duration was <4 minutes.

Underlying cause

An upstream provider suffered a router reboot, which caused traffic to be dropped both in/out of our network, for the small volume of traffic that passes via that provider.

Symptoms

Our monitoring probes immediately reported the incident. Customers would have experienced slow page load times through to a completely inaccessible site.

Resolution

The issue was resolved by temporarily dropping our connection to the respective provider and re-routing traffic over our other transit providers. The issue immediately subsided and normal traffic flows resumed. The upstream provider since has restored a previous configuration on their device and has it running stable again, we have since restored our connection to them and are operating at 100% capacity.

Network Interruption (PayPal)

2016-10-21T16:00:00+00:00

Disruption

PayPal's API domain is suffering a DNS outage and is not resolving correctly (more information)

Update (18:03): Our team have investigated the issue and identified that PayPal's DNS provider (Dyn) is currently suffering a dDoS attack, resulting in DNS resolution failure.

Planned Maintenance

2016-09-08T15:18:19+00:00

Date/Time

Start Date / Time: 13/09/16 21:00
Finish Date / Time: 14/09/16 06:00

Disruption

Upstream transit provider maintenance.

Estimated Downtime

Up to 15 minutes

Actions

An upstream transit provider will be undertaking scheduled maintenance as part of their on-going network enhancement programme. This may affect our support portal, but should have limited impact on customers traffic.

Planned Maintenance

2016-09-07T14:51:59+00:00

Date/Time

Start Date / Time: 12/09/16 21:00
Finish Date / Time: 13/09/16 06:00

Disruption

Upstream transit provider maintenance.

Estimated Downtime

Up to 15 minutes

Actions

Network Interruption

2016-08-03T13:08:00+00:00

Confirmed connectivity issues reported.

Update (13:07): An inbound DOS attack has been detected and is being mitigated.
Update (14:20): A second wave of attack has been detected and is being mitigated.
Update (14:52): A third wave of attack has been detected and is still being mitigated. Customer services are still 100% functional with successful mitigation active.
Update (17:28): Mitigation remains active, whilst some smaller attacks continue, it is not service affecting.

Network Interruption

2016-07-23T15:26:00+00:00

Confirmed connectivity issues reported.

Update (15:21): The network team are currently investigating the issue
Update (15:24): A large volume DOS attack has been detected and mitigation will begin shortly
Update (15:30): Our DOS mitigation platform has now been activated and the attack is being mitigated, normal traffic is now flowing
Update (15:36): Our DOS mitigation platform is reporting an inbound attack in excess of 300Gb/s - all of which is being successfully scrubbed

Post-Mortem

Our report from the incident is as follows.

Issue

Loss of connectivity from some ISPs to our legacy IP ranges.

Outage Length

The duration was 9 minutes.

Underlying cause

A large volume DOS attack (300Gb per second 200 million packets per second) targeted a single customer.

Symptoms

Our monitoring probes immediately reported the attack. Despite the attack being targeted at a single customer, the volume affected all our customers causing high levels of packet loss.

Resolution

The issue was resolved by the automatic activation of our Level3 DOS mitigation platform. From start of attack to mitigation - the total time was 9 minutes. Our dDoS mitigation platform is a new addition to the Sonassi network to offer an unprecedented level of protection to customers - and we are extremely happy that a DOS attack of such significant volume was mitigated so successfully.

Network Interruption (BT)

2016-07-20T09:47:00+00:00

BT is still suffering major outages affecting residential and business ISPs. Any BT customer will be affected by the outage within BT's network, which may cause poor performance or difficulty reaching websites.

If you are a BT customer and cannot access your server, please contact BT.

Planned Maintenance

2016-07-14T20:07:10+00:00

Date/Time

19/07/2016 23:00 - 20/07/2016 06:00 BST 21/07/2016 00:01 - 21/07/2016 06:00 BST

Disruption

Upstream transit provider maintenance.

Estimated Downtime

Up to 15 minutes

Actions

Network Interruption

2016-06-20T13:30:00+00:00

Confirmed connectivity issues reported.

Update (13:30): We are currently looking into reports of potential packet loss on the network.
Update (13:40): This issue appears to be global, as several providers worldwide are reporting similar issues.
Update (14:52): The symptoms have subsided and we can no longer reproduce problems, this external issue appears to be resolved

Planned Maintenance

2016-04-16T14:46:00+00:00

Date/Time

16/04/2016 14:00 - 17/04/2016 00:00 GMT 17/04/2016 14:00 - 18/04/2016 00:00 GMT

Disruption

Core network capacity upgrade.

Estimated Downtime

0-5 minutes

Actions

We are increasing the capacity of our core and aggregation network to improve performance and scalability for our customers. This will involve deploying new network switches and replacing previous devices one at a time. It should largely be a downtime free operation; there may be small windows of packet loss during failover between our A and B switching networks.

Update (14:44): Preliminary works are beginning shortly.
Update (00:44): Today's works have completed.

Planned Maintenance

2016-04-13T11:46:00+00:00

Date/Time

April 2016

Disruption

A MageStack security update is being deployed, this important security update may cause 502 errors to be displayed briefly on your store.

Estimated Downtime

1-5 minutes

Actions

The automated update is being monitored by our team and you will be notified on start/finish of the works.

Supprt Portal Interruption

2016-04-04T09:30:00+00:00

Disruption

Update (09:30): The server powering the support portal is undergoing emergency maintenance, an ETA on availability is ~45 minutes. Customers are encouraged to call for support during this unplanned maintenance window.
Update (09:45): The server maintenance is now complete and the support portal is fully available

Network Interruption

2016-03-03T20:57:00+00:00

Confirmed connectivity issues reported.

Update (20:57): We are seeing intermittent packet loss and are investigating.
Update (20:57): Initial signs show a significant DOS attack targeted at a number of customers.
Update (21:34): Traffic levels have subsided and an investigation is under way.

Planned Maintenance

2016-02-25T20:28:00+00:00

Date/Time

25/02/2016 20:30 - 25/02/2016 21:30 GMT

Disruption

Hardware upgrade for support portal server.

Estimated Downtime

30-60 minutes

Actions

We are upgrading the hardware used for our support portal.

Updates

Update (22:17): Changes are taking a little longer than expected. At present, we do not have an ETA for fix. Updates will follow within 60 minutes.
Update (23:08): The maintenance is now complete and support portal restored.

Network Interruption

2016-02-24T20:43:00+00:00

Connectivity issues reported by a number of customers.

Update (20:40): We have received a number of isolated reports from customers indicating reduced performance accessing their stores. Initial diagnosis points to a UK peering location, London Internet Exchange (LINX), being the cause of the issue. Our network engineering team have put measures in place to route traffic around LINX for outbound traffic, and taken steps to reduce the likelihood of inbound traffic via LINX. This will continue to be monitored for the next 24 hours whilst we follow up directly with LINX. As LINX is not within our network, it is not under our control and is deemed an external internet issue, not a fault in Sonassi's network.
Update (06:09): LINX confirmed a failed fibre optic transceiver on one of their switches and replaced it around 04:00.
Update (10:14): We have restored our connectivity on paths via LINX.

Planned Maintenance

2016-02-18T18:49:00+00:00

Date/Time

18/02/2016 18:00 - 18/02/2016 19:30 GMT

Disruption

Hardware upgrade for support portal server.

Estimated Downtime

5-10 minutes

Actions

We are upgrading the hardware used for our support portal.

Planned Maintenance

2016-02-08T20:49:00+00:00

Date/Time

13/02/2016 23:00 - 14/02/2016 01:00 GMT

Disruption

Firmware upgrade on border routers

Estimated Downtime

5-10 minutes

Actions

Each core router will be upgraded in turn to their latest firmware release. Failover should occur between the devices, resulting a brief period of downtime, however, we would like to allow for a window of up to 10 minutes possible downtime.

Network Interruption

2016-02-03T13:00:00+00:00

Unconfirmed connectivity issues reported.

Update (13:02): We are seeing intermittent packet loss and are investigating.
Update (13:14): Initial signs show a significant DOS attack targeted at a number of customers.
Update (14:14): The elevated traffic levels appear to have subsided and service appears to have returned to normal. We will continue to monitor for any changes.

Switch Failure

2016-01-08T12:20:00+00:00

Our monitoring probes have reported an access switch failure.

Update (12:20): An engineer is on site investigating the cause.
Update (12:37): The onsite engineer has identified an issue, work is ongoing to rectify this. Estimated time to fix: 5 minutes
Update (12:43): The onsite engineer has rectified the problem and all affected servers are back online.

Network Interruption

2015-11-30T18:34:00+00:00

Unconfirmed connectivity issues reported.

Update (18:40): We aren't aware of any issues within out network, or our transit providers. So this appears to be an internet issue as a whole.
Update (18:55): From receipt of a very, very small percentage of customers reporting issues; the common component appears to be Level3. As this is an issue peripheral to our network; we cannot provide any further information or updates.

Security Update

2015-11-12T22:30:00+00:00

We are already aware of the Mageworx compromise and all VPN bundles have been revoked from all customer stacks.

Update (22:31): All Mageworx VPN bundles have been automatically revoked from all customer stacks
Update (23:14): Customers are advised to remove any administrator accounts they may have issued to Mageworx. In addition, we would advise following our recommended checklist for securing your store; especially enabling admin, downloader and API protection

Network Interruption

2015-11-03T14:32:00+00:00

Unconfirmed connectivity issues reported.

Update (14:36): We're seeing reachability issues to our legacy network IP ranges.
Update (14:37): The transit provider in question (that carries our legacy ranges) appears to have made a routing change and traffic is now flowing.
Update (14:51): Connectivity has remained stable for the past 5 minutes. A reason for outage will be provided once available.

Post-Mortem

Our report from the incident is as follows.

Issue

Loss of connectivity from some ISPs to our legacy IP ranges.

Outage Length

The duration was 30 seconds.

Underlying cause

The transit provider carrying our legacy IP range inbound traffic experienced an interface flap at LINX, triggering a re-route and re-convergence of routing.

Symptoms

Our external monitoring probes immediately reported the fault. A very small number of users would have been unable to access their servers.

Resolution

The issue resolved itself by automatically selecting another carrier (Level3) when the connection at LINX dropped. The cause of the downtime was merely the delay of end-user ISP route re-convergance.

All customers are already being migrated from our legacy IP range as part of our 2015 IP migration; giving us full control of all customer traffic, both inbound and outbound.

Network Interruption

2015-10-19T16:02:00+00:00

Our network monitoring system is reporting an issue with one of our transit providers.

Update (16:02): The network management team has identified a possible issue with one of our transit providers, the connection has been shut down whilst further investigation is carried out.
Update (16:08): Connectivity appears stable now that our link to Hurricane Electric has been shut down. An incident has been raised with Hurricane Electric for investigation on their part. Our network is now performing normally.
Update (16:37): HE confirmed that a DDoS attack occurred within their network (targeted at another customer) which saturated a national link and resulted in a brief loss of connectivity. This matter is now resolved and the BGP sessions have been restored.

Possible network interruption

2015-10-14T13:18:00+00:00

Our network monitoring system is reporting a significant increase in inbound traffic. Initial investigations would appear to be a dDOS attack.

Update (12:12): The network management team have mitigated the traffic.
Update (12:20): Traffic is flowing as normal and no services are affected.

Network interruption

2015-10-08T11:24:00+00:00

Some local monitoring probes are encountering reachability issues to certain networks, it isn’t clear if this is a false alarm at this stage. It is currently being investigated.

Update (12:14): It would appear the issue was isolated to specific IP ranges within Google’s network. A ticket has been raised with Google’s European NOC.
Update (14:58): Google confirmed this was an internal issue in their network and their IP engineering team have taken steps to remedy it.

Global internet connectivity issues

2015-09-04T23:44:25+00:00

Our external monitoring probes are showing issues with a global transit provider, Level3.

Update (22:40): The networking team shut down our links to Level3 at 22:20 when the initial packet loss was detected. However, some ISPs may still be sending traffic via Level3. Some users may continue to experience issues until their own ISP has re-routed traffic around Level3.
Update (22:55): No further updates will be provided on this external global internet issue as it is not an issue within Sonassi’s network or control.
Update (00:39): Level3 have provided confirmation that the earlier issues are resolved. We have restored our BGP sessions with Level3 now.

Magento patch required

2015-08-04T19:48:45+00:00

SUPEE-6482

Magento has released a new security patch for versions 1.4 and newer, SUPEE-6482

The vulnerabilities

This bundle includes protection against the following security-related issues:

Cross-site Scripting Using Unvalidated Headers
Autoloaded File Inclusion in Magento SOAP API
XSS in Gift Registry Search
SSRF Vulnerability in WSDL File

What you need to do

You must apply this new security patch as soon as possible. It can be downloaded from https://www.magentocommerce.com/download

You can either patch the store yourself using the instructions below, or submit a (chargeable) maintenance support ticket at https://www.theclientarea.info where our support team can apply the patch on your behalf (est. 5-60 mins application time).

More information

Planned Maintenance

2015-07-16T19:16:57+00:00

Connectivity interruption for customers on legacy IP ranges

Estimated Downtime

Periodic loss of connectivity for short periods

Actions

One of our transit providers is performing maintenance on their core network as part of their continued service enhancement programme.

Magento patch required

2015-07-07T20:09:18+00:00

SUPEE-6285

Magento has released a new security patch for versions 1.6 and newer, SUPEE-6285

The vulnerabilities

This bundle includes protection against the following security-related issues:

Customer Information Leak via RSS and Privilege Escalation
Request Forgery in Magento Connect Leads to Code Execution
Cross-site Scripting in Wishlist
Cross-site Scripting in Cart
Store Path Disclosure
Permissions on Log Files too Broad
Cross-site Scripting in Admin
Cross-site Scripting in Orders RSS

What you need to do

You must apply this new security patch as soon as possible. It can be downloaded from https://www.magentocommerce.com/download

More information

Network Interruption

2015-07-06T13:55:37+00:00

Unconfirmed connectivity issues reported.

Update (12:56): Some customers have reported connectivity issues to their stores. There is no known issue within the Sonassi internal or external network. We are currently investigating possible global internet issues (peripheral to Sonassi).
Update (13:15): We are unable to replicate any fault from any monitoring nodes, however we still conducting tests and collecting customer information.
Update (13:30): No further updates.
Update (13:45): Unable to reproduce this issue, we are still collecting information from customers to ascertain what the commonality is in requests (ie. a failing intermediary ISP). Investigations are still continuing.
Update (14:00): Issue deemed unreproducable/localised to an isolated group of customers. No further action will be taken. Fault downgraded from high to low.

Post-Mortem

Our report from the incident is as follows.

Issue

A very small number of customers reported connectivity issues, this was unreproduceable and unconfirmed by our network team.

Outage Length

No outage.

Underlying cause

We collected several traceroutes from customers, observing both the forward and reverse path to ascertain what commonality may have existed. However, no single cause could be identified

Symptoms

Customers reported slow page load times and general difficulty connecting to their stores.

Resolution

No action was taken by our team. We had 5 isolated reports from customers, which lead us to create a network alert in case of a network-wide event. It is our policy that after 5 isolated reports, we put out an un-confirmed notification whilst we investigate.

As we were unable to identify any fault, the issue can only be attributed to an unknown larger internet congestion issue.

Network Interruption

2015-05-23T14:23:35+00:00

Experiencing some packet loss to some internal routes.

Update (13:38): We can see an extremely large amount of traffic targetting our edge routers. Updates to follow.
Update (13:48): We have blackholed the device being targetted with our upstream providers.
Update (13:53): Full service is now restored.

Post-Mortem

Our report from the incident is as follows.

Issue

Total packet loss, some customers servers were completely inaccessible.

Outage Length

The duration was 15 minutes.

Underlying cause

Continual diagnosis of our network core is taking place by the vendor in an effort to identify and resolve the outstanding issues we are experiencing.

This diagnosis involves gathering information from the switches and in some cases, making minor adjustments. A configuration change was made to the network edge filtering that left a window open for attack.

The increased traffic flow targetting the routing engine lead to increased CPU utilisation and a subsequent restart of the packet forwarding process (under current network conditions, this can take up to 10 minutes to recover).

Symptoms

Access to servers behind the affected subnets was impossible.

Resolution

The backup switching/routing network was manually restored to resume connectivity.

After approxomiately 7 minutes, the primary network was restored to full health and traffic was sucessfully, cleanly failed back.

The current network condition is very healthy, with full firewalling and full automated availability of routing and switching. Through continued efforts from the vendor and our own team, the historic issues we have experienced should now be considered resolved.

Network Interruption

2015-05-19T22:07:07+00:00

Experiencing some packet loss to some internal routes.

Update (21:22): We can see spanning tree re-convergence loops through the core infrastructure. Links are being manually shutdown to prevent loops.
Update (21:27): The issue has been idenitified and resolved.

Emergency Maintenance

2015-05-19T03:30:40+00:00

Emergency maintenance is taking place on network core. Downtime is possible.

Update (03:10): Preparation for shutdown of core router is complete. Device will be rebooted shortly.
Update (07:00): Maintenance completed without downtime.

Planned Maintenance

2015-05-18T21:01:59+00:00

Core network

Estimated Downtime

Periodic loss of connectivity for <10 minutes

Actions

Software updates are to be applied to each switch within our network. This will require powering down each device in turn to update the firmware.

This will result in a small amount of downtime which may affect a small number of customers as each access switch is powered down for update.

Planned Maintenance

2015-05-14T23:47:12+00:00

Core network

Estimated Downtime

Periodic loss of connectivity for <10 minutes

Actions

Software updates are to be applied to each switch within our network. This will require powering down each device in turn to update the firmware.

This will result in a small amount of downtime which may affect a small number of customers as each access switch is powered down for update.

Network Interruption

2015-05-14T16:25:31+00:00

Experiencing some packet loss to some internal routes.

Update (15:32): The issue has been traced to an upstream provider and an incidient has been raised with them. Updates to follow.
Update (15:37): The provider is looking into the issue.
Update (15:49): We are starting to see routes coming back up again.
Update (15:51): We have had confirmation that the provider in question has reverted a change they made at 15:30 to restore the previous configuration. Routing looks to be “normal”. We will continue to monitor.
Update (16:06): We are seeing full stability with the previously affected routes. Once an RFO has been provided from our upstream provider, we will update this incident with a post-mortem. -

Post-Mortem

Our report from the incident is as follows.

Issue

A very small number of IP subnets encountered a routing loop and some customers servers were inaccessible.

Outage Length

The duration was 23 minutes.

Underlying cause

The affected IP ranges are those we carry as a legacy from a historic provider - they do not form part of our multihomed BGP network and as such are subject to possible outages should the transit provider supplying them encouter problems.

The routing loop occured because of a configuration change on the upstream providers network.

Symptoms

Access to servers behind the affected subnets was impossible, a routing loop was visible on a traceroute.

Resolution

The provider reverted the change and service was immediately restored.

The long-term plan is to renumber all these IP addresses into our own, so that they can be announced over our resilient, mutlihomed BGP network. This task is already underway and customers will be contacted soon to arrange for changeover dates.

Planned Maintenance

2015-05-06T18:35:18+00:00

Core network

Estimated Downtime

Periodic loss of connectivity for <10 minutes

Actions

Software updates are to be applied to each switch within our network. This will require powering down each device in turn to update the firmware.

This will result in a small amount of downtime which may affect a small number of customers as each access switch is powered down for update.

Network connectivity fault

2015-05-06T09:35:30+00:00

Our external monitoring probes are showing issues with a global transit provider, Level3.

Update (08:31): The networking team shut down our links to Level3 at 08:00 when the initial packet loss was detected. However, some ISPs may still be sending traffic via Level3. Some users may continue to experience issues until their own ISP has re-routed traffic around Level3.
Update (11:58): Level3 have confirmed the issue within their network is resolved and is considered “stable”. We will leave our connections to Level3 as down until we confirm the same.
Update (16:43): We consider Level3 to now be “stable” and have restored our BGP sessions with Level3.
Update (19:50): Level3 is showing signs of packet loss. We have disabled our links to Level3 until further notice. Another global ISP, Telia, is also experiencing major packet loss causing the majority of global internet connectivity to be affected.

Network Interruption

2015-04-15T12:27:47+00:00

Experiencing partial packet loss on access switches in a single rack.

Update (11:35): Traffic appears to be flowing as normal without packet loss/elevated latency.

Emergency Maintenance

2015-04-14T17:36:58+00:00

Emergency maintenance is taking place on network core. Downtime is likely.

Update (17:00): Maintnance is complete.

Network Interruption

2015-04-13T18:59:03+00:00

Experiencing packet loss within network core.

Update (18:15): High CPU load has been identified on all edge routers, investigation is continuing.
Update (18:30): Both core routers are non-responsive, an engineer on site at the data centre is investigating the issue.
Update (18:45): Upon reboot, one router has resumed partial functionality. Investigation is still continuing.
Update (19:00): No new updates
Update (19:15): Attempts are still being made to restore functionality on both routers, connectivity will be lost as one is power-cycled as the other is booting.
Update (19:30): No new updates
Update (19:45): No new updates
Update (20:10): Both routers are online, but high packet loss is still present. Investigation is still continuing.
Update (20:15): No new updates
Update (20:30): No new updates
Update (20:40): We are seeing more consistent network throughput, packet loss is significantly reduced, however, issue does not appear to be fully resolved.
Update (20:50): Our monitoring is reporting 0% packet loss and full network health. However, we have been forced to shutdown an entire switcing network to resolve the issue - the network is currently running in a “degraded” state. We are still continuing to investigate the issue.
Update (22:00): The network has been fully restored and is no longer degraded. The issue is resolved, a post-mortem will be posted in due time.
Update (01:45): HA failover test attempted to verify fix.

Post-Mortem

Our report from the incident is as follows.

Issue

Significant packet loss, effecting all servers at our Joule House location, causing a total service loss to all servers.

Outage Length

The duration was 97 minutes.

Underlying cause

Currently under investigation with Juniper TAC to identify and isolate the issue. It appears to be a repeat incident whereby flooding within a single access switch caused significant control plane CPU consumption within other network devices.

Symptoms

Our external monitoring probes immediately reported the fault. End users will have noticed the issue as it had an effect on overall service.

Resolution

Whilst the sympoms have been resolved, we believe the underlying issue to still be present and the result of firmware bug. The issue has been elevated to the vendor for in depth analysis and urgent review.

Our network and transit team is continuing to investigate, replicate and resolve this issue in an isolated environment in parallel with Juniper TACs efforts, so that a swift resolution can be reached.

The network status will remain as high until we are confident of a permanent resolution.

Network Interruption

2015-02-17T20:49:00+00:00

Experiencing packet loss within network core.

Update (20:54): The issue has been traced to an access switch flooding the upstream network with traffic. As it is non responsive, it is being power cycled, which can take up to 5 minutes to complete.
Update (21:05): A power cycle hasn’t resolved the issue. An engineer is on site investigating now.
Update (21:57): The failed device has been removed from the network completely and replaced with a new unit. The device configuration has been applied and traffic appears to be flowing normally amongst other access switches in the stack.

Post-Mortem

Our report from the incident is as follows.

Issue

Significant packet loss, causing over 50% of packets to be dropped to a single rack of equipment and a secondary symptomatic effect of <10% loss within the network core. This had a significant effect on servers at Joule House, causing a total service loss to the servers connected to the respective access switch stack.

Outage Length

The duration was 63 minutes.

Underlying cause

Flooding within a single access switch, causing significant control plane CPU consumption within other network devices.

Symptoms

Our external monitoring probes immediately reported the fault. End users will have noticed the issue as it had an effect on overall service.

Resolution

The switch generating the traffic was observed to be consuming 100% CPU, it was initially power cycled in the hope that the device would become responsive again. Unfortunately, the issue propagated to the remaining 5 switches within the stack (in a single rack), generating further problems.

To avoid major network disruption for the entire location, all access switches were powered off simultaneously, then powered back up one at a time. This restored service and resolved the issue at hand.

Technical reports will be submitted to the vendor for analysis, however, with upcoming access-layer networking upgrades, it is unlikely this will be pursued further.

Planned Maintenance

2015-02-02T17:12:30+00:00

Core network

Estimated Downtime

Periodic loss of connectivity for <5 minutes

Actions

One of our carriers is undertaking maintenance, which will result in a brief loss of connectivity for some customers. The maintenance window is 4 hours long, so it is possible that there may be a number of small outages, each less than a minute.

Decommissioning of shared hosting servers

2015-01-16T11:56:02+00:00

Shared hosting

Estimated Downtime

Indefinitely

Actions

It has now reached 6 months from the initial notification that our legacy shared hosting offering would be decommissioned.

As shared hosting is no longer part of our offering at Sonassi, the servers will be powered down as of 12:00 indefinitely.

Planned Maintenance

2015-01-15T16:20:00+00:00

Core network

Estimated Downtime

Periodic loss of connectivity for <5 minutes

Actions

One of our carriers is undertaking maintenance, which will result in a brief loss of connectivity for some customers. The maintenance window is 1 hour long, so it is possible that there may be a number of small outages, each less than a minute.

Planned Maintenance

2015-01-13T10:55:25+00:00

Core network

Estimated Downtime

Periodic loss of connectivity for <5 minutes

Actions

Issue with sms-sagat

2014-09-30T11:16:00+00:00

sms-sagat is experiencing extremely high load.

Update (10:15): The load appears to be coming from high levels of disk activity, it is being investigated now.
Update (10:33): The load has been tracked back to user activity causing particularly high disk usage. The account in question has been suspended to not impact other users.

Network connectivity fault

2014-08-28T01:26:00+00:00

Intermittent packet loss.

Update (00:27): One of the two core routers at Joule House is failing to respond, causing a network wide disruption.
Update (00:29): The affected device is being power cycled.
Update (00:51): The router has successfully powered back up.

Post-Mortem

Our report from the incident is as follows.

Issue

Minor intermittent packet loss, causing some packets to be dropped. This had negligible effect on servers at Joule House, the effects may not have even been noticeable to end users.

Outage Length

The intermittent packet loss duration was 2 minutes.

Underlying cause

The load on a core router suddenly increased with no known cause.

Symptoms

Our external monitoring probes immediately reported the fault. End users may not have noticed the issue as it had near negligible effect on overall service.

Resolution

As the router was non-responsive to input, it was deemed necessary to restart the device. Seamless failover completed to the other router whilst the device was powered down. There was no loss of service during failover.

The router was powered back on, underwent a consistency check and was added back into the routing pool.

Planned Maintenance

2014-07-25T18:10:13+00:00

MageStack

Estimated Downtime

<0 minutes

Actions

Some minor software updates are due. An automated update will take place on your server during the maintenance window, you should not notice any outages during this period.

Network connectivity fault

2014-05-14T12:58:00+00:00

Some ISPs are reporting connectivity issues.

Update (11:56): One of our transit providers (Cogent) have suffered a total router failure in Manchester, causing 100% traffic loss for the routes out of that network.
Update (11:58): Our routers automatically removed Cogent from the routing pool and traffic is flowing over other carriers. Failover was automatic and instantaneous, BGP route updates converged within around 3 minutes as expected.
Update (12:51): Cogent confirmed a line card failed in a core router in Manchester which lead to subsequent packet loss. The failed line card has been replaced and service is 100% restored. Our routers have once again begun flowing traffic over Cogent.

Post-Mortem

Our report from the incident is as follows.

Issue

Minor network outage

Outage Length

3 seconds

Underlying cause

One of our transit providers (Cogent) experienced a router failure within their network. Increasing CPU usage on their core router caused packets to be progressively dropped.

Symptoms

Our external monitoring probes immediately reported the fault. Some customers (whose traffic was routed over Cogent), experienced an extremely brief window (<1 minute) of slow page load times or server inaccessibility.

Resolution

Once the packet loss threshold was hit, our internal BGP latency and packet loss measuring device automatically de-preferenced Cogent from the available BGP routes. Once Cogent was removed, traffic continued to flow out over our remaining carriers as normal.

Convergence took <5 seconds, but propagation at other ISPs may have taken a couple of minutes, which is why some customers may have experienced a slightly longer outage.

Our automated systems and monitoring systems behaved exactly as designed for this disaster scenario and recovered the carrier failure in less than 5 seconds.

Issue with sms-sagat

2014-05-12T10:31:00+00:00

sms-sagat has stopped responding, root partition has been remounted read-only.

Update (09:30): The server has encountered the same issue 10 days ago. The root partition has been mounted read only. The machine has been rebooted and a filesystem check is under way. Currently at 13%.
Update (09:54): Automatic fsck has failed, manual fsck has failed. Again, a drive is showing as failing - running SMART test on affected drive now.
Update (11:04): We have taken the drives out of this chassis and installed them into a completely new chassis (to rule out backplane/motherboard failure). During a filesystem check, the drive indicated as failing is once again showing the same symptoms of failure - making the file system check almost impossible. Updates to follow.
Update (12:00): We are attempting a manual fsck on the filesystem. Should this fail again - we intend to let the RAID array rebuild, then removed the failed disk, then attempt a fsck on the remaining healthy disk. Bad blocks, that are unable to be reallocated, on the failing disk are causing the fsck to fail.
Update (12:06): The fsck is at 66.3%
Update (12:09): The fsck is at 80.2%
Update (12:13): The fsck completed without error and the server is rebooting.
Update (12:25): The server has booted, however, we need to let the RAID rebuild complete - so that the failing disk can be removed as quickly as possible. Maintenance pages will be put up on customer sites to expedite the RAID rebuild process.
Update (13:48): The RAID rebuild is currently at 11%. ETA 155 mins.
Update (14:11): The RAID rebuild is currently at 25%. ETA 135 mins.
Update (14:47): The RAID rebuild is currently at 41%. ETA 114 mins.
Update (15:08) The RAID rebuild is currently at 47%. ETA 173 mins.
Update (15:24): The RAID rebuild is currently at 52%. ETA 185 mins.
Update (16:06): The RAID rebuild is currently at 73%. ETA 45 mins.
Update (18:31): The RAID rebuild is currently at 84%. ETA 19 mins.
Update (18:45): The RAID rebuild has failed. The issue is stemming from bad sectors on the failing disk causing the RAID rebuild operation to reach around 95% before halting the rebuild process and dropping the good drive out the array. So we are resorting to creating a new array and manually copying the data between the two - once complete, we will reboot the server to load the new root partition.
Update (19:21): The data copy is currently at 3%. ETA 1:26 mins.
Update (19:51): The data copy is currently at 35%. ETA 1:02 mins.
Update (20:31): The data copy is currently at 49%. ETA 1:16 mins.
Update (21:13): The data copy is currently at 66%. ETA 56 mins.
Update (21:51): The data copy is currently at 73%. ETA 54 mins.
Update (23:19): The data copy is complete, the server is being rebooted, a final data sync will take place before services are brought up.
Update (03:51): The data copy completed without data loss. However, the primary MySQL data file (ibdata1) was stored on a section of the failing disk drive with bad sectors. This means the file is unable to be properly copied. We have been able to start the MySQL daemon in a separate environment and are progressively dumping and importing databases one by one. We have yet to encounter an error yet, but it may be possible that the bad sector still may be hit whilst retrieving data. Once all DBs have been imported, we can remove the respective maintenance pages.
Update (05:04): Customer websites are now on-line.

Post-Mortem

Our report from the incident is as follows.

Issue

Total service outage on sms-sagat

Outage Length

23 hours and 3 minutes.

Underlying cause

A HDD developed bad sectors which caused the RAID array to fail. Counter-intuitively, the healthy drive was forced out of the array due to the faulty drive being able to replicate its data to it.

Symptoms

Our internal and external monitoring probes did not report a fault. The machine was still responding to ICMP and HTTP tests, some customer websites appeared to be browsable, but the read-only file system meant that they were effectively unusable.

Resolution

We immediately removed the drives from the server and installed them into a brand new server, to rule out power supply, motherboard or backplane failure.

We attempted to rebuild the RAID array in order to remove the failed drive, however this proved futile.

Instead, we were forced to create a new single disk RAID array, then recover the data from the failing disk to this one. We did encounter issues with the MySQL databases and this posed a concern that there may have been data loss. However, indications show data was retrieved without error. Retrieving the data took a considerable amount of time and was the primary cause of such an extended outage.

The likelihood of a brand new drive failing, within 10 days of 2 other drives failing is unprecedented - an almost impossible possibility. We have not completed testing the old chassis, but at this point - we can only assume a power related issue that could have been overvolting the drives and causing drive failure.

Issue with sms-sagat

2014-05-02T10:55:00+00:00

sms-sagat has stopped responding, initial indications look like an issue with the RAID controller. Updates to follow.

Update (09:54): The server was restarted, and is currently running a disk consistency check to verify there is no damage to the file system. This could take up to an hour to complete. It appears that a single drive dropped out of the RAID array and (currently for reasons unknown), the file-system mounted as read-only as a precautionary measure.
Update (10:05): The automatic file system check has failed, and a manual check is now being run.
Update (10:42): The file system check failed again and the root cause is a failing disk drive. The failed drive has been replaced and a disk check is running once again.
Update (11:07): The file system check is currently at 45% - estimates are that it will complete in 43 minutes (if successful).
Update (11:20): The automatic system check has failed again, which indicates a failure of both disk drives, or the physical backplane the drives are attached too or a RAM fault with the machine. The server will relocated into another blade chassis with new RAM to rule out backplane and RAM failure.
Update (12:16): The remaining working drive has been re-located into another chassis and a file system check is running again. The other drive removed earlier today is confirmed as completely failed and data irrecoverable. Estimates at this point indicate at least another 60 minutes of downtime.
Update (13:06): The disk check has completed with a few minor errors. The disk is being cloned prior to a RAID rebuild (in case of a drive failure during rebuild). After which, the server will be powered on and RAID rebuild can begin.
Update (14:03): Unfortunately further errors have been thrown during the disk clone, so further time must be spend recovering the data. Estimates stand at least 2 hours from this point.
Update (14:59): The second to last step of recovery is underway, currently progress is at 7%.
Update (15:08): Progress is currently at 12.5% - ETA for full fix is 4 hours.
Update (16:00): The recovery step has again encountered a further error, with a stream of access errors from the disk drive itself. Data recovery is looking ever less likely at this stage.
Update (16:25): We have managed to mount the (damaged) file system now and are attempting to copy files from disk to disk. After which, the server should be bootable (assuming there are no damaged files). Progress during the copy is relatively quick - we’ll calculate an ETA once we’ve got a few minutes of statistics.
Update (19:15): Data recovery is still continuing, but slow. With 400GB of data to copy speed is jumping between 1MB/s and 80MB/s (due to failing source drive). Completion time is unknown - but we are confident for a full restoration.
Update (21:12): Almost 100% of data has been recovered and copied onto a fresh pair of drives, to form a new RAID array. We anticipate the server being up within the next 2 hours - however, performance will be severely limited during the RAID rebuild.
Update (02:52): The server has been able to successfully boot, MySQL checks are running on all DBs to ensure integrity, followed by the restoration of the remaining files to the machine.
Update (08:31): All checks have passed, all customer data has been successfully recovered - and the server is now up. Customer’s may experience some performance issues whilst the RAID array continues to rebuild. As we cannot check each customer store, customer’s are encouraged to contact us if they experience any issues.

Post-Mortem

Our report from the incident is as follows.

Issue

Total service outage on sms-sagat

Outage Length

27 hours and 30 minutes.

Underlying cause

A HDD completely failed in the server leaving the RAID array degraded. However, the remaining drive immediately then began showing signs it was failing, resulting in the server mounting the filesystem as read-only as a precautionary measure against data corruption.

Symptoms

Resolution

We immediately removed the failed drive and replaced it. We attempted several times to check the filesystem health and copy the data from the failing drive to the new RAID array, eventually resulting in 100% custom data retrieval.

Retrieving the data took a considerable amount of time and was the primary cause of such an extended outage. As we do not operate off-site backups on shared hosting, it meant that this absolutely necessary to recover customer data.

Also a secondary reason for the downtime duration was that our monitoring platform failed to identify an issue, given the machine was responding to healthcheck tests in a normal fashion.

The servers automatically run daily and weekly HDD self- tests, but no early warning signs had been issued, so we had no indication this was about to happen.

As a result of this incident, we will

Provide off-site backups for shared hosting
Add additional healthchecks for a read-only disk scenario

Planned Maintenance

2014-04-22T15:39:00+00:00

Core network

Estimated Downtime

<5 minutes

Actions

One of our carriers is undertaking maintenance, which will result in a BGP feed being taken offline. Routes should gracefully failover to an alternate carrier during their maintenance window.

Update (04:18): Maintenance has begun, services at risk
Update (04:32): All test sites report failover complete and all services up

Network connectivity fault

2014-01-03T14:06:00+00:00

We have become aware of core network issue with one of our providers. Updates to follow.

Update (14:10): The fault has been identified with a single provider, changes are being made to take this offline and to use an alternate until such time that it is fixed.
Update (14:20): The affected provider has been isolated and temporarily dropped from our network and connectivity is restored. Updates to follow when this provider is restored again.

Post-Mortem

Our report from the incident is as follows.

Issue

Partial packet loss affecting global connectivity.

Outage Length

5 minutes.

Underlying cause

One of our carriers saw a large burst of traffic within their network resulting in a loss of connectivity.

Symptoms

Our internal and external monitoring probes immediately reported a fault.

Resolution

No active steps needed to be taken. Connectivity gracefully fell over to another carrier. A manual de-preference of that carrier was added to prevent its use until connectivity continuity was restored.

Emergency Maintenance

2013-12-29T10:28:00+00:00

sms-jay

Estimated Downtime

20 minutes

Actions

Since the restart, I/O wait has been higher than expected on sms-jay. We feel that a kernel update would be a wise manoeuvre given the age of the current release.

Update (11:26): The new kernel is installed and all tests are complete.

Sms-jay restarted

2013-12-28T11:33:00+00:00

sms-jay has stopped responding.

Update (11:30): sms-jay had restarted (cause currently unknown), upon reboot, as the server had been up for over 347 days without a reboot, a file system check must be carried out.
Update (11:32): Estimated completion time is approximately another 30 minutes.
Update (11:39): Progress is currently at 87.8%
Update (11:43): Automatic fsck failed, a manual fsck is now being run and overseen to correct file system errors.
Update (11:51): The manual fsck is complete and the errors have been fixed. The server is now rebooting and should be up shortly.
Update (11:56): The server is now up and operational and web services are being started.
Update (11:59): Web services are operational, load will be high (and thus the server will be slower than usual) whilst the RAID array rebuilds (estimated 8 hours).

Planned Maintenance

2013-12-15T23:57:47+00:00

Core network

Estimated Downtime

None

Actions

One of our carriers is undertaking maintenance, which will result in a BGP feed being taken offline. Routes should gracefully failover to an alternate carrier during their maintenance window.

Update (23:49): We don’t normally provide notice of maintenance to our carrier networks as it isn’t usually a necessity to do so. Services typically discretely failover without any adverse affects. However, in this instance, it has been brought to our attention that that failover was not instantaneous.Our internal measured downtime was less than 3s during BGP changes, however, some customers have reported longer outages. However, we unfortunately cannot control external ISPs or the rate at which their routing tables update - any outages seen were merely the result of out of date routing entries on external networks that had not yet been updated.

Network connectivity fault

2013-12-04T23:55:00+00:00

We have become aware of core network issue with one of our providers. Updates to follow.

Update (12:14): Core services restored
Update (12:28): Customer services have been restored

Post-Mortem

Our report from the incident is as follows.

Issue

Packet loss affecting worldwide connectivity.

Outage Length

25 minutes

Underlying cause

We had been made aware from one of our carriers that maintenance would be conducted between 12am and 6am; we had made provisions for this and were prepared for systems to automatically switch over during the outage on one carrier.

However, the failover did not occur as planned, as it appeared our other carrier was also affected.

Symptoms

Our internal and external monitoring probes immediately reported a fault.

Resolution

The BGP sessions were not able to be re-established despite efforts. As a last resort, both routers were consecutively rebooted by an on-site technician. This re-established the BGP sessions and connectivity was restored.

We believe there many be commonality between the carriers (shared fibre/conduits/backhauls). An investigation has been launched to see why and how both were simultaneously affected.

Services are being closely monitored and it is likely that some failover tests will be conducted throughout the week to test and guarantee against future failure.

Packet loss

2013-07-24T13:41:00+00:00

We have become aware of a spike of traffic from external sources.

We have null routed the traffic at our network edge and expect the problem to be resolved shortly.

Update (11:45): The high levels of inbound traffic is continuing over different IP blocks
Update (12:00): The issue is still being investigated
Update (12:15): The issue is still being investigated
Update (12:45): The affected IP blocks have been de-announced at our main peering points and via all carriers, at present the BGP route announcement is propagating
Update (13:00): The affected carriers are now no longer carrying the dDOS traffic and all services are restored to 100%

Planned Maintenance

2013-07-07T21:51:17+00:00

MageStack software upgrade of specific customer stacks

Estimated Downtime

Up to 240 seconds per customer

Actions

MageStack is automatically kept up to date, however, some older releases need human oversight during a manual upgrade for the major point upgrades

Some packet loss over UK routes

2013-06-27T18:54:00+00:00

One of our carriers has reported a fault with routing in the UK.

Update (17:54): We have failed over to another carrier in the interim whilst this is investigated. Updates to follow.
Update (18:02): The issue stems from a major fault at LINX. We’re routing around LINX right now, but other ISPs may still be sending traffic via LINX and there might still be users experiencing packet loss.
Update (18:15): Traffic flows have reached normal levels again, the issue at [ One of our carriers has reported a fault with routing in the UK.
Update (17:54): We have failed over to another carrier in the interim whilst this is investigated. Updates to follow.
Update (18:02): The issue stems from a major fault at LINX. We’re routing around LINX right now, but other ISPs may still be sending traffic via LINX and there might still be users experiencing packet loss.
Update (18:15): Traffic flows have reached normal levels again, the issue at]5 looks to be resolved.

Post-Mortem

Our report from the incident is as follows.

IssuePacket loss affecting UK ISPs.

Underlying causeBroadcast traffic at LINX causing high CPU usage on core routers. SymptomsInaccessibility of our network from certain ISPs.ResolutionWe dropped our peering session with LINX whilst the issue was diagnosed and identified. The issue at LINX has reportedly been fixed and our network team will enable peering at LINX once we are confident their routers are once again stable.

Planned Maintenance

2013-06-14T13:20:00+00:00

Brief loss of connectivity for all customers although it should be largely un-noticable

Estimated Downtime

Up to 120 seconds in small bursts across the course of the day

Actions

Following some repeated firmware issues on access switches, we are systematically replacing all access switches in every rack in Joule House.

As we run a redundant switching network, we will begin by powering down the B switching network, and replacing cabling and switches. Then powering it back up. Then repeating the same process on the A switching network.

When the switching networks go down and back up (4 times in total), it will cause a spanning tree calculation, which will result in momentary downtime (<30s).

Some packet loss over international routes

2013-06-13T23:08:00+00:00

We have a number of customers reporting some packet loss whilst accessing their servers from outside of the UK.

Update (22:07): This appears to be an issue with one of our upstream carriers. Updates to follow.
Update (22:20): Routes have been de-preferenced over L3 and Global Crossing; Cogent is currently be prioritised higher and outbound routes are proving more stable.
Update (22:34): The majority of inbound routes over the selected carriers are now flowing via Cogent and all external health checks are reporting healthy.

Post-Mortem

Our report from the incident is as follows.

IssuePacket loss affecting a very small number users worldwide, resulting in slow page load times, or no response.

Underlying causeFibre break within Level 3 / Global Crossing’s core infrastructure. Our other carriers were unaffected, so only those ISPs peering with Level 3 saw issues.SymptomsUp to 100% packet loss from a limited number of national and international ISPs where traffic was coming in via Level 3.ResolutionWe temporarily stopped using L3/GBLX and prioritised our routes over Cogent until L3/GBLX had put repairs in place. We received confirmation that at 3:00am the broken fibre had been fixed, and routes began to flow back over L3/GBLX once again.

Outage on rack distribution switch

2013-06-02T10:21:00+00:00

One of our rack distribution switches has failed to respond Joule House - and failover to the secondary has not occured like it should have. Approximately half the servers within the affected rack have intermittent or no internet access.

Update (09:22): An on-site engineer will attempt a power cycle of the affected device.
Update (09:24): Rebooting both switches has restored connectivity.

Outage on sms-jay

2013-06-02T00:23:00+00:00

sms-jay suddenly stopped responding. System being looked at by an engineer.

Update (23:31): The issue looks to be the same as the fault affecting the machine last week. However, rather than it being a network driver at fault, it looks like an issue with the access switch that the server is connected to. We’re continuing the investigation. In the mean time, the system has been restarted and failed over onto the secondary access switch.
Update (01:52): sms-jay has stopped responding again.
Update (03:12): A system reboot did not resolve the issue following the second outage and the switch port that the server is attached to refused to detect the link as being up on either access switch.The cables were replaced but to no avail. The network card in the server was replaced, but to no avail. So the server has been plugged into another pair of switches whilst the offending ports are diagnosed. It does not look to be a switch issue - but there is some fault taking place whereby the link is being forced as down at the switch port level (on two separate switches).
Update (03:13): sms-jay is back up and being closely monitored.
Update (09:52): Upon reviewing the logs from the machine, we found a series of ECC (RAM) errors being logged on the management console on the device. The RAM will be immediately replaced today whilst the existing sticks undergo testing.
Update (13:40): sms-jay has been deliberately powered down for an emergency RAM exchange. It should be back up in approximately 10 mins.
Update (14:39): The server has been powered back up and running for approximately 1 hour now, performance will be degraded during RAID array rebuild (from unclean shutdown). New memory and a new CPU has been installed - the CPU and memory is being tested separately to attempt to identify a cause of these repeated issues.
Update (15:32): The memory check on all modules has completed 1 full pass without error. The CPU stress test is still running without error. We will leave these components to test for a further 24 hours, however, it looks unlikely that the issue is hardware related.

Outage on sms-jay

2013-05-27T01:38:00+00:00

sms-jay suddenly stopped responding no cause has yet been identified; this machine was recently included in the system upgrades, so we believe it could have been caused by an incompatible network driver.

Update (00:40): sms-jay has been restarted
Update (00:56): The RAID array is currently degraded. Estimated 9hrs until fully rebuilt.

Planned Maintenance

2013-05-26T15:23:27+00:00

All shared hosting servers are getting an upgrade in RAM and CPU. The downtime should be extremely minimal, each machine will be powered down in turn and upgraded

Estimated Downtime

300 seconds

Actions

Shared hosting server maintenance

HDD failure on sms-sagat

2013-05-05T09:52:00+00:00

Drive failure on sms-sagat.

Estimated Downtime

Up to 180 minutes

Actions

A HDD failed on sms-sagat and required the replacement of the drive. On-site engineers replaced the drive within minutes of notification.

The RAID array now needs to rebuild, during the time it is degraded, performance will be relatively poor.

Update (09:16): The array is 58.1% rebuilt so far. The ETA to completion is 11 hours.
Update (13:34): The array is 85.9% rebuilt so far. The ETA to completion is 3 hours.
Update (18:38): The array is fully rebuilt and the server is 100% operational (see HDD activity graph attached below).

Emergency maintenance 22/03/2013

2013-03-21T18:45:00+00:00

**Date/Time
**

22/03/2013 11:00pm

Disruption

Switch over of power feeds from B only to A+B

Estimated Downtime

Up to 10 minutes

Actions

Following the ‘B’ feed power failure on 13/03/2012 - failover switches were moved to the ‘A’ feed.

As the ‘B’ feed has now been repaired, all affected servers must now be moved back to the ‘B’ feed. This means powering down the switching equipment and checking routes propagate correctly afterwards.

Downtime should be minimal during the swap - but there will be a loss of service.

Update (23:51): Engineer is on site and commencing work immediately.
Update (01:02): Failover testing 50% complete, failover fault found and fixed. Final and repeat tests to complete to confirm.
Update (01:36): All tests complete and failover confirmed fully functional. Report to follow.

Emergency

2013-03-13T11:12:00+00:00

Some servers within Joule House are unavailable/inaccessible.

Updates to follow.

Update (11:07): The incident looks to be related to the failure of the B power feed. Failover to secondary access switches should have occurred, but doesn’t appeared to have done so.
Update (11:08): An engineer has been dispatched to the Data Centre.
Update (11:20): Power over B feed confirmed down. All racks have been moved to single A feed only. Systems are being powered back up.
Update (11:25): Connectivity has been restored. An investigation is under way as to why failover did not occur in the expected manner.

Emergency maintenance 19/01/2013

2013-01-16T18:47:00+00:00

**Date/Time
**

19/01/2012 12:00am

Disruption

Switch over of power feeds from B only to A+B

Estimated Downtime

Up to 5 minutes

Actions

Following the ‘A’ feed power failure on 15/01/2012 - single-fed servers were manually moved to the 'B’ feed to immediately restore service.

As the 'A’ feed has now been repaired, all affected servers must now be moved back to the primary 'A’ feed. This means the powering down of your machine - moving its PDU to the 'B’ feed and then powering it back up.

Downtime should be minimal during the swap - but there will be a loss of service.

Update (23:39): Engineers are now on site at the data centre preparing for the graceful shut-down and power up of the servers.
Update (00:00): Powering down of systems is beginning.
Update (00:10): All systems powered back on and booting
Update (01:00): All systems manually verified as up and operational. If you are experiencing any problems. Please create a support ticket at http://www.theclientarea.info with an 'Emergency’ status.

Loss of power at Joule House

2013-01-15T05:28:00+00:00

**Date/Time
**

04:05:00 15/01/2013

Disruption There has been a total loss of power on a single feed at Joule House.

Updates to follow.

Update (04:30): We escalated the issue to the data centre technical team.
Update (05:10): Power is still down so servers on down feed have been moved to secondary feed.
Update (05:18): Some routes are still down and it is currently under investigation.
Update (6:57): All servers at Joule have been confirmed to have booted cleanly. But any customers experiencing any issues should create a ticket at www.theclientarea.info with an “Emergency” status. Power remains off a single feed and will be required to be switched over.

Planned Maintenance

2013-01-14T11:57:00+00:00

**Date/Time
**

18/01/2013 00:00am until 18/01/2013 00:30am

Disruption

Customers will experience a potential loss of service for up to 60 seconds in any one instance, during the maintenance window of 30 minutes

Estimated Downtime

60 seconds

Actions

Core network maintenance

Emergency maintenance 15/11/2012

2012-11-15T17:26:00+00:00

**Date/Time
**

15/11/2012 (TBC)

Disruption

Brief total loss of connectivity to our entire network

Estimated Downtime

Up to 5 minutes

Actions

Our racks at Joule House are currently fed via multiple point-to-point links back to Reynolds House. However a fault has been identified in the interconnects at Reynolds House that affects both of these point-to-point links.

Engineers are currently putting in a 3rd link and will switch the service onto this replacement whilst both the primary and secondary links are tested and repaired.

Update (21:10): The tertiary link has been brought up and services seamlessly moved over without any loss of service. The primary and secondary links will now be tested and repaired.

Experiencing very brief outages

2012-11-15T14:40:00+00:00

**Date/Time
**

08:53:51 15/11/2012
…
16:57:57 15/11/2012

Disruption

We have seen frequent but very brief outages (lasting between 20 seconds and 4 minutes) since 08:53 today.

This is currently under investigation with our NOC team.

Updates to follow.

Update (14:39): We escalated the issue with our transit provider to urgent status, awaiting feedback.
Update (17:00): The intermittent nature of the issue is proving difficult to diagnose; it is still currently under investigation from the NOC.
Update (17:17): The transit provider has identified that there is a faulty link between our major PoP and Joule House and that replacement and testing will take place between 03:00 and 04:00 on 16/11/2012. An emergency maintenance repair has been scheduled.
Update (21:10): The tertiary link has been brought up and services seamlessly moved over without any loss of service. The primary and secondary links will now be tested and repaired.

Planned maintenance expected to last until 4pm

2012-11-04T08:22:00+00:00

Due to unforeseen circumstances, the current data centre migration is taking a little longer than expected. We expect to have all systems fully operational by 4pm. We apologise profusely for this major inconvenience**.**

Update 1:13 - The first servers are being powered on and undergoing manual boot checks.

Update 14:42 - Servers are still being powered on and traffic routing is being enabled server - by server. Currently we are 60% complete and have restored connectivity to all servers.

Update 15:14 - All servers in all racks are now powered up and running. Remote IPMI access remains inaccessible but will be available shortly.

Please contact us at https://sms-sagat.theclientarea.info/ if your machine is unavailable.

19/10/2012 downtime post-mortem

2012-10-21T21:58:05+00:00

Our report from the incident on 19/10/2012-21/10/2012 is as follows.

Issue

Minor packet loss affecting a very small number users worldwide, resulting in slow page load times, or no response

Underlying cause

Failed components within Level 3 / Global Crossing’s core infrastructure

Symptoms

<1% packet loss across all network

Resolution

We have temporarily stopped using L3/GBLX and are advertising our routes using Cogent until such time L3/GBLX have put repairs in place.

Experiencing some packet loss on certain routes

2012-10-19T16:35:00+00:00

**Date/Time
**

19/10/2012 13:45

Disruption

There is some reported packet loss from specific UK broadband providers to our network. We are currently investigating it.

Update (14:00): Our NOC team are looking into the issue and have established an issue with Level 3 / Global Crossing
Update (18:00): Route changes have been made to bypass L3/GBLX due to issues within L3/GBLX core infrastructure. Routes are now being advertised via Cogent

Service updates on shared hosting platforms

2012-09-15T12:49:00+00:00

**Date/Time
**

15/09/2012 11:45am

Disruption

Multiple security updates are being applied to all shared hosting servers

Estimated Downtime

10 minutes

Actions

PHP/MySQL/Roundcube/PHPMyAdmin will be removed and re-installed on all shared hosting servers with the latest security patches applied.

Update (11:47): Started PHP upgrade
Update (12:38): Completed PHP upgrade
Update (12:38): Started Webmail/PHPMyAdmin upgrade
Update (15:06): Completed Webmail/PHPMyAdmin updrade
Update (17:02): Started MySQL upgrade
Update (17:53): Completed MySQL upgrade

Data Centre Migration

2012-09-14T12:15:00+00:00

**Date/Time
**

03/11/2012 12:00am

Disruption

A works order has been raised to migrate all dedicated servers from Reynolds House and Delta House to our new facility, Joule House.

Estimated Downtime

120-240 minutes

Actions Servers will be cleanly powered down; then transported to the new data-centre. Where they will be installed and powered back up.

Some customers will experience a loss of service during the migration.

We have a 3 year lease at Joule House; so this will mark the last of any future data centre migrations.

Essential Mainenance

2012-09-11T16:46:57+00:00

**Date/Time
**

12/09/2012 22:00pm

Disruption

A works order has been raised to carry out Essential Maintenance Work to add additional redundancy measures to our infrastructure.

Estimated Downtime

120 minutes

Actions Re-routing of traffic is required to perform essential maintenance work upon our infrastructure.

Customers should not experience a loss of service, but may notice traffic being re-routed during the maintenance window.

Urgent mainenance due

2012-08-12T15:51:00+00:00

**Date/Time
**

19/08/2012 11:00am

Disruption

Shared hosting customers and dedicated hosting customers in both Reynolds House and Delta House will be affected.

Estimated Downtime

15 minutes

Actions

A core router has failed and the network is currently running degraded. We had intentions of upgrading the core network infrastructure over the coming months with newer (more powerful equipment), however, due to this failure, we intend to bring the go-live date forward.

As a result, the maintenance window will include removing all previous routing hardware and replacing it with the newer, more powerful, units.

Update (12:00pm): Maintenance started
Update (12:38pm): Maintenance complete

24/07/2012 downtime explanation

2012-07-24T19:15:00+00:00

Our report from the incident on 24/07/2012 is as follows.

Issue

Minor packet loss affecting a very small number of dedicated servers at Delta House

Underlying cause

A minor DOS attack to a boundary shared firewall

Symptoms

<10% packet loss to some subnets

Resolution We isolated the affected subnets and removed this from the shared firewall routes to let the core routers divert the traffic to the affected server.

Slight packet loss

2012-07-24T15:24:13+00:00

There is slight packet loss across some subnets, we are currently investigating it.

01/07/2012 downtime explanation

2012-07-01T21:32:22+00:00

Our report from the incident on 01/07/2012 is as follows.

Issue

Complete loss of power at Delta House, Manchester

Underlying cause

The core UPS infrastructure suffered a fatal error causing power to be cut on the entire “A” feed - causing a data centre site-wide outage affecting over 50 racks.

Symptoms

Complete loss of power and subsequent access to any equipment in Delta House

Resolution

A Sonassi engineer was on-site within 15 minutes of the incident. Data centre electricians had diagnosed the fault to be within the UPS and promptly bypassed it, to run power via mains directly.

Our engineer manually booted and tested each server in every rack to ensure it started cleanly with all HTTP services running. All machines started without fault or issue.

Engineers from the UPS manufacturer were dispatched to the data centre and are currently implementing a repair.

Power loss detected

2012-07-01T18:09:00+00:00

Power has been lost on the entire “A” feed at our Delta House facility in Manchester, following a UPS failure.

Power was briefly restored on generator with the UPS in bypass; then terminated when it was switched back on to primary/mains.

Power should be restored, however the UPS is currently being investigated by engineers.

Update (18:53 - 01/07/2012)

All affected racks in the facility have been booted and an engineer on site has manually verified that HTTP services on each server have come back up cleanly.

Please submit a support ticket at theclientarea.info if you are experiencing any difficulties with your server.

15/06/2012 downtime explanation

2012-06-15T16:21:00+00:00

Our report from the incident on 15/06/2012 is as follows.

Issue

Complete loss of all connectivity to our entire network

Underlying cause

Our upstream transit provider’s switch failed at Reynolds House, Manchester, causing the loss of connectivity on one site feed. Simultaneously, a power breaker failed at Delta House, Manchester, rendering the redundant feed unavailable too.

Symptoms

Complete loss of all connectivity to our entire network

Resolution

A temporary alternative transit feed has been provisioned and routes are now sent via this (around 3 hops more than the previous feed). This is a semi-permanent fix whilst the switch is replaced in Reynolds House. There is expected to be planned maintenance to change the transit back to the primary feed; but nothing has been scheduled yet. The network is now operating as it should be and we apologise for the downtime. Downtime incurred was 180 minutes.

Connectivity partially restored

2012-06-15T15:50:09+00:00

Routes are now being sent over alternative carriers and connectivity is partially restored.

Complete loss of network connectivity

2012-06-15T14:21:06+00:00

The root cause of the fault has been identified as a breaker failure, resulting in some racks in the data centre losing power.

Whilst all our racks still have power, the routing equipment providing connectivity does not.

Engineers are on site working on two solutions simultaneously.

They are repairing the failed breaker to restore power to the routing racks
An alternative upstream link is being rapidly set up to restore connectivity in the interim.

We apologise for this incident and downtime thus far and will provide future updates here.

Complete loss of network connectivity

2012-06-15T13:07:00+00:00

Our upstream provider has had a complete loss of connectivity at our primary data centre and it is currently being investigated.

At this stage, the reason for the fault is unknown

More information will follow when it is available.

Sms-akuma urgent maintenance due

2012-06-03T18:31:37+00:00

**Date/Time
**

03/06/2012 11:00pm

Disruption

Only shared hosting customers using web node sms-jay will be affected.

Estimated Downtime

20 minutes

Actions

The hard drives from sms-akuma will be removed and inserted into a new server, whilst the original chassis is tested. After testing, sms-akuma will be moved back to the original (or new) chassis - this will be scheduled for approximately 14 days from today.

03/06/2012 downtime explaination

2012-06-03T18:27:02+00:00

Our report from the incident on 03/06/2012 is as follows.

Issue

sms-akuma restarted

Underlying cause

Watchdog triggered a reboot

Symptoms

Complete loss of service on sms-akuma.

Resolution

The automatic watchdog monitoring service restarted the server after detecting a non-recoverable error.

This is the second incident of this nature within 60 days - so the chassis will be taken down for fault testing and sms-akuma will be migrated to another physical server.

This maintenance window is scheduled for 03/06/2012 11:00pm to minimise disruption. Downtime should be a maximum of 20 minutes.

Sms-akuma unresponsive

2012-06-03T18:18:18+00:00

Currently being investigated

25/04/2012 downtime explaination

2012-05-25T16:22:34+00:00

Our report from the incident on 25/04/2012 is as follows.

Issue

sms-akuma restarted

Underlying cause

Watchdog triggered a reboot

Symptoms

Complete loss of service on sms-akuma.

Resolution

The automatic watchdog monitoring service restarted the server after detecting a non-recoverable error. After an automatic fsck - the system came back up successfully following 36 minutes downtime.

05/04/2012 downtime explaination

2012-05-05T17:31:00+00:00

Our report from the incident on 05/04/2012 is as follows.

Issue

sms-sagat unresponsive

Underlying cause

Memory page fault caused a kernel panic

Symptoms

Complete loss of service on sms-sagat

Resolution

After detecting the server was down, the machine’s serial console output was reviewed to show a kernel panic.
The system was powered down, memory re-seated, and powered up into a rescue environment to run memtest+
Memtest completed 1 pass without error
Server was powered back on into normal run level

Continual memory tests are running on the system, but so far have shown without error. It is assumed it was a software fault (not hardware).

The RAID array is also degraded and being re-built, so performance is limited.

–

Follow Up

A SMART test was run on all drives and one drive reported bad sectors. As a result, this drive has been removed and replaced and the RAID array is rebuilding. An off-line snapshot has been taken of the system whilst the RAID array is degraded.

Sms-sagat unresponsive

2012-05-05T15:50:29+00:00

sms-sagat has stopped responding to ping, we are investigating now.

Pingdom reporting connectivity issues

2012-03-04T19:10:01+00:00

We use two forms of monitoring, Pingdom (an external service) and our own monitoring platform (also external).

Within the last 15 minutes, we have received several Pingdom notifications reporting connectivity dropping and immediately coming back up. However, this does not correspond with our own monitoring reports.

Both Pingdom’s monitoring service and Pingdom’s FPT are showing strange results - however, other 3rd party services are reporting no issues.

At the moment, we are investigating what is going on, but it looks to be an issue with Pingdom rather than our connectivity. Enquiries are under way.

27/02/2012 downtime explaination

2012-03-01T17:02:00+00:00

Our report from the incident on 27/02/2012 is as follows.

Issue

DDOS attack to our transit provider’s network

Underlying cause

External high volume attack from multiple sources targeting a customer subnet

Symptoms

Intermittent loss of service on multiple subnets

Resolution

9:31pm 27/02/2012 the network monitoring and noc team saw a sustained DDOS attack to the network or around 3-4Gbit per second from around 2-3k of hosts. Traffic was received over all 4 carriers from both sites
9:41pm port security violations limits were hit on our one of our carrier upstreams which took one of the carriers offline increasing the load on the remaining carriers
10:05pm affected IP subnet block was identified and network engineers began null routing the affected subnet from the network core
10:30pm Delta house network connectivity was restored
10:41pm partial traffic restored in Reynolds house network
10:45pm full network was restored and DDOS traffic was being held back by the border routers and full customer services restored
2:30am amended border routers to further drop packets from the attack.

From the information gathered so far the evidence points to a single attack to one customer.

The team are still looking through logs and progressing the incident with the relevant authorities and further measures are currently being invoked to reduce such attacks in future.

Core issue resolved

2012-02-28T18:12:00+00:00

We have had a brief chat with the data centre team and the root cause of the downtime last night is believed to be down to a broad DDoS attack across a number of subnets - peripheral to our own network, but substantial enough to saturate the 10GB uplinks to our peers.

A formal investigation is under way at present, however, we have been assured our own connectivity should not be affected any more.

We would like to apologise for the outage last night, which spanned 11 minutes in total, but we hope our proactive response to the situation and information clarity throughout was of some benefit to concerned customers.

We are currently discussing means to prevent this happening again, however, as the attack was not directed at subnets within our own network, it will still be hard to mitigate.

For reliability and performance, we hand off BGP to our upstream provider who uses multiple peers and handles external (internet) routes on our behalf - however, this was our downfall, as when another customer of theirs fell victim to a DDoS attack, it saturated the common transit uplinks affecting the entire data centre.

We are not in doubt of our current peers/transit providers; as it has served us well, with 3 years of 100% network connectivity and we have full faith in their ability to deal with future issues.

Experiencing high packet loss

2012-02-27T23:16:33+00:00

Connectivity was mostly restored after a few small windows of downtime, but routes are flapping at the moment.

Engineers are still working on a resolution and to identify the root issue - but at present we are awaiting updates.

What we know

The issue is outside of Sonassi Hosting’s network; our transit provider is experiencing difficulties at the data centre which is something that we cannot remedy. They have engineers on site working on a fix.

We still have 100% power and 100% cooling, as well as our internal network (from edge-in) is 100% functional, however outbound/inbound national routes are flapping.

First ever significant outage

This is our first ever significant outage in 3 years of operations and certainly not what our clients are accustomed to.

We would like to reassure all customers that we will remain available on here and Twitter (@sonassi @sonassihosting) if you want to talk to us directly.

Experiencing high packet loss

2012-02-27T22:42:31+00:00

The issue looks to be much further upstream than our network. Technicians are on site from our transit provider and are working hard to achieve a resolution.

Another update will be given within the hour.

Experiencing high packet loss

2012-02-27T22:26:24+00:00

Still under investigation, but it looks like a dDOS attack across a number of subnets. We are looking to null route the affected subnets whilst we investigate.

Experiencing high packet loss

2012-02-27T22:26:24+00:00

Still under investigation, but it looks like a dDOS attack across a number of subnets. We are looking to null route the affected subnets whilst we investigate.

IPMI connectivity changes

2012-02-24T14:20:00+00:00

We are currently making some changes to the IPMI infrastructure (adding some additional switches).

Access will be limited over the next 1hr.

This will not effect any dedicated or shared hosting customers.

IPMI connectivity changes

2012-02-24T14:20:00+00:00

We are currently making some changes to the IPMI infrastructure (adding some additional switches).

Access will be limited over the next 1hr.

This will not effect any dedicated or shared hosting customers.

Theclientarea.info will be offline for a short while

2012-02-02T11:18:46+00:00

We are just rolling out some updates to theclientarea.info - so it will likely be unavailable for short periods of time today.

If you need any assistance with managing your account, just email the team at support@sonassihosting.com and we’ll take care of all your requests.

Theclientarea.info will be offline for a short while

2012-02-02T11:18:46+00:00

We are just rolling out some updates to theclientarea.info - so it will likely be unavailable for short periods of time today.

If you need any assistance with managing your account, just email the team at support@sonassihosting.com and we’ll take care of all your requests.

Interface reboot required on sms-sagat

2011-10-09T20:03:00+00:00

After bringing up switch a - sms-sagat has not behaved in the way as expected (still routing via switch b). As a result, the network interface will be reset. The downtime should be minimal.

Resolved

Services are back up and running now, apologies for the slight glitch and 1 minute downtime.

Failure on switch a

2011-10-09T18:48:00+00:00

Switch a has experienced a failure, we are currently investigating the issue, services should be failing over onto switch b.

Downtime may be incurred, but should be minimal.

Resolved

A fault occurred during a routine configuration change and the switch took longer to reboot than expected.

Sms-sagat is experiencing some issues

2011-10-08T15:50:00+00:00

HTTP and HTTPS isn’t responding as it should be, we’re investigating now.

Resolved

A service did not auto-restart as it should have done and required manual intervention. All services on sms-sagat are up and running again.

Nice and simple network updates

2011-07-03T18:27:35+00:00

We’ve set-up a dedicated network status page so that you can keep an eye on our network health, any maintenance messages or general news.

You can find it at http://status.sonassihosting.com/