12:09 We believe to have identified a fix this issue and are liaising with the engineers on site to bring services back online. We will keep you updated.
12:18 Connectivity has now been reinstated and we are monitoring the issue further before closing this outage off.
12:29 Connectivity is stable and operational. We will now set this incident back to operational and continue a true root cause investigation with our senior networking resources. We apologise for the interruption to your service.
]]>08:14 - Our network engineers are working alongside our onsite teams and hope to have full connectivity reinstated shortly.
08:16 - Connectivity has been reinstated. We are still investigating root cause and checking full resiliency is back in place.
11:15 - Connectivity has remained stable but full resiliency is yet to be reinstated. There are planned emergency works to replace a switch this afternoon to fully reinstate resiliency. More information and communications will be forthcoming once a plan is finalised and booked in with our engineers.
15:45 - In order to fully reinstate resiliency to those impacted by this mornings outage. We will be commencing a switch replacement shortly. During this window all services impacted are deemed "at risk" although we do not expect an impact to services. This is an emergency change which is essential to bring full resiliency back online. We will update this status when the work has been completed.
17:30 - EMERGENCY MAINTENANCE COMPLETED - Our team has completed the emergency maintenance and have added resiliency back to the impacted part of our network. The entire platform is no longer deemed "at risk". We will make no further changes tonight and our team do not expect any further impact to your service relating to this issue. We will be in touch with more information once the full investigation begins tomorrow.
]]>21:48 - We are currently looking into a major outage which looks to be a network related incident. Our escalation teams are working on this and we will provide further updates as soon as possible.
22:20 - We have identified the network outage and are liaising with third parties and our own networking teams to mitigate the impact as soon as possible.
00:09 - Our senior networking team are working on a resolution and currently expect to have a further update in 1 hours time.
01:30 - We are still awaiting a resolution to this network issue. Third parties and our own teams are investigating and making progress in resolving the fault. There will be further updates to follow.
]]>We're aware of a disruption to our services and our team are currently investigating.
Updates will follow as soon as possible.
20/6/23 11:00 BST
Actions
Our network team will be carrying out maintenance on some of the routes within the Sonassi network.
Impact
No downtime or disruption to service is expected.
]]>20/6/23 11:00 BST
Actions
Our network team will be carrying out maintenance on some of the routes within the Sonassi network.
Impact
No downtime or disruption to service is expected.
]]>We're aware of a disruption to our services and our network team are currently investigating.
Updates will follow as soon as possible.
update 11:47 GMT.
This issue is currently stable as of 11:15. Engineers are continuing to monitor.
]]>Actions
As per notifications sent in July, we will be adjusting the default TLS Ciphers defined on all stack load balancers.
The change removes the usage of RSA and weaker AES-CBC ciphers in order to meet security and PCI requirements.
Impact
The change will mean that the following clients and operating systems will be unable to communicate via HTTPS with the server:
If this isn't possible, then please get in touch and we can ensure that these ciphers remain enabled on your stack however, please be aware that you will need sufficient justification for the usage of weaker ciphers in order to pass PCI scans.
]]>Disruption
We need to perform some emergency maintenance on the backend systems that power my.sonassi.com and its management functionality.
Viewing credentials or stack management functionality (such as PHP version changes or issuing VPN bundles) will be unavailable during the maintenance period.
Estimated Downtime
Between 30 minutes - 1 hour.
Actions
my.sonassi.com will continue to be available for support requests during this time.
15:30 - Work has been completed.
]]>Service has now been restored and customers can now submit support requests. - Update (01:23):
We are currently aware of an issue with my.sonassi.com so customers will not be able to raise support requests. We are looking into this and will update as soon as possible - Update (00:10):
]]>Disruption
Replacement hardware - Network interruption
Estimated Downtime
10 Minutes - 1 Hour of expected downtime depending on location.
Actions
We're replacing certain components of our core network we suspect has been causing issues with the Sonassi network over the past few months.
]]>An incident has occurred affecting a small number of customers within the datacentre. Our network team are looking into the issue now and we'll update this page once we have some further information
Update (18:39): The issue has been located and fixed by our networking team. Further information will be communicated to customers effected.
Disruption
Potential total loss of network connectivity for all customer infrastructure.
Estimated Downtime
Up to 30 minutes.
Actions
An issue has been detected with a number of pieces of networking equipment. During the maintenance window these switches will be replaced with different hardware.
Updates
Disruption
Unavailability of support portal and stack control panel. This will not affect customer services/infrastructure.
Estimated Downtime
Up to 30 minutes.
Actions
The support portal platform hardware is being upgraded during which time services will be unavailable.
Updates
Update (14:39): We're investigating an issue at one of our datacentres
Update (16:31): We can confirm all services should now be back online and further investigation on the issue will be carried out.
Disruption
Migration of network edge routers.
Estimated Downtime
Up to 15 minutes.
Actions
We will be migrating our network edge routing infrastructure from Equinix MA3 into our own LDEX2 facility, during this window there will be disruption to edge routing whilst routing re-converges in the new location
Disruption
To all stacks in our main facility
Estimated Downtime
Up to 180 minutes.
Actions
Scheduled upgrades to customers stacks are taking place to improve networking performance and reliability. You will have received an email about this as to the specific date your stack/s will be upgraded.
]]>Disruption
"This will disrupt Autoscaling, Overflow servers, IPSEC VPN tunnels, and some my.sonassi.com portal functionality."
Estimated Downtime
Up to 90 minutes.
Actions
Upgrade of some critical infrastructure
]]>Disruption
Replacement of core routers.
Estimated Downtime
Up to 5 minutes.
Actions
Core routers are being upgraded to newer hardware. This should be a seamless zero-downtime migration between existing and new devices.
]]>Disruption
Maintenance on my.sonassi.com control panel.
Estimated Downtime
Up to 120 minutes.
Actions
Scheduled upgrades to my.sonassi.com are taking place to improve performance and reliability. This is the second of a two-part maintenance programme of the upgrade.
]]>Disruption
Maintenance on my.sonassi.com control panel.
Estimated Downtime
Up to 60 minutes.
Actions
Scheduled upgrades to my.sonassi.com are taking place to improve performance and reliability. This is the first of a two-part maintenance programme of the upgrade.
Updates
Disruption
Failover testing of core network devices.
Estimated Downtime
Up to 5 minutes.
Actions
Ahead of the upcoming data centre migration, testing of all core networking equipment is being performed to ensure that failover operates correctly and downtime is kept to a minimum during the procedure.
Updates
Estimated Downtime
A short network blip is expected during failover.
Disruption
As with any maintenance work there is an increased risk during the maintenance window as network services will be operating on reduced paths.
]]>Estimated Downtime
Up to 1 hour.
Disruption
Upgrades to internal infrastructure which will disrupt overflow and autoscaling services during the times specified. Other services should remain unaffected.
]]>Disruption
The following services will be unavailable while the work is performed:
A support telephone number will be provided as an alternative to report any emergency issues while the work is being performed.
Estimated Downtime
15 minutes.
Actions
Our infrastructure team will be performing work to upgrade the platform that hosts the my.sonassi.com ticketing and ordering system.
]]>Disruption
Network equipment investigation/replacement per incident 160. There will be brief periods of network instability during this window.
Estimated Downtime
Up to 4 hours.
Actions
Network infrastructure will be monitored whilst configuration is updated to replicate conditions of the previous incident and implement a resolution.
]]>Disruption
Support portal platform upgrade per incident 160. The support portal will be entirely unavailable during this period.
Estimated Downtime
Up to 60 minutes.
Actions
The underlying hardware platform for the support portal is being replaced with newer/larger infrastructure. This will result in improved performance of the portal as a whole as well as offering improved reliability.
Updates
149.86.96.0/24
range.Disruption
Core router upgrade programme, per incident 153. All customers services will be affected during this maintenance window whilst the switchover occurs.
Estimated Downtime
Up to 25 minutes.
Actions
Our core routers are being replaced in UK-1/MA3. These devices are responsible for all routing in the UK and as such, all services will be briefly impacted during the necessary maintenance window. The procedure will include powering off the secondary router, installing the replacement - then briefly suspending traffic flows to redirect over the new router. Downtime should be kept to an absolute minimum through this process unless an unplanned incident occurs.
Updates
Disruption
Aggregation switch upgrade programme, per incident 153. Several customers services will be affected during this maintenance window during the switchover.
Estimated Downtime
Up to 15 minutes.
Actions
We will be upgrading a number of aggregation switches across UK-1/MA3. The procedure will include powering off the secondary switch, installing the replacement - then briefly suspending traffic flows to redirect over the new switch. Downtime should be kept to an absolute minimum through this process unless an unplanned incident occurs.
Status
Do not raise an emergency ticket if you use CloudFlare. Customers are advised to view https://www.cloudflarestatus.com/ and disable CloudFlare in the mean time.
]]>Post-Mortem
Our report from the incident is as follows.
Issue
Loss of connectivity, high load and periods of unavailability for the entire MA3 facility and a single isolated network segment.
Outage Length
The duration was between 60 to 180 minutes.
Underlying cause
Unfortunately, this was a repeat incident of similar nature to https://status.sonassi.com/incident/149/
We believe a malfunctioning aggregation switch as part of the backbone of the network core began sending out malformed/erroneous L2 packets, driving up CPU utilisation on other routers and switches. Control plane traffic was disturbed and multi-chassis aggregated links degraded, resulting in loss of downstream connectivity to rack pods and subsequent customer stacks.
Different symptoms to the last incident lead diagnosis ultimately down an incorrect path of initial resolution, leading to extended resolution times and the isolation of an entire network segment (a single rack pod).
Symptoms
Our facilities monitoring, and service monitoring probes immediately reported the incident. Customers would have experienced slow page load times through to a completely inaccessible site.
Resolution
A repeat incident of a malfunctioning aggregation switch appeared to be the source of increased CPU load throughout the network; permanently powering off the device (with subsequent replacement due) resolved the underlying issue.
Prevention
The network architecture and equipment in use is more than adequate, offering extreme levels of availability, with multiple layers of redundancy designed into every tier of the network stack. However, a failed/failing device has caused significant network disruption wherein the device had no explicitly failed, demonstrated signs of error or malfunction - other than the successful operation of the network in its absence.
Whilst rare, the eventual degradation of the switch silicon is lead to be the cause and the device (and its paired device) will be replaced with latest generation hardware.
]]>Post-Mortem
Our report from the incident is as follows.
Issue
Loss of connectivity, high load and periods of unavailability for the entire MA3 facility.
Outage Length
The duration was between 55 to 95 minutes.
Underlying cause
A surge in CPU load on both core routers caused disruption in control plane traffic, where forwarding and routing were operating without issue.
Customer stacks feature two network interfaces, attached to two, diverse switching and routing networks; the active interface is selected by performing an reachability check (using ARP) to its respective gateway. As the core routers were not responding to control plane traffic, ARP requests were being dropped, resulting in stacks taking both primary and secondary interfaces offline; ultimately severing all connectivity.
HA stacks were further affected by this issue, where the shut down interfaces lead to each member of the cluster not being able to reach each other and a "split brain" scenario occurring. Even when connectivity was restored, manual intervention was required to address the split brain.
Symptoms
Our facilities monitoring, and service monitoring probes immediately reported the incident. Customers would have experienced slow page load times through to a completely inaccessible site.
Resolution
A malfunctioning aggregation switch appeared to be the source of increased CPU load throughout the network; rebooting the affected switch was sufficient to allow CPU loads to drop and for stacks to "bring up" their network interfaces after having a successful ARP check.
Prevention
We have identified several areas in which improvements can be made,
In addition to the above; deploying newer generation aggregation switches and core routers will go a long way towards addressing control plan capacity. Substantial investment in extremely high capacity, latest generation hardware, has already been made and the timeline for replacement of existing hardware will be brought forwards.
]]>13:30 10/06/2019
Disruption
The control panel and support portal at my.sonassi.com will be briefly inaccessible.
Estimated Downtime
5 minutes.
Actions
Following an earlier incident, maintenance is taking place in order to replace failed devices that the support portal system is currently connected to.
]]>Update (15:40): MailChimp is suffering a global outage. Disable MailChimp immediately on your store if you are experiencing downtime.
Update (15:41): Disable the module using MageRun
mr_examplecom config:set 'mailchimp/ecommerce/active' 0
Update (17:14): MailChimp's service status is healthy again, but it is recommended to remove/disable the MailChimp module from your store due to inherent reliance on MailChimp's API.
Post-Mortem
Our report from the incident is as follows.
Issue
Loss of connectivity, high load and periods of unavailability for a single rack of equipment.
Outage Length
The duration was up to 18 minutes.
Underlying cause
A Power Supply Unit (PSU) in another server, within the same rack connected to the B power feed catastrophically failed. The catastrophic failure, unlike that of a typical PSU failure, caused a spike/surge in electrical activity, which triggered the Moulded Case Circuit Breaker (MCCB) for the B power feed to trip and cut off power.
On restoration of power of the B feed, the access switch booted into a “fail-safe” environment, in which the configuration was not fully restored. This led to the stack detecting the physical link was back up and began routing traffic over the B network; however the traffic was not being routed correctly.
Symptoms
Our facilities monitoring, and service monitoring probes immediately reported the incident. Customers would have experienced slow page load times through to a completely inaccessible site.
Resolution
Network engineers immediately restored the running configuration of the affected access switches, and gradually services restored normality. The disruption in network flow and packet loss caused delays in service restoration in the mutli-server stack, as each stack member was unable to cleanly communicate with its peers. This in turn caused load averages to increase, resulting in delayed service restoration even after the network connectivity was fully operational.
Prevention
Monitoring, disaster recovery procedures and staff action were followed exactly as anticipated, rapidly identifying the issue, resolving the issue and keeping downtime to a minimum.
This was an exceptional situation and there is nothing we believe could be performed differently to reduce or avoid downtime.
]]>30/06/2018 - ~
Disruption
None.
Estimated Downtime
Eternity.
Actions
TLS v1.0 is permanently disabled in line with the deadline for 30/06/2018. See https://www.sonassi.com/blog/pci-dss-v3-and-tls-v1-0 for more information.
]]>We've identified a problem with a third party service, MailChimp.
If your store is experiencing issues and using MailChimp, please disable the module prior to contacting support
]]>10:15-10:30 07/06/2018
Disruption
Scheduled upgrades for my.sonassi.com
Estimated Downtime
15 minutes
Actions
Customers are encouraged to call for support during this planned maintenance window.
]]>Post-Mortem
Our report from the incident is as follows.
Issue
Loss of connectivity from some ISPs.
Outage Length
The duration was 6 minutes.
Underlying cause
The outage experienced was due to one of our transit providers suffering an unexpected reboot of a router in Manchester. High CPU was noted at the time of the reboot which was what was responsible for dropped packets prior to reboot.
Symptoms
Our monitoring probes immediately reported the packet loss, which affected approximately 30% of our total inbound traffic.
Resolution
Our network operations team immediately shut down the connectivity to the affected transit provider and re-routed traffic around them. This restored full connectivity within seconds. The affected transit provider will remain "shut down" until we have seen consistent healthy performance, after which, it will be added to our transit pool.
Sonassi maintains connectivity from multiple independent transit providers to provide internet connectivity resilience. In this instance, a single provider failed, resulting in some traffic being briefly dropped prior to re-routing.
]]>Disruption (13/12/2017)
Disruption (14/12/2017)
We are currently investigating a possible network incident.
09:00-10:00 14/06/2017
Disruption
Scheduled upgrades for my.sonassi.com
Estimated Downtime
1 hour
Actions
This is a non-service affecting upgrade of our support portal at my.sonassi.com, these important feature updates will bring new functionality and control for your account. Critical support will be available via email and telephone, where details will be provided on my.sonassi.com during the maintenance.
]]>We are currently investigating a possible network incident.
We are currently investigating a possible network incident.
Post-Mortem
Our report from the incident is as follows.
Issue
Loss of connectivity from some ISPs.
Outage Length
The duration was 9 minutes.
Underlying cause
A large volume DOS attack targeted a single customer and was of such significant volume that it saturated the connectivity of one of our transit providers.
Symptoms
Our monitoring probes immediately reported the attack. Despite the attack being targeted at a single customer, the volume affected all our customers causing high levels of packet loss.
Resolution
Initially, without full information available, we interpreted the packet loss as an issue with a single transit provider and shut down our connectivity to said provider. At that point, traffic re-routed to our other (larger capacity) providers and the issue looked to be resolved as the larger capacity transit "absorbed" the attack. Moments later, our Level3 dDoS mitigation platform automatically activated and began scrubbing the malicious traffic and we restored connectivity to the original transit provider.
From start of attack to mitigation - the total time was 9 minutes. Our dDoS mitigation platform is a relatively new addition to the Sonassi network to offer an unprecedented level of protection to customers - and we are extremely happy that yet another large volume DOS attack was mitigated with only minimal disruption prior to activation.
]]>Start Date / Time: 23/11/16 16:00
Finish Date / Time: 23/11/16 18:00
Disruption
Access switch replacement, internal network capacity increase.
Estimated Downtime
Up to 15 minutes
Actions
We will be upgrading each access switches within each affected rack. The procedure will include failing traffic over to the backup switch, replacing the primary switch, restoring normal traffic flows, then replacing the backup switch. Downtime should be kept to an absolute minimum through this process unless an unplanned incident occurs.
]]>Start Date / Time: 22/11/16 16:00
Finish Date / Time: 22/11/16 18:00
Disruption
Access switch replacement, internal network capacity increase.
Estimated Downtime
Up to 15 minutes
Actions
We will be upgrading each access switches within each affected rack. The procedure will include failing traffic over to the backup switch, replacing the primary switch, restoring normal traffic flows, then replacing the backup switch. Downtime should be kept to an absolute minimum through this process unless an unplanned incident occurs.
]]>Post-Mortem
Our report from the incident is as follows.
Issue
Very brief loss of connectivity affecting around 20% of total traffic.
Outage Length
The duration was <4 minutes.
Underlying cause
An upstream provider suffered a router reboot, which caused traffic to be dropped both in/out of our network, for the small volume of traffic that passes via that provider.
Symptoms
Our monitoring probes immediately reported the incident. Customers would have experienced slow page load times through to a completely inaccessible site.
Resolution
The issue was resolved by temporarily dropping our connection to the respective provider and re-routing traffic over our other transit providers. The issue immediately subsided and normal traffic flows resumed. The upstream provider since has restored a previous configuration on their device and has it running stable again, we have since restored our connection to them and are operating at 100% capacity.
]]>PayPal's API domain is suffering a DNS outage and is not resolving correctly (more information)
Start Date / Time: 13/09/16 21:00
Finish Date / Time: 14/09/16 06:00
Disruption
Upstream transit provider maintenance.
Estimated Downtime
Up to 15 minutes
Actions
An upstream transit provider will be undertaking scheduled maintenance as part of their on-going network enhancement programme. This may affect our support portal, but should have limited impact on customers traffic.
]]>Start Date / Time: 12/09/16 21:00
Finish Date / Time: 13/09/16 06:00
Disruption
Upstream transit provider maintenance.
Estimated Downtime
Up to 15 minutes
Actions
An upstream transit provider will be undertaking scheduled maintenance as part of their on-going network enhancement programme. This may affect our support portal, but should have limited impact on customers traffic.
]]>Post-Mortem
Our report from the incident is as follows.
Issue
Loss of connectivity from some ISPs to our legacy IP ranges.
Outage Length
The duration was 9 minutes.
Underlying cause
A large volume DOS attack (300Gb per second 200 million packets per second) targeted a single customer.
Symptoms
Our monitoring probes immediately reported the attack. Despite the attack being targeted at a single customer, the volume affected all our customers causing high levels of packet loss.
Resolution
The issue was resolved by the automatic activation of our Level3 DOS mitigation platform. From start of attack to mitigation - the total time was 9 minutes. Our dDoS mitigation platform is a new addition to the Sonassi network to offer an unprecedented level of protection to customers - and we are extremely happy that a DOS attack of such significant volume was mitigated so successfully.
]]>If you are a BT customer and cannot access your server, please contact BT.
]]>19/07/2016 23:00 - 20/07/2016 06:00 BST 21/07/2016 00:01 - 21/07/2016 06:00 BST
Disruption
Upstream transit provider maintenance.
Estimated Downtime
Up to 15 minutes
Actions
An upstream transit provider will be undertaking scheduled maintenance as part of their on-going network enhancement programme. This may affect our support portal, but should have limited impact on customers traffic.
]]>16/04/2016 14:00 - 17/04/2016 00:00 GMT 17/04/2016 14:00 - 18/04/2016 00:00 GMT
Disruption
Core network capacity upgrade.
Estimated Downtime
0-5 minutes
Actions
We are increasing the capacity of our core and aggregation network to improve performance and scalability for our customers. This will involve deploying new network switches and replacing previous devices one at a time. It should largely be a downtime free operation; there may be small windows of packet loss during failover between our A and B switching networks.
April 2016
Disruption
A MageStack security update is being deployed, this important security update may cause 502 errors to be displayed briefly on your store.
Estimated Downtime
1-5 minutes
Actions
The automated update is being monitored by our team and you will be notified on start/finish of the works.
]]>25/02/2016 20:30 - 25/02/2016 21:30 GMT
Disruption
Hardware upgrade for support portal server.
Estimated Downtime
30-60 minutes
Actions
We are upgrading the hardware used for our support portal.
Updates
18/02/2016 18:00 - 18/02/2016 19:30 GMT
Disruption
Hardware upgrade for support portal server.
Estimated Downtime
5-10 minutes
Actions
We are upgrading the hardware used for our support portal.
]]>13/02/2016 23:00 - 14/02/2016 01:00 GMT
Disruption
Firmware upgrade on border routers
Estimated Downtime
5-10 minutes
Actions
Each core router will be upgraded in turn to their latest firmware release. Failover should occur between the devices, resulting a brief period of downtime, however, we would like to allow for a window of up to 10 minutes possible downtime.
]]>Post-Mortem
Our report from the incident is as follows.
Issue
Loss of connectivity from some ISPs to our legacy IP ranges.
Outage Length
The duration was 30 seconds.
Underlying cause
The transit provider carrying our legacy IP range inbound traffic experienced an interface flap at LINX, triggering a re-route and re-convergence of routing.
Symptoms
Our external monitoring probes immediately reported the fault. A very small number of users would have been unable to access their servers.
Resolution
The issue resolved itself by automatically selecting another carrier (Level3) when the connection at LINX dropped. The cause of the downtime was merely the delay of end-user ISP route re-convergance.
All customers are already being migrated from our legacy IP range as part of our 2015 IP migration; giving us full control of all customer traffic, both inbound and outbound.
]]>Magento has released a new security patch for versions 1.4 and newer, SUPEE-6482
The vulnerabilities
This bundle includes protection against the following security-related issues:
What you need to do
You must apply this new security patch as soon as possible. It can be downloaded from https://www.magentocommerce.com/download
You can either patch the store yourself using the instructions below, or submit a (chargeable) maintenance support ticket at https://www.theclientarea.info where our support team can apply the patch on your behalf (est. 5-60 mins application time).
More information
Read more about the patch here, http://us5.campaign-archive1.com/?u=34ff0d4b547cfa0a6a6901212&id=90740291cb
]]>Estimated Downtime
Periodic loss of connectivity for short periods
Actions
One of our transit providers is performing maintenance on their core network as part of their continued service enhancement programme.
]]>Magento has released a new security patch for versions 1.6 and newer, SUPEE-6285
The vulnerabilities
This bundle includes protection against the following security-related issues:
What you need to do
You must apply this new security patch as soon as possible. It can be downloaded from https://www.magentocommerce.com/download
You can either patch the store yourself using the instructions below, or submit a (chargeable) maintenance support ticket at https://www.theclientarea.info where our support team can apply the patch on your behalf (est. 5-10 mins application time).
More information
Read more about the patch here, http://us5.campaign-archive1.com/?u=34ff0d4b547cfa0a6a6901212&id=d47fcf1c6d
]]>Post-Mortem
Our report from the incident is as follows.
Issue
A very small number of customers reported connectivity issues, this was unreproduceable and unconfirmed by our network team.
Outage Length
No outage.
Underlying cause
We collected several traceroutes from customers, observing both the forward and reverse path to ascertain what commonality may have existed. However, no single cause could be identified
Symptoms
Customers reported slow page load times and general difficulty connecting to their stores.
Resolution
No action was taken by our team. We had 5 isolated reports from customers, which lead us to create a network alert in case of a network-wide event. It is our policy that after 5 isolated reports, we put out an un-confirmed notification whilst we investigate.
As we were unable to identify any fault, the issue can only be attributed to an unknown larger internet congestion issue.
]]>Post-Mortem
Our report from the incident is as follows.
Issue
Total packet loss, some customers servers were completely inaccessible.
Outage Length
The duration was 15 minutes.
Underlying cause
Continual diagnosis of our network core is taking place by the vendor in an effort to identify and resolve the outstanding issues we are experiencing.
This diagnosis involves gathering information from the switches and in some cases, making minor adjustments. A configuration change was made to the network edge filtering that left a window open for attack.
The increased traffic flow targetting the routing engine lead to increased CPU utilisation and a subsequent restart of the packet forwarding process (under current network conditions, this can take up to 10 minutes to recover).
Symptoms
Access to servers behind the affected subnets was impossible.
Resolution
The backup switching/routing network was manually restored to resume connectivity.
After approxomiately 7 minutes, the primary network was restored to full health and traffic was sucessfully, cleanly failed back.
The current network condition is very healthy, with full firewalling and full automated availability of routing and switching. Through continued efforts from the vendor and our own team, the historic issues we have experienced should now be considered resolved.
]]>Estimated Downtime
Periodic loss of connectivity for <10 minutes
Actions
Software updates are to be applied to each switch within our network. This will require powering down each device in turn to update the firmware.
This will result in a small amount of downtime which may affect a small number of customers as each access switch is powered down for update.
]]>Estimated Downtime
Periodic loss of connectivity for <10 minutes
Actions
Software updates are to be applied to each switch within our network. This will require powering down each device in turn to update the firmware.
This will result in a small amount of downtime which may affect a small number of customers as each access switch is powered down for update.
]]>Post-Mortem
Our report from the incident is as follows.
Issue
A very small number of IP subnets encountered a routing loop and some customers servers were inaccessible.
Outage Length
The duration was 23 minutes.
Underlying cause
The affected IP ranges are those we carry as a legacy from a historic provider - they do not form part of our multihomed BGP network and as such are subject to possible outages should the transit provider supplying them encouter problems.
The routing loop occured because of a configuration change on the upstream providers network.
Symptoms
Access to servers behind the affected subnets was impossible, a routing loop was visible on a traceroute.
Resolution
The provider reverted the change and service was immediately restored.
The long-term plan is to renumber all these IP addresses into our own, so that they can be announced over our resilient, mutlihomed BGP network. This task is already underway and customers will be contacted soon to arrange for changeover dates.
]]>Estimated Downtime
Periodic loss of connectivity for <10 minutes
Actions
Software updates are to be applied to each switch within our network. This will require powering down each device in turn to update the firmware.
This will result in a small amount of downtime which may affect a small number of customers as each access switch is powered down for update.
]]>Post-Mortem
Our report from the incident is as follows.
Issue
Significant packet loss, effecting all servers at our Joule House location, causing a total service loss to all servers.
Outage Length
The duration was 97 minutes.
Underlying cause
Currently under investigation with Juniper TAC to identify and isolate the issue. It appears to be a repeat incident whereby flooding within a single access switch caused significant control plane CPU consumption within other network devices.
Symptoms
Our external monitoring probes immediately reported the fault. End users will have noticed the issue as it had an effect on overall service.
Resolution
Whilst the sympoms have been resolved, we believe the underlying issue to still be present and the result of firmware bug. The issue has been elevated to the vendor for in depth analysis and urgent review.
Our network and transit team is continuing to investigate, replicate and resolve this issue in an isolated environment in parallel with Juniper TACs efforts, so that a swift resolution can be reached.
The network status will remain as high until we are confident of a permanent resolution.
]]>Post-Mortem
Our report from the incident is as follows.
Issue
Significant packet loss, causing over 50% of packets to be dropped to a single rack of equipment and a secondary symptomatic effect of <10% loss within the network core. This had a significant effect on servers at Joule House, causing a total service loss to the servers connected to the respective access switch stack.
Outage Length
The duration was 63 minutes.
Underlying cause
Flooding within a single access switch, causing significant control plane CPU consumption within other network devices.
Symptoms
Our external monitoring probes immediately reported the fault. End users will have noticed the issue as it had an effect on overall service.
Resolution
The switch generating the traffic was observed to be consuming 100% CPU, it was initially power cycled in the hope that the device would become responsive again. Unfortunately, the issue propagated to the remaining 5 switches within the stack (in a single rack), generating further problems.
To avoid major network disruption for the entire location, all access switches were powered off simultaneously, then powered back up one at a time. This restored service and resolved the issue at hand.
Technical reports will be submitted to the vendor for analysis, however, with upcoming access-layer networking upgrades, it is unlikely this will be pursued further.
]]>Estimated Downtime
Periodic loss of connectivity for <5 minutes
Actions
One of our carriers is undertaking maintenance, which will result in a brief loss of connectivity for some customers. The maintenance window is 4 hours long, so it is possible that there may be a number of small outages, each less than a minute.
]]>Estimated Downtime
Indefinitely
Actions
It has now reached 6 months from the initial notification that our legacy shared hosting offering would be decommissioned.
As shared hosting is no longer part of our offering at Sonassi, the servers will be powered down as of 12:00 indefinitely.
]]>Estimated Downtime
Periodic loss of connectivity for <5 minutes
Actions
One of our carriers is undertaking maintenance, which will result in a brief loss of connectivity for some customers. The maintenance window is 1 hour long, so it is possible that there may be a number of small outages, each less than a minute.
]]>Estimated Downtime
Periodic loss of connectivity for <5 minutes
Actions
One of our carriers is undertaking maintenance, which will result in a brief loss of connectivity for some customers. The maintenance window is 4 hours long, so it is possible that there may be a number of small outages, each less than a minute.
]]>Post-Mortem
Our report from the incident is as follows.
Issue
Minor intermittent packet loss, causing some packets to be dropped. This had negligible effect on servers at Joule House, the effects may not have even been noticeable to end users.
Outage Length
The intermittent packet loss duration was 2 minutes.
Underlying cause
The load on a core router suddenly increased with no known cause.
Symptoms
Our external monitoring probes immediately reported the fault. End users may not have noticed the issue as it had near negligible effect on overall service.
Resolution
As the router was non-responsive to input, it was deemed necessary to restart the device. Seamless failover completed to the other router whilst the device was powered down. There was no loss of service during failover.
The router was powered back on, underwent a consistency check and was added back into the routing pool.
]]>Estimated Downtime
<0 minutes
Actions
Some minor software updates are due. An automated update will take place on your server during the maintenance window, you should not notice any outages during this period.
]]>Post-Mortem
Our report from the incident is as follows.
Issue
Minor network outage
Outage Length
3 seconds
Underlying cause
One of our transit providers (Cogent) experienced a router failure within their network. Increasing CPU usage on their core router caused packets to be progressively dropped.
Symptoms
Our external monitoring probes immediately reported the fault. Some customers (whose traffic was routed over Cogent), experienced an extremely brief window (<1 minute) of slow page load times or server inaccessibility.
Resolution
Once the packet loss threshold was hit, our internal BGP latency and packet loss measuring device automatically de-preferenced Cogent from the available BGP routes. Once Cogent was removed, traffic continued to flow out over our remaining carriers as normal.
Convergence took <5 seconds, but propagation at other ISPs may have taken a couple of minutes, which is why some customers may have experienced a slightly longer outage.
Our automated systems and monitoring systems behaved exactly as designed for this disaster scenario and recovered the carrier failure in less than 5 seconds.
]]>Post-Mortem
Our report from the incident is as follows.
Issue
Total service outage on sms-sagat
Outage Length
23 hours and 3 minutes.
Underlying cause
A HDD developed bad sectors which caused the RAID array to fail. Counter-intuitively, the healthy drive was forced out of the array due to the faulty drive being able to replicate its data to it.
Symptoms
Our internal and external monitoring probes did not report a fault. The machine was still responding to ICMP and HTTP tests, some customer websites appeared to be browsable, but the read-only file system meant that they were effectively unusable.
Resolution
We immediately removed the drives from the server and installed them into a brand new server, to rule out power supply, motherboard or backplane failure.
We attempted to rebuild the RAID array in order to remove the failed drive, however this proved futile.
Instead, we were forced to create a new single disk RAID array, then recover the data from the failing disk to this one. We did encounter issues with the MySQL databases and this posed a concern that there may have been data loss. However, indications show data was retrieved without error. Retrieving the data took a considerable amount of time and was the primary cause of such an extended outage.
The likelihood of a brand new drive failing, within 10 days of 2 other drives failing is unprecedented - an almost impossible possibility. We have not completed testing the old chassis, but at this point - we can only assume a power related issue that could have been overvolting the drives and causing drive failure.
]]>Post-Mortem
Our report from the incident is as follows.
Issue
Total service outage on sms-sagat
Outage Length
27 hours and 30 minutes.
Underlying cause
A HDD completely failed in the server leaving the RAID array degraded. However, the remaining drive immediately then began showing signs it was failing, resulting in the server mounting the filesystem as read-only as a precautionary measure against data corruption.
Symptoms
Our internal and external monitoring probes did not report a fault. The machine was still responding to ICMP and HTTP tests, some customer websites appeared to be browsable, but the read-only files ystem meant that they were effectively unusable.
Resolution
We immediately removed the failed drive and replaced it. We attempted several times to check the filesystem health and copy the data from the failing drive to the new RAID array, eventually resulting in 100% custom data retrieval.
Retrieving the data took a considerable amount of time and was the primary cause of such an extended outage. As we do not operate off-site backups on shared hosting, it meant that this absolutely necessary to recover customer data.
Also a secondary reason for the downtime duration was that our monitoring platform failed to identify an issue, given the machine was responding to healthcheck tests in a normal fashion.
The servers automatically run daily and weekly HDD self- tests, but no early warning signs had been issued, so we had no indication this was about to happen.
As a result of this incident, we will
Estimated Downtime
<5 minutes
Actions
One of our carriers is undertaking maintenance, which will result in a BGP feed being taken offline. Routes should gracefully failover to an alternate carrier during their maintenance window.
Post-Mortem
Our report from the incident is as follows.
Issue
Partial packet loss affecting global connectivity.
Outage Length
5 minutes.
Underlying cause
One of our carriers saw a large burst of traffic within their network resulting in a loss of connectivity.
Symptoms
Our internal and external monitoring probes immediately reported a fault.
Resolution
No active steps needed to be taken. Connectivity gracefully fell over to another carrier. A manual de-preference of that carrier was added to prevent its use until connectivity continuity was restored.
]]>Estimated Downtime
20 minutes
Actions
Since the restart, I/O wait has been higher than expected on sms-jay. We feel that a kernel update would be a wise manoeuvre given the age of the current release.
Estimated Downtime
None
Actions
One of our carriers is undertaking maintenance, which will result in a BGP feed being taken offline. Routes should gracefully failover to an alternate carrier during their maintenance window.
Post-Mortem
Our report from the incident is as follows.
Issue
Packet loss affecting worldwide connectivity.
Outage Length
25 minutes
Underlying cause
We had been made aware from one of our carriers that maintenance would be conducted between 12am and 6am; we had made provisions for this and were prepared for systems to automatically switch over during the outage on one carrier.
However, the failover did not occur as planned, as it appeared our other carrier was also affected.
Symptoms
Our internal and external monitoring probes immediately reported a fault.
Resolution
The BGP sessions were not able to be re-established despite efforts. As a last resort, both routers were consecutively rebooted by an on-site technician. This re-established the BGP sessions and connectivity was restored.
We believe there many be commonality between the carriers (shared fibre/conduits/backhauls). An investigation has been launched to see why and how both were simultaneously affected.
Services are being closely monitored and it is likely that some failover tests will be conducted throughout the week to test and guarantee against future failure.
]]>We have null routed the traffic at our network edge and expect the problem to be resolved shortly.
Estimated Downtime
Up to 240 seconds per customer
Actions
MageStack is automatically kept up to date, however, some older releases need human oversight during a manual upgrade for the major point upgrades
]]>Update (17:54): We have failed over to another carrier in the interim whilst this is investigated. Updates to follow.
Update (18:02): The issue stems from a major fault at LINX. We’re routing around LINX right now, but other ISPs may still be sending traffic via LINX and there might still be users experiencing packet loss.
Update (18:15): Traffic flows have reached normal levels again, the issue at [ One of our carriers has reported a fault with routing in the UK.
Update (17:54): We have failed over to another carrier in the interim whilst this is investigated. Updates to follow.
Update (18:02): The issue stems from a major fault at LINX. We’re routing around LINX right now, but other ISPs may still be sending traffic via LINX and there might still be users experiencing packet loss.
Update (18:15): Traffic flows have reached normal levels again, the issue at]5 looks to be resolved.
<!-- more -->
Post-Mortem
Our report from the incident is as follows.
IssuePacket loss affecting UK ISPs.
Underlying causeBroadcast traffic at LINX causing high CPU usage on core routers. SymptomsInaccessibility of our network from certain ISPs.ResolutionWe dropped our peering session with LINX whilst the issue was diagnosed and identified. The issue at LINX has reportedly been fixed and our network team will enable peering at LINX once we are confident their routers are once again stable.
]]>Estimated Downtime
Up to 120 seconds in small bursts across the course of the day
Actions
Following some repeated firmware issues on access switches, we are systematically replacing all access switches in every rack in Joule House.
As we run a redundant switching network, we will begin by powering down the B switching network, and replacing cabling and switches. Then powering it back up. Then repeating the same process on the A switching network.
When the switching networks go down and back up (4 times in total), it will cause a spanning tree calculation, which will result in momentary downtime (<30s).
]]><!-- more -->
Post-Mortem
Our report from the incident is as follows.
IssuePacket loss affecting a very small number users worldwide, resulting in slow page load times, or no response.
Underlying causeFibre break within Level 3 / Global Crossing’s core infrastructure. Our other carriers were unaffected, so only those ISPs peering with Level 3 saw issues.SymptomsUp to 100% packet loss from a limited number of national and international ISPs where traffic was coming in via Level 3.ResolutionWe temporarily stopped using L3/GBLX and prioritised our routes over Cogent until L3/GBLX had put repairs in place. We received confirmation that at 3:00am the broken fibre had been fixed, and routes began to flow back over L3/GBLX once again.
]]>Estimated Downtime
300 seconds
Actions
Shared hosting server maintenance
]]>Estimated Downtime
Up to 180 minutes
Actions
A HDD failed on sms-sagat and required the replacement of the drive. On-site engineers replaced the drive within minutes of notification.
The RAID array now needs to rebuild, during the time it is degraded, performance will be relatively poor.
22/03/2013 11:00pm
Disruption
Switch over of power feeds from B only to A+B
Estimated Downtime
Up to 10 minutes
Actions
Following the ‘B’ feed power failure on 13/03/2012 - failover switches were moved to the ‘A’ feed.
As the ‘B’ feed has now been repaired, all affected servers must now be moved back to the ‘B’ feed. This means powering down the switching equipment and checking routes propagate correctly afterwards.
Downtime should be minimal during the swap - but there will be a loss of service.
Updates to follow.
19/01/2012 12:00am
Disruption
Switch over of power feeds from B only to A+B
Estimated Downtime
Up to 5 minutes
Actions
Following the ‘A’ feed power failure on 15/01/2012 - single-fed servers were manually moved to the 'B’ feed to immediately restore service.
As the 'A’ feed has now been repaired, all affected servers must now be moved back to the primary 'A’ feed. This means the powering down of your machine - moving its PDU to the 'B’ feed and then powering it back up.
Downtime should be minimal during the swap - but there will be a loss of service.
04:05:00 15/01/2013
Disruption There has been a total loss of power on a single feed at Joule House.
Updates to follow.
18/01/2013 00:00am until 18/01/2013 00:30am
Disruption
Customers will experience a potential loss of service for up to 60 seconds in any one instance, during the maintenance window of 30 minutes
Estimated Downtime
60 seconds
Actions
Core network maintenance
]]>15/11/2012 (TBC)
Disruption
Brief total loss of connectivity to our entire network
Estimated Downtime
Up to 5 minutes
Actions
Our racks at Joule House are currently fed via multiple point-to-point links back to Reynolds House. However a fault has been identified in the interconnects at Reynolds House that affects both of these point-to-point links.
Engineers are currently putting in a 3rd link and will switch the service onto this replacement whilst both the primary and secondary links are tested and repaired.
08:53:51 15/11/2012
…
16:57:57 15/11/2012
Disruption
We have seen frequent but very brief outages (lasting between 20 seconds and 4 minutes) since 08:53 today.
This is currently under investigation with our NOC team.
Updates to follow.
Please contact us at https://sms-sagat.theclientarea.info/ if your machine is unavailable.
]]>Minor packet loss affecting a very small number users worldwide, resulting in slow page load times, or no response
Underlying cause
Failed components within Level 3 / Global Crossing’s core infrastructure
Symptoms
<1% packet loss across all network
Resolution
We have temporarily stopped using L3/GBLX and are advertising our routes using Cogent until such time L3/GBLX have put repairs in place.
]]>19/10/2012 13:45
Disruption
There is some reported packet loss from specific UK broadband providers to our network. We are currently investigating it.
15/09/2012 11:45am
Disruption
Multiple security updates are being applied to all shared hosting servers
Estimated Downtime
10 minutes
Actions
PHP/MySQL/Roundcube/PHPMyAdmin will be removed and re-installed on all shared hosting servers with the latest security patches applied.
03/11/2012 12:00am
Disruption
A works order has been raised to migrate all dedicated servers from Reynolds House and Delta House to our new facility, Joule House.
Estimated Downtime
120-240 minutes
Actions Servers will be cleanly powered down; then transported to the new data-centre. Where they will be installed and powered back up.
Some customers will experience a loss of service during the migration.
We have a 3 year lease at Joule House; so this will mark the last of any future data centre migrations.
]]>12/09/2012 22:00pm
Disruption
A works order has been raised to carry out Essential Maintenance Work to add additional redundancy measures to our infrastructure.
Estimated Downtime
120 minutes
Actions Re-routing of traffic is required to perform essential maintenance work upon our infrastructure.
Customers should not experience a loss of service, but may notice traffic being re-routed during the maintenance window.
]]>19/08/2012 11:00am
Disruption
Shared hosting customers and dedicated hosting customers in both Reynolds House and Delta House will be affected.
Estimated Downtime
15 minutes
Actions
A core router has failed and the network is currently running degraded. We had intentions of upgrading the core network infrastructure over the coming months with newer (more powerful equipment), however, due to this failure, we intend to bring the go-live date forward.
As a result, the maintenance window will include removing all previous routing hardware and replacing it with the newer, more powerful, units.
Minor packet loss affecting a very small number of dedicated servers at Delta House
Underlying cause
A minor DOS attack to a boundary shared firewall
Symptoms
<10% packet loss to some subnets
Resolution We isolated the affected subnets and removed this from the shared firewall routes to let the core routers divert the traffic to the affected server.
]]>Complete loss of power at Delta House, Manchester
Underlying cause
The core UPS infrastructure suffered a fatal error causing power to be cut on the entire “A” feed - causing a data centre site-wide outage affecting over 50 racks.
Symptoms
Complete loss of power and subsequent access to any equipment in Delta House
Resolution
A Sonassi engineer was on-site within 15 minutes of the incident. Data centre electricians had diagnosed the fault to be within the UPS and promptly bypassed it, to run power via mains directly.
Our engineer manually booted and tested each server in every rack to ensure it started cleanly with all HTTP services running. All machines started without fault or issue.
Engineers from the UPS manufacturer were dispatched to the data centre and are currently implementing a repair.
]]>Power was briefly restored on generator with the UPS in bypass; then terminated when it was switched back on to primary/mains.
Power should be restored, however the UPS is currently being investigated by engineers.
Update (18:53 - 01/07/2012)
All affected racks in the facility have been booted and an engineer on site has manually verified that HTTP services on each server have come back up cleanly.
Please submit a support ticket at theclientarea.info if you are experiencing any difficulties with your server.
]]>Complete loss of all connectivity to our entire network
Underlying cause
Our upstream transit provider’s switch failed at Reynolds House, Manchester, causing the loss of connectivity on one site feed. Simultaneously, a power breaker failed at Delta House, Manchester, rendering the redundant feed unavailable too.
Symptoms
Complete loss of all connectivity to our entire network
Resolution
A temporary alternative transit feed has been provisioned and routes are now sent via this (around 3 hops more than the previous feed). This is a semi-permanent fix whilst the switch is replaced in Reynolds House. There is expected to be planned maintenance to change the transit back to the primary feed; but nothing has been scheduled yet. The network is now operating as it should be and we apologise for the downtime. Downtime incurred was 180 minutes.
]]>Whilst all our racks still have power, the routing equipment providing connectivity does not.
Engineers are on site working on two solutions simultaneously.
We apologise for this incident and downtime thus far and will provide future updates here.
]]>At this stage, the reason for the fault is unknown
More information will follow when it is available.
]]>03/06/2012 11:00pm
Disruption
Only shared hosting customers using web node sms-jay will be affected.
Estimated Downtime
20 minutes
Actions
The hard drives from sms-akuma will be removed and inserted into a new server, whilst the original chassis is tested. After testing, sms-akuma will be moved back to the original (or new) chassis - this will be scheduled for approximately 14 days from today.
]]>sms-akuma restarted
Underlying cause
Watchdog triggered a reboot
Symptoms
Complete loss of service on sms-akuma.
Resolution
The automatic watchdog monitoring service restarted the server after detecting a non-recoverable error.
This is the second incident of this nature within 60 days - so the chassis will be taken down for fault testing and sms-akuma will be migrated to another physical server.
This maintenance window is scheduled for 03/06/2012 11:00pm to minimise disruption. Downtime should be a maximum of 20 minutes.
]]>sms-akuma restarted
Underlying cause
Watchdog triggered a reboot
Symptoms
Complete loss of service on sms-akuma.
Resolution
The automatic watchdog monitoring service restarted the server after detecting a non-recoverable error. After an automatic fsck - the system came back up successfully following 36 minutes downtime.
]]>sms-sagat unresponsive
Underlying cause
Memory page fault caused a kernel panic
Symptoms
Complete loss of service on sms-sagat
Resolution
Continual memory tests are running on the system, but so far have shown without error. It is assumed it was a software fault (not hardware).
The RAID array is also degraded and being re-built, so performance is limited.
–
Follow Up
A SMART test was run on all drives and one drive reported bad sectors. As a result, this drive has been removed and replaced and the RAID array is rebuilding. An off-line snapshot has been taken of the system whilst the RAID array is degraded.
]]>Within the last 15 minutes, we have received several Pingdom notifications reporting connectivity dropping and immediately coming back up. However, this does not correspond with our own monitoring reports.
Both Pingdom’s monitoring service and Pingdom’s FPT are showing strange results - however, other 3rd party services are reporting no issues.
At the moment, we are investigating what is going on, but it looks to be an issue with Pingdom rather than our connectivity. Enquiries are under way.
]]>DDOS attack to our transit provider’s network
Underlying cause
External high volume attack from multiple sources targeting a customer subnet
Symptoms
Intermittent loss of service on multiple subnets
Resolution
From the information gathered so far the evidence points to a single attack to one customer.
The team are still looking through logs and progressing the incident with the relevant authorities and further measures are currently being invoked to reduce such attacks in future.
]]>A formal investigation is under way at present, however, we have been assured our own connectivity should not be affected any more.
We would like to apologise for the outage last night, which spanned 11 minutes in total, but we hope our proactive response to the situation and information clarity throughout was of some benefit to concerned customers.
We are currently discussing means to prevent this happening again, however, as the attack was not directed at subnets within our own network, it will still be hard to mitigate.
For reliability and performance, we hand off BGP to our upstream provider who uses multiple peers and handles external (internet) routes on our behalf - however, this was our downfall, as when another customer of theirs fell victim to a DDoS attack, it saturated the common transit uplinks affecting the entire data centre.
We are not in doubt of our current peers/transit providers; as it has served us well, with 3 years of 100% network connectivity and we have full faith in their ability to deal with future issues.
]]>Engineers are still working on a resolution and to identify the root issue - but at present we are awaiting updates.
What we know
The issue is outside of Sonassi Hosting’s network; our transit provider is experiencing difficulties at the data centre which is something that we cannot remedy. They have engineers on site working on a fix.
We still have 100% power and 100% cooling, as well as our internal network (from edge-in) is 100% functional, however outbound/inbound national routes are flapping.
First ever significant outage
This is our first ever significant outage in 3 years of operations and certainly not what our clients are accustomed to.
We would like to reassure all customers that we will remain available on here and Twitter (@sonassi @sonassihosting) if you want to talk to us directly.
]]>Another update will be given within the hour.
]]>Access will be limited over the next 1hr.
This will not effect any dedicated or shared hosting customers.
]]>Access will be limited over the next 1hr.
This will not effect any dedicated or shared hosting customers.
]]>If you need any assistance with managing your account, just email the team at support@sonassihosting.com and we’ll take care of all your requests.
]]>If you need any assistance with managing your account, just email the team at support@sonassihosting.com and we’ll take care of all your requests.
]]>Resolved
Services are back up and running now, apologies for the slight glitch and 1 minute downtime.
]]>Downtime may be incurred, but should be minimal.
Resolved
A fault occurred during a routine configuration change and the switch took longer to reboot than expected.
]]>Resolved
A service did not auto-restart as it should have done and required manual intervention. All services on sms-sagat are up and running again.
]]>You can find it at http://status.sonassihosting.com/
]]>