The cloud has accumulated increasing adoption in the last couple of years. But, the path is not always an easy one, mainly on account of clearly demarcated SLAs for adoption and migration. While the cloud service provider must take responsibility for their infrastructure and security, it may not always be the best way forward and may create challenging situations.
The list below rounds up all the major instances of cloud outages in the year 2019, situations when the customer systems and devices were jeopardized. These incidences will not only highlight the criticality of cloud security but also help to place stronger systems and processes to avoid these incidents.
Many iCloud users across the globe received a “Service Unavailable – DNS failure” message for several hours in July. This widespread cloud outage disturbed services such as App Store, Apple TV, Apple ID, Apple Music, Apple Music, Apple Books, Subscriptions, and more. Though the issue was resolved, during the time of the outage, users failed to use various functions such as Find My iPhone. Apple confirmed that the cloud outage was due to a ‘BGP route flap’ issue causing severe packet loss for North American users.
In May 2019, Salesforce faced one of the most shocking service disruptions while the deployment of a database script to the Pardot Marketing Cloud. This incident resulted in advanced permissions being granted to regular users. Salesforce was forced to block the users to prevent the employees from stealing sensitive corporate data. Also, they had to block network access to other Salesforce services like Service Cloud and Sales Cloud to prevent further damage. As a result of this, customers failed to access Pardot Marketing Cloud for 20 hours, and it took 12 days to roll out the other Salesforce services completely.
In August 2019, an Amazon AWS US-EAST-1 datacenter in North Virginia faced terrible power failure leading to failure of the datacenter’s backup generators. It led to 7.5% of the EBS volumes and EC2 instances becoming unavailable. After the restoration of power, Amazon determined that some of them had incurred hardware damage with loss of data. Some customers faced extensive data loss questioning the security of the data stored in the cloud. The AWS failure was one of the significant instances of hardware failure in 2019 that proved that hosting data in the cloud is never completely safe and reliable.
Facebook and Instagram
There were issues with Instagram and Facebook earlier 2019 caused by a server configuration change. During the outage, users faced glitches with Facebook-owned properties WhatsApp and Instagram for around 14 hours. Facebook has received several client complaints in 2019 regarding the app not working, preventing them from accessing the Messenger app.
Google Cloud Servers
The cloud servers in the US-east1 zone were cut off from the rest of the globe due to the Cloud Networking and Load balancing issue. This caused physical damage to multiple concurrent fiber bundles used in the network paths of that region. Google carried out extensive mitigation work after the incident; however, users complained of increased latency.
Microsoft faced its share of cloud outages in 2019, affecting Azure, Dynamics, Microsoft 365, and DevOps. In May, Microsoft suffered an outage that lasted for more than an hour resulting in network connectivity errors in Microsoft Azure. This deeply affected its cloud services, including Xbox Live, Office 365, Microsoft Teams, and several others, which are used by Microsoft’s commercial customers. The root cause was as an incorrect name server delegation issue affecting the DNS resolution, network connectivity, and downstream impact. Although no customer DNS records got impacted during this incident, it still was a significant failure for the world’s largest company.
Google Cloud Platform
Google Cloud Platform (GCP) experienced significant issues with services, including cloud dataflow, cloud storage, and compute engine recently. This affected multiple products, with major APIs getting impacted globally. Interestingly, this incident came almost 20 days after users faced 100% packet loss to and from 20% of instances in GCP’s us-west1-b region for 2.5 hours. The reason for this significant failure was the chubby lock system associated, which resulted in the control plane gaining and losing leadership in short succession.
All these cloud outages surely were eye-openers for the entire industry as they involved some top global brands. The reasons for these outages were strange and least expected, which makes them even more frightening.
Clearly, all small, medium, as well as large scale firms, need to analyze these incidences and employ systems to ensure that such outages can be prevented.