Since the global pandemic made us all switch to online activities a year ago, reliable software is everything. The secret of software providers’ success lies in preparing for different failures and reacting quickly whenever things go wrong. DevOps methodology provides smart tools that can make your cloud software more reliable than ever before.
When the Internet entered general use a few decades ago, users were deeply fascinated and ready to accept even major difficulties with connection, website loading or different software functionalities. In 2021, when software offers various functionalities, end-users are incredibly aware of their needs and ready to express their demands publicly. Hours-long outages of work-related apps like Slack, Teams or Office 365, social media, shops, or streaming platforms are unacceptable. Software providers need to create a detailed plan to build and maintain fully reliable systems as customers’ trust is at stake.
Giants’ loudest outages of 2020
2020, at the beginning called the year of cloud computing, soon turned out to be the time of frustration, when global cloud-based software providers suffered outages that significantly impacted web services, apps, and overall business. Major cloud service providers had to find conclusions and effective remedies for the future.
Microsoft Azure, March 2020
- Duration of outage : 6 hours
- Affected: Microsoft’s East U.S. datacenter region
- Cause: Cooling system failure required manually resetting the cooling system’s controllers
- Affected: Storage, compute, networking, and other services
IBM Cloud, June 2020
- Duration of outage: 4 hours
- Cause: Multi-zone interruption of services caused by a third party network provider flooded the IBM Cloud network with incorrect routing
- London, Frankfurt, and Sydney; IBM Cloud services and 80+ data centres, General cloud services, Kubernetes services, App connect Watson AI cloud services
- Affected: IBM cloud customers in Washington, D.C., Dallas
Cloudflare, July 2020
- Duration of outage: July 17, 2020
- Cause: A configuration error in Cloudflare’s global backbone network resulted in a 50% traffic drop across its network
- Affected: A significant chunk of internet services, several big-name clients such as Discord, Feedly, GitLab, League of Legends, Patreon, Politico, and Shopify
Amazon Web Services, November 2020
- Duration of outage: November 25, 2020
- Cause: A multi-hour, global outage was triggered due to the small addition of capacity to Amazon Kinesis
- Affected: the U.S. East-1 region that knocked down services of prominent AWS customers: 1Password, Adobe Spark, Autodesk, Flickr, iRobot, Roku, Twilio, The Washington Post, and Glassdoor + other AWS services, such as Lambda, LEX, Macie, Managed Blockchain, Marketplace, MediaLive, MediaConvert, Personalize, Rekognition, SageMaker, and Workspaces
Google Cloud, December 2020
- Duration of outage: nearly 1 hour, December 14, 2020
- Cause: Google Cloud experienced a widespread global authentication system outage due to an internal storage quota issue
- Affected: Major Google services including YouTube, Google Maps, Google Docs, Google Maps and Gmail
The remedy: DevOps practices
The meaning of DevOps is still evolving, but back to the beginning – Devops is a set of practices.
A compound of development (Dev) and operations (Ops), DevOps is the union of people, process, and technology to provide value to customers continually. What does DevOps mean for teams? DevOps enables formerly siloed roles—development, IT operations, quality engineering, and security—to coordinate and collaborate to produce better, more reliable products. By adopting a DevOps culture along with DevOps practices and tools, teams gain the ability to better respond to customer needs, increase confidence in the applications they build and achieve business goals faster.Microsoft
With DevOps methodology development and operations teams can work together, imply automation and use the same tools in shorter development cycles so you get the results much faster.
DevOps methodology is based on nine pillars that, when joint, constitute a complete, highly beneficial for your business approach that leads to the project success. It’s faster, more reliable and secure.
In adopting DevOps practices, teams work to ensure system reliability, high availability and aim for zero downtime while reinforcing security and governance. DevOps teams seek to identify issues before they affect the customer experience and mitigate problems immediately when they occur.
Maintaining this vigilance requires:
- rich telemetry,
- actionable alerting,
- full visibility into applications and the underlying system.
Failure as part of a plan
Reliable software is crucial not only from a customer’s perspective. It’s also essential for software developers and operators. When an engineering team faces disruption, they step into an interrupt-driven development phase, which is an easy way to burn out as a group and individually.
DevOps tools to build software reliability
First of all, we need to think about achieving and maintaining software reliability as a constant process that requires engagement. You decide with your team on the level of reliability your company wants to provide. And after that, all team members need to work on it consistently every day. These are just a few among the efficient tools to use:
Analysing and learning through retrospectives are a great way to understand why something happened, but also why things work at all. Your team will discover what was done to resolve the incident and why certain decisions were made. When you identify all the factors that contributed to an outage, you can analyse every single one of them in detail, explore weak areas and plan better decisions for the future.
Rule the chaos before it takes over your system. Causing outages intentionally in a controlled way provides your team with priceless knowledge. It is a great way to build resilience and reliability among engineers. Chaos engineering helps build better functioning software.
Quick on/off turning
There are multiple tools to use, like canary releases, A/b, blue/green, rolling updates, dark launching, feature flags. They are used in the software stack, and the reason we use them is simple. When complex systems and deploys need a simple light switch to make some parts go dark when the failure occurs.
Complex systems include many running applications. An individual can’t keep track of them all. Maintenance and keeping control of thousands of microservices is easier when you have a record of all of them and all the inner workings. Your team knows what exactly they rely on in a complex system.
It’s essential to have a plan and an overall knowledge base in case something happens. Simple runbooks are a repository of rules to share before a crisis hits the system. When a database approaches its maximum disc capacity, your team will get an alert and a checklist of actions to take.
It is challenging to measure reliability, but you can explore it by implementing SLOs (service-level objectives). Alerting on SLOs that show customer experience elements, you can get closer to your particular system’s meaning of reliability.
Turn failure into high reliability
Building a reliable system is a process of constant improvement. It never ends and requires devoted engineers ready to search for the best solution, test different paths, and most of all, eagerly learn from failure. Remember that users’ needs change just like the elements they trust and rely on.