Complexity is a natural byproduct of a highly heterogeneous and distributed architecture. Now we better understand its impact. Credit: Thinkstock Although we get different messages from cloud computing providers, we now have data that suggests public cloud outages are getting worse. The Uptime Institute recently released its 2022 Outage Analysis report that included such findings as “high outage rates remain an issue.” Indeed, one in five organizations reported a “serious” or “severe” outage that resulted in significant financial losses, reputational damage, compliance breaches, or, in some severe cases, loss of life. The report concludes that there has been a slight upward trend in the prevalence of major outages in the past three years. I’m usually not one to bust out the quotes, but this statement by Andy Lawrence of the Uptime Institute is worth mentioning: “The lack of improvement in overall outage rates is partly the result of the immensity of recent investment in digital infrastructure and all the associated complexity that operators face as they transition to hybrid, distributed architectures.” Complexity is not a new challenge for IT. However, we recently created much more complexity through quick digital transformations and the wild rush to cloud and multicloud in response to the pandemic. These factors resulted in a new, high headcount in the types of systems that support businesses. Most enterprises reported that they once supported about 500 cloud services for the entire enterprise and now support about 3,000 services over a multicloud deployment. These numbers indicate that the technology doesn’t cause the outages; it’s how the technology is used and the amount of technology in use. As the report states, nearly 40% of organizations have suffered a major outage caused by human error. Of these incidents, 85% have a root cause of staff failing to follow procedures or flaws in the processes and procedures themselves. The root causes of complexity are well understood. There are many more moving parts to oversee in multicloud and cloud architectures and not enough money to quadruple operations staff. Cause, meet effect. Why does this complexity happen in the first place? Much better operations tools are now available, such as AIops and cross-cloud multicloud monitoring solutions. These tools allow developers and innovators to leverage best-of-breed technologies to build and deploy business-changing technologies. Developers can deploy the optimal choices for storage systems, AI systems, compute, databases, etc., that may come from one or (more likely) many cloud providers. The result is a complex and highly heterogenous multicloud deployment that requires staff with specialized skills to effectively operate and limit the number of outages. Ironically, most IT organizations can’t get approval for an increased ops budget because cloud computing promised to make operations less expensive. What’s the solution? As I’ve stated here a few times, abstraction and automation layers remove humans (and human errors) from the front and center of all operations processes. These layers also include tools for ops planning or replanning to optimize multicloud operations, which can take your operations game to the next level. That brings us back to the original problem. Rebooting cloud and multicloud operations to incorporate abstraction and automation layers translates into more money and skills. Until enterprises reach a tipping point where the complexity costs more to manage than it does to directly address, we’ll see more outages. It’s too bad that we must do damage just to understand how to avoid doing damage. Sadly, we’ve been here many times before. Related content analysis Azure AI Foundry tools for changes in AI applications Microsoft’s launch of Azure AI Foundry at Ignite 2024 signals a welcome shift from chatbots to agents and to using AI for business process automation. By Simon Bisson Nov 20, 2024 7 mins Microsoft Azure Generative AI Development Tools analysis Succeeding with observability in the cloud Cloud observability practices are complex—just like the cloud deployments they seek to understand. The insights observability offers make it a challenge worth tackling. By David Linthicum Nov 19, 2024 5 mins Cloud Management Cloud Computing news Akka distributed computing platform adds Java SDK Akka enables development of applications that are primarily event-driven, deployable on Akka’s serverless platform or on AWS, Azure, or GCP cloud instances. By Paul Krill Nov 18, 2024 2 mins Java Scala Serverless Computing analysis Strategies to navigate the pitfalls of cloud costs Cloud providers waste a lot of their customers’ cloud dollars, but enterprises can take action. By David Linthicum Nov 15, 2024 6 mins Cloud Architecture Cloud Management Cloud Computing Resources Videos