It’s a Wednesday, and the accounting team is closing out this month's sales and running end-of-month processing on a multicloud platform deployed four months ago. They run sales order entries on one cloud provider and the accounting application on another. Spanning both clouds is a common security system and API manager, among other services.
What took only a few hours last month to process from start to finish now takes almost a day. You get an angry call from the CFO, “What the heck is going on?” Better put, what is happening with your multicloud’s performance this month?
Multicloud deployments, and cloud deployments in general, behave differently at different stress levels. There was little stress during last month’s processing; this month there’s a medium stress level that is causing a severe performance issue.
Those of you who diagnose and fix performance problems already understand this, but if not, here’s the best way to think about cloud performance: All interdependent components depend on all components to function well. Troubles arise when a component does not pull its weight in the “cloud performance supply chain.” The problem might stem from network or database latency, memory I/O latency, or storage performance. The result is the same: Overall performance will suffer.
In our example, any components that failed to perform could have caused a cascading set of events that killed overall performance. In this case, end-of-month processing suffered, even though the load only increased from small to medium stress levels.
Of course, the slowest component sets your overall performance, which is no different in the cloud. This can cause problems such as network performance, slow databases, a lack of CPU-required resources, or poorly performing applications. These are often called “cloud gremlins” that cloud architects and developers chase for days, sometimes months. In many instances, they are not easy to track down. So, where do you look?
The best answer is to employ a good cloud management and operations tool, preferably one that can provide operational observability. Instead of wading through piles of detailed data (often called noise), you get the meaning of the data. A good tool typically indicates where the performance issue exists and can even provide the root cause.
The network might have a latency problem, which is easy to diagnose. The tool could also track the issue to a poorly performing VPN that sends and receives data from one cloud provider to another. This is a frequent problem in multicloud deployments, considering that intercloud communications are relied upon and thus stressed, and the connections between clouds should be maintained more effectively. Indeed, in the last several performance problems I was called upon to diagnose, the root cause was an intercloud communications networking issue.
Other frequent problems with multicloud deployments include database performance issues on a single cloud provider that cause latency across several applications. Often the applications themselves are blamed, and code fixes are even ordered. The database became the culprit when they determined the code fixes did not work. The moral of that story is to diagnose first and fix second.
Of course, the list goes on. Multiclouds are complex, distributed platform deployments. The applications and data that reside on multiclouds can also be complex. Performance issues will pop up often. My best advice is to invest in a good set of cross-cloud cloudops technologies that operate across providers and can quickly diagnose the most common problems. Some even provide self-healing services to correct issues proactively. These tools pay for themselves with the first problem they solve.