Many devops teams focus on implementing CI/CD pipelines, automating regression testing, configuring the infrastructure as code, and containerizing the application runtime environments. Collectively, these practices and technologies help organizations deploy applications more frequently and reduce the errors from manual steps and configurations.
But many businesses want more and expectSaaS-like performance from their applications. This isn’t just about how reliable the application (that is, how many 9s of uptime) is or its response time. Those are just the table stakes for monitoring applications. More businesses are using technologies in strategic ways where user issues can affect revenue or operations.
That desire drives a whole new set of monitoring considerations around applications. Knowing that the web server is responding, that a microservice has millisecond response times, and that database query performance meets SLAs is no longer sufficient.
Think about the last time you flew on an airplane. We all expect to land safely and are angered if there are flight delays or issues with our luggage. Going deeper, we’re hoping for a better experience from the time we walk into the airport: How long does it take to get through security? How pleasant is the waiting areas and can we find something to eat? Is boarding the airplane seamless? Is the inflight entertainment and Wi-Fi working properly?
Then think about what happens if something goes wrong. If there are delays, how accurate and efficient is the airline in communicating status and flight options to you? If the inflight entertainment isn’t working, can they fix the issue while onboard and quickly so you can watch a movie? You consider all these things when answering the question “How was your flight?”
The new questions to ask on application monitoring
Developers, engineers, and managers should think about an expanded set of application-monitoring requirements. More specifically, devops team looking to excel in operational performance should consider monitoring that addresses some of these questions:
- How well are your CI/CD and testing pipelines performing, and how quickly does the team resolve issues that break the build?
- How well is the application addressing user needs and expectations?
- What application improvements can be discerned from user behaviors?
- Can operational incidents be isolated and resolved with minimal user impact?
- To what extent are developers disrupted by operational incidents (that is, engaging in firefighting)?
- Are slowly increasing usage metrics starting to affect performance?
- How quickly can application data get loaded, processed, and reported on?
- If the application is on a public cloud, are costs growing faster than expectations, and are there other cost optimizations worth considering?
- How should applications monitor for context such as browser, device, location, and time of day?
- What additional monitoring is required around APIs, especially ones used by third party applications?
- What monitoring tools will be required for large-scale IoT applications or others interfacing with blockchains?
- Are there any new versions, patches, or alerts on any application components?
- Are there are any security issues or breaches in the application?
This a highly expanded scope of concerns beyond what is traditionally bucketed and provisioned under traditional application monitoring. Yet, as more businesses operate like technology companies, devops teams should be thinking about these requirements.
Here are four options for devops teams to address this scope of application monitoring. As in all transformation priorities, devops teams should smart small and focus their monitoring efforts on the greatest opportunities to learn about usage while addressing requirements on the greatest risks.
1. Aggregate information on the user experience
If you’re building web or mobile applications, it’s common to embed an analytics tracker to capture users, visits, and other usage metrics. More advanced metrics can also be captured directly by the application and stored in log files, databases, or piped into data streams. There may also be relevant data in user-registration systems, single-sign-on (SSO) tools, and customer relationship management (CRM) systems.
In my experience, the business teams and marketers are more likely to be looking at user behavior in customer-facingapplications. The usage and behavior of internalapplications is more often an afterthought when IT deploys enterprise or internal workflow applications.
In both cases, developers, database engineers, and IT operations have a vested interest in capturing metrics, reviewing usage, understanding patterns, and evaluating satisfaction of application users. One way to do this is to aggregate the relevant metrics into a data warehouse or data lake and then use tools like Tableau, Microsoft PowerBI, or an open source data visualization tool to monitor performance and discover insights on user behavior.
2. Large IT organizations should define CI/CD performance metrics
Large development teams supporting more application development and microservice architectures should establish CI/CD pipeline and testing metrics. For large organizations, these metrics represent the throughput and quality of the team’s work and should alert on blocks and quality issues that can impede their progress. Tools like Jenkins, Jira, and Git can track different aspect of the develop, test, build, and deploy processes.
Some of these metrics should be strategic and tied to key devops performance indicators. As a CIO, I am most interested in strategic metrics like features released per quarter and the defect-escape rates.
But development teams should also establish the metrics that make software development run like a manufacturing assembly line. Metrics like deployment frequency, build duration, build failure rate, and automation test coverage provide indicators on the efficiency of the team and quality of work.
3. Use machine learning to improve incident response
Once applications are running in production, a primary concern for devops teams is handling incidents that affect users, put the organization at risk, or put the IT team into firefighting mode. When there is an issue, the key question is whether IT has application monitors in place that alert it to the issue and how fast IT can resolve it. It is also important to understand who gets involved in researching, diagnosing, and resolving the issue. If incidents are escalated to developers, these issues often take longer and are more expensive to resolve.
Proactive devops teams implement more monitors, log more data, and invest in code-level exception handling to make sure there is sufficient indicators and data to detect and diagnose issues. But this can also overwhelm teams when a single application has several alerts coming from different tools indicating one or more issues.
New autonomous digital operations and AIops tools from vendors like BigPanda, BMC, HPE, IBM, and Splunk aim to simplify managing the operations and improve response time to incidents by using machine learning. These tools correlate information from multiple monitors, aid in discovering root cause, forecast future concerns by using predictive analytics, and automate elements of incident response. Organizations that manage mission-critical user experiences and large-scale applications connected to multiple data sources, microservices, and third-party APIs should benefit from machine-learning-backed tools.
4. Develop a holistic view of the operating environment
The last aspect of monitoring requires developing a holistic view of usage, environment, infrastructure, and ecosystem of the applications. Think of it like facilities management practices applied to applications, showing all assets, asset maintenance factors such as patching and security alerts, long-term usage patterns, and anomalous activities. This macro view provides devops management teams a set of indicators and tasks to better maintain a portfolio of applications.