You don’t have to look very hard to find technology platforms that advertise ML (machine learning), automation, and AI (artificial intelligence) capabilities. Once devops became mainstream, it bred process, technology, and IT culture movements with similar names, including cloudops, dataops, sysops, and AIops.
This may leave some of you skeptical about whether applying machine learning in IT operations can deliver business and IT value. Being skeptical is healthy, but you shouldn’t be surprised. I assure you that there are significant opportunities, and AIops is one of my devops capabilities to boost in 2021.
IT environments have become more complex during the past decade, with autoscaling public and private clouds, edge computing infrastructure supporting IoT (Internet of Things), machine learning experiments on massive-scale databases, new integrations, frequent application deployments, mission-critical legacy systems, and highly leveraged microservices. There are also plenty of variables outside of IT control, such as security incidents, disparate end-user computing configurations, and volatile application usage patterns.
It’s a challenging environment if your job is to respond to incidents, resolve application problems, perform root cause analysis, diagnose complex user issues, validate operational risks, identify security weaknesses, or forecast computing costs.
This is where AIops solutions aim to help. In a previous article, I wrote about how AIops helps IT and SREs improve application monitoring and resolve incidents. But I still wanted to know more about how different solutions implement data cleansing, analytics, machine learning, and automation to simplify IT and deliver business impact.
Six AIops solution providers shared some answers that paint a broad picture of what problems AIops solves for the business and IT, what types of machine learning algorithms are used in their solutions, and how their products support automation.
Devo provides real-time ops and security visibility
Paco Huerta, senior director of IT operations and discoverability at Devo, says AIops should help IT be a step ahead of end-user issues. “AI in Devo provides automatic, full contextual insights across hybrid environments at scale, enabling operators to pinpoint an issue’s exact cause before the end-user is impacted.”
IT is under constant pressure, and Devo helps sift through the noise, quickly find the problem’s root cause, and assess risks. Inside Devo, a variety of open source and proprietary ML algorithms are at work, including time-series anomaly detection and an ML workbench to develop and deploy models. Models in Devo are stream based, so they learn continuously and adapt fast.
Micro Focus aims to find and fix IT ops problems
Michael Procopio, AIops product marketing manager at Micro Focus, says full-stack AIops helps IT sift through enormous data sets to find and fix problems. “IT environments today produce more data than humans can process, and machine learning can reduce hundreds of alerts or millions of log files to a few suspects that humans can easily handle. Data reduction makes finding problems faster, and automation is the key to fixing problems faster. We call it full-stack AIops when linking the two can provide a find-to-fix solution with little or no human intervention.”
Micro Focus’s AIops solutions include Operations Bridge, which collects all events, metrics, and logs, including system-patch level and compliance data from more than 200 third-party tools and technologies. It then correlates against the service map, topology, and dependency data to build an accurate business service model.
The platform leverages unsupervised ML, including clustering, regression, inference statistics, custom logic, and seasonality algorithms. It also utilizes operator feedback to improve system accuracy and direct future actions.
Moogsoft enhances the cognitive capabilities of IT ops
Will Cappelli, field CTO at Moogsoft, stresses that IT operations need AI to keep up with the fast pace of devops-driven changes. “Modern IT systems exhibit complex behaviors, and their components and connecting topologies are continually changing under the pressure of changes deployed frequently with CI/CD [continuous integration/continuous development]. AI is needed to make sense of the self-descriptive data, including logs, event records, and metrics generated by modern IT systems; to anticipate problems and outages; and to support execution of responses to the issues revealed by the signals the AI technology has interpreted.”
Moogsoft’s AI performs several functions in sequence. It selects high-information data sets from within a background of noise aggregated from log files and other operating systems. Then it discovers correlational patterns in those high-information data sets and determines which of the correlations are causal. Finally, it assists in the robotic execution of a response.
Moogsoft states that AIops can have a direct impact on revenue and brand reputation. When an intelligent response is robotic, it shortens the MTTR (mean time to recovery) of incidents that impact customers and employees.
OpsRamp aids IT to meet service-level objectives
Neil Pearson, OpsRamp’s principal product manager for event management and automation, states that the automation in AIops helps IT perform better at their jobs, and that’s good for business. “AIOps is the application of various AI technologies, including ML, deep learning, and robotic process automation (RPA), to automate complex, manually intensive, repetitive tasks. It typically involves ingesting a large amount of data from different sources and different formats. We focus on detecting anomalies, predicting and preventing repeat alerts and incidents from the initial discovery of resources through to resolution. It’s about making people measurably better at their jobs and helping companies get better at their business.”
OpsRamp ingests and processes large volumes of data sets from multiple sources, such as metrics, logs, network packets, and traces to identify the needle in the haystack that is the root cause of an issue. It uses deep learning and natural language processing algorithms to remove the noise and assist operations by making recommendations on resolving issues and ensuring they don’t repeat. OpsRamp helps IT design auto-response policies that reduce manual interventions and help prioritize problems based on business impact.
Resolve fuels agile, autonomous IT operations
Vijay Kurkal, CEO of Resolve, believes a “self-healing IT” can become a reality using AI and automation to close the loop between problem and resolution. “AIops tools quickly identify existing or potential performance issues, spot anomalies, pinpoint the root cause of problems, and even predict future issues to trigger proactive fixes before the business is impacted. By coupling insights from AI with automation, organizations can maximize the value and potential of these technologies and create a closed loop of discovery, analysis, detection, prediction, and automation, thus bringing organizations closer to the ever-elusive self-healing IT.”
Resolve can also automatically discover applications and infrastructure, generate rich topology maps, and identify dependencies between business-critical applications and underlying infrastructure. Understanding these relationships makes troubleshooting easier and facilitates overall IT management, offering a single pane of glass into complex, cross-domain environments. This data can be automatically pushed to the CMDB (configuration management database) in near real time, ensuring accurate inventory information and creating a strong ITSM (IT service management) foundation.
Resolve Insights utilizes many ML algorithms, including anomaly detection, event pattern identification, and predictive algorithms. The goal is to enhance the overall customer and employee experience by improving the performance of critical apps and infrastructure, maximizing uptime, and providing insights that inform optimization efforts.
Splunk helps IT manage complex operating environments
Andi Mann, chief technology advocate at Splunk, is also a highly regarded devops leader and author of books on innovation and IT operations. He suggests that IT must progress beyond a legacy operating model designed to support monolithic applications to one focused on being data driven, embracing automation, and committing to service delivery practices.
“As modern approaches accelerate technology adoption and engagement in a global, 24/7, electronic marketplace, the complexity of modern systems is too high for humans to effectively manage, and ‘old-school’ IT operations techniques designed for legacy monoliths fail to keep up. It is only with a data-driven approach, applying advanced algorithmic processing, machine learning, artificial intelligence, response automation, and workflow orchestration—aka AIops—that service delivery teams can cope with these new levels of complexity. Splunk addresses these challenges with AIops, providing a data-driven approach to ITops, observability, and security to ensure the performance, availability, functionality, stability, and impact that their business—and their customers—demand.”
Splunk takes a “white box” approach to machine learning and is prepopulated with 30 algorithms for anomaly detection, classification, clustering, cross-validation, feature extraction, preprocessing, regression, and time series analysis. It also has more than 300 open source Python algorithms from scikit-learn, pandas, statsmodels, NumPy, and SciPy libraries.
AIops can be a big step forward for all IT teams
Mann reminds me of my old days working with IT operations teams on maintaining high availability and performance of Web applications. When customers and employees escalated issues, we knew we had to get system and application monitors in place. When there were repeat incident types, we developed playbooks and standard operating procedures to resolve them. Where possible we built scripts to restart Web servers, clean out database tablespaces, and archive old files from primary storage systems.
Today’s scale, complexity, and service expectations all require IT to accelerate these disciplines, and that’s exactly what AIops solutions address. AIops platforms centralize and cleanse operational data, leverage machine learning to pinpoint different problems, and provide a framework to automate resolutions. The end goal is to provide better experiences, reduce toil, and free IT to pursue business-impacting projects and innovations.