Blog

Artificial Intelligence: Coming to the Rescue of ITOps

Mckinsey Global Institute Report of 2018 states that Artificial Intelligence (AI) has the potential to create annual value of $3.5 billion -$5.8 billion across different industry sectors. Today, AI in Finance and IT alone accounts for about $100 billion and hence it is becoming quite the game changer in the IT world.

 With the onset of cloud adoption, the world of IT DevOps has changed dramatically. The focus of IT Ops is changing to an integrated, service-centric approach that maximizes business services availability. AI can help IT Ops in early detection of outages, potential Root Cause prediction, finding systems and nodes which are susceptible to outages, average resolution time and more. This article highlights a few use cases where AI can be integrated with IT Ops, simplifying day-to-day operations and making remediation more robust.

1.)    Predictive analytics of outages:  False positive causes threat alert fatigue for IT Ops teams. The survey indicates that about 52% of security alerts are generally false positives. This puts a lot of pressure on the teams as they have to review each of these alerts manually. In such a scenario, deep neural networks can predict whether an alert will result into outages.

Alerts                                              Layers                                              Yes/No

Feed Forward back propagation with 2 hidden layers should yield good results in terms of predicting outages as illustrated above. All alert types within a stipulated time can act as inputs and outages would be the output. Historical data should be used to train the model. Every enterprise has its own fault line and weakness, and it is only through historical data that latent features are surfaced, hence every enterprise should build their own customized model as “one size fits all” model has a higher likelihood of not delivering expected outcomes.

 The alternate method is a logistic regression where all “alert types” are input variables and “binary outages” would be the output.

Logistic regression measures the relationship between the categorical dependent variables and one or more independent variables by estimating probabilities using a logistic function, which is the cumulative logistic distribution. Thus, it treats the same set of problems as probit regression using similar techniques, with the latter using a cumulative normal distribution curve instead. 

2.)    Root Cause classification and prediction:  This is a two-step process. In the first step, root cause classification is done based on key word search. From free flow Root Cause Analysis fields, Natural Language Processing (NLP) is used to extract key values and classify into predefined root causes. This can be either supervised or unsupervised.

In the second step, Random Forest for Multi-Class Neural Network can be used to predict root causes while other attributes act as input. Based on the data volume and the datatype, one can choose the right classification model. In general, Random Forest has better accuracy, but it needs structured data and right labeling and it is less fault tolerant to data quality. While Multi-Class Neural Network will need a large volume of data to train, it is more fault tolerant but slightly less accurate.

 

3.)    Prediction of average time to close a ticket:  A simple weighted average formula can be used to predict time taken for ticket resolution.

Avg time (t) = (a1.T1 + a2.T2+ a3.T3 )/(count of T1+T2+T3)

Where T1 are ticket types.

Other attributes can be used to segment the ticket into right cohorts to make it more predictable. This helps in better resource planning and utilization. Weightage of features can be done heuristically or empirically.

4.)    Unusual Load on System: Simple anomaly detection algorithms can inform whether the system is going through a normal load or it has high variance. A high variance / deviation from average on time series can inform the unusual activities or resources that are not freeing up. However, the algorithm should take care of seasonality as a system load is a function of time and season.

Given the above scenarios it is obvious that AI has a tremendous opportunity to serve IT operations. It can be used for several IT Ops including prediction, event correlation, detection of unusual loads on system (e.g. cyber-attack) and remediation based on root cause analysis.  

About the Author:

Vivek Singh is the Senior Director at Relevance Lab and has around 22 years of IT experience in several large enterprises and startups. He is a data architect, an open source evangelist and a chief contributor of Open Source Data Quality project. He is the author of a novel “The Reverse Journey”.

(The article was originally published in Devops.com and can be read here: https://devops.com/artificial-intelligence-coming-to-the-rescue-of-itops/ 

Author


Avatar