Data Analytics, Machine Learning and Root Cause Analysis — A Practical Path to Continuous Improvement
Continuous improvement is a key theme in many operating plants, and a lot of money is often spent to understand the key gaps in facilities to help optimize maintenance and/or operational goals, with the ultimate intent being the production of a commodity in the most cost-effective, safe, and reliable way. The customer is ultimately paying for the commodity, and the plant has to ensure that it is able to provide this by keeping operating and other secondary costs in check in order to maximize profits and keep expenses and waste costs to a minimum.
In any operating or manufacturing environment, data analytics and machine learning play key roles; if used correctly, they can help make sense of the large amount of data that go through plants during the course of their lifecycle. At the very least, a good data analyst and engineer with a sound understanding of plant machinery and processes can draw a conclusive assessment and recommendation from the analytics and patterns that are discovered when big data is processed. It is extremely difficult and inefficient for the human mind to see patterns and correlations and recognize root cause(s) when there is large amount of data generated on a regular basis through various systems in a running plant. Often, such data is not carefully looked at, which leads to several inefficiencies and lost opportunities for the optimization and betterment of such continuous and complex operating systems and units. There are several free programming tools in the market that can provide the interested engineer with the capability to process continuous and complex data streams.
This article discusses a few simple yet valuable tools that any engineer can utilize to gain a quick look at complex data and big data continuously generated by complex systems to make sense of big and complex data before proceeding to experiment with machine learning algorithms. Discoveries made through this data can provide valuable insights into the corrective course of action that engineers can take and recommend without applying an approach that is heavy on trial and error and quickly converging on root cause assessment, thereby saving the cost and wastage of repeat failures. The reader should bear in mind that data analytics and machine learning is another tool in an arsenal of tools that engineering has access to, including reliability-centered maintenance and condition-monitoring technologies, such as vibration monitoring, thermal scanning, and lube oil condition monitoring. The added advantage is that machine learning and data analytics are a growing technology, with many people engaged in research and development to further the technologies and their applications in this space. Operating industries that have been utilizing traditional technologies for several decades can immensely benefit from applying new strategies to resolve old and repeating complex challenges in these industries. As the great and renowned physicist Albert Einstein once stated, “We cannot solve our problems with the same thinking we used when we created them”—a motto that is applicable in the world of continuous improvement if we are to gain an advantage in an increasingly competitive world.
Data analytics and machine learning is not meant to completely substitute these technologies at this stage of development but rather, complement these technologies to draw conclusions, troubleshoot issues, assess data, and make recommendations more efficiently and comprehensively without dedicating and wasting time on analyzing big data manually. The following graphs and visualizations can be easily generated though Python, utilizing any available compliers and IDEs.
The first step in any machine learning exercise would be to visualize, if possible, and assess the data to develop an understanding of the dataset.
Heat Map
This map provides a visual representation of the linear correlation between various variables that one might find in data. The visualization provides a quick indication of what is related to what and by what magnitude. Values that approach 1 show perfect linear correlation, whereas values moving more towards -1 show an exact inverse correlation. As can be seen, the diagonal column is all 1, showing that the features on the x-axis are perfectly correlated to themselves on the y-axis. Heat map is a very good way to assess the initial set of features that can be utilized in a machine learning algorithm from a weighted standpoint. The heat map only shows how two variables among the many correlated in the heat map do not consider the effect of other variables on a specific target variable if we are trying to predict the performance of this target variable based on other features. We will discuss some other tools that can be utilized to see the correlations and importance of various variables on the target variable.
There are various ways to plot a heat map, such as using the Seaborn library within Python, which can easily plot the required variable and present the colored map to visually represent the correlates (see Fig. 1).
Parallel Coordinates
Parallel coordinates show how a target variable can vary with other possible dependent or independent variables plotted on a parallel axis. In Fig. 2, while the data has been censored for privacy reasons, it can be visually observed on the basis of the color of the inlet flow, with the highest amount of flow (the red zone) through-put achieved at certain percentages and values located on individual parallel axes. A visual understanding of data in this format can help the data scientist or engineer develop a concise operating map for various pieces of machinery or process conditions. Furthermore, real-time trending of such data can quickly pinpoint operating excursion operating conditions, where the system or machine begins to operate outside its accepted envelope. Information such as this can help ensure that systems are operating reliably and profitably, leading to savings in cost and making sure the system is operating in a safe and reliable manner. One key advantage of visualization and representation in such a format is the quick understanding of very big data—something that the human mind cannot easily comprehend quickly and efficiently. Such a representation can provide insights into areas hidden within the data, including, but not limited to, variables causing a certain issue in the root cause process to prevent recurring problems.
The reader is encouraged to start with Plotly, a library available in Python, that can help generate good parallel plots, with multiple features plotted for concise and valuable visual representation of data. Similar to the heat map, the data scientist and engineer can determine and assess which features would provide the most relevant bearing on the problem at hand and utilize these within the machine learning algorithm to finetune the model.
In an operating plant, one can utilize such a plot to look at things such as compositions, flow rates, pressures, and temperatures, and how this has an effect on the optimal versus not-so-optimal conditions during operation. Plots such as these can be utilized in rotating equipment to study vibration and bad actor turbomachinery. A good example would be to correlate various process variables and other motor amps and conditions to what causes a spike in vibration signatures leading to equipment trips. This provides for a quick and easy way to assess data compared to having to spend a lot of money in bringing in advanced machine learning algorithms or sophisticated detailed analysis without understanding the initial dataset and what it indicates within the plant. Many major companies are allocating and spending extensive amounts of money to implement data analytics and machine learning and integrate this into their systems; while this is beneficial, it is equally important to first understand what we are trying to achieve by implementing such practices and where the biggest bang for the buck can be realized in an operating context.
Visualization of data in 3D, 4D, 5D ….
Big data visualization in multi-dimensions can provide a good understanding of how the data is changing and transforming based on a multitude of variables. This helps understand the impacts of various variables on a particular variable as well as determining transition zones. The analyst can understand how the feature of a particular variable changes with time and influence from other variables. For instance, in Fig. 3, the flow rate is depicted to be changing based on changes in concentration. Color and hue can provide additional insight; here, a distinct color was chosen to show where the transition occurs between the blue zone and the red zone. One can utilize a slowly changing hue to show a more regressive transition between different variables. This can help discern ranges of transition and understand the data better. Another example is utilizing such a pattern to understand pump performance and transition from zones away from the best efficiency point (BEP) point. The reader can utilize this tool in various applications to understand what the data brings to the table. The plots can be generated in multi-dimensions, provided this helps assess, analyze and present the data in a meaningful form to drive decisions and solutions.
Machine learning
The preceding part of the paper covered some of the ways in which a data scientist can visualize the data and make sense of what the data indicates. This can help uncover aspects of the data not visible to the human eye or comprehensible by the human brain, especially when the dataset is very large and continuous, as in many instances found in operations. Several additional visualization and data analytics tools can be utilized, including statistical tools for better understanding.
I would now briefly like to touch on the machine learning aspect of the discussion in the second part of this paper. This paper is by no means a comprehensive coverage of machine learning. Machine learning is a vast, complex, dynamically evolving, and advanced field that needs to be customized in many instances based on data and the final outcome one is trying to achieve. This makes it very interesting and capable of handling data and producing results that can suit and optimize specific systems to a very good degree. Filtering and finetuning the data followed by defining a model, training the model, testing the model, and finally, reviewing the outcomes of the model in predictive assessment is key to applying machine learning strategies. There are various models and algorithms that the reader is encouraged to research that provide distinct advantages and disadvantages based on the set of data or system being analyzed. For instance, the random forest method provides a quick way to assess the features that are important in a given data stream and provides a score on how important each of the feature sets is to the overall prediction. Random forest is a good supervised learning algorithm that is quick and, with parametric hyper-tuning, can run efficiently and quickly on big data. Support vector machine is another good supervised learning strategy that can be employed to understand various plant issues. Support vector machine (SVM) can be applied to both classification and regression problems to predict the outcomes.
There are various algorithms and machine learning strategies at the disposal of the data scientist and/or a plant engineer to help him or her make the best sense of data and make decisions that help run a plant safely, reliably, and cost-effectively. Engineers are encouraged to utilize these algorithms and tools to dig deeper into the causes and eliminate potential waste in a running plant. Constant exposure to machine learning and data analysis in a plant or industrial engineering setting can provide immense benefits as technology evolves, helping organizations rely on internal engineering expertise in solving complex issues.