Engineering Application of Data Science
Reading some of the recent posts and discussions (some in the Linked-In and some in the SPE discussion sites) regarding data driven analytics and the role of geology and physics when data science is applied to Petroleum Engineering related problems, inspired me to write this piece. What follows is the result of 25 years of research and development in the application of data driven solutions and artificial intelligence to drilling, reservoir, and production engineering. This is not aimed at criticizing anyone or any school of thoughts. The objective is to share my views and experience rather than engage in a debate.
Of course, I have never hesitated to engage in meaningful debates (especially on this topics), but I have learned that when debates are not face to face, and are engaged through social/professional networks, it does not serve its purpose. I have observed that the discussions on these sites, on this specific topic, is a mixed bag. It include some thoughtful comments and insights that I have enjoyed and learned from, and some superficial statements that do not deserve replies, including those that use only a few words as a soft insult with no substance or others that use unprofessional language.
When it comes to “Engineering Application of Data Science”, there exist an spectrum. On the ends of this spectrum there are those that display a "religious" view toward physics and engineering, or, statistics. The idea they subscribe to, whether it be statistics or physics, is more a "belief" that should not and must not be challenged, rather than a never-ending quest to find the best solution. I have found myself in a continuous disagreement with both these camps.
On one end of the spectrum, the “true believers” in physics and geology that I usually call “traditionalists” cannot even imagine than there might be other ways of solving problems without first understanding the underlying physics, formulating the governing equations, and then using math to solve the developed equations for solution. Human brain never uses this path to find solutions (think of the simple problem of catching a ball that is thrown to you). Artificial Intelligence (AI) is an attempt to mimic the method that brain uses to solve problems. Many physics-based problems are now solved this way using AI. Driverless cars are examples that probably everyone can relate to.
On the other side of the spectrum, there are the “true believers” in statistics. They “believe” that “all” problems can be solved with data. Of course this may not be too far from the truth (in a hypothetical world), but the problem is that these folks cannot see that there is a major difference between how problems are solved using statistical methods versus machine learning. They cannot digest the difference between stochastic data models versus data generated by a well-defined and sometimes well-understood physical phenomenon. Of course the more advanced ones in this group, call machine learning and AI, statistical methods. I do not have a problem with that, as long as they do not try to completely ignore the physics (or geology) or try to fit a pre-determined probability distributions, or a pre-determined equation (or set of equations whether it be linear or pre-defined non-linear – aka, multivariate analysis), to every dataset that they see, regardless of its origin. To them whether the data is from social media or a numerical simulation makes no difference. Domain expertise to this group is not very relevant, although in our industry, they have learned not say that out loud and sometimes even they hire some domain experts in their organizations, just to avert criticism or to be able to use the right abbreviations and language.
I disagree with both sides of this spectrum. To me, the tool used to accomplish the objective is not relevant, rather, the key is to find a proper and practical solution. My definition of “proper” and “practical” solution relates, first and foremost, to “Prediction”. In other words, the resulting solution (model) must be able to generate accurate prediction. Once this condition is met, then and only then, it is important for the solution (model) to be able to provide information regarding the nature of the process. Since if the predictions are not accurate, I cannot trust the information provided, regardless of the fact that it may include “internal consistency”.
As you may have noticed, “Data Science” has become a buzz word and a marketing ploy and it has started to lose some of it original attraction as a scientific approach to finding proper and practical solutions. Overnight, an army of “data scientists” started emerging from every corner of the world. So much so that it is now being translated to the ability to open an Excel file. Here is how I see this discipline.
Since its introduction as a discipline in mid-90s “Data Science” has been used as a synonym for applied statistics. Today, Data Science is used in multiple disciplines and is enjoying immense popularity. What has been causing confusion is the essence of Data Science as it is applied to physics-based versus non-physics-based disciplines. Such distinctions surface once Data Science is applied to industrial applications and when it starts moving above and beyond simple academic problems.
So what is the difference between Data Science as it is applied to physics-based versus non-physics-based disciplines? When Data Science is applied to non-physics-based problems, it is merely applied statistics. Application of Data Science in social networks and social media, consumer relations, demographics, or politics (some may even include medical and/or pharmaceutical sciences to this list) takes a purely statistical form, since there are no sets of governing partial differential (or other mathematical) equations that have been developed to model human behavior or to the respond of human biology to drugs. In such cases (non-physics-based areas), relationship between correlation and causation cannot be resolved using physical experiments and usually, as long as they are not absurd, are justified or explained, by scientist and statisticians, using psychological, sociological, or biological reasoning.
On the other hand, when Data Science is applied to physics-based problems such as self-driving cars, or multi-phase fluid flow in reactors (CFD), or in porous media (reservoir simulation), it is a completely different story. The interaction between parameters that is of interest to physics-based problem solving, despite their complex nature, have been understood and modeled by scientists and engineers for decades. Therefore, treating the data that is generated from such phenomena (regardless whether it is measurements by sensors or generated by simulation) as just numbers that need to be processed in order to learn their interactions is a gross mistreatment and over-simplification of the problem, and hardly ever generates useful results. That is why many of such attempts have, at best, resulted in unattractive and mediocre outcomes. So much so that many engineers (and scientists) have concluded that Data Science has little serious applications in industrial and engineering disciplines.
The question may rise that if the interaction between parameters that is of interest to engineers and scientists have been understood and modeled for decades, then how could Data Science contribute to industrial and engineering problems? The answer is: “considerable (and sometimes game changing and transformational) increase in the efficiency (and even accuracy) of the problem solving”. So much so that it may change a solution from an academic exercise into a real-life solution. For example, many of the governing equations that can be solved to build and control a driverless car are well known. However, solving these complex set of non-linear equations and incorporating them into a real-time process that actually controls and drives a car in the street, is beyond the capabilities of any computer today (or in foreseeable future). Data driven analytics and machine learning contribute significantly to accomplishing such tasks.
There is a flourishing future for Data Science as the new generation of engineers and scientists are exposed to, and start using it in their everyday life. The solutions to clarify and distinguish the application of Data Science to physics-based disciplines and to demonstrate the useful and game changing applications of Data Science in engineering and industrial applications is to develop a new generation of engineers and scientists that are well versed in the application of Data Science. In other words, the objective should be to train and develop “engineers” that understand and are capable of efficiently apply Data Science to engineering problem solving.
Thomas, I know what you are saying, but I disagree. Statistics has always been at the centre of physics. Quantum theory is all about statistics. Monte Carlo was invented by physicists. Hamiltonian Markov Chain Monte Carlo was invented by physicists. Bose-Einstein statistics is a cornerstone. All many-body theory is statistical. Statistical physics is a mandatory part of an undergraduate degree. I myself also studied stochastic processes as a physics undergraduate. Where I would agree is that different academic silos do not talk to each other enough.
Until not too long ago, if you asked what was the opposite science of physics the answer was statistics and vice versa. Thankfully, things have changed and a recent example is the language used by CERN to announce the results of the Higgs boson experiments. A purely predictive approach, one that ignores the underlying prior knowledge, be it physics, medicine, chemistry or whatever, is unlikely to produce meaningful results unless there is no underlying state of knowledge. Even then, sequential experimentation, when possible, is much preferred for early stage learning, especially when the cost of improper decision is high (e.g. self driving cars, a new drug). It is generally uncommon - and unreasonable in my view- for a physicists or an engineer or a physician to possess the deep statistical and ML knowledge. Collaboration is key. What I don't see is this being a problem limited to physics. In my past life in health research, purely predictive ML type models that ignore the biology and clinical manifestation have been considered a bit of a joke and never lived up to their expectations. When paired with substantive domain knowledge results are quite different, but other problems emerge (excellent publications by Peter Austin).
Excellent overview. Another way of looking at it is to think of error. Statisticians think of any measurement as having an error. If you are doing a sociological survey, the answers from the same person may change on two different days. However, when you are dealing with a simulation model, the answer is identical every time you run the simulation with the same parameter inputs. So the whole understanding of 'error' (or variance) has changed. We need to think of model error, and measurement error, in different ways. I know about seeds and the chaotic or noisy behaviour of geomodelling, but the principle is the same. It can be quite hard for statisticians to digest 'there is no error'. When you regress against simulation model outputs, there is no error in the conventional statistical sense.
I totally agree that today's application of machine learning in engineering is limited to accelerating the computationally expensive parts (e.g. predicting field updates at each time step in CFD calculations). However, physics-aware machine learning is slowly making progress in science and engineering and needs more people with a knowledge of machine learning and an interest in science or engineering to push the field forward.
Thanks Dr Mohaghegh for your article. It clarifies and re-focus on what this discipline really is for those engineers transitioning into Data Science when solving physics-governing problems. The question I may have: is it true that in engineering and industrial applications we subtract ourselves of caring excessively in the physics underneath and focus more on what is presented as a reality? For engineers, this could be a bit challenging but I believe engineers can provide a more 'realistic' and 'to the point' approach when selecting predictive-power 'predictors' to resolve the problem at hand and there is where I think a Data Scientist for Oil & Gas make a difference because of their physics-oriented mindset. I agree with you!!!