Decision making in data generation for regression models

Ivan Marroquin, Ph.D.

Published Mar 18, 2018

Regression analysis is concerned with establishing a model to represent the interactions between an observation and a set of predictive variables. There are several types of regression techniques, ranging from the well-known multi-linear regression to those derived from machine learning (e.g., neural networks, Naïve Bayes, support vector machines, random forests, etc.). In fact, something very fascinating about regression analysis is that you can even cook up your own regression techniques. However, the fun and creativity do not begin here!

The first decision any data scientist has to consider is what kind of information does he or she need? In this stage, a data scientist starts evaluating the opportunities and limitations of a data set or ascertaining its suitability to perform quantitative analysis. It is then that the audacity and innovation of the data scientist emerge. As there is no limit to how data is gathered, generated, transformed, or tweaked to suit the needs of the regression model.

For this blog post, the intent is to provide an end-to-end case study that highlights how you could circumvent lacking enough direct measurements. I approached this issue from the following angle. First, I decided to build the regression model using synthetic data. And second, I tested the performance of the model on real data. Of course, there are advantages and disadvantages on how, when, or why to use synthetic data. In my opinion, the inclusion of a synthetic data set should not be an issue as long it is relevant to the analysis and we understand its limitations. So, this study is about estimating the depth of an orebody deposit from very low-frequency electromagnetic (VLF-EM) data. Knowledge of this factor can be crucial for mining companies because this puts them in a better position to judge the feasibility to extract ore from a specific area. An orebody (Figure 1a) is a natural occurrence of, or an aggregate of, minerals in the form of lode, vein, seam, or placer deposits. From these deposits a valuable metal can be mined. VLF-EM is a geophysical method used to delineate relative shallow subsurface conducting geologic features in highly resistive surroundings. An orebody is indicated by the presence of a cross-over anomaly along the VLF-EM profile (Figure 1b).

Figure 1. (a) Diagram showing the presence of an orebody deposit embedded in a more resistive host rock. (b) An idealized VLF-EM profile measured perpendicular to the orebody deposit. The portion of the profile in red denotes the extension of the cross-over anomaly.

To generate synthetic data that match, as much as possible, the characteristics of a VLF-EM anomaly, I opted for a simple mathematical model based on an infinite current filament within a more resistive environment (Figure 2). With this model, I assumed that there is a current channel located in the upper part of the orebody deposit that produces the measured VLF-EM anomaly. As for the mathematical model, the primary field (Hpx) interacts with the filament current and generates a secondary electromagnetic field signal (Hs). By measuring the ratio of the vertical secondary magnetic field (Hsz) relative to the sum of horizontal secondary magnetic and primary fields (Hpx + Hsx), the mathematical model is able to emulate the presence of an orebody deposit.

Figure 2. Schematic representation of the mathematical model used to generate synthetic VLF-EM data.

I got what I needed. In one hand, I can investigate how the orebody deposit depth influences the strength of the cross-over anomaly. On the other hand, I can also incorporate the effects on the shape of the cross-over anomaly due to resistivity contrast between an orebody deposit with its host rock.

I designed a synthetic data set with the following characteristics (Figure 3a). First, the data set was divided into five different major sections. Each section represents the resistivity of the surrounding medium (i.e., 4000, 8000, 12,000, 16,000, and 20,000 ohm-m). Then, each major section was further divided into eleven sub-sections describing the resistivity contrasts between the orebody and the host medium (i.e., 25, 23, 21, 19, 17, 15, 13, 11, 9, 7, and 5). For each resistivity contrast, the orebody depth varied from 5 to 30 m with a step of 5 m. At the end, the synthetic data set consisted of 41 measurements on a total of 330 VLF-EM profiles (Figure 3b). Each of these profiles have the following common characteristics:

1) The distance between the stations is 12.5 m, and

2) The cross-over anomaly has three representative points: a maximum, an inflexion, and a minimum.

Figure 3. (a) Schematic representation on how the synthetic VLF-EM data was generated. (b) Produced VLF-EM profile using 41 stations.

Now that the necessary information was generated, two groups of candidate predictors were computed:

1) To capture the influence of resistivity contrasts between the orebody deposit and the host rock on the shape of the VLF-EM anomaly: the standard-deviation, variance, and slope from peak-to-trough were measured.

2) To represent the effect of the orebody deposit depth on the strength of the VLF-EM anomaly: the first six components from the output of a principal component analysis.

In order to maximize the capability of the regression model, it is important to choose the predictive variables that are just right for the task at hand. In other words, avoid either too few predictors (an underspecified regression model would tend to produce biased estimates) or too many predictors (an over-specified regression model would tend to have less precise estimates). To address these concerns, I used the stepwise regression technique coupled with a Fischer test. Based on the analysis, the retained predictors were: the second and third principal components, and the standard-deviation.

To test the quality of the regression model, I used the synthetic data set as input to estimate the depth of the line conductor. The cross-plot shown in Figure 4 indicates that the model does in general work. It seems there is a curvilinear relationship which may suggest the inclusion of predictors of higher order.

Figure 4. Evaluation of the regression model quality.

We are almost finished with the process. However, there is one more thing to do. Although the learned regression model is ready to be applied on VLF-EM real data, the characteristics of the synthetic data may fail to totally reflect the useful patterns of real data and their peculiar challenges (e.g., presence of noise, quality of field data measurements, properties of the cross-over anomaly, etc.).

Therefore, it is desirable to process the real data so that it presents the aforementioned characteristics of the synthetic data. In order to accomplish this, I came up with the following workflow. First, extract from the VLF-EM profile the section that corresponds to the cross-over anomaly. Second, estimate the position of the inflection point between the maximum and minimum points. From this point, the extracted section of the VLF-EM profile is divided into two parts, named left and right components. Third, take the left (or right) component and apply an interpolation with a step of 12.5 m to produce 21 measurements. Finally, a new VLF-EM profile had been generated by taking the interpolated left (or right) component and doing a translation with respect to the point of inflection.

With our objective to demonstrate the capacity of the predictive model to estimate the orebody depth deposit, a VLF-EM profile from geophysical survey (north of Québec, Canada) was used (Figure 5a). Following the proposed workflow, the original cross-over anomaly section was used to produce two new profiles (Figures 5b – c).

Figure 5. (a) VLF-EM profile measured during a geophysical survey (north of Québec, Canada). The extracted cross-over anomaly is shown as a segment in black. Following the proposed workflow, two new VLF-EM signals were generated: (b) left component and (c) right component.

The estimated orebody deposit depth using the left and right components is 0 m, which is in agreement with the known vertical placement of this deposit.

I hope that you enjoyed this post. If you encountered a similar situation and would like to share it, or even expand more on the topic. Let me know in the comments and I will add it in.

To view or add a comment, sign in

Decision making in data generation for regression models

Ivan Marroquin, Ph.D.

More articles by Ivan Marroquin, Ph.D.

Others also viewed

Reset Artificial Intelligence in Oil and Gas

So you think you’re above logistic regression? Here’s why you’re not

You Only Look Once: Unified, Real-Time Object Detection (2018)

Modern Facts Series - Part 5: Artificial Intelligence, Data Mining and Machine Learning Solutions in Military Systems

Stochastic processes. A Data Science perspective.

Unsung Heroes Navigating the Transportation Multiverse: Why Agent-Based Modeling Deserves the Spotlight

Towards a Science of Scaling Agent Systems: A Comprehensive Analysis of Multi-Agent Architectures, Coordination Dynamics, and Performance Scaling Laws

Can Machine Learning (ML) go for a Walk?

Opening the Black Box: How SubsurfaceAI Makes ML Predictions Explainable for Porosity Prediction

Things that a Bayesian may not tell you

Explore content categories

More articles by Ivan Marroquin, Ph.D.

Automated edge operator selection: A data driven approach

Learn to know your high-dimensional data

Analyzing and visualizing data with your mobile device

A promising approach for data integration analysis

On how to steer the seismic facies classification to your advantage!