Clustering Geoscience data: Part 2 – Selecting and preparing the inputs for clustering

In our previous post - Clustering Geoscience data: Part 1 – Choosing the right tool for the job - we discussed algorithm selection. In this post we continue with this theme and discuss how to choose and prepare appropriate input data for clustering, and how these choices can impact the end results.

Clustering is an unsupervised method of exploring the structure of data by grouping together samples that are similar (with respect to a distance/similarity measure). Even though the clustering itself is unsupervised, we can - and should - apply supervision and guidance by being selective about what variables are used and how they are prepared. We also need to consider the most appropriate clustering method and what type of distance/similarity metric will be used.

Two important factors to consider before undertaking unsupervised learning are:

·      Domain knowledge - Different sets of inputs will produce clusters that can have very different meanings. While tempting, using all the available variables will often produce clustered solutions that are difficult to interpret. A geologist will often have inherent knowledge which will help in the selection of optimal variables for a particular unsupervised task.

·      Pre-processing – Manipulation and tidying of a dataset are key to providing an optimal and useful solution. Examples of this include, but are not limited to; identification and possible removal of correlated variables, standardisation (or normalisation) of the data and the assessment of outliers and their likely effect on the end result.

The input variables, the processing and the clustering algorithm that are chosen are all dependent on one thing – what patterns you are looking to extract from the data. Throwing all the data in with no pre-processing is fraught with danger will often lead to more questions than answers.

For a more detailed explanation of these concepts read the full blog here, or contact us with any questions at information@solvegeosolutions.com.  

To view or add a comment, sign in

Others also viewed

Explore content categories