Modeling with Sparse Data in Materials Development
By Avy Tahan
Materials development experiment data is often characterized by sparsity. Sparse data is not missing data; let’s clarify that from the outset. Sparse data is data which consists of a significant proportion of input features that hold many zero values. This is problematic in machine learning modeling as the distribution of such data is widely scattered and unevan, giving a model little chance to learn a useful mapping from input to target.
The Problem
In materials development, sparsity manifests particularly in those input features which represent proportions of raw material included in experiment formulations, many of which may be scarcely used. Complicating the problem may be the high number of such sparse raw material features in the dataset, included to allow a wide range of possible combinations, but ultimately exploding the dimensionality of the data and causing it to be scattered even more remotely throughout the feature space. Furthermore, the dataset may constitute groups of raw materials which are mutually exclusive, representing alternatives, so that when one has a zero value the others do not and vice versa. These ‘either-or’ constraints create ‘orthogonal’ subsets in our data which would need to be modelled separately, subset size permitting.
Considerations
Technical discussion on the subject might suggest applying some transformation method on the input features to turn them from sparse to dense. Such techniques which project the data into an alternative feature space, typified by Principal Components Analysis (PCA), effectively create new features as mixtures of the original features, making them, and any subsequent models created from them, much less interpretable, which would hinder experimental understanding.
Other responses to addressing sparse datasets may express that particular machine learning models, such as tree-based models, are better at ‘handling’ them. Such vague claims must be treated with caution as they may be referring to a particular algorithm’s implementation having the functionality to successfully train a working model from sparse data, and not to the actual skill of the trained model in making accurate predictions. As usual the performance of a model must be determined through formal evaluation techniques such as cross-validation.
Recommended by LinkedIn
Strategies
Techniques to effectively deal with sparse data are somewhat limited, but may be effective in certain scenarios. The first and most obvious step to take, is to simply remove those input features from the dataset whose sparsity is beyond some predefined threshold, which would have the desirable effect of reducing the dimensionality of the data. Extremely sparse features are less likely to be able to express themselves in a machine learning model in any case, as other denser features with high variance may dominate. If expert knowledge indicates that the removed features are important for prediction, then future experiments should include them more often to increase their density, so that they may be incorporated into subsequent modelling.
In the more extreme case of a high-dimensional and sparse dataset with few data points, building an effective predictive model is near impossible and the approach should be that of sequential learning to navigate the feature space wisely to reach some optimal target while expanding the dataset, as is espoused by Bayesian Optimization.
A final possibility, especially in the case of materials development experimental data, is to aggregate groups of sparse raw material features into single features, according to their specific function in the experiment, so long as this makes sense from a chemistry stand-point, which is where the expert knowledge of a materials scientist comes into play. Where appropriate, this will both serve to reduce the dimensionality of the data and decrease the sparsity, while somewhat retaining the independence of the original features, all of which may have a positive impact on the resulting models, both in terms of their performance and their interpretability.
The Bottom Line
In conclusion, sparse data is a reality that cannot be ignored and must be handled with care on a case-by-case basis. All the above-mentioned techniques should be considered, and applied as appropriate to the particular use-case depending on the goal of the research, the field of study and the particular characteristics of the dataset. Where one technique succeeds, others will fail and it’s left to the skill and discretion of the machine learning practitioner to judiciously determine the most suitable tools to apply. And, if all else fails, it may just be necessary to pull up your sleeves and take the time to collect and curate a better dataset.
Insightful piece Avy!