Challenges in Machine Learning methodologies applied to chemistry and Drug Discovery
The first challenge: Databases generation
With the advent of machine learning techniques in the last decades of the twentieth century and its dissemination in recent years, several applications have been suggested for the field of computational chemistry research. However, its greater dissemination still faces several setbacks, related to the lack of data for training predictive models, for analysis and validation of models, in order to assess the possibility of biases caused by poor conduct of the algorithm training process, in addition to little consensus on good practices about which descriptors should be used in each scientific investigation.
Even though machine learning techniques, were developed in the 20th century, their low use in the field of scientific research, especially in the area of chemistry, makes these methodologies to be observed as still very recent. In fact, this statement can be corroborated in a brief evaluation of the databases used to carry out training of predictive algorithms. Few are those that can be found, compared to the databases used in other areas, such as economics and information science, for example.
The algorithms used in the model training steps need databases with thousands of information and cases, so that the property prediction results, for example, have a minimum of reliability. It is possible to imagine that, nowadays, in possession of a range of structures of organic compounds, it is possible to train a neural network to be able to anticipate the physical properties of a new compound, based on the data already obtained. However, each compound present within a database, used in any model training stage, had to go through a batch of characterisation analyses, which, in a well-known way, are not fast enough analysis for a database of structures and properties to be created in a few years.
This is certainly one of the most difficult challenges, but not the most difficult to overcome. In a few years, at the rate of development and data collection observed today, we will be able to establish data sets available to the entire scientific community, which will be able to provide standards for comparison between models and formalize the training protocols necessary to increase the reliability in the results obtained with machine learning algorithms.
The second challenge: What is the correct procedure?
The second challenge, described in the topic introduction, says more about the practices of developing prediction models. It is well known that each model needs a minimum data base so that it can start to generate results consistent with real observations and, finally, be used in predictive techniques. However, the choice of the amount of information provided for training a machine learning model and for its subsequent validation step is as important as the dataset itself being used. For example, a prediction model such as Random Forest can be trained with 90% of the data from a set, ensuring a generous amount of information in the training stage. However, this mechanism can overtrain the set, so that the network has biases and becomes good in prediction steps, only for the data inserted in the training set, invalidating the model for data outside the initial set.
For this reason, the conduct of good model development practices must follow a standard that can provide reproducibility, precision and, therefore, reliability of the results. This pattern of good model development practices naturally arises in the scientific discussion, when we find the limitations of each approach and make improvements. Thus, this is a clear concern, but not difficult to overcome.
Recommended by LinkedIn
The third challenge: How to describe chemicals and their properties?
On the other hand, probably the biggest challenge of computational chemistry, regarding the use of machine learning algorithms in investigations of properties of chemical systems of interest, is the correct use and development of molecular descriptors to translate the chemical information of molecules into a database, for computable values that can be inserted into predictive model training algorithms.
Descriptors are any functions used to convey relevant information about the chemical properties of a molecule to the model being trained. There are several formats of descriptors, with different dimensions. There are those that only transmit the number of atoms in a molecule, center of mass, moment of inertia, number of electrons, among other mostly structural properties, while some can transmit data derived from electronic properties, such as electric dipole, magnetic permittivity, absorption at a specific wavelength. There is not, formally, an exact description of how these functions should be and when to use them. This varies with each system being investigated and with the investigators conducting the process. Therefore, this is a complicated topic, since it can lead to a lack of standardization of procedures and a decrease in the reliability of results.
Recent studies show that it is not necessary to use only one descriptor to feed a training algorithm for a predictive model, but a linear combination of these descriptors can be used, making the standardization process even more complicated and lengthy. In addition, the descriptors themselves are the stage for broad scientific development, and it is possible to find entire works dedicated to this purpose.
Perspectives
Notwithstanding all the discussion developed around the data and functions to be used, one last issue cannot be ignored, the choice of the correct algorithm to perform prediction, clustering, classification work, in addition to the refinement of the hyperparameters used in each machine learning algorithm. In this way, the set formed by the method used, hyperparameters, descriptors and data set necessary for the training of a machine learning model, even presenting an enormous potential for the solution of complex problems in the area of chemistry, bring with them new problems to be solved.
However, the nature of these new problems is intrinsically easier to be solved, requiring greater attention to the standardization of these methodologies and the publication of more relevant scientific works on the subject.
Great!