Challenges in Machine Learning methodologies applied to chemistry and Drug Discovery

Denys Santos

Published Nov 3, 2022

The first challenge: Databases generation

With the advent of machine learning techniques in the last decades of the twentieth century and its dissemination in recent years, several applications have been suggested for the field of computational chemistry research. However, its greater dissemination still faces several setbacks, related to the lack of data for training predictive models, for analysis and validation of models, in order to assess the possibility of biases caused by poor conduct of the algorithm training process, in addition to little consensus on good practices about which descriptors should be used in each scientific investigation.

Even though machine learning techniques, were developed in the 20th century, their low use in the field of scientific research, especially in the area of chemistry, makes these methodologies to be observed as still very recent. In fact, this statement can be corroborated in a brief evaluation of the databases used to carry out training of predictive algorithms. Few are those that can be found, compared to the databases used in other areas, such as economics and information science, for example.

The algorithms used in the model training steps need databases with thousands of information and cases, so that the property prediction results, for example, have a minimum of reliability. It is possible to imagine that, nowadays, in possession of a range of structures of organic compounds, it is possible to train a neural network to be able to anticipate the physical properties of a new compound, based on the data already obtained. However, each compound present within a database, used in any model training stage, had to go through a batch of characterisation analyses, which, in a well-known way, are not fast enough analysis for a database of structures and properties to be created in a few years.

This is certainly one of the most difficult challenges, but not the most difficult to overcome. In a few years, at the rate of development and data collection observed today, we will be able to establish data sets available to the entire scientific community, which will be able to provide standards for comparison between models and formalize the training protocols necessary to increase the reliability in the results obtained with machine learning algorithms.

The second challenge: What is the correct procedure?

The second challenge, described in the topic introduction, says more about the practices of developing prediction models. It is well known that each model needs a minimum data base so that it can start to generate results consistent with real observations and, finally, be used in predictive techniques. However, the choice of the amount of information provided for training a machine learning model and for its subsequent validation step is as important as the dataset itself being used. For example, a prediction model such as Random Forest can be trained with 90% of the data from a set, ensuring a generous amount of information in the training stage. However, this mechanism can overtrain the set, so that the network has biases and becomes good in prediction steps, only for the data inserted in the training set, invalidating the model for data outside the initial set.

For this reason, the conduct of good model development practices must follow a standard that can provide reproducibility, precision and, therefore, reliability of the results. This pattern of good model development practices naturally arises in the scientific discussion, when we find the limitations of each approach and make improvements. Thus, this is a clear concern, but not difficult to overcome.

Recommended by LinkedIn

Machine-enhanced Materials Discovery

Shahrzad Hosseini 6 years ago

Machine Learning in Materials - don't be scared

Hannah Melia 6 years ago

Amortized Inverse Problems: Bridging the Gap Between…

Yeshwanth Nagaraj 2 years ago

The third challenge: How to describe chemicals and their properties?

On the other hand, probably the biggest challenge of computational chemistry, regarding the use of machine learning algorithms in investigations of properties of chemical systems of interest, is the correct use and development of molecular descriptors to translate the chemical information of molecules into a database, for computable values that can be inserted into predictive model training algorithms.

Descriptors are any functions used to convey relevant information about the chemical properties of a molecule to the model being trained. There are several formats of descriptors, with different dimensions. There are those that only transmit the number of atoms in a molecule, center of mass, moment of inertia, number of electrons, among other mostly structural properties, while some can transmit data derived from electronic properties, such as electric dipole, magnetic permittivity, absorption at a specific wavelength. There is not, formally, an exact description of how these functions should be and when to use them. This varies with each system being investigated and with the investigators conducting the process. Therefore, this is a complicated topic, since it can lead to a lack of standardization of procedures and a decrease in the reliability of results.

Recent studies show that it is not necessary to use only one descriptor to feed a training algorithm for a predictive model, but a linear combination of these descriptors can be used, making the standardization process even more complicated and lengthy. In addition, the descriptors themselves are the stage for broad scientific development, and it is possible to find entire works dedicated to this purpose.

Perspectives

Notwithstanding all the discussion developed around the data and functions to be used, one last issue cannot be ignored, the choice of the correct algorithm to perform prediction, clustering, classification work, in addition to the refinement of the hyperparameters used in each machine learning algorithm. In this way, the set formed by the method used, hyperparameters, descriptors and data set necessary for the training of a machine learning model, even presenting an enormous potential for the solution of complex problems in the area of chemistry, bring with them new problems to be solved.

However, the nature of these new problems is intrinsically easier to be solved, requiring greater attention to the standardization of these methodologies and the publication of more relevant scientific works on the subject.

Matheus Ferraz 3y

Great!

To view or add a comment, sign in

Challenges in Machine Learning methodologies applied to chemistry and Drug Discovery

Denys Santos

The first challenge: Databases generation

The second challenge: What is the correct procedure?

Recommended by LinkedIn

The third challenge: How to describe chemicals and their properties?

Perspectives

More articles by Denys Santos

Others also viewed

GPT-Rosalind and the Reconfiguration of Scientific Workflows

Predicting the Unpredictable: How ML(Machine Learning) is Paving the Way for Next-Gen Materials Science.

When AI Enters the Lab: What "Understanding” Means in Science

Data to Discovery: The Increasing Use of AI in Materials Science

Revolutionizing Chemical Reaction Prediction: How AI and Quantum Chemistry Join Forces

Problem of Vanishing and Exploding Gradients.

Role of Complex Variables in Engineering Mathematics and AI Systems

Materials & Process Informatics - Machine Learning for Materials Design in J-OCTA -

Why the Computational Model of Mind Worked — Until It Didn’t

Information Dynamics and Learning in Complex Adaptive Systems: Toward a Transdisciplinary Framework

Challenges of Machine Learning in Robotics

Computational Chemistry Using AI

How Machine Learning Transforms Chemistry

Overcoming Data Limitations In AI Model Development

Challenges in Real-World Data Collection

Explore content categories

The first challenge: Databases generation

The second challenge: What is the correct procedure?

Recommended by LinkedIn

The third challenge: How to describe chemicals and their properties?

Perspectives

More articles by Denys Santos

Muito mais que um solvente

O estigma profissional pesando sobre Mestres e Doutores. O porquê do Brasil ainda errar tanto ...

O porquê de não terminar faculdade no tempo certo! Um diálogo desconstruído.

Others also viewed

GPT-Rosalind and the Reconfiguration of Scientific Workflows

Predicting the Unpredictable: How ML(Machine Learning) is Paving the Way for Next-Gen Materials Science.

When AI Enters the Lab: What "Understanding” Means in Science

Data to Discovery: The Increasing Use of AI in Materials Science

Revolutionizing Chemical Reaction Prediction: How AI and Quantum Chemistry Join Forces

Problem of Vanishing and Exploding Gradients.

Role of Complex Variables in Engineering Mathematics and AI Systems

Materials & Process Informatics - Machine Learning for Materials Design in J-OCTA -

Why the Computational Model of Mind Worked — Until It Didn’t

Information Dynamics and Learning in Complex Adaptive Systems: Toward a Transdisciplinary Framework

Similar topics

Challenges of Machine Learning in Robotics

Computational Chemistry Using AI

How Machine Learning Transforms Chemistry

Overcoming Data Limitations In AI Model Development

Challenges in Real-World Data Collection

Explore content categories