Solving the microplastics detection problem using machine learning: a question of speed and scalability
Over the course of the recent years FTIR imaging has established itself as one of the key methods for analyzing microplastics - and not without reason. Fourier-transform infrared (FTIR) spectroscopy allows the acquisition of high-resolution spectra needed to distinguish between many different polymer types. Using focal plane array (FPA) detectors this powerful approach can be used to resolve environmental filter samples spatially for obtaining hyperspectral images (HSIs).
Spectral Library Searches
However, the full potential of FPA-based FTIR imaging is difficult to exploit since the large amounts of spectroscopic data cannot be analyzed manually without biasing the results. Therefore a lot of effort has been devoted to automatizing the data analysis process by means of spectral reference libraries. In this approach each spectrum of the HSI is compared to a database of known reference spectra by ranking each entry according to a similarity measure or hit quality index (HQI).
Even though this approach is appealing at first glance there are two major issues associated with spectral reference libraries if we use them to detect microplastics:
- Renner et al. (2018) argue that in order to avoid misidentifications the database should account for the rich diversity of spectra that are present in the measured datasets. This means that apart from virgin polymer spectra, aged as well as bio-fouled polymers and a variety of spectra from bio materials and other matrix components should be included.
- A spectral library search is an expensive computation that can take multiple hours or even days. This is the case because each unknown spectrum has to be compared to all the references of the spectral library in order to compute the HQI.
From this follows that if we want to improve the statistical performance of our database we inevitably also increase the time required for the analysis.
How can machine learning solve this issue?
The above problem is of course not new for there are many areas in analytical chemistry where one has to deal with large amounts of data very quickly. An alternative to spectral library searches is model-based machine learning which is a well-established methodology in chemometrics. The difference in this approach is that the analysis results are computed by means of a statistical model which is usually orders of magnitudes faster than calculating HQIs.
For creating a machine learning model we use a training dataset which contains a collection of reference spectra. These training datasets are of course very similar to spectral reference libraries, but with the key difference that they are usually much bigger and more diverse. This is possible because the complexity of the model will not increase if we increase the number of spectral references used in the creation process.
These differences have lead to to the success of model-based machine learning in many areas. For the detection of microplastics this has two consequences:
- We can follow the advice given by Renner et al. (2018) to the extreme and create training datasets which are much richer in diversity. Therefore the results from the automatic analysis will improve.
- As the throughput rate of machine learning models is usually much higher in comparison to using spectral library searches more samples can be analyzed in less time.
Current machine learning approaches
Recently the trend of using machine learning models for chemometric tasks has reached the microplastics research community. Starting with Paul et al. (2018) a high-throughput approach for detecting microplastics in soil based on NIR bulk spectra was developed based on partial least squares discriminant analysis (PLS-DA) and support vector machine (SVM). PLS-DA is one of the most established machine learning approaches in chemometrics with a long track record of applications in spectroscopy (Lee et al., 2018). Due to its high throughput rate it has also been applied in HSI classification tasks of which the model published by Serranti et al. (2018) allows for a fast screening for three different polymers using a SWIR camera. Shan et al. (2018) introduced a SVM-based approach for NIR cameras for eight different polymers.
In the domain of FPA-based µFTIR imaging Hufnagl et al. (2019) describe a preliminary study of the applicability of random decision forests (RDF) for the detection of microplastics. While this early version only allowed for distinguishing between five polymers the first application of the approach in a study of the Roter Main river by Frei et al. (2019) already used an extended version of the model which could already account for eight different polymer types. To date the model has been extended to cover more than twenty polymer types and can process an FTIR image of roughly 1 million pixels within 10 minutes.
Conclusion
If our goal is to prevent the uncontrolled emission of microplastics into our environment, monitoring approaches which allow for fast processing of many samples are direly needed. In this context I believe that the above methodologies of model-based machine learning form a new pathway that will allow for meeting these demands. In my view further research is necessary to reach broader acceptance of machine learning in the microplastics community. I can only highly recommend the inclusion of these approaches in the current standardization process so that these powerful approaches can reach their full potential.