Real Time Analysis Predictive Analytics for Clinical Trial Data
Abstract
In the world, currently through technological advancement, the livelihood and the world practices, in general, have been altered with regardless of the field. Through the introduction of artificial intelligence in the healthcare industry, the medical field has significantly improved across various areas through the impact of the data analysis. However, COVID-19 has influenced the livelihood normalcy across multiple countries. The effect of the corona has left many life practices globally disrupted and confused. Every country is working out to install a system better enough to deal with the Coronavirus. AI tools are the best solution to deal with COVID-19.
Introduction
In the Wuhan city China, on December 2019, the novel Coronavirus was reportedly spreading in the town which was later reported by the world health organization as the world threat pandemic in which every country was supposed to take care and put prevention measures. The World health organization named the virus COVID-19 2020 on January when it started accelerating the spread. Then Coronavirus is a group of family viruses such as ARDS, SARS. According to the world health organization, the COVID-19 disease is being transmitted vitally through the respiratory tract, when healthy people are in contact with the sick person. However, other transmission means are not yet identified, but the scientist works out to find out the full report about it. According to WHO, when a person has infected the symptoms shows out between 2-14 days, which is determined by the incubation period. Through technological advancement, the world is fighting the COVID-9 pandemic to rule out everything connected to it.
Predictive Modeling in Healthcare: Optimizing Clinical Trials
Currently, in the world, clinical trials have been identified as the fundamental essentials in the medical industry. However, the clinical trials are becoming too much costly, although it is continuing to facilitate the advancement of new drugs. Besides, the aspect of pharmaceutical industry drug trails has also been an essential and wonderful concept used in the predictive modelling healthcare sector (Marlow & Potruch, 2018). Predictive modelling is defined when data from the past and present is utilized in the healthcare systems to determine the outcome through the use of various statistics. Mostly, the application of predictive modelling in healthcare is used to any event type regardless of the time of occurrences. For instance, predictive models are applied often in the crime detection sections to help in the identification of suspect after a particular crime has occurred in a given region.
In Combination of the various concept such as machine learning capabilities and powerful predictive learning, we are modelling through the use of technology in predictive modelling. It helps in empowering the capabilities of pharmaceutical industries to strengthen the treatment sections and bring a more sophisticated system in treatments bettering treatment. Through this technology, the world has benefitted literally in transforming the medical sectors and saving more lives (Marlow & Potruch, V. (2018). Based on the statistical model through the application of statistical assumptions and mathematical model, the sample data collected from generations to identify a give aspect within a population.
In considering new drug licensing based on the standards, any drug can take something like 10-15 years in order to complete precisely three clinical trials of a high standard for it to be licensed and accepted in the medication. The world pharmaceutical research industry and trials have been spending millions of dollars on finding out drugs to deal with various diseases such as COVID-19. However, in reference to the food and drug administration of the United States of America, the statistics indicate that only around one within ten drugs are approved regardless of the cost spent (Mukherjee, 2019). Through the reduction of the time frame in the various medical event while at the same time improving outcomes of the clinical trial, will make all the processes more effective and faster. This means that based on the end result, the drugs will be produced, save more lives and at an affordable cost.
Application of leveraging concept in the healthcare predictive modelling in the pharmaceutical industry has dramatically facilitated in its development. However, before the pharmaceuticals industries start drug trials, the research process development usually uses several million. For example, the processes such as R & D usually, utilize most of the billions of dollars through many years, as the industry of drug-making tries to determine the best formulation, delivery mechanisms and dosage (Mukherjee, 2019). However, according to the algorithm used in predictive modelling, the drug impacts and efficacy is created to impact drug variables. COVID-19 is an immense case of applying machine learning and artificial intelligence to optimize clinical trials for its drug and vaccine.
Researches have used these tools to optimize the whole process from tracking hospital capacity to identify the high-risk patient. They believed that these technologies would help to formulate for similar circumstances in the future. Through, applying data analytics and machine learning strategies, the world has leant, the spread of various diseases, especially in the bases of COVID-19 pandemic. The data analytics has dramatically impacted the sector of drug discoveries, based explicitly on the corona spread and pandemic. Besides, according to the researchers in the medical industry, technology can affect result in the complex in the treatment sector.
On the other hand, Coronavirus showed how technology still needs more effort and maturity to resolve pandemics like this one. Especially when it comes to accessing data, sharing and quality; which is impacting the accuracy of the algorithms. Experts and scientists from all sectors are working to overcome this pandemic and to get better use of artificial intelligence to won on this virus.
The COVID-19 pandemic clinical trials are best performed through AI technique to do adaptive changes in regards to the patient population, the dose and schedule through Combination of various drugs and deployment of several biomarkers to enable predictive potentiality. Through the current technological advancement in the medical industry, the analytics system facilitates multiple things (Mukherjee, 2019). Fast and powerfully drill down based on all data points from all the clinical trials, clinical data exploration flexibly. Besides, the hypothesis generation and real decision making have been positively impacted. Finally, the integration of clinical data with the various field has brought out a significant understanding of the medical areas.
The power of real-time, predictive analytics for improving patient care
The predictive analytics currently in the world of medical, through EMR system proliferations in various hospitals, has facilitated the accessing of more data and information to the administrators and clinicians (Wynants et.al 2019). However, although predictive data analytics has impacted hospitalization, its management has been a critical challenge facing the medical society. Besides, using EMR information in the tracking, preventing and managing diseases acquiring raw data alone is becoming an issue. While basing the predictive analytics in the field, sciences have been researching more data about COVID-19. The first phase of the clinical trials based on humanity; they are the most complex and challenging when developing this is because the development process requires strong organization aspect and interpretations in general.
Decision Tree
Machine learning and data mining statistic apply decision tree learning ideology in the modelling approaches sector. Additionally, the decision tree approach is used alternatively in the partition during classification that classifies space independently and every region. Various features identified in the engineering features that have values are usually represented through the table in which they are supplied as input.
AdaBoost
This is where dataset instances that ensembles algorithm in learning are weighed, AdaBoost is based on training the booster classifier. Through us of table presentations, the features selected are represented in which they are supplied as an input (Ravì et.al, 2016). However, through the learning of the algorithm, the total output is combined and summed into the final boosted classifier result. The adaption aspect of the Adaboot is based on the fact that subsequent learning algorithms are tweaked to favours other misclassified classifiers previously. This is the best and sensitive formula for noisy and outlies data, but in some cases, the susceptibility of the various issues can be less overfitting compared to learning algorithms.
Random Forest Classifier
The random decision or random decision forest is methods for regression and classification it ensembles learning algorithm for machine learning. However, based on the training of forest algorithms, this is where the bootstrap aggregating technique is used. When average is done for all the individual regressions, the total predictions can be made; but vote taken is for the majority in the case for classification trees (Mursalin, 2017). The algorithms of random forest classifier usually applied methods of modified learning algorithms that are based on the selection and splitting of the process of learning through random features subset.
Stochastic Gradient Boosting
Stochastic gradient boosting is an algorithm, which is based on the gradient boosting samples created from the training dataset. These algorithms, construct regression model that is additive putting, functions parameterization fitting to the current residuals based on the lowest iteration square (Zhou et.al, 2019).The stochastic gradient boosting algorithm is generally based on the classification and regression issues in order to produce a model for prediction in a form that ensembles other typical decisions.
Is Machine Learning (ML) Helping to Optimize Clinical Trials?
According to a study conducted by researchers at MIT; machine learning, artificial intelligence and data analytics can help predict outcomes of clinical trials. This leads to a faster approval time for drugs and vaccine at a lower cost. To predict clinical trial outcomes, researchers used a big set of data. They developed machine-learning algorithms to analyses over 140 features such as trial status, accrual rates, and duration and sponsor records of accomplishment. The researchers added some statistical techniques to estimate missing values to come up with more accurate predictions (Battineni et al., 2019). This study resulted in 0.78 predictive measures for forecasting transitions from phase 2 to approval, and 0.81 for predicting transitions from stage 3 to approval. These accurate predictions can reduce the ambiguity of drug development and growth of the amount that investors are willing to offer to clinical trials. The study showed that these algorithms could play a significant role in advancing new treatments from phase 2 to regulatory approval, or from phase 3 to regulatory approval.
The algorithms can be run using in analytical modeling and simulation technology that can help and boost the pharmaceutical industries in bringing sustainable treatments to the public in a short time. The modeling and simulation technology can transform the entire procedures for drug development and save lives. It takes up to 15 years to license a drug for medication because it has to undergo the standard clinical trial phases and procedures. Based on the US Food and Drug Administration (FDA) standards and regulations out ten drugs developed during research, only one drug gets approval for medicinal use when many resources have been used in researching and carrying out trials. The integration of advanced technological processes and optimization of clinical research and trial procedures will enhance and shorten the timestamp for developing life-saving drugs at a reasonably affordable cost. The emergence and outbreak of the coronavirus type in December 2019 named COVID-19 by the World Health Organization (WHO) on February 2020 terming it as a global threat and declaring it outbreak a global health emergency. There is no specific treatment for the Virus, which has affected more than 20 million people globally and more than 4 million American citizens. A global coronavirus statistic of as of April 2020
Machine learning offers support in the process of identifying the disease by utilizing the available image and textual data. It, however, requires big data in the process of classifying and predicting the diseases patterns, it is useful in analyzing the nature of the COVID-19 across the globe. The current pandemic has attracted several researchers and scientists to help solve the problem by using X-ray image data provided by the John Hopkins University through creating models that classify the images, whether COVID-19 or not (Zhang et al., 2020). The image data is converted into metadata and is integrated with clinical reports in the form of text for easy categorizing of the disease to help to detect the type of coronavirus from early stages and symptoms.
The Role of Advanced Analytics in The Development Of COVID-19 Vaccine
COVID-19 is an immense case of applying machine learning and artificial intelligence to optimize clinical trials for its drug and vaccine. Researches have used these tools to optimize the whole process from tracking hospital capacity to identify the high-risk patient. The purpose of advanced analytics is to connect a variety of sources ranging from research papers, clinical trials, and drug development that influences a patented Natural Language Processing (NPL) techniques and curated classifications to extract context and broad insights and understand the spread of the disease. They believed that these technologies would help to formulate for similar circumstances in the future; however, the disease surpassed the technology and showed how technology still needs more effort and maturity to resolve the pandemic. The quality of data, methods of accessing data, and the network for sharing the data always has implications algorithm accuracy as well as determining the accuracy of an algorithm. Scientists and Experts across the technology and health research departments have been working together to find a vaccine for the COVID-19 Virus by incorporating Artificial Intelligence in their research to win the Virus.
Artificial Intelligence (AI) is being relied on, as the hope in clinical trials to make a difference and learn the behavior of the Virus in the body. NLP (natural language processing) is a branch of AI, allows the software to read and analyze a written and spoken word. In the case of healthcare and medicine, NLP allows a computer program to search doctors' note and pathology reports for potential participants in a clinical trial (Charles & Emrouznejad, 2019). Unstructured data is the problem in this case. The text is usually free-flowing, and information might be implicit and require some background knowledge to be understood (Shi et al., 2020). Doctors have several ways of describing the same illness; for example, diabetes might be called malignant Mellitus or another example is a heart attack can be defined as a myocardial infarction, myocardial infarct MT. An NLP program can be trained to map the symptoms and group them all under one disease. This algorithm can then be used to interpret unannotated records. Many open-source web tools work on helping researchers and administrators to search databases without any need of a technical background. These programs are made by translating into standardized, ceded query format that the database can understand a lot of work and effort is being made to make this task easier.
Methodology
The procedure includes five steps, which are 1. Data collection, 2. Definition of data refining, 3. Preprocessing overview, 4. Feature extraction mechanisms, and 5. Traditional and ensemble machine learning algorithms. The data in the proposed methodology has been represented in charts and graphs.
1. Data collection
The research centers and hospitals and other health facilities have given access to data about the pandemic via open source repositories like the GitHub for data collection and analysis in this research. We obtained data of about 212 patients who show the signs and symptoms of the coronavirus and other viruses. The data had several attributes, which include patient ID, age, temperature, name, lymphocyte count, neutrophil count, leukocyte count, offset, pO2 _saturation, sex, finding, survival, intubated, date, location, view, folder, went_ICU, needed_supplemental_O2, extubated, modality, and DOI.
2. Relevant datasets
In the data collection and analysis processes, several algorithms were used for defining and refining the relevant datasets extracted from clinical notes and findings for preprocessing. In the datasets, there is ARDS, SARS, COVID and both (COVID, ARDS) as shown in the graph below.
3. Preprocessing
It involves procedures required to refine the data to enable machine-learning process to be done through following various steps in a phased manner. The entire stage involves deleting unnecessary texts, punctuations, symbols, stop words, and links to enhance the accuracy of data, as shown in the image below.
After the refining and defining the data, it follows an extraction of specific features according to predetermined semantics then converted into probability values by use of TF//IDF techniques. In this case, 40 features were identified then categorized for input in the machine learning algorithms.
The Machine Learning Categorization
The categories are created to have four distinct types of viruses of the given text, which include ARDS, SARS, COVID (person with coronavirus) and both (COVID (has corona) and ARDS). This involves different supervised ML algorithms across all categories, which include Multinomial Naïve Bayes (MNB), the support vector machine (SVM), decision making, random forest, stochastic gradient boosting and logistic regression.
The logistic regression algorithm uses the relationship between the numerical variable class and its label to make predictions and calculate the probability of class membership using the formula below.
The Multinomial Naïve Bayes uses the Bayes rule to computes the class probabilities of the provided text.
The support vector machine (SVM) it and ML supervised algorithm for grouping the text into various categories to construct a classifier. In this study, the 40 features selected during engineering with values and represented in the form of a table and used as input.
Result and Conclusion
Tables
Charts
The entire system used to during the research had the following specifications; Microsoft Windows operating system, 3.88GHz processor, and 6GB RAM to carry out the entire process. The Scikit learn tool used to execute the machine learning categorization with the help of the several libraries such as STOPWORDS among others to improve the accuracy and correctness of the algorithms pipeline used. The deeper insights of the data were obtained from the statistical computations of the datasets, with 70% of the data being used in model training and the other 30% used to test data for the model (Khanday et al., 2020). The classification used ML algorithms that supplied features obtained from feature engineering step, besides, while exploring the generalization of the model from the training data to unseen data and minimize the chances of overfitting we split the original dataset into separate test and training subsets.
Each algorithm underwent the tenfold cross-validation approach five-six times independently to ensure that no biasness would arise during portioning of the data set in the validation process. Table 1 (above) provides a comparative analysis of all classical ML methods used in the task. In contrast, Table 2 (above) provides a comparative analysis of classical ML and Ensemble learning methods used during the classification of clinical text in the four groups. After training and testing the algorithms and models, it was revealed that the logistic regression and multinomial Naïve Bayesian classifiers give the best results with a 94% precision, 96% recall an f1-score of 95%and 96.2% accuracy. On the other hand, random forest, gradient boosting, had relatively goods result of 94% accuracy for all (Khanday et al., 2020). The model was experimented in two stages to obtain real accuracy level having 75% accuracy in phase 1, where fewer data was used. The accuracy level raised in phase 2 where all data was used, and it is therefore clear that the more data provide the model with the more accurate results we obtain and the more the performance is improved.
Conclusion
Lack of a vaccine or a drug to treat the COVID-19 Virus has brought a lot of concerns and nightmares among researchers and scientists. However, various researchers and institutions are working closely to find a cure for the Virus. The 212 patients’ clinical reports data sample in the four categories (COVID, SARS, ARDS, and both [COVID, ARDS]) after running the ML algorithms and classifying the reports it was clear that; the logistic regression and multinomial Naïve Bayesian classifiers give the best results with a 94% precision, 96% recall an f1-score of 95%and 96.2% accuracy. The other algorithms showed better results, but they could not have relied on much. Increasing the data amount can enhance the models’ efficiency. The analytics and classification of information can be done on the basis of gender to determine the most affected gender (males or females) and find out the influencing factors and appropriate measures to take to counter them. In the future, there is a need to use more feature engineering to get better results and for deep learning for the models.
References
Battineni, G., Chintalapudi, N., & Amenta, F. (2019). Machine learning in medicine: Performance calculation of dementia prediction by support vector machines (SVM). Informatics in Medicine Unlocked.
Charles, V., & Emrouznejad, A. (2019). Big Data for the Greater Good: An Introduction. In Big Data for the Greater Good (pp. 1-18). Springer, Cham.
Khanday, A. M. U. D., Rabani, S. T., Khan, Q. R., Rouf, N., & Din, M. M. U. (2020). Machine learning based approaches for detecting COVID-19 using clinical text data. International Journal of Information Technology, 1-9.
Khanday, AMUD, Rabani, S.T., Khan, Q.R. et al. Machine learning based approaches for detecting COVID-19 using clinical text data. Int. j. inf. tecnol. (2020). https://doi.org/10.1007/s41870-020-00495-9
Loginov, M., Marlow, E., & Potruch, V. (2018). Predictive modeling in healthcare costs using regression techniques. ARCH 2013.1 Proceedings, 1-32.
Mukherjee, S. (2019). Predictive Analytics and Predictive Modeling in Healthcare. Available at SSRN 3403900.
Mursalin, M., Zhang, Y., Chen, Y., & Chawla, N. V. (2017). Automated epileptic seizure detection using improved correlation-based feature selection with random forest classifier. Neurocomputing, 241, 204-214.
Ravì, D., Wong, C., Deligianni, F., Berthelot, M., Andreu-Perez, J., Lo, B., & Yang, G. Z. (2016). Deep learning for health informatics. IEEE journal of biomedical and health informatics, 21(1), 4-21.
Shi, F., Wang, J., Shi, J., Wu, Z., Wang, Q., Tang, Z., ... & Shen, D. (2020). Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid-19. IEEE reviews in biomedical engineering.
Van Calster, B., Wynants, L., Timmerman, D., Steyerberg, E. W., & Collins, G. S. (2019). Predictive analytics in health care: how can we know it works?. Journal of the American Medical Informatics Association, 26(12), 1651-1654.
Zhang, K., Liu, X., Shen, J., Li, Z., Sang, Y., Wu, X., ... & Ye, L. (2020). Clinically applicable AI system for accurate diagnosis, quantitative measurements, and prognosis of covid-19 pneumonia using computed tomography. Cell.
Zhou, J., Li, E., Wang, M., Chen, X., Shi, X., & Jiang, L. (2019). Feasibility of stochastic gradient boosting approach for evaluating seismic liquefaction potential based on SPT and CPT case histories. Journal of Performance of Constructed Faci