A Machine Learning based framework to identify health conditions and risks in diabetes from internet data.
Over the past 3 months, I had the pleasure to collaborate with Zhongshen Zeng (Boston, USA), Jacob Odada (Johannesburg, ZAF) Guomei Wang (St. Louis, USA), and Rabab Kaheel (Cairo, EGY) on an open source project. We experimented with an innovative method of mining chronic disease information from popular internet platforms: Google Search and Twitter.
Traditional healthcare studies based on sources such as government surveillance, vital statistic, Medicare claims, ER, and healthcare surveys have long publishing cycles. There are concerns of data quality and consistency integrating data from these various sources. There is also a potential for selection bias since the data consisted primarily of those that often seek medical treatment.
People are increasingly turning to the internet for their healthcare related questions and concerns. Our proposed method looked to the user generated data from search engine and social network to provide an overview of health risks and concerns associated with a particular chronic condition. This method has an advantage over traditional studies in terms of speed and cost. The method also has an advantage in terms of data recency since it enables the collection and analysis of near large, real-time datasets. We chose diabetes as the study topic since it is a global issue. The awareness and education of diabetes has huge implications for people around us.
Google Trends is the query platform for Google Search. There are many ways to ask a question on Google. Our intuition was there are correlations between these questions and the answers that we seek. In this case, we started with 2 questions of our own: “what are diabetes risks” and “what are diabetes conditions”. From there, we used the Google Trends API to extract all the related search queries over a course of 7 years. The result consisted of 36,000 pairs of queries and related queries. We represented each query as a node in a network. We then used network algorithm to uncovered communities within the network. Each community consisted of a group of related queries. We were able to identify common comorbidity such as heart, stroke, and kidney failure. In addition, we also captured less known, yet serious risks such as cerebral vascular accidents, paroxysmal, chlamydia, and hemiplegia.
Twitter was the other internet data source. We processed 1PB of data which consisted of 2 months of tweets at the minute level. Our hypothesis was that similar topics are often tweeted together. We calculated the vector distance between diabetes and other common medical conditions in the data. The closer the distance, the stronger the correlation. We validated our findings with an official dataset which we analyzed from the Center for Disease Control and Prevention. Indeed, obesity, hypertension, kidney disease, and abnormal blood pressure were among the highly correlated conditions.
We used open source tools (i.e. Python, SQLite) and low cost cloud platforms to derive insights that previously required large funding and long duration to attain. With minor changes in data collection parameters, we can apply the same framework to learn about other healthcare topics. User generated data from search engines and social networks may not provide precise information as they are also influenced by other factors, but they offer a quick overview of issues, medical and otherwise that is of concern within general populations. There is a huge opportunity for improvement in analysis of online data in relation to health care which will hopefully improve access to education and management of chronic conditions.