A Machine Learning based framework to identify health conditions and risks in diabetes from internet data.

Michelle Fukunaga, MBA

Published Apr 27, 2019

Over the past 3 months, I had the pleasure to collaborate with Zhongshen Zeng (Boston, USA), Jacob Odada (Johannesburg, ZAF) Guomei Wang (St. Louis, USA), and Rabab Kaheel (Cairo, EGY) on an open source project. We experimented with an innovative method of mining chronic disease information from popular internet platforms: Google Search and Twitter.

Traditional healthcare studies based on sources such as government surveillance, vital statistic, Medicare claims, ER, and healthcare surveys have long publishing cycles. There are concerns of data quality and consistency integrating data from these various sources. There is also a potential for selection bias since the data consisted primarily of those that often seek medical treatment.

People are increasingly turning to the internet for their healthcare related questions and concerns. Our proposed method looked to the user generated data from search engine and social network to provide an overview of health risks and concerns associated with a particular chronic condition. This method has an advantage over traditional studies in terms of speed and cost. The method also has an advantage in terms of data recency since it enables the collection and analysis of near large, real-time datasets. We chose diabetes as the study topic since it is a global issue. The awareness and education of diabetes has huge implications for people around us.

Google Trends is the query platform for Google Search. There are many ways to ask a question on Google. Our intuition was there are correlations between these questions and the answers that we seek. In this case, we started with 2 questions of our own: “what are diabetes risks” and “what are diabetes conditions”. From there, we used the Google Trends API to extract all the related search queries over a course of 7 years. The result consisted of 36,000 pairs of queries and related queries. We represented each query as a node in a network. We then used network algorithm to uncovered communities within the network. Each community consisted of a group of related queries. We were able to identify common comorbidity such as heart, stroke, and kidney failure. In addition, we also captured less known, yet serious risks such as cerebral vascular accidents, paroxysmal, chlamydia, and hemiplegia.

Network graph of Google related search queries

Twitter was the other internet data source. We processed 1PB of data which consisted of 2 months of tweets at the minute level. Our hypothesis was that similar topics are often tweeted together. We calculated the vector distance between diabetes and other common medical conditions in the data. The closer the distance, the stronger the correlation. We validated our findings with an official dataset which we analyzed from the Center for Disease Control and Prevention. Indeed, obesity, hypertension, kidney disease, and abnormal blood pressure were among the highly correlated conditions.

We used open source tools (i.e. Python, SQLite) and low cost cloud platforms to derive insights that previously required large funding and long duration to attain. With minor changes in data collection parameters, we can apply the same framework to learn about other healthcare topics. User generated data from search engines and social networks may not provide precise information as they are also influenced by other factors, but they offer a quick overview of issues, medical and otherwise that is of concern within general populations. There is a huge opportunity for improvement in analysis of online data in relation to health care which will hopefully improve access to education and management of chronic conditions.

A Machine Learning based framework to identify health conditions and risks in diabetes from internet data.

Michelle Fukunaga, MBA

Others also viewed

Health Data Analytics

The Health Big Data - Driven Complex Adaptive System - Study of Systemic Impact on Health

Being wrong about efficacy probabilities

Building Temporal Knowledge Graphs for Healthcare and HEOR: A Comprehensive Guide

Fibonacci Series: A Hidden Gem in Health Data Analysis

Analyzing the Heart disease dataset: Predictive analysis using Machine Learning model "Logistic Regression"

DeepSeek R1 vs. OpenAI o1: A comparative analysis of reasoning process

Novel algorithm being tested with medical data

Situational Choice Experiments with Exploded Rank Orders

Bringing Data to Life: A Machine Learning Approach to AIDS Analytics

Explore content categories