Analyzing Hospital Data with SQL
Introduction
As a physical therapist, I’ve spent years at patients’ bedsides—helping them recover, regain function, and navigate their hospital journeys. But I always had a quiet curiosity: What drives the decisions behind the care plans? How are hospital resources managed? Is care really equitable across patient populations?
When I stepped into this new hypothetical role as a healthcare data analyst, I was handed a long list of tasks by my overwhelmed boss. At first, it felt like stepping into a different world—spreadsheets instead of stretchers, SQL instead of SOAP notes. But soon, that initial pressure turned into a sense of excitement. This was a chance to look behind the curtain and explore the systems I’d been working within for years.
Why THIS Project?
This project became my way of connecting the dots between clinical care and data-driven decisions. By analyzing hospital stay lengths, identifying which specialties performed the most procedures, and examining patterns in lab testing and medication use, I could start to see how data tells a story—one that impacts real patients, real workflows, and real outcomes.
I was especially drawn to the questions about equitable care. As someone who has treated patients from all backgrounds, it was encouraging to see that the data showed minimal disparity in lab procedures across racial groups. That insight, along with others, reaffirmed why this work matters.
In short, this project wasn’t just about running queries—it was about blending my clinical lens with analytical tools to uncover insights that can genuinely improve patient care and help hospitals function more effectively.
Information on dataset:
The data set can be found on Kaggle and represents 10 years (1999-2008) of clinical care data from 130 US hospitals and integrated delivery networks. This extensive data provided a solid foundation for my analysis. There are 2 main tables:
Key Takeaways
The Data Analysis
Hospital Stay Length
Time in the hospital is a BIG part of hospital operations. The hospital has one limiting resource, their size. Our health care data analyst boss wants to know what the distribution of time spent in the hospital looks like. They're also curious to know if the majority stay less than 7 days. Once patients stay over 7 days, the hospital wants to ensure these patients are very acute. For this, I first created a bucket for the number of days patients stay in the hospital, and then created the bar histogram using the RPAD function.
This gave me the answer that most of the patients stay less than 7 days in the hospital.
Medical Specialties with the Highest Procedure Averages
Next, there is a hypothetical brand new Hospital Director who wants to know what medical specialties are doing the most number of procedures on average, since procedures make up the significant cost of the hospital. They've asked to get a list of all the specialties and the average total of the number of procedures.
I used a HAVING clause, limiting the focus only on specialties with over 50 patients and an average of more than 2.5 procedures. This helped highlight the departments driving the bulk of procedural activity. I also used a ROUND function, to round up my average procedures up to 1 decimal unit.
The findings indicate that specialties such as Thoracic Surgery, Cardiovascular Surgery, Radiology, Cardiology, and Vascular Surgery perform the highest average number of procedures. This suggests their critical role in the hospital, likely driven by high patient demand for both surgical interventions and diagnostic services.
Checking for Racial Disparities in Lab Procedure Access
Next, the Chief of Nursing wants to know if the hospital seems to be treating patients of different races differently, specifically with the number of lab procedures done.
To find this, I used a JOIN function, to join the demographic and health tables, using patient_num as a foreign key relationship. Then, I calculated the average number of lab procedures per racial group by grouping the data by race and sorting the results in descending order to identify any noticeable differences.
The analysis reveals that the average volume of lab procedures among different racial groups varies only slightly. This small difference indicates there is no significant disparity in how lab services are provided across races. Therefore, the data does not support the claim of unequal treatment based on race.
Exploring the Relationship Between Lab Procedures and Hospital Stay
Next, the boss asked me to give some insight on how the number of lab procedures might correlate to the number of days in hospital. They want to know do those who get a lot of lab procedures, stay in the hospital longer?
For this, I used a CASE WHEN function, that created a column called procedure_frequency that is either "few", "average", or "many". Next, by using GROUP BY and AVG functions I could evaluate how the average length of stay changes based on what procedure frequency group.
I found that there is a correlation between these; the longer someone is at the hospital the more lab procedures they have.
Recommended by LinkedIn
Patient ID List for Targeted Medical Testing
The research team issued an urgent request to identify patients with patient ID for upcoming medical tests—focusing on African American individuals and those with an increased metformin dosage.
To fulfill this, I used the UNION function to combine relevant results from the demographics and health tables, producing a unified list of patient IDs meeting either criterion.
Highlighting Emergency Patient Success Stories
The hospital administrator sought to identify notable cases involving patients admitted through the emergency department (admission type ID = 1) who had hospital stays shorter than the overall average.
To uncover these cases, I wrote a query that retrieved all patient records, applied a filter for emergency admissions, and further refined the results to include only those with a length of stay below the calculated overall average, using a subquery for comparison.
Top 50 Patients by Medication Count and Lab Procedures
Our boss has asked us to write a summary for the top 50 medication patients, make any ties with the number of lab procedures the sort the highest at the top. The output should be in the following format "Patient [patient_id] was [race] and was [readmitted/not readmitted]. They had [x] medications and [x] lab procedures."
For this, I used a query that combines patient demographic and medical data into a single summary string using SELECT and CONCAT, labels readmission status with a CASE statement, and joins the necessary tables. It ranks patients by medication count (breaking ties with lab procedures) using ORDER BY, and returns the top 50 results with LIMIT.
When I reviewed this, I realized there is a large gap in number of procedures. Further investigating, I realized that number of lab procedures were stored in VARCHAR, which I ALTER to INT (integer) to give me following result.
Ranking Common Diabetes Medications by Patient Age
Lastly, the Hospital manager asked me, "How do the top 3 medications are ranked per age group?" the top 3 medications were insulin, metformin, and glipizide. Now these 3 medications were in 3 different columns in the health table.
To get the answer, I used UNION to stack medication columns into a long format (transposing the table), then wrapped it in a CTE. I joined it with demographics to include age, grouped and counted records by age and medication, and used a window function to rank them.
Main Takeaways
Through this project, I learned that targeted data analysis can provide actionable insights that improve hospital operations and patient care. Here are some important points:
Conclusion and Personal Reflections
This project taught me a lot about the importance of data quality and the impact of analysis on real-world healthcare decisions. Cleaning the data was challenging, but it reinforced the idea that accuracy is vital for trust in healthcare analytics. I also gained insight into how data can drive equity and efficiency in patient care, shaping my perspective as I move forward in my career.
Call to Action
If you found this article insightful, I invite you to connect with me on LinkedIn! Let’s share thoughts and explore opportunities together. If you or someone you know is looking to hire a data/healthcare analyst, let's chat. I would love to hear your thoughts or answer any questions you have!
Enlighten! I like how you clearly presented the insight.
You’ve done a fantastic job with this project. Your personal voice adds real depth.
Very informative