Vision Zero is a multi-national effort to address the rising global traffic fatalities problem. The ultimate goal is to achieve zero deaths from traffic crashes. According to the National Highway Traffic Safety Administration, 36,096 people were killed in motor vehicle traffic crashes in 2019. This number increased to 38,824 in 2020. I firmly believe learnings from the historical crash data and other interlinked data sources using advanced machine learning techniques may help identify potential solutions to this growing problem in the United States.
There are several publicly available data sources, including but not limited to traffic crashes, weather, and roadway infrastructure. I plan to develop new insights and post updates to this article as I progress through my analysis.
I have picked a state in the U.S. and downloaded the traffic fatalities database for 2019 with a total of 3,300 records and 114 columns. After several data quality checks, I removed only two records from the database. A quick review of traffic volumes indicates a right-skewed distribution with a median of around 25,000 vehicles and a mean of approximately 35,000 vehicles. Although the plot shows many outliers, I retained them because they are not unreal.
I ended up with 34 columns after removing several that I thought would add little value to the analysis. A correlation matrix of the remaining numerical columns suggests some apparent relationships, but the remaining columns indicate some independence.
As part of the feature engineering process, I have consolidated the "Contributing Factors" from 728 unique values to 11 unique values. The final database has a mix of numerical and categorical columns, so I have used the K-Prototype clustering technique. Based on the elbow chart below, I have decided to use five clusters.
I have created a Tableau dashboard to depict the characteristics of each of the five clusters.
- First Cluster: It has 346 (10.5%) crashes, the lowest traffic volumes, the highest percentage of trucks, no median, narrow outside shoulders, an average speed limit of 51mph, and the lowest average roadway surface width. It has the lowest number of crashes related to active transportation modes and the highest number of crashes that occurred on County roads. Most of the crashes occurred in rural communities.
- Second Cluster: It has 600 (18.2%) crashes, an average ADT of 32K, the second-highest percentage of trucks, the highest average median width plus inside shoulders, the second-highest average outside shoulder width, and an average speed limit of 67mph. It ranks third in the number of pedestrian-related crashes, and most crashes occurred on U.S. and State highways and interstates. Most of the crashes occurred in rural communities.
- Third Cluster: It has 449 (13.6%) crashes, the highest average ADT of 143K, the second-lowest truck percentage, the second-highest average median width plus inside shoulders, the highest average outside shoulder width, an average speed limit of 56mph, the highest average surface width of 90 feet. It has the second-highest pedestrian-related crashes, and the majority of crashes occurred on interstates. Most of the crashes occurred in communities with over 250K population.
- Fourth Cluster: It has the second-highest number of crashes at 928 (28.1%), the second-lowest average ADT, an average truck percentage of 12%, no median, narrow outside shoulders, an average speed limit of 63mph, and the second-lowest surface width. It has the second-highest bicycle-related crashes and around 60 pedestrian-related crashes, ranking fourth out of five classes, and the majority of crashes occurred on U.S. and State Highways and Farm to Market roadways. Most of the crashes occurred in rural communities.
- Fifth Cluster: It has the highest number of crashes at 975 (29.6%), an average ADT of 23K, the lowest truck percentage of 6%, the second-lowest average median width plus inside shoulders, narrow outside shoulders, an average speed limit of 38mph, and an average surface width of 58 feet. It has the highest number of crashes related to active transportation modes, and the majority of crashes occurred on city streets. Most of the crashes occurred in communities with over 250K population.
Great insights Rama!!