Understanding the Modern Suicide Epidemic using Python
In Steven Pinker's Enlightenment Now, various metrics were used to demystify the idea that, despite overwhelming negativity portrayed all around us, humanity can be optimistic as we move towards the age of science, technology and reason. One of those metrics is the trend in Suicide rates. Though the data is only limited to three countries (Switzerland, United States & England), the graph points towards an expectation of less deaths heading into 2020 and beyond.
Can we really be that optimistic into the future, or is human progress more often than not, seen as an illusion of cherry-picked data? This article will explore in detail if this phenomenon remains true for 98 other countries including those of different economic status and geographical location. Additionally, I tried to observe if any accurate predictors of suicide deaths can be determined.
For a complex data-set, we break down the analysis systematically in smaller chunks in the form of answering questions, before proceeding with the machine learning aspect with a heat-map and a simple linear regression analysis. Sanity checks for data integrity, multi-collinearity, endogeneity, etc were primarily achieved through data visualisation.
Preliminary Data understanding (About the data)
The data-set describes 32 years of data from 1985 (inclusive) to 2016 across 101 countries, 6 generations and 6 age groups. A snippet of the attributes given in the data set can be found below. HDI (Human Development Index) for year, Country-year and population columns were not used in this analysis for reasons of data-insufficiency and redundancy. For the exact details on how the data was cleaned and transformed, you may PM me.
*Note: We will not be including the results we see for year 2016 in our analysis. Unless of course, this really signifies an optimistic sign for humanity or simply a lack of data collection. But for simplicity sake, we take it as the latter.*
What can Age groups tell us about Suicide numbers?
Observation: The number of suicides for almost all age groups follow the same trajectory (except the age group 5-14,which remained constant at almost 0, and for age group 35-54 with the only difference being that it took a dip earlier in 2003, with a slight increase in 2010. After which it decreased again.)
Analysis: It seems that regardless of age groups, the number of suicides had first gone up, stabilised and decreased in 2010. Although age group 35-54 is more erratic in terms of its suicide numbers than any other age groups, there is not a definitive period where there have been more suicides across or within different age groups. There is very little to no correlation between number of suicides in age groups against time.
Can Gender tell us anything?
Observation: Evidently, there are more males than female suicide rates. Male suicide numbers increased from briefly 1988 to 2003, and decreased ever since (notwithstanding the slight increase in 2009). The number of suicides in female have across the board, been rather stagnant although the suicide numbers doubled in 1990 as compared to 1988.
Observation: There are some consistency between age and genders in terms of suicide numbers. For both genders, the age group (5-14) has the lowest suicide numbers. For both genders, the age group (35-54) has the highest suicide numbers. For both genders, the age group (55-74) has the second highest suicide numbers. The main difference is that female suicides in 75+ age group comes in third, whereas the age group 15-24 years old comes in third for males.
Analysis: Age groups do not really determine the suicide numbers between the genders, we should expect little correlation here.
What can Generational groups tell us about Suicide numbers?
Analysis: It is clear that there is no correlation between Generations vs suicide numbers across time as the plot graph suggests - with each generation having varying no of suicide rates. However, it could be useful to note that for the following time periods for the respective generation, suicide numbers were constant.
For Black & Blue generations, suicide numbers were almost constant from 1991 to 2000. For yellow generation, suicide numbers were almost constant from 2007 to 2015.
Investigating Country-specific age group - Cultural effect?
Observation: It could perhaps be useful to see if an element of cultural/regional effect plays a small role in determining suicide numbers between different age groups. Through the use of countries, we can see that at the bottom, USA & United Kingdom share similar cultural norms and have similar suicide numbers in terms of age groups with the red (age group: 35-54) as the most number of suicides, followed by yellow(55-74). Ukraine similarly has the same order in the number of suicides in terms of age groups - but Ukraine is a very different country in terms of culture and standards of living. Russia too has similar order in the age categories in suicide numbers.
Analysis: This eliminates any regional and/or cultural considerations in that the country-specific variable is not a good predictor in determining suicide numbers across peoples of different age groups. There is little to no correlation between country and suicide numbers. The pie-chart below confirms and shows the top 10 countries with the highest suicide numbers/100k population.
Are poorer countries more susceptible?
Observation: Most of the suicides occurring in high numbers, that is above 5000, happened amongst countries with lower GDP per Capita of 60,000 and below. Countries with GDP Per Capita above 60,000 have suicide numbers that are pretty low.
Analysis: We may expect to see a negative correlation between GDP per capita and suicide numbers.
Does population affect poorer countries more?
Observation: we use Population as a normalizing variable in this case, and we see that it has very little effect on the original plot above. However, it did manage to smooth out the data points across the scatterplot and more data points are scattered horizontally, with more suicide numbers appearing in countries with higher GDP per capita. Most of the suicides that are occuring in high numbers, that is above 50 suicide per 100k population, are happening among countries with lower GDP per Capita of 60,000 and below, whereas countries with GDP Per Capita of above 60,000 have suicide numbers that are mainly below 50 suicides/100k pop.
Analysis: We will still expect to see a negative correlation between the two variables.
Machine Learning - Correlation matrix via Heatmap
It turns out that by generating a heat map of all the aforementioned independent variables is an incredibly powerful method in visualizing relationships in a high dimensional space.
As we can see, there is very little correlation between the independent variables, as supported by the graphs enumerated earlier. However, the heatmap shows a very strong/perfect negative correlative between age_category vs generation_category, where correlation = -1.00. This could be due to the conversion of generation and age categories into numerical variables, which could cause the discrepancy in the results attained here. (Numbering of 1 in age group vs generation category, etc). It is still maintained that this correlation matrix supports our visual interpretations and the analysis that come along with them.
One key difference is that we were expecting a negative correlation between GDP per capita ($) and suicide numbers, however correlation calculations have indicated a very weak positive correlation. Only gender category shows the strongest (but relatively weak) positive correlation of about 0.2.
Machine Learning - Linear Regression (Russia)
*Note: I only conducted a linear regression analysis on Russia, which is the country with the highest number of suicides and suicides/100k population. Hence, we can use this country as a representative sample to describe the population data. Also, other forms of regression analysis, regressed with different combinations of independent variables, could be done for this study but having sympathy for readers, I only included one type of regression analysis and that is a linear one.*
(Russia - Female of different age groups)
(Russia - Male of different age groups)
We see that this regression result (through r^2) yields a relatively weak result (some are low r^2 that are below 0.8). All age groups are on a decline in suicide numbers.
Conclusion: We can possibly agree that, we can in fact be fueled by optimism for the human condition looking into future, as the expectation for suicides will decrease in the coming years. However, although the regression model used for this analysis generalises the data well but it was not robust enough. The anomaly in the 2016 suicide numbers was not a big issue precisely because the attributes given were not good enough predictors of suicides. To better come up with a more accurate and robust model for a complex issue such as this, we should look into more data such as reasons for suicides, health conditions, etc.