Using data and machine learning to improve your organisation’s phishing resilience and security education program
During COVID, the 2020 Harvey Nash KPMG CIO survey found that 40% of organisations had experienced an increase in phishing and malware. Phishing is not anything new, but it does represent one of the most common security threats today and having the right information at the right time is critical in combating phishing threats before they turn into a costly ransomware event or mass data exfiltration from your organisation.
Organisational phishing resilience comes down to a couple of factors – first, timely reporting of phishing threats inside an organisation so the source can be blocked from further proliferation. Technology is not in the starring role here – people are. Your people need to understand how to recognise a phishing threat and how to report it and that requires ongoing and engaging security education and awareness that turns the dial on a risk-based culture. Technology plays the supporting role by providing a simple and easy reporting method, such as an add-in in Outlook that gives your people with a 1-click method to report phishing attempts. The key words are ‘simple and easy’ – asking people to add an attachment to another email and send it to an email address will inevitably result in phishing reporting being relegated to the 'Urgh - too hard' basket.
Second, understanding the factors involved in phishing susceptibility through both quantitative and qualitative analysis. To achieve this, you need to be gathering data on your phishing susceptibility through both phishing simulations and real phishing reports submitted by your people and analysed by your security operations team.
By collecting a large amount of this data, you can use machine learning to uncover behavioural patterns that will optimise your awareness campaigns and help you make informed changes to bring about real cultural change. It also provides you with the ability to measure and justify the impact of your awareness program and allows you to tailor reporting to different stakeholders.
There are a few different machine learning models that can be used for data analysis such as KNN classifier and regression, however for phishing decision tree analysis works well in helping to identify potential factors behind why people are susceptible to phishing in your organisation. The following model shows an example of how decision tree analysis can be used to analyse phishing susceptibility. In this simplified example, you need to collect data on whether the person was using a mobile when they clicked, and whether they are classed as a contractor. This comes down to correlating the data collected from the phishing simulation system and your HR data.
But challenges include data quality and richness, as well as the stability and trustworthiness of machine learning models. Analysing collected data and machine learning isn’t the solution by itself and this is where contextual understanding comes in. There is no substitute for both understanding the common susceptibility factors and gathering supplementary data about your people’s phishing response practices (for example, through interviews, feedback and surveys).
Ultimately people click on phishing for a broad range of reasons that are not all easily explained by only looking at data. This includes:
- Our inherent cognitive biases
- Relationships with people who we digitally communicate with
- The design of phishing communications
- Time and timing of phishing delivery
- Curiosity and boredom
- A person’s individual risk propensity
- Their knowledge of how to recognise phishing (and what to do with it regardless of whether they have clicked or not)
Understanding the complex nature of human interactions is the first step in helping your organisation combat people-related security threats, and the data helps explain, confirm and communicate your resilience.
Acknowledgement: Thank you to Dr Yenni Tim, Senior Lecturer at University of NSW, for her collaboration with me on this article, as part of our presentation at AusCERT 2020.
Simeon M. Matt S.
Great article Bianca! Always love your engaging writing!
The data size does not always correlate with time periods. 2 years of data in one company might be enough and too little in another. We had our ML experiment published which was a predictor of student exam results and that used 2 iterations of data and yielded good results. You want to look at the complexity of the problem at hand and the best model for the job, some require more data than others. Whilst not ideal, you can create synthetic data also if you have enough original data to base it on also.
Check this out Jonathan Horne ⚐