How I learned to Data Science
There are few things that I know for sure, and one of them is that I am not a Data Scientist.
Having said that, over the past year or so, I've been learning how to use python and many submodules to shape stagnant data into something useful. After having spent this much time working on multiple projects, I feel like that might actually be the true definition of Data Science:
Using advanced tools to shape stagnant data into something useful
Reading the real definition it turns out I'm not that far off.
Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.
The experts that shaped me
Now I'm going to spend a few minutes walking you through the steps I took to start diving into the world of Data Science.
I have to admit right out the gate there are a few people both general coworkers and distant educators that helped me down the path. One guy I worked with Dr. Petropoulos (I kid you not), was a coworker of mine that really got me interested in the field. His knowledge of how Data Science and Machine Learning worked made me feel like I had not idea what what he was talking about. But then some of the classes by Hugo Bowne-Anderson on DataCamp really pushed me over the edge to really try new things.
To confirm my thoughts that I might be doing Data Science-y things, I reviewed some of my work with a few true Data Scientists at a bar in Reston, VA who were in town for an annual conference and visiting from the USF Diabetes Health Research Center. They confirmed my fears that I might be slowly becoming a Data Scientist.
Just to be clear, all of the programming language I've used so far has been python. There are many other options like R, SAS, and Java, but I chose Python.
The Language that shaped my Data
Ok, so let's get started.
I started by relearning the basics of python: opening files, setting variables, etc. Boring stuff. Then I moved into Numpy Arrays, List Comprehensions, and Pandas DataFrames. While Numpy Arrays store data in multi-dimensional lists, and List Comprehensions allow you to calculate and create lists from multiple elements in single commands, Pandas DataFrames allows you to ingest, structure, view, and shape data into something humans can understand and use.
Numpy Array: Source
List Comprehension Calculation:
myarray = [(arrayelement * 16)/12 for arrayelement in range(1,100) if arrayelemnent / 2 ]
The Data I shaped
Side Note: How to import the above and below mentioned modules
However Pandas DataFrames only lets an analyst represent data in string or text format. I then needed to be able to view data in a way that made it easy to quickly view trends, patterns, and groupings. Before the next section I should add that you need a functional platform that allows you to use modules that create a canvas to work with your data. So I started learning about Jupyter, which allows you to run code and view canvas based graphs via your browser, and Matplotlib.PyPlot, a canvas for visualizing the data.
This allows you to view trending patterns over time for groupings of data. The above is a graph created that shows shared techniques of Mitre ATT&CK hacking groups. Cool.
Other data I used was unstructured and I had not idea what the content looked like. In situations like those you may wind up with graphs like this: Not cool.
These really are only helpful to the analyst so that they realize they found out how to view the data, even if they still don't actually know what the data looks like.
I then wanted to find a way to view link analytics of Mitre ATT&CK groups to TTPs so I learned about Networkx and Seaborn. This actually was helpful in a different project I can't display here.
The future I'm predicting from the data my code shaped
As I continued to shape the data into something useful I started to be able view Cyber Threat Actor Attribution of active threats across sectors in the United States.
As I started moving into visualizing data and seeing the trends, patterns, and groupings the shaping was producing I started to be able to predict Sector Based Cyber Threat Actor attacks from month to month.
So in closing by using high quality stagnant data we just had lying around I was able to start performing advanced analytics against structured and unstructured data to help my community better understand their current sector based threat landscape and what might be in store for them in the near future.