Math for Data Science: Do we really need it?
In my last post, I wrote about learning statistics for data science . In this post, I want to talk about mathematics which is often mentioned in tandem with statistics as core data science skills. The necessity of learning statistics is clear. It is the only way to learn about correlation and causation. However, I think math has a different story. I have always loved math and have always been good at it. Here, I am going to demystify math as mentioned in data science context.
Math for data science reminds me of Andrew Ng's famous course on machine learning and how he taught from math point of view. It was the only course that taught machine learning in this way and I guess it is. I can also remember course assignments where students were supposed to write algorithms from scratch in MATLAB. For me, learning mathematically was joyful but the assignments seemed totally weird and irrelevant much like reinventing the wheel. The other disadvantage of the course in my eyes was that weeks' contents were not independent. For learning week 5 you had to start from the very beginning to get an idea of the math formulas if your mind was not warmed-up.
My first question is: Do we need math for getting a basic understanding of what we are doing or we need them only for fine-tuning algorithms? Is it a basic part of data science or a fancy one? Do we need to delve deep into every algorithm we are learning or a general understanding and learning about its applications would suffice? My second question is: Which math topics we need to learn? And How? Some people might recommend taking a linear algebra course but I am doubtful the course instructors are teaching from the perspective a data scientist needs. If you have taken a math course for data science purposes, please share your experience.
If you are a data scientist and you studied math, maybe these are serious questions for you because you are probably obsessed with thinking about how your formal education helps a data science project.
I am waiting to hear your ideas about the use of math in data science.
I've written about this on my blog (http://www.mlopt.com/?p=19) and had a subsequent discussion/debate with with Francisco J Martin, the CEO of Big ML on the topic (http://www.mlopt.com/?p=153). In short, I strongly suspect it is very important to understand the math and statistics to be able to thoughtfully configure the algorithms and interpret the results. Not understanding statistics also leads to bad interpretations and can lead to bad predictions, such as the google flu. Something that Jeremy Howard from Kaggle said in a panel discussion (https://www.youtube.com/watch?v=b4zr9Zx5WiE) a few years ago really resonates with me: "We’ve measured the performance of our 100,000 users as they enter predictive modelling competitions to see who is the best at making accurate predictions… I know most of the guys at the top of that community, [and] the top 10 do stuff that the top 100 can’t dream of, the top 100 do stuff the top 1000 can’t dream of, and the top 1000 do stuff 100 times faster than the top 10000 can. There is this massive curve of capability and speed. There is very little [discussion] around at the moment of how we train the next generation of those top 10, how do we identify them… of our competition, nearly all of our winners from the past year actually learnt about machine learning by watching Andrew Ng’s you tube lectures and Coursera lectures. Literally the best place to be trained right now is in online courses." As you say, Andrew Ng teaches from the mathematical perspective. His successful students truly understand the machine learning methods they are using... and all of the best data scientists were trained this way according to the man whose job it is to evaluate the ability of 100 thousand data scientists.
Upto a certain extend i am agree with your point that our conventioal educational method does not help a lot in profesional world. But in data scientist role, math enable you to understand WHY factor and create a window to make other people to understand your work/projects.
My own background is physics and actuarial science. In both cases mathematics is useful, however, I do not think it is not entirely necessary and may even be a barrier. To provide an analogy, in studying options-pricing most actuaries get taught how to derive the black scholes equation starting from first principles. This is fine, but in practice most people are pricing baskets of options not individual options with a fixed exercise date - is the solution described by black scholes. Indeed, outside the small group of people building the dedicated technical models for pricing, many traders rely on a stated formulae or shortcuts and typically price using heuristics. In short the world generally favours pragmatism. Returning to your question, I sometimes think that it becomes harder to see the wood for the trees when you get embroiled in the maths. What most people need and want are high level details on different methodologies, quick ways to deliver the solutions, and an understanding of what he downsides/where the methods are inappropriate. Again, we all happily use the sort button in excel. Do We care that a bubble-sort algorithm is an efficient way of doing this behind the scenes? Probably not. So in short, someone needs to follow the math, but it certainly isn't the bulk of people who could benefit from such methods but are currently facing a barrier to understanding. Ok that's just my view and slightly contrary to the others!