JavaScript and the C Bug
Recently I came across one of those image posts with tips for programmers that resurfaced in my mind a topic I wanted to write about. The post in question was about array deduplication and three solutions were given (as can be seen ahead). I’ll not reference it here, as my point is not picking on someone, rather I just want to share somethings I know in the hopes it’ll complement the knowledge of someone else as it happens to me on a daily basis.
The moment I saw it I was bitten by a very annoying bug: the C bug. Let me explain: one of the first programming languages I learnt was C (in fact, it was the second one) almost 15 years ago. Since then it’s influenced the way I program. The constant need to manage the heap allocations and the use of raw/crude polymorphismish with pointers to void or unions followed me as a ghost up the abstraction hill. When I’m short on time, I just ignore it (hey, we’ve all been there: a single line of expressive code for a life of pain); when I have some to spare, I make the requested cheap optimization.
Then comes the deduplication post. The first alternative: fine, I’d write it to get things going and never look back if it stays quiet. However on solutions two and three the first thing I saw when looking at them was O(n²).
We all know that Array.prototype.filter and Array.prototype.reduce must go through all the elements of the array, but on top of that Array.prototype.indexOf and Array.prototype.includes will have to run, at most, through all the elements. Hence, for each step of filter and reduce we check, at most, the entirety of the array. Also, in the third alternative, the splice itself is hiding a for loop to shallow copy all the values from the source to the result (that's the power of abstractions). Take a look at the data below. The functions above were tested against an array of 100,000 random numbers.
One could say that he/she never has to deal with large datasets, thus there is no need to worry about which solution is implemented. Fair. However it’s worth pointing out that as code piles up, things like these can escape simple benchmarks and may hurt you.
Anyhow, let’s not stop with the complexity analysis and take a look at the heap. Again, no deep V8 implementation analysis or advanced algorithms, let’s just mention what can be seen at first glance:
1) The solution using a set:
Recommended by LinkedIn
2) The filter alternative: easy peasy filter squeezy. All we do is create a new array with filtered data, which is equals to 2n + m;
3) The splice solution is inspired by pacman, as once it starts, it will attempt to devour the whole RAM unless it’s stopped. Each step of the reduce will allocate a new array with the length equals to the index of the step (plus the hidden for loop). Until the GC kick in and, hopefully, cleans up our mess, the heap will look like this:
All the things I said above can be seen just by looking at code, but you can go a step further (not so far as profiling, at first) and use the tools you have at hand to see how memory is being used. Check out the following snippet.
If you run the lines above, don’t hang on the numbers themselves, just pay attention to the magnitude of time and memory each solution takes. For instance, on my computer, most of the runs generated the results as can be seen here:
I don’t think you should optimize your code to the last percentile, I just want to point out that you shouldn’t pretend a computer is a box of limitless resources and that, by making simple analysis, you can deliver better code or at least get some performance budget for the days you have to move fast or are too tired. Also, by publishing it, I hope won't ignore theses principles as I once did and regret.