Writing "Better" Code
Although I cannot boast that I am a hardcore programmer or a hacker for that matter, but being a Data Scientist, I code a lot to build and validate my models. The obvious languages of choice for faster prototyping in this case becomes either R or Python. At our workplace we mostly work with R. Although I find R to be very useful for faster coding and less (or no) buggy codes, but coming from a procedural language background, I find that R lacks performance when written in a procedural manner. Although R is a functional language, but there are certain things that needs to be accomplished in our day to day work that is either difficult to write with functional expression(s) or no built in libraries/packages are available in R for that task.
"Adversity is the mother of inventions", but obviously without re-inventing the wheels, we try our best to write a good piece of software. When I say that the software is"good", it does not necessarily imply that it must satisfy all the software engineering principles laid down in those well written SWE bibles. It is one thing to write how a software must be written and completely another thing when we write it for real world usage. We do not write the procedural part of a code in R as it is quite time consuming and so we dedicate that workload to C++ using the Rcpp libraries. Following are some of the day to day observations and approaches that I found helps me to write a "better" software :
- Use efficient algorithms and data structures. Well nothing new here. Hash, hash and hash wherever possible. Do not repeat any operation on part or parts of the input. Cache the results. Don't hesitate to use advanced data structures wherever required if it gives more than say 3x to 4x speedup.
- Use built in libraries wherever possible. These libraries are optimized for speed and memory. E.g. std::sort for sorting or std::nth_element for median finding etc.
- Optimize your codes as much as possible. If you are using C++ then pass pointers or addresses, const pointers to functions as arguments. This way it avoids copying of memory and hence reduces memory consumption. I use C++11, which comes with lambda expressions. Lambdas are short and crisp and also improves performance over for loops or while loops. E.g. std::for_each, std::transform etc.
- For any matrix operations, use vectorized built in libraries. These are orders of magnitude times faster than for loops or while loops. E.g. matrix multiplication in R : (t(m) %*% m) or perform an outer product of two matrices where instead of multiplication use OR operation : outer(u, v, function(x, y) x | y).
- Parallelize the code wherever possible. The benefit of functional programming is most visible in the aspect of parallelism and concurrency as it mitigates shared states. If you are doing multiple independent operations on input data, then you can use mclapply in R or Message Passing Interface in C++ to distribute the operations across cores. Threading with C++ or JAVA can be useful but generally threading has lower performance over multi-core distributed computing and if number of cores is not a limitation, then definitely go for distributed computing.
- When working with large data sets where approximate results do no harm, use approximation (probabilistic) algorithms instead of polynomial or exponential time deterministic algorithms. Probabilistic algorithms generally uses multiple hash functions to divide your operations into smaller subsets. Operation on each subset has some error bounds due to the approximations. But when multiple votes are taken, the overall probability of error comes out to be quite low. This methods are much faster and suitable for real time analytics. E.g. HyperLogLog counting, Min-Sketches, Min-Hashing, Reservoir Sampling etc.
// Writing comments on your code is just a fad.