A message from the next person to read your code: Good Programming Practices

A message from the next person to read your code: Good Programming Practices

So this might be classified as a rather technical rant but basically I thought to turn my frustrations with having to deal with code built by others into something more productive so I put these guidelines for good programming practice based on my experience.

I have been now a little over a month in my ML postdoc and I am building on models created by a graduated PhD student from the computer science department. Now I have never been a computer science major in my life but I was blessed to have a PhD advisor who cared about the quality of results' reporting data and code. I used to consider CS major people to have the gold standard of good programming practice until I stumbled upon this codebase. Do not get me wrong, the man is brilliant and he did a lot of work but the number of programming (and scientific) sins I have found in this code deserves this post. So lets begin...

  • When you publish a study in a journal, you have to make sure that you can reproduce the results: and that generally means having a production level code where all your trial and errors are scraped leaving only the working model that is described in the manuscript. The next person should not have to redo your hyperparameter search because you have made other trials and changed the parameters that were published. Also, the next person should not discover by him/herself what order to run your codes in order to get the same output. Usually, one would make a wrapper code to run everything in the order it was meant to be done. If you want to further build on your work at some point (and you should) make sure to keep the production level codes saved in a separate location.
  • When reporting your results, make sure you have it documented which takes me to the point of logging. The point of having a log folder is to know what happened at what time so not adding a timestamp to your logs defeats the whole purpose and it runs you into the terrible risk of overwriting your old logs because you just had the log saved with the same file name and without a timestamp. The results of this can be detrimental as in research world, it could be understood that you actually fabricated your results.
  • I cannot believe that I am talking about this one but I have seen it. Saving your working model parameters, make sure to actually save all the relevant parameters of your working model and be super careful with this step. When creating a wrapper script that feeds parameters into your model and saves them as model arguments, make sure to actually use those parameters. Do not hardcode other parameters in the inside code because you thought it would be easier to make your changes there. This would make me have little trust in your code... and in you as a person.
  • Memoization (saving the outputs of some computations and reloading them instead of rerunning same computations again) is a very useful tool to avoid repeating long computations but I should easily be able to rerun that long command in the production code if I want to. I am not supposed to magically realize that I need to uncomment some scattered lines of code without clear instructions.
  • This should go without saying but properly comment your code and make your variable names meaningful. Do not name number of hidden layers in the model as nhl. It can be intuitive to some but some coders can also be into ice hockey.
  • Put all dependencies of your code in one location accessible to users. You are wasting my time if I find that your code depends on a file that is located in your personal folder where I have no reading permissions.
  • If your code is takes long time, do not print at every single loop iteration or at least make a debug mode if you are trying to keep track of everything.
  • If you are going to be using Git to track your code, actually use it. Do not just initialize the folder and ignore.

I think I managed to vent off "some of" my frustration producing something that could be useful to someone (hopefully). Tomorrow, I need to go back to the lab and see if the code that runs the test set on about 1000 different trained model has finished finding the one that actually achieves the performance reported in the published manuscript.

I would add a couple from my experience: 1- It's often helpful to save the git commit hash that was used when the particular version of code used in the paper was run .. or even as you do more work and change your code iteratively. This  generally helps track weird results that actually stem from older versions of code.  2- Writing useful git commit messages is helpful, even if just for yourself. Recently I also started working on a dev branch and then merging when a major milestone is achieved, making sure to describe the overall changes in the pull request. Again, helps with frustrations.  3- If a data file is not large (eg a csv), I would save it as part of the git repository itself to help myself or the next person to find and use it. Private git repositories make this possible even for work in progress.  4- Organizing the git repo logically saves so much time. Ideally, you'd want a "config" folder containing all the config files to be read. Avoid having hyperparameters declared at the start of the script.  5- This goes without saying, but wrapping everything into functions and classes, and using short functions (modularization) really goes a long way in helping oneself and others. It's also great for debugging..  6- Writing docstrings explaining the inputs and outputs of major functions is usefl even for oneself.  7- Catching exceptions is good, but ideally one should specify which specific exception is being caught, and if that's not possible, at least print or log the exception. I've seen code that has try ... except .. pass (!!) , meaning the exception was caught without even alerting the user that there's a problem, 

To view or add a comment, sign in

More articles by Mohamed Abdelhack

  • Why can't we design healthcare AI systems alone?

    As computer scientists, techies, data scientists, researchers (you pick the title you like), we tend to focus a lot on…

    2 Comments
  • Lessons learned from my career break

    After concluding the painful years of PhD successfully, I had taken a decision to take a few months' career break. My…

    9 Comments

Others also viewed

Explore content categories