What does Engineering for Data-Science really mean?
Annapurna Mountains, Taken from Wikipedia

What does Engineering for Data-Science really mean?

I define engineering as “all the tech-work one needs to do to make their solution work in practice”. Here I will only discuss about the kind of engineering relevant for a data-scientist.

The most fundamental engineering skill is having an ability to convert any idea/algorithm in mind into a working code in a programming language such as Python/R/C++. The best way to acquire this skill is practice, one should practice solving a lot of algorithmic problems, or Kaggle problems. For a given problem, you can look at the algorithm in detail, but you should try to code it up yourself without looking at the solution. You can start with simple problems, and slowly work your way up to more difficult ones. I would also highly recommend doing a proper course on algorithms and data-structures, to train your mind to think algorithmically.

Now coming to, how ML solutions are implemented in practice one common way is to break down the implementation into the following 3 parts:

  1. Data Pipelines: In this step you bring all the relevant data from various data-sources, using tools such as SQL/Hive/Spark. You clean the raw data, and convert it into more processed features.
  2. Training the model: This step takes the dataframe with processed features, and use them to train a model. Its output is a model object, which is periodically updated.
  3. Model Run: This is the code which runs at real time (e.g. when a user sends request), uses the model object, and produces final result. This step should run very fast.

A good data-scientist should take end-to-end responsibility of his work (the three steps written above), and he should ensure that all of these run smoothly.

Before starting any coding, one should first properly think and plan how the implementation will be done. Unfortunately “proper-thinking” is a highly ignored skill among engineers, especially in this era where there is over-emphasis on being “hands-on”. But, truly good engineers whom I’ve worked with, put significant time and effort in thinking, and the result is a bug-free code which many others can reuse, and which lasts for years to come. In summary "better thinking" is a crucial aspect, which ultimately leads to "better engineering".

Another thing is that, though it is useful to follow a list of “best engineering practices”, it is even more important to think about “what properties my solution should have?” and anticipate “where can things go wrong?”. Some examples could be:

  1. How much my model will get affected if last two days data couldn’t get logged in the database? Would my model fail or produce a bad output?
  2. What if the model training step fails today? Do I have a backup plan?
  3. Is it easy to make small changes to my code e.g. if I want to add another feature to the model would I need to re-write the whole code?
  4. How do I reduce unexpected behaviour from the model? If it happens, how do I get notified about this? What sanity checks should I put in place?

Finally, I want to say that one should not try to achieve mastery of hundred different tools, as there are far more valuable things to invest your time on. But, it is very important to develop an “engineering-attitude”. This means having a strong inclination towards not just designing a solution, but actually implementing it and making it work. You should be ready to learn any new tool, or break through any tech-hurdles you encounter during work. These are usually not difficult to do, and are often somewhat boring part of a data-scientist’s work. But, yes whenever required one must roll up his sleeves and do it.

To view or add a comment, sign in

More articles by Pulkit Bansal

Others also viewed

Explore content categories