Transforming yourself into a Data Engineer

Transforming yourself into a Data Engineer

When I started my professional career in 2016, data engineering was a hot skill. Over the past six years, this skill has only become hotter. During my learning period, the hype surrounding Hadoop dominated my quest for core understanding behind the advent of this field. End of the day, no matter what occupation you choose, learning should begin with the basic understanding of the things you will be achieving towards making this extremely co-dependant world a better place. That should be your only motivation and all the good things will follow.

Having felt extremely overwhelmed at the beginning of my journey, I wanted to clear some air for the beginners by sharing the scope of this role, technologies to be covered and a detailed plan to kickstart their career as a data engineer

So, let us begin our understanding of data engineering through the hierarchy of needs with a relatable example. Suppose you have just started working as a data engineer at an e-commerce company selling video games. This company has become a huge hit among youth in the last five years. You are assigned a project working on building a recommendation system to recommend the games your customers are more susceptible to buy based on their history.

 The data scientists will bring all their statistical knowledge in training the “gathered“ data and understand the relationship between various factors which influenced the customer towards checking out a particular product and the final decision he made (bought the product or declined it) and will come up with a recommendation system.

In this whole process, the data engineer’s role might seem inconsequential but here is the catch, data gathering is the most cumbersome activity among all the steps involved and the data engineer is the one responsible for accomplishing it. Now you must have understood what data engineering all is about, it is an area that deals with the collection, movement, storage, and preparation of data.

Let us get into the technologies part. If you are a novice with minimum or no experience, I would suggest you start your learning with Structured Query Language (SQL). Data Engineering demands a lot of communication with data. I do not think there is a better and more easily understandable language out there for interpreting the data. SQL is easy to learn and a substantial devotion of time each day for a couple of weeks will make you familiar with this language.

I found the w3schools SQL tutorial helpful during my journey.

https://www.w3schools.com/sql/sql_intro.asp

Going back to our earlier example of an e-commerce company, your data scientist asked you to get the list of all the customers who browsed your website tagged along with the product he spent most of the time on that day. This can be achieved with SQL as well as python. However, python syntax makes this achievable in lesser lines of code. Also, python offers a wide range of data structures which brings in a lot of efficiency to the storage and accessibility of data. Besides, if you want to generate this list daily, you can create a python function and schedule it to run without manual intervention. We are going to talk about the scheduling part later, but here I want you to realize the prominence of python in interpreting the data. Python also comes with a renowned library called pandas which is highly efficient to use.

Udemy has been my best buddy through my learning phase.

https://www.udemy.com/course/complete-python-bootcamp/

https://www.geeksforgeeks.org/python-programming-language/ is a great website for reference.

At this point, we have centered the most important languages we should have in our armory for dealing with data. But what about the engine which drives our process? This question leads us to spark which is a general-purpose distributed data processing engine that is suitable for large-scale data processing. So, to summarize, you will deal/communicate with huge chunks of distributed data using SQL/Python with Spark as your engine. Just to keep the number of languages to a minimum, I am stressing on python and SQL but with Spark as an engine, languages like Scala, R, and Java can be used as per the convenience of the user.

At this point, we have centered the most important languages we should have in our armory for dealing with data. But what about the engine which drives our process? This question leads us to spark which is a general-purpose distributed data processing engine that is suitable for large-scale data processing. So, to summarize, you will deal/communicate with huge chunks of distributed data using SQL/Python with Spark as your engine. Just to keep the number of languages to a minimum, I am stressing on python and SQL but with Spark as an engine, languages like Scala, R, and Java can be used as per the convenience of the user.

https://www.udemy.com/course/taming-big-data-with-apache-spark-hands-on/

As a data engineer, you will get to navigate through the file system for accessing files on the server. You might also be required to move files from one file system to another. These operations require a good knowledge of some basic Linux commands. Do not try to memorize these commands instead try to understand their application.

https://www.udemy.com/course/linux-mastery/

Having figured out the languages and ways to engineer the data, next we must schedule the code to execute on a schedule daily. Airflow is being widely used for scheduling python-based spark jobs.

https://airflow.apache.org/docs/stable/

In addition to the above-discussed tools and technologies, a good understanding of Hadoop’s basics, working knowledge of a version control tool (preferably git), and a little bit of knowledge with at least one cloud platform can prove to be invaluable.

https://docs.microsoft.com/en-us/azure/?product=featured

https://blog.matthewrathbone.com/2013/04/17/what-is-hadoop.html

https://freecodecamp.org/news/understanding-git-basics-commands-tips-tricks/

Mastering all the technologies discussed above will take time but if you are serious about kickstarting your career as a beginner, obtaining good working knowledge is quite achievable if you dedicate twelve weeks of your time to executing the plan below.

  1. For the first couple of weeks, spend your time learning SQL. Try to master SQL. Practice as much as you can. Try to communicate with data as much as possible.
  2. For the next three weeks, jump onto python. Once again, it is all about practice. In addition to the Udemy course, try some of the examples on Codingbat. Focus more and more on conditional statements, loops, data structures, and functions
  3. Now, move on to Linux fundamentals. Spend a week on these. As a python data engineer, you will not spend all your life writing shell scripts. So, a basic understanding of commands should suffice.
  4. Having got your hands dirty with coding for six weeks, it is time for acquiring some theoretical knowledge. Try to understand the world of Big Data. See how Hadoop is acting as a solution to the big data problems being faced by clients. Read about at least one cloud technology. Do not spend more than two weeks of time on these theoretical concepts to avoid burnout.
  5. Spending two weeks each on SQL and three weeks on python will give you nominal expertise in python and a decent hold on SQL. Now, for the next three weeks proceed with the Pyspark course by Frank Kane. It should teach you all the basics of spark required while cracking the interview and some introductory knowledge of Pyspark syntax which looks more of an amalgamation of SQL and python which you learned earlier.
  6. During the last week of this program focus on the tools available for scheduling the jobs, you have developed so far. Also, familiarize yourself with the basic usage of a version control tool.

Do not overwhelm yourself by looking at the wide list of technologies tagged along with the role of a data engineer.

If data engineering can be equated to driving a car, SQL is your steering, python gives you the flexibility to accelerate and apply brakes, Spark is your engine, and all the rest of the technologies are add-ons to ensure a smooth drive

With this, you will be giving yourself preliminary training on tackling the challenges the world of data engineering is going to through at you. All this might seem a bit formidable and might sound a bit introductory but, trust me, it can offer you a really strong base moving forward. I wish you the absolute best in your journey towards becoming a data engineer.

Thanks for sharing. Always great to hear from you on this topic. Saideep Koppaka

Commenting for a better reach!

To view or add a comment, sign in

More articles by Saideep Koppaka

Others also viewed

Explore content categories