Modern Data Engineer

Data scientist as we know is the hottest job in the market. Data scientists are majorly supported by their lesser known cousin - 'Data Engineer'. These are the shadow forces working tirelessly lurking in the shadows to assist data scientists and product teams to bring forth the shiny ML and analytics products. Data engineering evolved over time to address pervasive data needs of this AI age.

I will try to highlight some of the unique traits of the ‘Modern Data Engineer’ in the pursuit to bring them to spot light and also help others considering this as their career choice.

Strong Data Fundamentals - Data is the life blood of many organizations. I have seen highly ambitious programs not taking flight as they lacked ‘Data Driven’ decisions made. Understanding how data flows among different systems, represented and persisted is crucial to design systems to make it available to take meaningful decisions. The skills to model data stores, build pipelines to transform data from multiple systems and provide consumption mechanisms for business teams is a must. Also are the abilities to deal with multiple databases, NoSQL and document storage systems along with knowledge of ‘Data Recovery’ mechanisms.

Programming - It is not uncommon for Data engineers to be perceived as ‘SQL Only’ coders. While ‘SQL’ is the most powerful tool in their tool chest due to its rich data manipulation and analytics capabilities, it is by no means the only thing they need to know. There are multiple reasons for a Data engineer to be a strong programmer.

  • Non-scalable/Inflexible Vendor tools:  Most of the corporates would invest in data platforms and tools sold by third party vendors based on the legacy needs and practices. Unfortunately, they neither provide the flexibility for developers to ingest and transform the data nor scale with the growing data needs. Developers either build separate systems to fill the gaps or try to find alternatives around these limitations which are again fall short.
  • New Data Sources: The typical data feeds or Database stores are being fast replaced with NoSQL, streaming and cloud storage systems needing coding skills to build data pipelines.
  • Big Data Processing: Though legacy tool vendors are catching up to provide big data support, most of the systems coupled with unstructured/semi-structured data involve complex processing needs that can only solved by programming.  

There are many other reasons apart from above that emphasize the importance of programming skills for Data engineers. It is a ‘no brainer’ to pick ‘Python’ as the best choice to start with due to its universal abilities to operate on wide variety solutions. There is additional perk to work with ‘PySpark’ to leverage Spark’s capabilities. After Python, Java and Scala are the most desired in building data services.

Get into Big Data - In this data economy, there are 90% chances that one working in the industries of financial services, Healthcare, Insurance or Retail which are the biggest adopters of big data eco systems. It is an essential skillset for Data engineers to deal with data at scale. There are myriad of choices out there with both open source and packaged vendors like Cloudera. If I were to rank, I would pick Spark, Hadoop/Hive with HDFS or Cloud storage as a must.

Live in a Cloud – There is a tremendous cloud adoption in the industries. According to Gartner, around 28% of IT spend in next 3 years is going to be on cloud infrastructure or application partners. As application moves to cloud, so as the data along with them. Though there is no single choice here, AWS by far is the leader with Microsoft Azure being a fast chaser. Google’s GCP is lagging as the third in this. I would strongly bet on AWS and recommend their big data certification to acquire fundamentals of both big data and cloud concepts through that.

Analytics Tools – At the end, Data engineers need to provide consumable end points for variety of users like data scientists, analysts and business users along with supporting other down streams. Typically BI platform, Query tools, Notebook applications would be the primary applications. Basic knowledge on some of the top platforms like Tableau, Redash, Jupyter would help any data engineer provide a better experience to customers of the data.

The ‘Modern Data Engineer’ need to juggle with multiple technologies, skills and disciplines. He/she is part software engineer, data engineer and data analyst with multiple secondary skills added. The strongest survival skill is to show ‘Learn & Be Curious’ and keep an eye on the evolving trends.

Thanks Manish. Glad you liked it.I definitely agree that 'Visualization' is a great skill that enables Data engineers to 'Tell the story' of their data and complete the end-to-end loop. But over time, I have seen Engineers struggle to come up with effective methods to Visualize to compel the audience take a decisions and more of that moved to analysts. Not to again generalize, I believe Visualization is more of an 'Art' than 'Science' needing different skill sets to convince and produce the impact desired.In summary, good to have both knowledge and hands-on. I will cover my detailed view point on BI/Visualization/Data analysis in a follow up article.

Great read Panini. You covered all the important dimensions of a DE role. One point i want to bring here is visualization of data. There will be a time when a DE would need to present findings/ insights/ information. A good visualization skill set plays a key role while interacting with C level/ non technical audience.

Like
Reply

To view or add a comment, sign in

More articles by Panini Jannabhatla

Others also viewed

Explore content categories