Which programming language should I pick in the data area?

Which programming language should I pick in the data area?

If you read the title of this article and came eager to know which is the best programming language to learn, I’m sorry, but you were just caught by a clickbait! Obviously, the most popular language is Python, but that doesn’t mean it’s the best one. It all depends on what you are looking for in the data field.

Let’s begin looking for the data analyst role. It’s quite often that professionals in this role don’t even know how to code a single line in Python. Among data analysts the SQL dominates, and it’s a prerequisite in most job descriptions in the market. Usually, requirements such as the understanding of analytics softwares like PowerBI, Tableau, Qlik Sense (or Qlikview), or Looker are the most important requirements. All these tools have support for SQL queries, and most of them also have integration for other programming languages. To know a programming language such as Python, R, Java, Javascript, etc. Can be a “plus”, but normally isn’t a “must have”.

Article content

Now looking for the data engineer role, this is the role in the data area that is closer to a software developer. Thus, to have a strong knowledge in one, two, or even three programming languages is usually in the “most have” list. The most popular programming languages for this role are Python, SQL, PySpark, Scala, Java, Golang, C++, and Rust (in order from most common to very rare). Furthermore, the knowledge about databases is usually also a strong requirement, also relational databases (accepting SQL queries) and NoSQL databases. Finally, knowledge using tools such as Airflow, Apache Flink, Apache Kafka, Databricks, and many others may also be required. Therefore, the data engineer role is the one in the data area that probably has the most diversity in its stack. Because of this, companies usually are more open to look for candidates who understand the concepts related to the tools, instead of looking for a professional who actually has experience working with each one of the tools. Because it obviously will be like looking for a unicorn.

Now, about the data scientist role, we also have a bunch of programming languages that can be used, but we will divide them into three categories. The first one is the Python-based category, where we have the Python language in itself, with famous libraries such as Pandas, SciKit Learn, Tensorflow, and Numpy. We also have PySpark, which is a programming language very similar to Python and that runs in the Spark environment, allowing users to use Python and Spark libraries.

For the second category, we will look into languages to run scripts in databases. Here, the SQL reigns absolutely, being used not only for retrieving data from databases, but for executing different types of transformations in the data as well. The third category also has only one language, the favorite language among statisticians, the R language. This language has a syntax similar to statistical notation, so it is especially used in research and academic studies.

Finally, the fourth category is for the Java-based languages. Starting with Java in itself, it may seem weird to be in this list, but actually, Java is largely used to build data science tools, such as Spark, Weka, DeepLearning4J, and many components of the Hadoop environment. However, when we talk about things done in Spark and Hadoop context, usually the language selected in Scala, which is a language specially developed to be used in these contexts. This language is accepted by most of the applications backed by Spark, like Databricks, Amazon Glue, Google Dataproc, and many others. Finishing this category, we have Matlab that is still pretty popular in academic projects and research.

Article content

Obviously, there exist projects that use other languages, such as Haskell, JavaScript, C++, and others. However, I tried to focus on languages that, in my experience, are more often seen in requirements lists in job descriptions.

Before we finish this article… You can extinguish your torch, because I am aware that SQL isn’t a programming language. I just wrote the article as it was to simplify the text and don’t have the need to explain myself every time I talk about SQL. Another consideration is that Excel is being largely used, not only in business areas, but in the data area too (it is a reality and we need to face it).

But if you came all this way long and are feeling frustrated because the article doesn’t give a single straight opinion, here it goes; If you are starting in the data area and don’t have a specific project in mind, I do suggest that you focus on Python and SQL. These are the most popular languages, and they are more likely to be used at some point in your career. However, do not ignore other languages! If you feel any curiosity about other languages, go learn a little bit about them. Maybe in the future it will be a “nice to have” in a position where you are a candidate.

To view or add a comment, sign in

More articles by Rafael Araújo, MCs

  • Sentiment Analysis: beyond positive, negative, and neutral

    Keeping track of Natural Language Processing (NLP), this article discusses the subject of my research for my master's…

    1 Comment
  • Is Natural Language Processing (NLP) still relevant?

    In this article, I have decided to revisit the subject of Artificial Intelligence, focusing on a specific field within…

  • The raise of Data Quality

    Currently, the subject of data quality is growing in importance in the data field. When we talk about “data quality”…

  • A few cases of data quality

    In one of the first projects I worked on, I was tasked with developing several web scrapers to collect public datasets.…

  • Using Airflow for data quality

    In this article, I decided to explore the possibilities of using Airflow as a data quality tool. This is now something…

  • Five movies about data

    In this article, I decided to try joining two of my passions: movies and data technology. I’ve been a cinephile since I…

  • Do not hire me because of my technical knowledge; hire me because of the games I platinumed!

    Okay, the title of this article is clearly clickbait, but if you are not a gamer, it's worth reading on because you may…

  • Organizing the analytical data

    Data can be viewed in two ways: transactional or analytical. Handling data in these different ways creates a need to…

  • The Code Challenge Pitfall

    I just finished another Code Challenge that left me with the same sensation as when I wake up after a hangover…

  • Presenting the library BR Utils

    In today’s article, I will talk about a Brazilian open source project that may be very useful for those who are working…

    1 Comment

Others also viewed

Explore content categories