Which programming language should I pick in the data area?
If you read the title of this article and came eager to know which is the best programming language to learn, I’m sorry, but you were just caught by a clickbait! Obviously, the most popular language is Python, but that doesn’t mean it’s the best one. It all depends on what you are looking for in the data field.
Let’s begin looking for the data analyst role. It’s quite often that professionals in this role don’t even know how to code a single line in Python. Among data analysts the SQL dominates, and it’s a prerequisite in most job descriptions in the market. Usually, requirements such as the understanding of analytics softwares like PowerBI, Tableau, Qlik Sense (or Qlikview), or Looker are the most important requirements. All these tools have support for SQL queries, and most of them also have integration for other programming languages. To know a programming language such as Python, R, Java, Javascript, etc. Can be a “plus”, but normally isn’t a “must have”.
Now looking for the data engineer role, this is the role in the data area that is closer to a software developer. Thus, to have a strong knowledge in one, two, or even three programming languages is usually in the “most have” list. The most popular programming languages for this role are Python, SQL, PySpark, Scala, Java, Golang, C++, and Rust (in order from most common to very rare). Furthermore, the knowledge about databases is usually also a strong requirement, also relational databases (accepting SQL queries) and NoSQL databases. Finally, knowledge using tools such as Airflow, Apache Flink, Apache Kafka, Databricks, and many others may also be required. Therefore, the data engineer role is the one in the data area that probably has the most diversity in its stack. Because of this, companies usually are more open to look for candidates who understand the concepts related to the tools, instead of looking for a professional who actually has experience working with each one of the tools. Because it obviously will be like looking for a unicorn.
Now, about the data scientist role, we also have a bunch of programming languages that can be used, but we will divide them into three categories. The first one is the Python-based category, where we have the Python language in itself, with famous libraries such as Pandas, SciKit Learn, Tensorflow, and Numpy. We also have PySpark, which is a programming language very similar to Python and that runs in the Spark environment, allowing users to use Python and Spark libraries.
For the second category, we will look into languages to run scripts in databases. Here, the SQL reigns absolutely, being used not only for retrieving data from databases, but for executing different types of transformations in the data as well. The third category also has only one language, the favorite language among statisticians, the R language. This language has a syntax similar to statistical notation, so it is especially used in research and academic studies.
Recommended by LinkedIn
Finally, the fourth category is for the Java-based languages. Starting with Java in itself, it may seem weird to be in this list, but actually, Java is largely used to build data science tools, such as Spark, Weka, DeepLearning4J, and many components of the Hadoop environment. However, when we talk about things done in Spark and Hadoop context, usually the language selected in Scala, which is a language specially developed to be used in these contexts. This language is accepted by most of the applications backed by Spark, like Databricks, Amazon Glue, Google Dataproc, and many others. Finishing this category, we have Matlab that is still pretty popular in academic projects and research.
Obviously, there exist projects that use other languages, such as Haskell, JavaScript, C++, and others. However, I tried to focus on languages that, in my experience, are more often seen in requirements lists in job descriptions.
Before we finish this article… You can extinguish your torch, because I am aware that SQL isn’t a programming language. I just wrote the article as it was to simplify the text and don’t have the need to explain myself every time I talk about SQL. Another consideration is that Excel is being largely used, not only in business areas, but in the data area too (it is a reality and we need to face it).
But if you came all this way long and are feeling frustrated because the article doesn’t give a single straight opinion, here it goes; If you are starting in the data area and don’t have a specific project in mind, I do suggest that you focus on Python and SQL. These are the most popular languages, and they are more likely to be used at some point in your career. However, do not ignore other languages! If you feel any curiosity about other languages, go learn a little bit about them. Maybe in the future it will be a “nice to have” in a position where you are a candidate.