Languages and Dialects (Part 1)
When I first approached SQL syntax, I felt pretty familiar with such an intuitive language. The idea of dealing with a unique and simple syntax reassured me. Soon, I found out that despite the differences were slight, each Database Management System had its own version of SQL syntax.
The same happened with Databricks: when I started using it, I felt very enthusiastic with a unique environment where to code in different languages. I literally foretasted the idea of using R and Python, as I used to know them. However, the expectation did not really match the reality! Don't get me wrong, I am still very enthusiastic, I still recognize the big value of such an environment and I definitely enjoy switching from one language to another; however, I am not coding (uniquely) with my familiar languages. I found, instead, hybrid versions of R and Python, which I would call trivially dialects (they are not proper dialects!).
It is indeed typical of programming languages to have some variations called dialects. Quoting Wikipedia, "A dialect of a programming language is a (relatively small) variation or extension of the language that does not change its intrinsic nature".
In the context of Data Science there are several possible choices, despite some languages are preferred to others. However, the toolkit of a data scientist should always include SQL (which is not a real programming language), Python (which is basically the most popular one) or any other substitute language, able to fulfil the same tasks. We know that, nowadays, R and Julia have become very popular as well and they are often included (especially R) in software capabilities (see Databricks and Tableau). Scala is another language that deserve our attention, since it is basically the elected language for Data Engineering.
The aim of this article is to present some typical situations in which a data scientist/engineer face extensions or variations of known languages and provide some useful shortcuts to make this experience less traumatic.
Structured Query Language (SQL)
The first time I heard about dialects in data science, it was referred to SQL. The Structured Query Language is a basic syntax for database management. It is generally split into some categories:
In this case I omit the Transaction Control Language.
With the previous categories of syntax, it is possible to fulfil any kind of activity related to a database, however, switching from one Relational Management Database System to another, it is likely to find some discrepancies in the SQL syntax. There some main technologies for RDBMS: we can consider for example Oracle, MySQL, SQL Server, PostgreSQL and so on. Each one of them has small variations with respect to the others, such that they could be considered different dialects without a real official language.
Let us consider, for example, the DDL. In this case, there is an interesting difference between the dialects. When creating a new table, it is possible to take the structure from another one, without copying the data. This very operation is performed this way:
Let us consider now the DML. Taking the common INSERT statement, we might be interested at manually inserting some values into a table. The syntax is as follows:
Let us consider now the DQL. In the basic SELECT statements, we might be interested in selecting only a portion of the data. This operation can be performed as follows:
Even some specific functions for different purposes might be different. Let us consider for example a function that returns the length of a string:
Recommended by LinkedIn
Let us consider a function for the extraction of a substring from a string:
Let us consider a function that eliminates all the spaces from a string:
Let us consider a function that concatenates different strings:
Let us consider a function that yields a specific value in case it finds a NULL value. There are different options:
Conclusions
This is obviously a partial view of the whole picture. The differences between RDMSs are several and each of them requires much experience to be utterly known and mastered. Nevertheless, it might be useful, switching from one to another, remember some of these small tips, in order to avoid the repetitive and sometimes boring activity of searching for the correct function in each environment.
Sources
https://troels.arvin.dk/db/rdbms/
https://www.w3schools.com/sql/sql_isnull.asp
Per anni ho usato ADODB quando sviluppavo in PHP: https://adodb.org/dokuwiki/doku.php per tenere a bada questa "noia". E' un DBAL https://en.wikipedia.org/wiki/Database_abstraction_layer Se usi Python e vuoi provare qualcosa del genere, puoi leggere questa eloquente risposta su StackOverflow https://stackoverflow.com/questions/679806/what-are-the-viable-database-abstraction-layers-for-python