Languages and Dialects (Part 1)

Languages and Dialects (Part 1)

When I first approached SQL syntax, I felt pretty familiar with such an intuitive language. The idea of dealing with a unique and simple syntax reassured me. Soon, I found out that despite the differences were slight, each Database Management System had its own version of SQL syntax.

The same happened with Databricks: when I started using it, I felt very enthusiastic with a unique environment where to code in different languages. I literally foretasted the idea of using R and Python, as I used to know them. However, the expectation did not really match the reality! Don't get me wrong, I am still very enthusiastic, I still recognize the big value of such an environment and I definitely enjoy switching from one language to another; however, I am not coding (uniquely) with my familiar languages. I found, instead, hybrid versions of R and Python, which I would call trivially dialects (they are not proper dialects!).

It is indeed typical of programming languages to have some variations called dialects. Quoting Wikipedia, "A dialect of a programming language is a (relatively small) variation or extension of the language that does not change its intrinsic nature".

In the context of Data Science there are several possible choices, despite some languages are preferred to others. However, the toolkit of a data scientist should always include SQL (which is not a real programming language), Python (which is basically the most popular one) or any other substitute language, able to fulfil the same tasks. We know that, nowadays, R and Julia have become very popular as well and they are often included (especially R) in software capabilities (see Databricks and Tableau). Scala is another language that deserve our attention, since it is basically the elected language for Data Engineering.

The aim of this article is to present some typical situations in which a data scientist/engineer face extensions or variations of known languages and provide some useful shortcuts to make this experience less traumatic.


Structured Query Language (SQL)

The first time I heard about dialects in data science, it was referred to SQL. The Structured Query Language is a basic syntax for database management. It is generally split into some categories:

  • DDL: the Data Definition Language is the syntax apt to define and describe data, tables and views.
  • DML: the Data Manipulation Language is the syntax that allows to manipulate the data stores in tables, by deleting, updating them or inserting new ones.
  • DCL: the Data Control Language is the syntax apt to grant or revoke any kind of permission across the database.
  • DQL: the Data Query Language is the syntax for interrogating the tables about their data.

In this case I omit the Transaction Control Language.

With the previous categories of syntax, it is possible to fulfil any kind of activity related to a database, however, switching from one Relational Management Database System to another, it is likely to find some discrepancies in the SQL syntax. There some main technologies for RDBMS: we can consider for example Oracle, MySQL, SQL Server, PostgreSQL and so on. Each one of them has small variations with respect to the others, such that they could be considered different dialects without a real official language.

Let us consider, for example, the DDL. In this case, there is an interesting difference between the dialects. When creating a new table, it is possible to take the structure from another one, without copying the data. This very operation is performed this way:

  • MySQL: CREATE TABLE t2 (LIKE t1)
  • SQL Server: SELECT * INTO t2 FROM t1 WHERE 1<>1
  • Oracle: CREATE TABLE t2 AS SELECT * FROM t1 WHERE 1<>1
  • PostgreSQL: CREATE TABLE t2 (LIKE t1)


Let us consider now the DML. Taking the common INSERT statement, we might be interested at manually inserting some values into a table. The syntax is as follows:

  • MySQL: INSERT INTO tablename VALUES (1, 'A'), (2, 'B'), (3, 'C')
  • SQL Server: INSERT INTO tablename VALUES (1, 'A'), (2, 'B'), (3, 'C')
  • Oracle: INSERT ALL INTO tablename VALUES (1, 'A') INTO tablename VALUES (2, 'B') INTO tablename VALUES (3, 'C') SELECT null FROM dual
  • PostgreSQL: INSERT INTO tablename VALUES (1, 'A'), (2, 'B'), (3, 'C')


Let us consider now the DQL. In the basic SELECT statements, we might be interested in selecting only a portion of the data. This operation can be performed as follows:

  • MySQL: SELECT columns FROM tablename ORDER BY key ASC LIMIT n
  • SQL Server: SELECT TOP n columns FROM tablename ORDER BY key ASC
  • Oracle: SELECT * FROM (SELECT ROW_NUMBER() OVER (ORDER BY key ASC) AS rownumber, columns FROM tablename) WHERE rownumber <= n
  • PostgreSQL: SELECT columns FROM tablename ORDER BY key ASC LIMIT n


Even some specific functions for different purposes might be different. Let us consider for example a function that returns the length of a string:

  • MySQL: LENGTH(), CHARACTER_LENGTH() or CHAR_LENGTH()
  • SQL Server: LEN() or DATALENGTH()
  • Oracle: LENGTH()
  • PostgreSQL: CHARACTER_LENGTH() or CHAR_LENGTH()


Let us consider a function for the extraction of a substring from a string:

  • MySQL: SUBSTRING(input FROM start-position [FOR length])
  • SQL Server: SUBSTRING(input, start, length)
  • Oracle: SUBSTR(input,start-pos[,length])
  • PostgreSQL: SUBSTRING(input FROM start-position [FOR length])


Let us consider a function that eliminates all the spaces from a string:

  • MySQL: TRIM(string_to_be_trimmed)
  • SQL Server: LTRIM(string_to_be_trimmed) or RTRIM(string_to_be_trimmed)
  • Oracle: TRIM(string_to_be_trimmed)
  • PostgreSQL: TRIM(string_to_be_trimmed)


Let us consider a function that concatenates different strings:

  • MySQL: CONCAT(string, string)
  • SQL Server: string1 + string2
  • Oracle: string1 || string2
  • PostgreSQL: string1 || string2


Let us consider a function that yields a specific value in case it finds a NULL value. There are different options:

  • MySQL: IFNULL(column, new_value) or COALESCE(column, new_value)
  • SQL Server: ISNULL(column, new_value) or COALESCE(column, new_value)
  • Oracle: NVL(column, new_value)
  • PostgreSQL: COALESCE(column, new_value)


Conclusions

This is obviously a partial view of the whole picture. The differences between RDMSs are several and each of them requires much experience to be utterly known and mastered. Nevertheless, it might be useful, switching from one to another, remember some of these small tips, in order to avoid the repetitive and sometimes boring activity of searching for the correct function in each environment.


Sources

https://troels.arvin.dk/db/rdbms/

https://www.w3schools.com/sql/sql_isnull.asp

Per anni ho usato ADODB quando sviluppavo in PHP: https://adodb.org/dokuwiki/doku.php per tenere a bada questa "noia". E' un DBAL https://en.wikipedia.org/wiki/Database_abstraction_layer Se usi Python e vuoi provare qualcosa del genere, puoi leggere questa eloquente risposta su StackOverflow https://stackoverflow.com/questions/679806/what-are-the-viable-database-abstraction-layers-for-python

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore content categories