Data integration SQL challenges in diverse relational DBMS environments

Data integration SQL challenges in diverse relational DBMS environments

In modern medium to large IT departments or companies, a good mixture of different database systems can be utilized to serve different purposes and data. You will see legacy systems and modern systems crunching data 24/7 for the purpose of storing them into databases for further use. Such utilization involves, in a great many cases, data integration between database systems where data is moved and transformed from the source to the target database.

These movements and transformations often involve usage of the SQL query language through direct connections between the source and target database by using heterogeneous gateways or ODBC tunnels, and, of course, the data. The data, residing in tables, is a mixture of small to huge tables in respect of number of columns as well as number of rows.

Examining SQL language in participating databases reveals how every database system follows the ANSI SQL standards in one way or another. Specifically, a slight difference in datatypes and built-in SQL functions (scalar-/aggregate-/table- and row-functions) can be found in every database system. This difference and variety in SQL-functions as well as other functionality is what makes each database system specialized in recognizing how queries are processed.

The heterogeneous gateways and ODBC tunnels, as previously mentioned, require a certain functionality that is common between all databases. A vital requirement is converting a “local” SQL query (used to query a remote database) into heterogeneous SQL statement (a SQL statement that the remote database understands) before it is sent over to execution.

It is at this stage where the challenges of data integration with SQL begin to emerge.

Let's look at a typical scenario, the source database has a table with 100 million rows with a daily generation of several hundred thousand rows. The target database has a batch program running daily fetching yesterday's data from the source table and adding it to its local copy for further transformations and processing.

If the batch program´s remote query contains one or a set of local built-in SQL-functions that the remote (source) database has difficulty understanding, it may lead to the database optimizer splitting up the query in two different queries and running them each on different databases; 1) one local query containing original where clause and local built-in SQL-functions running on the target database and 2) the other heterogeneous query, with a small subset of, or entirely without the where-clause, running on the source database.

In short, this will lead to pre-processing on the source and post-processing on the target. That is to say, the whole 100 million row table is read and all it's data transferred over the network to the target database where it is then post-processed. The post-process involves several hundred thousand rows being filtered out according to demand. In accordance to the query's wants and needs, the selected rows are finally inserted into the target's table while unneeded rows are discarded. This entails throwing away 99% of the rows on a daily basis as they are unnecessarily transferred from the source to the target thus prolonging the batch execution’s time and occupying temporary storage and memory excessively.

For this reason, when experiencing the multiple database environments and batch programs running for hours, exhausting CPU resources or other resource usage due to data transfer between the discussed databases, it may prove justifiable and financially beneficial to take time to examine and identify problematic issues with the queries. If queries have sql-functions that are database specific in their where-clause, rewriting those queries for the purpose of enabling both the target and remote optimizer to process them in a more optimal way will definitely solve your situation. This enablement will speed up the processing function on both sides, facilitating transference of only necessary data over the network while at the same time minimizing all other resource usage.

To view or add a comment, sign in

More articles by Tomas Helgi Johannsson

Explore content categories