Explore your Qlik Sense tenant and app data using SQL
I've been trying out DuckDB recently and in doing so also learning more about the parquet data file format. If you work in the world of software and data and haven't come across DuckDB yet it's probably worth having on your radar - it's causing a bit of a buzz, and with good reason from my experience so far.
In short it's an in process (with no external dependencies) OLAP optimised relational database that supports SQL. It's often compared to SQLite which is also an in process database but optimised for OLTP whereas DuckDB is optimised for analytical workloads.
You can read more here:
One of the first neat features which stood out to me is how easily you can directly query data in a CSV or Parquet file, either on disk or over http with something like:
select * from 'my_data.csv';
Or:
select * from 'http://server/my_data.parquet';
There's no need to install and configure a traditional RDBMS or import the data (although you can easily also import the data into a 'proper' table).
The CSV reader also has lots of configuration options but the auto options used above seem to work well.
The real magic comes in with parquet data though. This, being a (hybrid) columnar format with associated metadata statistics stored for each row group, allows DuckDB to effectively 'pushdown' many types of query into the file itself and load only the required parts necessary to satisfy the query. This means you can often get results from very large files surprisingly quickly!
With my background I thought it might be fun to present a Qlik tenant using the parquet data file format and open it up to exploration and querying using SQL. This may already be possible via some other route but I thought it would be a good chance to build and learn something new.
Let's dive in!
First we need DuckDB. An easy way to try it out is with their CLI (command line interface) which I run locally on Windows:
There's even a WASM build so you can just run it directly in the browser:
I am running a small webserver locally on http://localhost:7777/ which takes care of creating the parquet files the first time they are requested and then serving them up. I can probably make a download of the server available if there is any interest in the comments.
With that in place, in the CLI we need to run the following to load the httpfs extension:
load httpfs;
install httpfs;
Now, instead of querying the files directly as the examples above show, we can create a view for each of the tables (parquet files) currently supported:
create view models as select * from 'http://localhost:7777/tenant1/models.parquet'
create view reloads as select * from 'http://localhost:7777/tenant1/reloads.parquet';
create view apps as select * from 'http://localhost:7777/tenant1/apps.parquet';
create view users as select * from 'http://localhost:7777/tenant1/users.parquet';
create view audits as select * from 'http://localhost:7777/tenant1/audits.parquet';
Now we can run:
show tables;
And see:
And then explore a couple of the schemas:
Recommended by LinkedIn
describe apps;
describe models;
We can now use all the power of SQL to join, aggregate and filter the data to answer some questions.
I'll make up a few random ones to illustrate - it's been a while since I wrote much SQL so please feel free to point out any mistakes or inefficient way of doing things in the comments!
BTW - DuckDB even includes a bunch of SQL enhancements:
How many apps, reloads, tables and fields are in my tenant?
select (select count(*) from apps) as apps, (select count(*) from reloads) as reloads, (select count(DISTINCT(resourceid, tablename)) from models) as tables, (select count(DISTINCT(resourceid, tablename, fieldname)) from models) as fields;
What were the last 10 reloads on my tenant and did they succeed?
select apps.Name, reloads.Status, EndTime from reloads join apps on reloads.AppId = apps.ResourceId order by EndTime desc limit 10;
As an aside, I understand DuckDB also has full text search capability - as the reloads table includes the reload log perhaps there could be some use case there for searching for apps failing with a particular error message.
What are the top 10 audit events in my server?
select distinct eventType, count(eventType) as count from audits group by all order by count desc limit 10;
You can also query the data within a Qlik app, for example after adding the sample data app we can do:
create view sample as select * from 'http://localhost:7777/tenant1/apps/2be3e00f-defa-4038-960e-b2fb8f1adf71/SampleData.parquet';
And then for example:
select count(*) from sample;
select city, customer, round(sum(sales),2) as sales from sample where sales > 0 group by all order by sales desc limit 3;
It's been interesting dusting down my SQL again, digging into DuckDB a bit and building something new in Golang to connect it to Qlik.
DuckDB seems like an awesome project which is very active at the moment. I have even had some quick responses to questions and issues I have raised over on their discord server and GitHub repo. You can find the repo here.
This is AWSOME SAUCE on top of an AWESOME ENTREE enjoyed on an AWESOME DAY with AWESOME WEATHER and AWESOME COMPANY.
Why? Didn’t understand what is gained except the originality 🧐
Cool 😎