Presto SQL with S3

Presto SQL with S3

Presto SQL works with variety of connectors. I did some experiments to get it connect to AWS S3. I struggled a bit to get Presto SQL up and running and with an ability to query parquet files on S3.

What is Presto?

Presto is a highly parallel and distributed query engine, that is built from the ground up for efficient, low latency analytics. Presto Documentation

Hive Connector

Presto supports wide variety of connectors. Hive connector is one important connector which lets you connect presto to hive metastore(HMS). HMS manages the mapping between table definition and file system. The file system here could be HDFS or S3. Presto SQL works with hive metastore 3.0 and doesn't require HDFS or hive to be installed to get it working with Presto SQL.

Hive metastore

Download hive metastore binary. Untar the tar ball and for example

/opt/apache-hive-metastore-3.0.0-bin

Update the metastore configuration

edit /opt/apache-hive-metastore-3.0.0-bin/conf/metastore-site.xml


<configuration>
  <property>
    <name>metastore.thrift.uris</name>
    <value>thrift://localhost:9083</value>
    <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
  </property>
  <property>
    <name>metastore.task.threads.always</name>
    <value>org.apache.hadoop.hive.metastore.events.EventCleanerTask,org.apache.hadoop.hive.metastore.MaterializationsCacheCleanerTask</value>
  </property>
  <property>
    <name>metastore.expression.proxy</name>
    <value>org.apache.hadoop.hive.metastore.DefaultPartitionExpressionProxy</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hmsdb?serverTimezone=PST</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.cj.jdbc.Driver</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hms</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hms</value>
  </property>
  <property>
    <name>fs.s3a.server-side-encryption-algorithm</name>
    <value>AES256</value>
  </property>
</configuration>

Since we are using persistent metastore using mysql. We need to setup mysql to make sure it is able to talk to hive metastore.

Mysql configuration

Login on to mysql and create a database and user to manage HMS. Following DDL should suffice to do so

CREATE DATABASE hmsdb;

CREATE USER 'hms'@'localhost' IDENTIFIED WITH mysql_native_password BY 'hms' ;

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, INDEX, DROP, ALTER, CREATE TEMPORARY TABLES, LOCK TABLES ON hmsdb.* TO 'hms'@'localhost';

GRANT FILE ON *.* TO 'hms'@'localhost';

After creating the database and use make sure you test and connect to the hmsdb.

Initialize metastore

cd /opt/apache-hive-metastore-3.0.0-bin
./bin/schematool -initSchema -dbType MYSQL

It will initialize the hive metastore and create all the tables. You can connect to hmsdb and check the tables

DDl for mysql
> $ mysql -u hms -p
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 35331
Server version: 8.0.21 MySQL Community Server - GPL


Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.


Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.


Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> use hmsdb
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A


Database changed
mysql> show tables;
+---------------------------+
| Tables_in_hmsdb           |
+---------------------------+
| AUX_TABLE                 |
| BUCKETING_COLS            |
| CDS                       |
| COLUMNS_V2                |
| COMPACTION_QUEUE          |

...
| WRITE_SET                 |
+---------------------------+
73 rows in set (0.00 sec)


mysql>

Install and configure Presto SQL

Download Presto server Configure Presto server

$ tar -zxf presto-server-0.237.1.tar.gz

$ cd presto-server-0.237.1

Create a configuration file etc/config.properties to include the basic Presto configuration

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
discovery-server.enabled=true
discovery.uri=http://localhost:8080

Create etc/jvm.config to specify the following JVM configuration.

-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError

Create etc/node.properties to include the following lines.

node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/tmp/presto/data

Configure hive metastore, Presto Hive connector in etc/catalog/hive.properties, pointing to the Hive Metastore service that we just started

connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083
hive.s3.aws-access-key=<Your AWS Access Key>
hive.s3.aws-secret-key=<Your AWS Secret Key>

Start the presto server

$ ./bin/launcher start

To verify Presto server is running, open http://localhost:8080 in your browser and check the server status from its web UI.

Presto Client CLI

Download Presto client

$ mv presto-cli-0.237.1-executable.jar presto
$ chmod +x presto

Connect to server and query the data

$ ./presto --server localhost:8080  --catalog hive --debug

presto> use default;

Create a table using presto CLI with data backed by AWS S3

CREATE TABLE reason5 ( r_reason_sk varchar, r_reason_id varchar, r_reason_desc varchar ) WITH ( external_location = 's3a://com.test/presto/', format = 'PARQUET' );

Now this creates a table metadata in hive metastore which can be used to query the files in S3 which match the table schema. If we have files in S3 matching the schema, we should be able to query using presto CLI. Note the file format is PARQUET.

In order to do so you need to create a parquet file in S3 matching above schema. There are various tool to do so. I have used pyspark with jupyter to create a parquet file from CSV and then copy the file to S3.

Pyspark script

Install pyspark and use the following jupyter script to create parquet file.

input file

$ cat /tmp/reason5.csv
r_reason_sk, r_reason_id, r_reason_desc
"100", "this is the reason", "good way"
"101", "this is the reason", "good way"
"102", "this is the reason", "good way"
"103", "this is the reason", "good way"
"104", "this is the reason", "good way"
"105", "this is the reason", "good way"

jupyter script

import pyspark
from pyspark import SparkContext


sc = SparkContext.getOrCreate();


schema = StructType([
    StructField("r_reason_sk", StringType(), True),
    StructField("r_reason_id", StringType(), True),
    StructField("r_reason_desc", StringType(), True)])
rdd = sc.textFile('/tmp/reason5.csv').map(lambda line: line.split(","))
df = sqlContext.createDataFrame(rdd, schema)
df.write.parquet('/tmp/reason5.parquet')

Copy file to s3

aws s3 sync /tmp/reason5.parquet s3://com.test/presto/

Query using Presto CLI. Ta-da! You can query the files on S3 mapped to a SQL table.

presto:default> select * from reason5;
 r_reason_sk |      r_reason_id      | r_reason_desc
-------------+-----------------------+----------------
 "100"       |  "this is the reason" |  "good way"
 "101"       |  "this is the reason" |  "good way"
 "102"       |  "this is the reason" |  "good way"
 "103"       |  "this is the reason" |  "good way"
 "103"       |  "this is the reason" |  "good way"
 "104"       |  "this is the reason" |  "good way"
 "105"       |  "this is the reason" |  "good way"
 "104"       |  "this is the reason" |  "good way"
 "105"       |  "this is the reason" |  "good way"
 "100"       |  "this is the reason" |  "good way"
 "101"       |  "this is the reason" |  "good way"
 "102"       |  "this is the reason" |  "good way"
(12 rows)

Query 20201101_214203_00027_prps3, FINISHED, 1 node
Splits: 20 total, 20 done (100.00%)
0:01 [12 rows, 1.29KB] [17 rows/s, 1.57KB/s]

High level components used in the workflow. The query performance on presto can can tuned by giving more worker nodes and various other tuning parameters.

Presto Hive and S3


References

  1. Presto Database see
  2. Presto with hive metastore see
  3. Running pyspark see
  4. Image source see


Hi, I am trying to setup a standalone Hive metastore with Mysql while initializing hive metastore cd /opt/apache-hive-metastore-3.0.0-bin ./bin/schematool -initSchema -dbType MYSQL I am getting this error: "Cannot find hadoop installation: $HADOOP_HOME or $HADOOP_PREFIX must be set or hadoop must be in the path" Do I need to configure Hadoop in my machine to run Hive metastore?

Like
Reply

To view or add a comment, sign in

More articles by Abhishek Gupta

  • Kubernetes - Volumes

    Kubernetes has revolutionized the orchestration of containerized applications by providing robust mechanisms for…

Others also viewed

Explore content categories