Presto SQL with S3
Presto SQL works with variety of connectors. I did some experiments to get it connect to AWS S3. I struggled a bit to get Presto SQL up and running and with an ability to query parquet files on S3.
What is Presto?
Presto is a highly parallel and distributed query engine, that is built from the ground up for efficient, low latency analytics. Presto Documentation
Hive Connector
Presto supports wide variety of connectors. Hive connector is one important connector which lets you connect presto to hive metastore(HMS). HMS manages the mapping between table definition and file system. The file system here could be HDFS or S3. Presto SQL works with hive metastore 3.0 and doesn't require HDFS or hive to be installed to get it working with Presto SQL.
Hive metastore
Download hive metastore binary. Untar the tar ball and for example
/opt/apache-hive-metastore-3.0.0-bin
Update the metastore configuration
edit /opt/apache-hive-metastore-3.0.0-bin/conf/metastore-site.xml
<configuration>
<property>
<name>metastore.thrift.uris</name>
<value>thrift://localhost:9083</value>
<description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
</property>
<property>
<name>metastore.task.threads.always</name>
<value>org.apache.hadoop.hive.metastore.events.EventCleanerTask,org.apache.hadoop.hive.metastore.MaterializationsCacheCleanerTask</value>
</property>
<property>
<name>metastore.expression.proxy</name>
<value>org.apache.hadoop.hive.metastore.DefaultPartitionExpressionProxy</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hmsdb?serverTimezone=PST</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hms</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hms</value>
</property>
<property>
<name>fs.s3a.server-side-encryption-algorithm</name>
<value>AES256</value>
</property>
</configuration>
Since we are using persistent metastore using mysql. We need to setup mysql to make sure it is able to talk to hive metastore.
Mysql configuration
Login on to mysql and create a database and user to manage HMS. Following DDL should suffice to do so
CREATE DATABASE hmsdb; CREATE USER 'hms'@'localhost' IDENTIFIED WITH mysql_native_password BY 'hms' ; GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, INDEX, DROP, ALTER, CREATE TEMPORARY TABLES, LOCK TABLES ON hmsdb.* TO 'hms'@'localhost'; GRANT FILE ON *.* TO 'hms'@'localhost';
After creating the database and use make sure you test and connect to the hmsdb.
Initialize metastore
cd /opt/apache-hive-metastore-3.0.0-bin ./bin/schematool -initSchema -dbType MYSQL
It will initialize the hive metastore and create all the tables. You can connect to hmsdb and check the tables
DDl for mysql > $ mysql -u hms -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 35331 Server version: 8.0.21 MySQL Community Server - GPL Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> use hmsdb Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> show tables; +---------------------------+ | Tables_in_hmsdb | +---------------------------+ | AUX_TABLE | | BUCKETING_COLS | | CDS | | COLUMNS_V2 | | COMPACTION_QUEUE | ... | WRITE_SET | +---------------------------+ 73 rows in set (0.00 sec) mysql>
Install and configure Presto SQL
Download Presto server Configure Presto server
$ tar -zxf presto-server-0.237.1.tar.gz $ cd presto-server-0.237.1
Create a configuration file etc/config.properties to include the basic Presto configuration
coordinator=true node-scheduler.include-coordinator=true http-server.http.port=8080 discovery-server.enabled=true discovery.uri=http://localhost:8080
Create etc/jvm.config to specify the following JVM configuration.
-server -Xmx16G -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:+ExitOnOutOfMemoryError
Create etc/node.properties to include the following lines.
node.environment=production node.id=ffffffff-ffff-ffff-ffff-ffffffffffff node.data-dir=/tmp/presto/data
Configure hive metastore, Presto Hive connector in etc/catalog/hive.properties, pointing to the Hive Metastore service that we just started
connector.name=hive-hadoop2 hive.metastore.uri=thrift://localhost:9083 hive.s3.aws-access-key=<Your AWS Access Key> hive.s3.aws-secret-key=<Your AWS Secret Key>
Start the presto server
$ ./bin/launcher start
To verify Presto server is running, open http://localhost:8080 in your browser and check the server status from its web UI.
Presto Client CLI
$ mv presto-cli-0.237.1-executable.jar presto $ chmod +x presto
Connect to server and query the data
$ ./presto --server localhost:8080 --catalog hive --debug presto> use default;
Create a table using presto CLI with data backed by AWS S3
CREATE TABLE reason5 ( r_reason_sk varchar, r_reason_id varchar, r_reason_desc varchar ) WITH ( external_location = 's3a://com.test/presto/', format = 'PARQUET' );
Now this creates a table metadata in hive metastore which can be used to query the files in S3 which match the table schema. If we have files in S3 matching the schema, we should be able to query using presto CLI. Note the file format is PARQUET.
In order to do so you need to create a parquet file in S3 matching above schema. There are various tool to do so. I have used pyspark with jupyter to create a parquet file from CSV and then copy the file to S3.
Pyspark script
Install pyspark and use the following jupyter script to create parquet file.
input file
$ cat /tmp/reason5.csv r_reason_sk, r_reason_id, r_reason_desc "100", "this is the reason", "good way" "101", "this is the reason", "good way" "102", "this is the reason", "good way" "103", "this is the reason", "good way" "104", "this is the reason", "good way" "105", "this is the reason", "good way"
jupyter script
import pyspark
from pyspark import SparkContext
sc = SparkContext.getOrCreate();
schema = StructType([
StructField("r_reason_sk", StringType(), True),
StructField("r_reason_id", StringType(), True),
StructField("r_reason_desc", StringType(), True)])
rdd = sc.textFile('/tmp/reason5.csv').map(lambda line: line.split(","))
df = sqlContext.createDataFrame(rdd, schema)
df.write.parquet('/tmp/reason5.parquet')
Copy file to s3
aws s3 sync /tmp/reason5.parquet s3://com.test/presto/
Query using Presto CLI. Ta-da! You can query the files on S3 mapped to a SQL table.
presto:default> select * from reason5; r_reason_sk | r_reason_id | r_reason_desc -------------+-----------------------+---------------- "100" | "this is the reason" | "good way" "101" | "this is the reason" | "good way" "102" | "this is the reason" | "good way" "103" | "this is the reason" | "good way" "103" | "this is the reason" | "good way" "104" | "this is the reason" | "good way" "105" | "this is the reason" | "good way" "104" | "this is the reason" | "good way" "105" | "this is the reason" | "good way" "100" | "this is the reason" | "good way" "101" | "this is the reason" | "good way" "102" | "this is the reason" | "good way" (12 rows) Query 20201101_214203_00027_prps3, FINISHED, 1 node Splits: 20 total, 20 done (100.00%) 0:01 [12 rows, 1.29KB] [17 rows/s, 1.57KB/s]
High level components used in the workflow. The query performance on presto can can tuned by giving more worker nodes and various other tuning parameters.
References
Hi, I am trying to setup a standalone Hive metastore with Mysql while initializing hive metastore cd /opt/apache-hive-metastore-3.0.0-bin ./bin/schematool -initSchema -dbType MYSQL I am getting this error: "Cannot find hadoop installation: $HADOOP_HOME or $HADOOP_PREFIX must be set or hadoop must be in the path" Do I need to configure Hadoop in my machine to run Hive metastore?