SQOOP Performance tuning - Partitioning,Controlling Parallelism & Batch
Sqoop Performance Tuning on AWS EMR - Marth

SQOOP Performance tuning - Partitioning,Controlling Parallelism & Batch

OraOop can create a partitioned table for you – it will be partitioned by the date of the load (one new partition for each export), then sub partitioned by mapper (so 24 mappers would have 24 sub partitions). The easiest way is to use the template table parameter – so here is the command I used – in this example TMARTH is my existing non-partitioned table. TMARTH_EXP is a table that does not exist and will be created by OraOop – it will have the same columns as TMARTH – but will add the columns I mentioned for partitioning.

sqoop export -Dmapred.map.tasks.speculative.execution=false -Doraoop.template.table=tst_product
-Doraoop.partitioned=true --connect jdbc:oracle:thin:@//hostname1:1521/service --username
username --password password --export-dir tmarth --table tmarth_exp --num-mappers
24


To optimize performance, set the number of map tasks to a value lower than the maximum number of connections that the database supports

Using more mappers will lead to a higher no of concurent data transfer tasks, which can results in faster job completion

sqoop import --connect

jdbc:mysql://marth.example.com/sqoop --username

sqoop --password

sqoop --table

employee --num-mappers 10

Inserting Data in Batches

The JDBC interface exposes an API for doing batches in a prepared statement with multiple sets of values. With the --batch parameter, Sqoop can take advantage of this. This API is present in all JDBC drivers because it is required by the JDBC interface.

Enable JDBC batching using the --batch parameter.

sqoop export  --connect <<JDBC URL>>  --username <<SQOOP_USER_NAME>>  --password <<SQOOP_PASSWOR>>   --table <<TABLE_NAME>>  --export-dir <<FOLDER_URI>>  --batch



To view or add a comment, sign in

More articles by Marthandan Murugesan

  • How to Move Large Amounts of Data from HDFS to AWS S3

    Two tools—S3DistCp and DistCp—can help you move data stored on your local (data center) HDFS storage to Amazon S3…

  • Tips for creating EMR Clusters

    Transient Cluster (You can use these type of models for Data-Ingestion process) You can run your cluster as a transient…

Others also viewed

Explore content categories