Sqoop Import in Avro Files

Ashish kumar

Published Jan 22, 2019

Here today i will show you how we can sqoop data into avro file format.

Yeah, we know it's very simple put --as-avrodatafile with your sqoop import command as per all Apache documentation but in real life does all documented command works as simple as written???

Defiantly not…

And it happened same as others… so no worry here I’m gonna show you all probable issue you can face and how you need to debug and resolution for the same and if you have some different issue please comment. We will try to solve it together.

But before talking to those issue, let’s see what is AVRO and why we need to sqoop data in AVRO…

What is AVRO?

An .avro file created by Apache Avro data serialization system.

1. It contains data in .avro file in compact binary format and schema in .avsc file in JSON format.

2. It’s very useful for the exchange of extremely large datasets as you can split the data into multiple .avro part files.

3. It is a language-neutral data serialization system but schema bashed.

4. It is quite useful when you need to deal with dynamic schema. Ex- if you want to be unimpacted with change in schema of source over the time and what to access older ingested data even after schema change.

5. It can handle schema changes like any missing fields, new added fields and changed fields.

6. It can handle line breaks (\n) and that’s awesome if you had to deal it in .csv file.

Now I guess you must be thinking why sqoop to Avro why not to parquet…

As we know in open source to do single things we have several ways and tools but we find best one by considering all pros and cons and our requirements…

Here also we must select best file format bashed on our need.

Before comparing Avro and parquets here few points for parquet.

1. It is columnar storage of data compared to row based file

2. It has self-describing, complex nested schema, or structure, within the data file.

3. It is good for read intensive use case as it support indexing.

4. parquet and avro both support schema evolution (changing schema of data over time) but comparatively, Avro provide much richer schema evolution.

5. parquet give good compaction ratios as compare to avro.

Now here is the key factor before choosing Avro over Parquet.

1. Avro is write intensive where parquet is read intensive. So, to sqoop import I would prefer to write intensive Avro.

2. Avro provide much richer schema evolution.

3. It can handle line breaks (\n).

Now let’s see sqoop import and related issues and resolutions

Environment

1. HDP 3.0.1

2. Sqoop version: 1.4.7

Sqoop Import Command

Here I will show you step wise execution and related error and resolutions

Step 1-

sqoop import -Dmapreduce.job.queuename= xyz \

--connect "jdbc:sqlserver://*******:1433;databases=employee" \

--username riveriq \

--password ***** \

--query 'select * from employee.dbo.emp_details where $CONDITIONS' \

--m 4 \

--split-by empid \

--target-dir /user/riveriq/sqoop/emp_details \

--as-avrodatafile

Error :

19/01/20 23:38:18 INFO mapreduce.Job: Running job: job_1546234110553_0499

19/01/20 23:38:29 INFO mapreduce.Job: Job job_1546234110553_0499 running in uber mode : false

19/01/20 23:38:29 INFO mapreduce.Job:  map 0% reduce 0%

19/01/20 23:38:34 INFO mapreduce.Job: Task Id : attempt_1546234110553_0499_m_000000_0, Status : FAILED

Error: org.apache.avro.reflect.ReflectData.addLogicalTypeConversion(Lorg/apache/avro/Conversion;)V

19/01/20 23:38:39 INFO mapreduce.Job: Task Id : attempt_1546234110553_0499_m_000000_1, Status : FAILED

Error: org.apache.avro.reflect.ReflectData.addLogicalTypeConversion(Lorg/apache/avro/Conversion;)V

19/01/20 23:38:44 INFO mapreduce.Job: Task Id : attempt_1546234110553_0499_m_000000_2, Status : FAILED

Error: org.apache.avro.reflect.ReflectData.addLogicalTypeConversion(Lorg/apache/avro/Conversion;)V

19/01/20 23:38:51 INFO mapreduce.Job:  map 100% reduce 0%

19/01/20 23:38:51 INFO mapreduce.Job: Job job_1546234110553_0499 failed with state FAILED due to: Task failed task_1546234110553_0499_m_000000

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0

Debug:

yarn logs -applicationId application_1546448110553_0499

java.class.path:

/data/yarn/local/usercache/ashishk/appcache/application_1546234110553_0499/container_e46_1546448235553_0499_01_000004/mr-framework/hadoop/share/hadoop/common/lib/avro-1.7.7.jar/

/data/yarn/local/usercache/ashishk/appcache/application_1546448110553_0499/container_e46_1546448110553_0499_01_000004/avro-1.8.1.jar/

If you google above issue (org.apache.avro.reflect.ReflectData.addLogicalTypeConversion) then you will find this is a known sqoop issue related to avro version.

Sqoop uses 1.8.0 of avro and there are other Hadoop components using 1.7.5 or 1.7.4 avro.

But if you see above yarn application log, you will find that sqoop loading avro-1.7.7.jar first in class path than avro-1.8.1.jar and that create avro type conversion issue.

Solution: - Dmapreduce.job.user.classpath.first=true

https://community.hortonworks.com/questions/60890/sqoop-import-to-avro-failing-which-jars-to-be-used.html

Step 2:

sqoop import -Dmapreduce.job.user.classpath.first=true -Dmapreduce.job.queuename=BDSS \

--connect "jdbc:sqlserver://*******:1433;databases=employee" \

--username riveriq \

--password ***** \

--query 'select * from employee.dbo.emp_details where $CONDITIONS' \

--m 4 \

--split-by empid \

--target-dir /user/riveriq/sqoop/emp_details \

--as-avrodatafile

I run it with -Dmapreduce.job.user.classpath.first=true as above stated in above url that it reolved their issue.

But it was not that easy for me, it resolved sqoop avro version issue but my map reduce job started failing.

Then it made me to think why every time my issue is different from others and I always trapped in those issue where other are not.

But no worry… I feel if error is your enemy than logs are your friend…

Error :

[2019-01-21 00:02:29.593]Container exited with a non-zero exit code 1. Error file: prelaunch.err.

Last 4096 bytes of prelaunch.err :

Last 4096 bytes of stderr :

log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster).

log4j:WARN Please initialize the log4j system properly.

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html noconfig for more info.

Debug:

yarn logs -applicationId application_1546448110553_0500

java.class.path:

/data/yarn/local/usercache/ashishk/appcache/application_1546448110553_0500/container_e46_1546448110553_0500_01_000001/avro-1.8.1.jar/

/data/yarn/local/usercache/ashishk/appcache/application_1546448110553_0500/container_e46_1546448110553_0500_01_000001/mr-framework/hadoop/share/hadoop/common/lib/avro-1.7.7.jar/

Solution:

if you see above log after adding -Dmapreduce.job.user.classpath.first=true, it resolved sqoop avro version issue, now its loading avro-1.8.1.jar first and then avro-1.7.7.jar but now map reduce job is failing as there is map reduce dependency on avro-1.7.7.jar that not meeting with avro-1.8.1.jar.

to resolve this, we have use separate class loaded for both.

Use -Dmapreduce.job.classloader=true instead of -Dmapreduce.job.user.classpath.first=true

Step 3:

sqoop import -Dmapreduce.job.classloader=true -Dmapreduce.job.queuename=BDSS \

--connect "jdbc:sqlserver://*******:1433;databases=employee" \

--username riveriq \

--password ***** \

--query 'select * from employee.dbo.emp_details where $CONDITIONS' \

--m 4 \

--split-by empid \

--target-dir /user/riveriq/sqoop/emp_details \

--as-avrodatafile

Debug:

yarn logs -applicationId application_1546448110553_0500

java.class.path:

/data/yarn/local/usercache/ashishk/appcache/application_1546448110553_0503/container_e46_1546448110553_0504_01_000002/mr-framework/hadoop/share/hadoop/common/lib/avro-1.7.7.jar/

INFO [main] org.apache.hadoop.mapreduce.v2.util.MRApps/

 Creating job classloader

 

INFO [main] org.apache.hadoop.util.ApplicationClassLoader/

 classpath/

/data/yarn/local/usercache/ashishk/appcache/application_1546448110553_0503/container_e46_1546448110553_0504_01_000002/avro-1.8.1.jar,

Output : Data and schema fine get created

/user/riveriq/sqoop/emp_details/part_****.avro

/home/riveriq/emp_details.avsc

Conclusion: for me -Dmapreduce.job.classloader=true is final solution to resolve this issue.

Swapnali M 4y

Thanks Ashish it worked for me

Sachin Vardhan 6y

I have one doubt regarding the import of avro. If the sqoop import data from table, and the table has no of record as zero, it still imports the headers of the table to the target dir. Is there any way around to skip those if there is not data for table.

Mahmut Haktan 6y

I must try this way

1 Reaction

See more comments

To view or add a comment, sign in

Sqoop Import in Avro Files

Ashish kumar

More articles by Ashish kumar

Others also viewed

Use Case Discovery Workshop :: Performance Improvement Convert HiveQL To Spark SQL

Use ElasticSearch as the second Data Store for BI

SQL Statement Execution API by Databricks

SQL or SQL no more?

Simple Mongo Change Stream

The solution to SQL sucks

APACHE PARQUET The Backbone of Modern Big Data

UUID in SQL: insert & reindex performance

Apache Iceberg – End‑to‑End Notes (Architecture → Internals → Usage)

Finding lineage using Spark Catalyst Parser

Explore content categories

More articles by Ashish kumar

Azure & Databricks diagnostic logs settings

Multi-threaded Process with Parallel Collections

Delta Lake Concurrency Control

Azure Databricks Standard vs. Premium

Azure Databricks Notebook - How to get current workspace name

Log4j Configuration with spark-submit

Databricks Log4j Configuration

Delta Lake

Kafka as a Messaging System

Access data lake gen 2 with a service principal

Others also viewed

Use Case Discovery Workshop :: Performance Improvement Convert HiveQL To Spark SQL

Use ElasticSearch as the second Data Store for BI

SQL Statement Execution API by Databricks

SQL or SQL no more?

Simple Mongo Change Stream

The solution to SQL sucks

APACHE PARQUET The Backbone of Modern Big Data

UUID in SQL: insert & reindex performance

Apache Iceberg – End‑to‑End Notes (Architecture → Internals → Usage)

Finding lineage using Spark Catalyst Parser

Explore content categories