Sqoop Import in Avro Files

Sqoop Import in Avro Files

Here today i will show you how we can sqoop data into avro file format.

Yeah, we know it's very simple put --as-avrodatafile with your sqoop import command as per all Apache documentation but in real life does all documented command works as simple as written???

Defiantly not…

And it happened same as others… so no worry here I’m gonna show you all probable issue you can face and how you need to debug and resolution for the same and if you have some different issue please comment. We will try to solve it together.

But before talking to those issue, let’s see what is AVRO and why we need to sqoop data in AVRO…

What is AVRO?

An .avro file created by Apache Avro data serialization system.

1.     It contains data in .avro file in compact binary format and schema in .avsc file in JSON format.

2.     It’s very useful for the exchange of extremely large datasets as you can split the data into multiple .avro part files.

3.     It is a language-neutral data serialization system but schema bashed.

4.     It is quite useful when you need to deal with dynamic schema. Ex- if you want to be unimpacted with change in schema of source over the time and what to access older ingested data even after schema change.

5.     It can handle schema changes like any missing fields, new added fields and changed fields.

6.     It can handle line breaks (\n) and that’s awesome if you had to deal it in .csv file.

Now I guess you must be thinking why sqoop to Avro why not to parquet…

As we know in open source to do single things we have several ways and tools but we find best one by considering all pros and cons and our requirements…

Here also we must select best file format bashed on our need.

Before comparing Avro and parquets here few points for parquet.

1.     It is columnar storage of data compared to row based file

2.     It has self-describing, complex nested schema, or structure, within the data file.

3.     It is good for read intensive use case as it support indexing.

4.     parquet and avro both support schema evolution (changing schema of data over time) but comparatively, Avro provide much richer schema evolution.

5.     parquet give good compaction ratios as compare to avro.

Now here is the key factor before choosing Avro over Parquet.

1.     Avro is write intensive where parquet is read intensive. So, to sqoop import I would prefer to write intensive Avro.

2.     Avro provide much richer schema evolution.

3.     It can handle line breaks (\n).

Now let’s see sqoop import and related issues and resolutions

Environment

1.     HDP 3.0.1

2.     Sqoop version: 1.4.7

Sqoop Import Command

Here I will show you step wise execution and related error and resolutions

Step 1- 

sqoop import -Dmapreduce.job.queuename= xyz \

--connect "jdbc:sqlserver://*******:1433;databases=employee" \

--username riveriq \

--password ***** \

--query 'select * from employee.dbo.emp_details where $CONDITIONS' \

--m 4 \

--split-by empid \

--target-dir /user/riveriq/sqoop/emp_details \

--as-avrodatafile

Error : 

19/01/20 23:38:18 INFO mapreduce.Job: Running job: job_1546234110553_0499

19/01/20 23:38:29 INFO mapreduce.Job: Job job_1546234110553_0499 running in uber mode : false

19/01/20 23:38:29 INFO mapreduce.Jobmap 0% reduce 0%

19/01/20 23:38:34 INFO mapreduce.Job: Task Id : attempt_1546234110553_0499_m_000000_0, Status : FAILED

Error: org.apache.avro.reflect.ReflectData.addLogicalTypeConversion(Lorg/apache/avro/Conversion;)V

19/01/20 23:38:39 INFO mapreduce.Job: Task Id : attempt_1546234110553_0499_m_000000_1, Status : FAILED

Error: org.apache.avro.reflect.ReflectData.addLogicalTypeConversion(Lorg/apache/avro/Conversion;)V

19/01/20 23:38:44 INFO mapreduce.Job: Task Id : attempt_1546234110553_0499_m_000000_2, Status : FAILED

Error: org.apache.avro.reflect.ReflectData.addLogicalTypeConversion(Lorg/apache/avro/Conversion;)V

19/01/20 23:38:51 INFO mapreduce.Jobmap 100% reduce 0%

19/01/20 23:38:51 INFO mapreduce.Job: Job job_1546234110553_0499 failed with state FAILED due to: Task failed task_1546234110553_0499_m_000000

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0

Debug:

yarn logs -applicationId application_1546448110553_0499

java.class.path:

/data/yarn/local/usercache/ashishk/appcache/application_1546234110553_0499/container_e46_1546448235553_0499_01_000004/mr-framework/hadoop/share/hadoop/common/lib/avro-1.7.7.jar/

/data/yarn/local/usercache/ashishk/appcache/application_1546448110553_0499/container_e46_1546448110553_0499_01_000004/avro-1.8.1.jar/ 

If you google above issue (org.apache.avro.reflect.ReflectData.addLogicalTypeConversion) then you will find this is a known sqoop issue related to avro version.

Sqoop uses 1.8.0 of avro and there are other Hadoop components using 1.7.5 or 1.7.4 avro.

But if you see above yarn application log, you will find that sqoop loading avro-1.7.7.jar first in class path than avro-1.8.1.jar and that create avro type conversion issue.

Solution: - Dmapreduce.job.user.classpath.first=true

https://community.hortonworks.com/questions/60890/sqoop-import-to-avro-failing-which-jars-to-be-used.html

Step 2: 

sqoop import -Dmapreduce.job.user.classpath.first=true -Dmapreduce.job.queuename=BDSS \

--connect "jdbc:sqlserver://*******:1433;databases=employee" \

--username riveriq \

--password ***** \

--query 'select * from employee.dbo.emp_details where $CONDITIONS' \

--m 4 \

--split-by empid \

--target-dir /user/riveriq/sqoop/emp_details \

--as-avrodatafile

I run it with -Dmapreduce.job.user.classpath.first=true as above stated in above url that it reolved their issue.

But it was not that easy for me, it resolved sqoop avro version issue but my map reduce job started failing.

Then it made me to think why every time my issue is different from others and I always trapped in those issue where other are not.

But no worry… I feel if error is your enemy than logs are your friend…

Error :

[2019-01-21 00:02:29.593]Container exited with a non-zero exit code 1. Error file: prelaunch.err.

Last 4096 bytes of prelaunch.err :

Last 4096 bytes of stderr :

log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster).

log4j:WARN Please initialize the log4j system properly.

log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html noconfig for more info.

Debug:

yarn logs -applicationId application_1546448110553_0500

java.class.path:

/data/yarn/local/usercache/ashishk/appcache/application_1546448110553_0500/container_e46_1546448110553_0500_01_000001/avro-1.8.1.jar/

/data/yarn/local/usercache/ashishk/appcache/application_1546448110553_0500/container_e46_1546448110553_0500_01_000001/mr-framework/hadoop/share/hadoop/common/lib/avro-1.7.7.jar/

Solution:

if you see above log after adding -Dmapreduce.job.user.classpath.first=true, it resolved sqoop avro version issue, now its loading avro-1.8.1.jar first and then avro-1.7.7.jar but now map reduce job is failing as there is map reduce dependency on avro-1.7.7.jar that not meeting with avro-1.8.1.jar.

to resolve this, we have use separate class loaded for both.

Use -Dmapreduce.job.classloader=true instead of -Dmapreduce.job.user.classpath.first=true

Step 3: 

sqoop import -Dmapreduce.job.classloader=true -Dmapreduce.job.queuename=BDSS \

--connect "jdbc:sqlserver://*******:1433;databases=employee" \

--username riveriq \

--password ***** \

--query 'select * from employee.dbo.emp_details where $CONDITIONS' \

--m 4 \

--split-by empid \

--target-dir /user/riveriq/sqoop/emp_details \

--as-avrodatafile

Debug:

yarn logs -applicationId application_1546448110553_0500

java.class.path:

/data/yarn/local/usercache/ashishk/appcache/application_1546448110553_0503/container_e46_1546448110553_0504_01_000002/mr-framework/hadoop/share/hadoop/common/lib/avro-1.7.7.jar/

INFO [main] org.apache.hadoop.mapreduce.v2.util.MRApps/

 Creating job classloader

 

INFO [main] org.apache.hadoop.util.ApplicationClassLoader/

 classpath/

/data/yarn/local/usercache/ashishk/appcache/application_1546448110553_0503/container_e46_1546448110553_0504_01_000002/avro-1.8.1.jar,

Output : Data and schema fine get created

/user/riveriq/sqoop/emp_details/part_****.avro

/home/riveriq/emp_details.avsc

Conclusion: for me -Dmapreduce.job.classloader=true is final solution to resolve this issue.

Thanks Ashish it worked for me

Like
Reply

I have one doubt regarding the import of avro. If the sqoop import data from table, and the table has no of record as zero, it still imports the headers of the table to the target dir. Is there any way around to skip those if there is not data for table.

Like
Reply

To view or add a comment, sign in

More articles by Ashish kumar

  • Azure & Databricks diagnostic logs settings

    We remember God only when we are in a bad situation, the same way we find log only when we are stuck in some issue. A…

  • Multi-threaded Process with Parallel Collections

    If you like to do multi-tasking, then defiantly you would be fan of multi-threaded programming or parallel processing…

    1 Comment
  • Delta Lake Concurrency Control

    Wow..

    3 Comments
  • Azure Databricks Standard vs. Premium

    Whenever we have to choose anything from available options, we always look for best but the next things we look is…

    2 Comments
  • Azure Databricks Notebook - How to get current workspace name

    Sometimes you also have been in some situation where you feel something should be very easy but once you started…

    1 Comment
  • Log4j Configuration with spark-submit

    This is 2nd part of log4j configuration for spark application. For more understanding about log4j you can follow below…

  • Databricks Log4j Configuration

    Log4j is not new but Databricks is..

  • Delta Lake

    Delta Lake …. Yes, you are reading right, its delta lake not data lake, believe me first time it happened with me also…

    2 Comments
  • Kafka as a Messaging System

    Traditionally there were two type of messaging model: Queuing and publisher-subscriber and then Kafka came as a new…

  • Access data lake gen 2 with a service principal

    There are four ways of accessing Azure Data Lake Storage Gen2: Pass your Azure Active Directory credentials, also known…

Others also viewed

Explore content categories