CSV SerDe

CSV SerDe

Hello All,

In this post I will going to cover one small but important concept of SerDe w.r.t hive table.

Most of us as a new user to hive, at some point in your assignments or job task might have faced issue loading csv file to hive tables. Particularly, when record is delimited by comma and one of the field contains comma in-between the quotes.

Here I will going to explain one simple solution to handle such csv file.

Hope you guys like it and may find helpful.

 

What is a SerDe?

The SerDe interface allows you to instruct Hive as to how a record should be processed. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. Commonly, Deserializers are used at query time to execute SELECT statements, and Serializers are used when writing data, such as through an INSERT-SELECT statement.

CSV SerDe

All TEXTFILE tables have a SerDe (Serializer/Deserializer) assigned to them that tells them how to extract fields from a line of text. By default a TEXTFILE table uses the lazy serde: the lazy serde will split each line of text at the defined character delimiter and The Serde is set at the CREATE TABLE statement. If not specified, the lazy serde is used. If your text file is a csv with quotes and commas, the lazy serde will not intepret the quotes and they will be part of the field value, which can cause issue if the column value initially contained commas as commas are also the field delimiters

e.g consider this line in a csv file: "text , text", 123
It contains 2 fields :
1: "text, field" and
2: 123
The lazy serde will wronly assume it has 3 fields as there is a comma in the first field (delimited by quotes)

The CSV serde allows Hive to interpret these CSV files correctly. 

Note: 1) All fields extracted will be of type "string"       
2) Impala does NOT support SerDes
3) This does NOT support multi-line fields.

Rows are produced by the record reader in the inputformat when a line break character is present (\n). Only then the CSV serde can extract fields from rows.

Two or more CSV serdes are available for Hive:
use either now built-in org.apache.hadoop.hive.serde2.OpenCSVSerde
or com.bizo.hive.serde.csv.CSVSerde

Example:
Here is how to use the CSV serde in the CREATE TABLE statement:
serde properties is optional as the standard CSV properties will be used by default:


CREATE EXTERNAL TABLE my_csv_table
(c1 string, c2 string)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
WITH SERDEPROPERTIES (    
"separatorChar" = "\t",    
"quoteChar"     = "'",    
"escapeChar"    = "\\" )  
STORED AS TEXTFILE
LOCATION '/user/hive/landing/csvfolder/' ;

Result:

Given a line in a csv file containing 2 fields:
​"this is \ta csv \ttest , field", "this is another field"

Correct output using CSV serde:

 Incorrect output using default lazy serde:

Because the first csv field contains a comma between quotes, it is not a comma indicated a new field, but it's part of the field. If we load this data in a standard textfile Hive table, it will be misinterpreted: the quotes will be considered as text and the comma in the first field will cause the text after it to be in the next field. we will lose the data from the original 2nd field

One query though - how exactly is issue of null values or empty string resolved. I ve tried null serialization format and apparently it isnt helping.

Like
Reply

Excellent.. Thanks for sharing..

Like
Reply

To view or add a comment, sign in

More articles by Vishal Loharkar

Others also viewed

Explore content categories