CSV SerDe

Vishal Loharkar

Published Sep 26, 2016

Hello All,

In this post I will going to cover one small but important concept of SerDe w.r.t hive table.

Most of us as a new user to hive, at some point in your assignments or job task might have faced issue loading csv file to hive tables. Particularly, when record is delimited by comma and one of the field contains comma in-between the quotes.

Here I will going to explain one simple solution to handle such csv file.

Hope you guys like it and may find helpful.

What is a SerDe?

The SerDe interface allows you to instruct Hive as to how a record should be processed. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. Commonly, Deserializers are used at query time to execute SELECT statements, and Serializers are used when writing data, such as through an INSERT-SELECT statement.

CSV SerDe

All TEXTFILE tables have a SerDe (Serializer/Deserializer) assigned to them that tells them how to extract fields from a line of text. By default a TEXTFILE table uses the lazy serde: the lazy serde will split each line of text at the defined character delimiter and The Serde is set at the CREATE TABLE statement. If not specified, the lazy serde is used. If your text file is a csv with quotes and commas, the lazy serde will not intepret the quotes and they will be part of the field value, which can cause issue if the column value initially contained commas as commas are also the field delimiters

e.g consider this line in a csv file: "text , text", 123
It contains 2 fields :
1: "text, field" and
2: 123
The lazy serde will wronly assume it has 3 fields as there is a comma in the first field (delimited by quotes)

The CSV serde allows Hive to interpret these CSV files correctly.

Note: 1) All fields extracted will be of type "string"
2) Impala does NOT support SerDes
3) This does NOT support multi-line fields.

Rows are produced by the record reader in the inputformat when a line break character is present (\n). Only then the CSV serde can extract fields from rows.

Two or more CSV serdes are available for Hive:
use either now built-in org.apache.hadoop.hive.serde2.OpenCSVSerde
or com.bizo.hive.serde.csv.CSVSerde

Example:
Here is how to use the CSV serde in the CREATE TABLE statement:
serde properties is optional as the standard CSV properties will be used by default:

CREATE EXTERNAL TABLE my_csv_table
(c1 string, c2 string)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar"     = "'",
"escapeChar"    = "\\" )
STORED AS TEXTFILE
LOCATION '/user/hive/landing/csvfolder/' ;

Result:

Given a line in a csv file containing 2 fields:
"this is \ta csv \ttest , field", "this is another field"

Correct output using CSV serde:

Incorrect output using default lazy serde:

Because the first csv field contains a comma between quotes, it is not a comma indicated a new field, but it's part of the field. If we load this data in a standard textfile Hive table, it will be misinterpreted: the quotes will be considered as text and the comma in the first field will cause the text after it to be in the next field. we will lose the data from the original 2nd field

Supreeth Shankar 8y

One query though - how exactly is issue of null values or empty string resolved. I ve tried null serialization format and apparently it isnt helping.

Amol Kale 9y

Nice Article

Nakul Deshmukh 9y

This is great Vishal.

1 Reaction

Praveen Misra 9y

Excellent.. Thanks for sharing..

See more comments

To view or add a comment, sign in

CSV SerDe

Vishal Loharkar

Hello All,

What is a SerDe?

CSV SerDe

More articles by Vishal Loharkar

Others also viewed

Apache Arrow - The Missing Deep Dive Series - Part 1

Bulk Insert via python to insert over 4 Million+ rows to MariaDB at localhost [Project-Based]

Content ETL made even easier

Choosing Between ORM and Direct SQL Queries: A Balanced Approach (Part 1)

SQL Server to MinIO: GUI-Based Data Ingestion Tool

David Schachter’s Blog: ORMs vs. SQL

Spring Batch - Read from CSV and Write to Database

XML Encode using SQL on #IBMi

Explore content categories

Hello All,

What is a SerDe?

CSV SerDe

More articles by Vishal Loharkar

Turning ML Models into Real-World Applications with Streamlit 🚀

Breaking 70 Years of Algorithmic Limits: The New Shortest Path Algorithm That Outperforms Dijkstra

Mastering PySpark Transformations: A Simple Guide to a Complex Concept

Unlocking the Power of Apache Spark with Python for Next-Level ETL

Driving the Future: Key Data Engineering Innovations in 2025

Memory Management in Spark - I

Caching In Spark

Hive Optimization Parameters

What is Apache Tez?

Using Match() & Wildmatch() in QlikView script

Others also viewed

Apache Arrow - The Missing Deep Dive Series - Part 1

Bulk Insert via python to insert over 4 Million+ rows to MariaDB at localhost [Project-Based]

Content ETL made even easier

Choosing Between ORM and Direct SQL Queries: A Balanced Approach (Part 1)

SQL Server to MinIO: GUI-Based Data Ingestion Tool

David Schachter’s Blog: ORMs vs. SQL

Spring Batch - Read from CSV and Write to Database

XML Encode using SQL on #IBMi

Explore content categories