ORC vs RC file format

Rahul Kumar

Published Sep 7, 2017

+ Follow

ORC offers a number of features not available in RC files:

* Better encoding of data. Integer values are run length encoded.

Strings and dates are stored in a dictionary (and the resulting pointers

then run length encoded).

* Internal indexes and statistics on the data. This allows for more

efficient reading of the data as well as skipping of sections of the

data not relevant to a given query. These indexes can also be used by

the Hive optimizer to help plan query execution.

* Predicate push down for some predicates. For example, in the query

"select * from user where state = 'ca'", ORC could look at a collection

of rows and use the indexes to see that no rows in that group have that

value, and thus skip the group altogether.

* Tight integration with Hive's vectorized execution, which produces

much faster processing of rows

* Support for new ACID features in Hive (transactional insert, update,

and delete).

* It has a much faster read time than RCFile and compresses much more

efficiently.

**Whether ORC is the best format for what you're doing depends on the data

you're storing and how you are querying it. If you are storing data

where you know the schema and you are doing analytic type queries it's

the best choice (in fairness, some would dispute this and choose

Parquet, though much of what I said above about ORC vs RC applies to

Parquet as well). If you are doing queries that select the whole row

each time columnar formats like ORC won't be your friend. Also, if you

are storing self structured data such as JSON or Avro you may find text

or Avro storage to be a better format.

ORC vs RC file format

Rahul Kumar

More articles by this author

Explore content categories

Why Multidimensional Scaling Fails?

Feb 15, 2022

Ensemble Learning

Jan 5, 2019

Radial basis function network

Dec 8, 2018

Linear Discriminant Analysis

Aug 14, 2018

Resolving MERGE Performance in Azure SQL Database

Oct 5, 2017

Partitioning clustered columnstore tables in Azure Sql Data-warehouse

Sep 3, 2017

Best Practices for Azure Sql data warehouse Data Load using polybase or single-client gated load methods

Sep 2, 2017

NoSql Database Modelling Challenges

Sep 2, 2017

Explore content categories