Apache Hive : RCFile
RCFile (Record Columnar File) is a data placement structure designed for MapReduce-based data warehouse systems. Hive added the RCFile format in version 0.6.0.
RCFile stores table data in a flat file consisting of binary key/value pairs. It first partitions rows horizontally into row splits, and then it vertically partitions each row split in a columnar way. RCFile stores the metadata of a row split as the key part of a record, and all the data of a row split as the value part.
RCFile combines the advantages of both row-store and column-store to satisfy the need for fast data loading and query processing, efficient use of storage space, and adaptability to highly dynamic workload patterns.
- As row-store, RCFile guarantees that data in the same row are located in the same node.
- As column-store, RCFile can exploit column-wise data compression and skip unnecessary column reads.
A shell utility is available for reading RCFile data and metadata: see RCFileCat.
For details about the RCFile format, see:
- Javadoc for RCFile.java
- the 2011 ICDE conference paper “RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems” by Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang, and Zhiwei Xu.