HCatalog Manual

13 documents

Apache Hive : HCatalog

HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data on the grid.

This is the HCatalog manual.  

The old HCatalog wiki page has many other documents including additional user documentation, further information on HBase integration, and resources for contributors.

Apache Hive : HCatalog Authorization

Storage Based Authorization

Default Authorization Model of Hive

The default authorization model of Hive supports a traditional RDBMS style of authorization based on users, groups and roles and granting them permissions to do operations on database or table. It is described in more detail in Hive Authorization and Hive deprecated authorization mode / Legacy Mode.

This RDBMS style of authorization is not very suitable for the typical use cases in Hadoop because of the following differences in implementation:

Apache Hive : HCatalog Command Line Interface

Set Up

The HCatalog command line interface (CLI) can be invoked as HIVE_HOME=hive_home hcat_home/bin/hcat where hive_home is the directory where Hive has been installed and hcat_home is the directory where HCatalog has been installed.

If you are using BigTop’s rpms or debs you can invoke the CLI by doing /usr/bin/hcat.

HCatalog CLI

The HCatalog CLI supports these command line options:

OptionUsageDescription
-ghcat -g mygroup ...Tells HCatalog that the table which needs to be created must have group “mygroup”.
-phcat -p rwxr-xr-x ...Tells HCatalog that the table which needs to be created must have permissions “rwxr-xr-x”.
-fhcat -f myscript.hcatalog ...Tells HCatalog that myscript.hcatalog is a file containing DDL commands to execute.
-ehcat -e 'create table mytable(a int);' ...Tells HCatalog to treat the following string as a DDL command and execute it.
-Dhcat -Dkey=value ...Passes the key-value pair to HCatalog as a Java System Property.
 hcatPrints a usage message.

Note the following:

Apache Hive : HCatalog Configuration Properties

Apache HCatalog’s behaviour can be modified through the use of a few configuration parameters specified in jobs submitted to it. This document details all the various knobs that users have available to them, and what they accomplish. 

Setup

The properties described in this page are meant to be job-level properties set on HCatalog through the jobConf passed into it. This means that this page is relevant for Pig users of HCatLoader/HCatStorer, or MapReduce users of HCatInputFormat/HCatOutputFormat. For a MapReduce user of HCatalog, these must be present as key-values in the Configuration (JobConf/Job/JobContext) used to instantiate HCatOutputFormat or HCatInputFormat. For Pig users of HCatStorer, these parameters are set using the Pig “set” command before instantiating an HCatLoader/HCatStorer.

Apache Hive : HCatalog Dynamic Partitioning

Overview

When writing data in HCatalog it is possible to write all records to a single partition. In this case the partition column(s) need not be in the output data.

The following Pig script illustrates this:

A = load 'raw' using HCatLoader(); 
... 
split Z into for_us if region='us', for_eu if region='eu', for_asia if region='asia'; 
store for_us into 'processed' using HCatStorer("ds=20110110, region=us"); 
store for_eu into 'processed' using HCatStorer("ds=20110110, region=eu"); 
store for_asia into 'processed' using HCatStorer("ds=20110110, region=asia"); 

In cases where you want to write data to multiple partitions simultaneously, this can be done by placing partition columns in the data and not specifying partition values when storing the data.

Apache Hive : HCatalog Input and Output Interfaces

Set Up

No HCatalog-specific setup is required for the HCatInputFormat and HCatOutputFormat interfaces.

Note: HCatalog is not thread safe.

HCatInputFormat

The HCatInputFormat is used with MapReduce jobs to read data from HCatalog-managed tables.

HCatInputFormat exposes a Hadoop 0.20 MapReduce API for reading data as if it had been published to a table.

API

The API exposed by HCatInputFormat is shown below. It includes:

Apache Hive : HCatalog Installation from Tarball

HCatalog Installed with Hive

Version

HCatalog is installed with Hive, starting with Hive release 0.11.0.
Hive installation is documented here.

HCatalog Command Line

If you install Hive from the binary tarball, the hcat command is available in the hcatalog/bin directory.

The hcat command line is similar to the hive command line; the main difference is that it restricts the queries that can be run to metadata-only operations such as DDL and DML queries used to read metadata (for example, “show tables”).

Apache Hive : HCatalog Load and Store Interfaces

Set Up

The HCatLoader and HCatStorer interfaces are used with Pig scripts to read and write data in HCatalog-managed tables. No HCatalog-specific setup is required for these interfaces.

Note: HCatalog is not thread safe.

Running Pig

The -useHCatalog Flag

To bring in the appropriate jars for working with HCatalog, simply include the following flag / parameters when running Pig from the shell, Hue, or other applications:

Apache Hive : HCatalog Notification

Overview

Since version 0.2, HCatalog provides notifications for certain events happening in the system. This way applications such as Oozie can wait for those events and schedule the work that depends on them. The current version of HCatalog supports two kinds of events:

  • Notification when a new partition is added
  • Notification when a set of partitions is added

No additional work is required to send a notification when a new partition is added: the existing addPartition call will send the notification message.

Apache Hive : HCatalog Reader and Writer Interfaces

Overview

HCatalog provides a data transfer API for parallel input and output without using MapReduce. This API provides a way to read data from a Hadoop cluster or write data into a Hadoop cluster, using a basic storage abstraction of tables and rows.

The data transfer API has three essential classes:

  • HCatReader – reads data from a Hadoop cluster
  • HCatWriter – writes data into a Hadoop cluster
  • DataTransferFactory – generates reader and writer instances

Auxiliary classes in the data transfer API include:

Apache Hive : HCatalog Storage Formats

SerDes and Storage Formats

HCatalog uses Hive’s SerDe class to serialize and deserialize data. SerDes are provided for RCFile, CSV text, JSON text, and SequenceFile formats. Check the SerDe documentation for additional SerDes that might be included in new versions. For example, the Avro SerDe was added in Hive 0.9.1, the ORC file format was added in Hive 0.11.0, and Parquet was added in Hive 0.10.0 (plug-in) and Hive 0.13.0 (native).

Apache Hive : HCatalog Streaming Mutation API

A Java API focused on mutating (insert/update/delete) records into transactional tables using Hive’s ACID feature. It is introduced in Hive 2.0.0 (HIVE-10165).

Background

In certain data processing use cases it is necessary to modify existing data when new facts arrive. An example of this is the classic ETL merge where a copy of a data set is kept in sync with a master by the frequent application of deltas. The deltas describe the mutations (inserts, updates, deletes) that have occurred to the master since the previous sync. To implement such a case using Hadoop traditionally demands that the partitions containing records targeted by the mutations be rewritten. This is a coarse approach; a partition containing millions of records might be rebuilt because of a single record change. Additionally these partitions cannot be restated atomically; at some point the old partition data must be swapped with the new partition data. When this swap occurs, usually by issuing an HDFS rm followed by a mv, the possibility exists where the data appears to be unavailable and hence any downstream jobs consuming the data might unexpectedly fail. Therefore data processing patterns that restate raw data on HDFS cannot operate robustly without some external mechanism to orchestrate concurrent access to changing data.

Apache Hive : HCatalog Usage

Version information

HCatalog graduated from the Apache incubator and merged with the Hive project on March 26, 2013.
Hive version 0.11.0 is the first release that includes HCatalog.

Overview

HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored — RCFile format, text files, SequenceFiles, or ORC files.