HCatalog Manual
Apache Hive : HCatalog
Dec 12, 2024Apache Hive : HCatalog
HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data on the grid.
This is the HCatalog manual.
- Using HCatalog
- Installation from Tarball
- HCatalog Configuration Properties
- Load and Store Interfaces
- Input and Output Interfaces
- Reader and Writer Interfaces
- Command Line Interface
- Storage Formats
- Dynamic Partitioning
- Notification
- Storage Based Authorization
The old HCatalog wiki page has many other documents including additional user documentation, further information on HBase integration, and resources for contributors.
Apache Hive : HCatalog Authorization
Dec 12, 2024Apache Hive : HCatalog Authorization
Storage Based Authorization
Default Authorization Model of Hive
The default authorization model of Hive supports a traditional RDBMS style of authorization based on users, groups and roles and granting them permissions to do operations on database or table. It is described in more detail in Hive Authorization and Hive deprecated authorization mode / Legacy Mode.
This RDBMS style of authorization is not very suitable for the typical use cases in Hadoop because of the following differences in implementation:
Apache Hive : HCatalog CLI
Dec 12, 2024Apache Hive : HCatalog Command Line Interface
Set Up
The HCatalog command line interface (CLI) can be invoked as HIVE_HOME=hive_home hcat_home/bin/hcat where hive_home is the directory where Hive has been installed and hcat_home is the directory where HCatalog has been installed.
If you are using BigTop’s rpms or debs you can invoke the CLI by doing /usr/bin/hcat.
HCatalog CLI
The HCatalog CLI supports these command line options:
| Option | Usage | Description |
|---|---|---|
| -g | hcat -g mygroup ... | Tells HCatalog that the table which needs to be created must have group “mygroup”. |
| -p | hcat -p rwxr-xr-x ... | Tells HCatalog that the table which needs to be created must have permissions “rwxr-xr-x”. |
| -f | hcat -f myscript.hcatalog ... | Tells HCatalog that myscript.hcatalog is a file containing DDL commands to execute. |
| -e | hcat -e 'create table mytable(a int);' ... | Tells HCatalog to treat the following string as a DDL command and execute it. |
| -D | hcat -Dkey=value ... | Passes the key-value pair to HCatalog as a Java System Property. |
hcat | Prints a usage message. |
Note the following:
Apache Hive : HCatalog Configuration Properties
Dec 12, 2024Apache Hive : HCatalog Configuration Properties
Apache HCatalog’s behaviour can be modified through the use of a few configuration parameters specified in jobs submitted to it. This document details all the various knobs that users have available to them, and what they accomplish.
Setup
The properties described in this page are meant to be job-level properties set on HCatalog through the jobConf passed into it. This means that this page is relevant for Pig users of HCatLoader/HCatStorer, or MapReduce users of HCatInputFormat/HCatOutputFormat. For a MapReduce user of HCatalog, these must be present as key-values in the Configuration (JobConf/Job/JobContext) used to instantiate HCatOutputFormat or HCatInputFormat. For Pig users of HCatStorer, these parameters are set using the Pig “set” command before instantiating an HCatLoader/HCatStorer.
Apache Hive : HCatalog DynamicPartitions
Dec 12, 2024Apache Hive : HCatalog Dynamic Partitioning
Overview
When writing data in HCatalog it is possible to write all records to a single partition. In this case the partition column(s) need not be in the output data.
The following Pig script illustrates this:
A = load 'raw' using HCatLoader();
...
split Z into for_us if region='us', for_eu if region='eu', for_asia if region='asia';
store for_us into 'processed' using HCatStorer("ds=20110110, region=us");
store for_eu into 'processed' using HCatStorer("ds=20110110, region=eu");
store for_asia into 'processed' using HCatStorer("ds=20110110, region=asia");
In cases where you want to write data to multiple partitions simultaneously, this can be done by placing partition columns in the data and not specifying partition values when storing the data.
Apache Hive : HCatalog InputOutput
Dec 12, 2024Apache Hive : HCatalog Input and Output Interfaces
Set Up
No HCatalog-specific setup is required for the HCatInputFormat and HCatOutputFormat interfaces.
Note: HCatalog is not thread safe.
HCatInputFormat
The HCatInputFormat is used with MapReduce jobs to read data from HCatalog-managed tables.
HCatInputFormat exposes a Hadoop 0.20 MapReduce API for reading data as if it had been published to a table.
API
The API exposed by HCatInputFormat is shown below. It includes:
Apache Hive : HCatalog InstallHCat
Dec 12, 2024Apache Hive : HCatalog Installation from Tarball
HCatalog Installed with Hive
Version
HCatalog is installed with Hive, starting with Hive release 0.11.0.
Hive installation is documented here.
HCatalog Command Line
If you install Hive from the binary tarball, the hcat command is available in the hcatalog/bin directory.
The hcat command line is similar to the hive command line; the main difference is that it restricts the queries that can be run to metadata-only operations such as DDL and DML queries used to read metadata (for example, “show tables”).
Apache Hive : HCatalog LoadStore
Dec 12, 2024Apache Hive : HCatalog Load and Store Interfaces
Set Up
The HCatLoader and HCatStorer interfaces are used with Pig scripts to read and write data in HCatalog-managed tables. No HCatalog-specific setup is required for these interfaces.
Note: HCatalog is not thread safe.
Running Pig
The -useHCatalog Flag
To bring in the appropriate jars for working with HCatalog, simply include the following flag / parameters when running Pig from the shell, Hue, or other applications:
Apache Hive : HCatalog Notification
Dec 12, 2024Apache Hive : HCatalog Notification
Overview
Since version 0.2, HCatalog provides notifications for certain events happening in the system. This way applications such as Oozie can wait for those events and schedule the work that depends on them. The current version of HCatalog supports two kinds of events:
- Notification when a new partition is added
- Notification when a set of partitions is added
No additional work is required to send a notification when a new partition is added: the existing addPartition call will send the notification message.
Apache Hive : HCatalog ReaderWriter
Dec 12, 2024Apache Hive : HCatalog Reader and Writer Interfaces
Overview
HCatalog provides a data transfer API for parallel input and output without using MapReduce. This API provides a way to read data from a Hadoop cluster or write data into a Hadoop cluster, using a basic storage abstraction of tables and rows.
The data transfer API has three essential classes:
- HCatReader – reads data from a Hadoop cluster
- HCatWriter – writes data into a Hadoop cluster
- DataTransferFactory – generates reader and writer instances
Auxiliary classes in the data transfer API include:
Apache Hive : HCatalog StorageFormats
Dec 12, 2024Apache Hive : HCatalog Storage Formats
SerDes and Storage Formats
HCatalog uses Hive’s SerDe class to serialize and deserialize data. SerDes are provided for RCFile, CSV text, JSON text, and SequenceFile formats. Check the SerDe documentation for additional SerDes that might be included in new versions. For example, the Avro SerDe was added in Hive 0.9.1, the ORC file format was added in Hive 0.11.0, and Parquet was added in Hive 0.10.0 (plug-in) and Hive 0.13.0 (native).
Apache Hive : HCatalog Streaming Mutation API
Dec 12, 2024Apache Hive : HCatalog Streaming Mutation API
A Java API focused on mutating (insert/update/delete) records into transactional tables using Hive’s ACID feature. It is introduced in Hive 2.0.0 (HIVE-10165).
Background
In certain data processing use cases it is necessary to modify existing data when new facts arrive. An example of this is the classic ETL merge where a copy of a data set is kept in sync with a master by the frequent application of deltas. The deltas describe the mutations (inserts, updates, deletes) that have occurred to the master since the previous sync. To implement such a case using Hadoop traditionally demands that the partitions containing records targeted by the mutations be rewritten. This is a coarse approach; a partition containing millions of records might be rebuilt because of a single record change. Additionally these partitions cannot be restated atomically; at some point the old partition data must be swapped with the new partition data. When this swap occurs, usually by issuing an HDFS rm followed by a mv, the possibility exists where the data appears to be unavailable and hence any downstream jobs consuming the data might unexpectedly fail. Therefore data processing patterns that restate raw data on HDFS cannot operate robustly without some external mechanism to orchestrate concurrent access to changing data.
Apache Hive : HCatalog UsingHCat
Dec 12, 2024Apache Hive : HCatalog Usage
Version information
HCatalog graduated from the Apache incubator and merged with the Hive project on March 26, 2013.
Hive version 0.11.0 is the first release that includes HCatalog.
Overview
HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored — RCFile format, text files, SequenceFiles, or ORC files.