Apache Hive : Replication

Overview

Hive Replication builds on the metastore event and ExIm features to provide a framework for replicating Hive metadata and data changes between clusters. There is no requirement for the source cluster and replica to run the same Hadoop distribution, Hive version, or metastore RDBMS. The replication system has a fairly ‘light touch’, exhibiting a low degree of coupling and using the Hive-metastore Thrift service as an integration point. However, the current implementation is not an ‘out of the box’ solution. In particular it is necessary to provide some kind of orchestration service that is responsible for requesting replication tasks and executing them.

See HiveReplicationDevelopment for information on the design of replication in Hive.

 

A more advanced replication mechanism is being implemented in Hive to address some of the limitations of this mode. See HiveReplicationv2Development for details.

Potential Uses

Prerequisites

Limitations

Configuration

To configure the persistence of metastore notification events it is necessary to set the following [hive-site.xml](#hive-site-xml) properties on the source cluster. A restart of the metastore service will be required for the settings to take effect.

hive-site.xml Configuration for Replication

  <property>
    <name>hive.metastore.event.listeners</name>
    <value>org.apache.hive.hcatalog.listener.DbNotificationListener</value>
  </property>
  <property>
    <name>hive.metastore.event.db.listener.timetolive</name>
    <value>86400s</value>
  </property>

The system uses the org.apache.hive.hcatalog.api.repl.exim.EximReplicationTaskFactory by default. This uses EXPORT and IMPORT commands to capture, move, and ingest the metadata and data that need to be replicated. However, it is possible to provide custom implementations by setting the hive.repl.task.factory Hive configuration property.

Typical Mode of Operation

Replication to AWS/EMR/S3

At this time it is not possible to replicate to tables on EMR that have a path location in S3. This is due to a bug in the dependency of the IMPORT command in the EMR distribution (checked in AMI-4.2.0). Also, if using the EximReplicationTaskFactory you may need to add the relevant S3 protocols to your Hive configurations:

HiveConf Configuration for ExIm on S3

  <property>
    <name>hive.exim.uri.scheme.whitelist</name>
    <value>hdfs,s3a</value>
  </property>

 

Save