Apache Hive : Hive across Multiple Data Centers (Physical Clusters)

This project has been abandoned. We’re leaving the design doc here in case someone decides to attempt this project in the future.

Use Cases

Inside facebook, we are running out of power inside a data center (physical cluster), and we have a need to have a bigger cluster.

We can divide the cluster into multiple clusters - multiple hive instances, multiple mr and multiple dfs. This will put a burden on

the user - he needs to know which cluster to use. It will be very difficult to support joins across tables in different clusters, and

will lead to a lot of duplication of data in the long run. To get around these problems, we are planning to extend hive to span

multiple data centers, and make the existence of multiple clusters transparent to the end users in the long run. Note that, even

today, different partitions/tables can span multiple dfs’s, and hive does not enforce any restrictions. Those dfs’s can be in different

data centers also. However, a single table/partition can only have a single location. We need to enhance this. We will not be able to

partition our warehouse cluster into multiple disjoint clusters, and therefore some tables/partitions will be present in multiple clusters.

Requirements

In order to do so, we need to make some changes in hive, and this document primarily talks about those. The changes should be generic

enough, so that they can be used by others (outside Facebook) also, if they have such a requirement. The following restrictions have

been imposed to simplify the problem:

secondary clusters.

However, an object (unpartitioned table/partition) is either fully present or not present at all in the secondary cluster.

It is not possible to have partial data of a partition in the secondary cluster.

T11’s secondary cluster is C2 (and all the data for T11 is also present in C2). + The query ‘select .. from T11 .. ' will be processed in C1 + The query ‘select .. from T11 join T12 .. ' will be processed in C1 + The query ‘select .. from T21 .. ' will be processed in C2 + The query ‘select .. from T11 join T21 .. ' will be processed in C2 + The query ‘select .. from T11 join T31 .. ' will fail + ‘Insert .. T13 select .. from T11 ..’ will be processed in C1 and the T13 will be created in C1 + ‘Insert .. T21 select .. from T11 ..’ will be processed in C2, and T21 will remain in C2

The same idea can be extended for partitioned tables.

in the . If the query can run in this cluster, it will succeed. Otherwise, it will fail.

In the first cut, the user needs to do this operation outside hive (one simple way to do so, is distcp the partition from the

primary to the secondary cluster, and then update the metadata directly - via the thrift api).

PrimaryCluster - ClusterStorageDescriptor

and SecondaryClusters - Set

The ClusterStorageDescriptor contains the following:

ClusterName

Location

location will be removed from the StorageDescriptor.

In order to support the above, hive metastore needs to be enhanced to have the concept of a cluster. The following parameters will

be added:

All new tables belong to this cluster.

The existing thrift API’s will continue to work as if the user is trying to access the default cluster.

New APIs will be added which take the cluster as a new parameter. Almost all the existing APIs will be

enhanced to support this. The behavior will be the same as if, the user issued the command ‘USE CLUSTER

If the user does not intend of use multiple clusters, there should be no change in the behavior of hive commands, and all the

existing thrift APIs should continue to work.

An upgrade script will be provided to upgrade the metastore with the patch.