Apache Hive : Enabling gRPC in Hive/Hive Metastore (Proposal)

Contacts: Cameron Moberg (Google), Zhou Fang (Google), Feng Lu (Google), Thejas Nair (Cloudera), Vihang Karajgaonkar (Cloudera), Naveen Gangam (Cloudera)

Last upated: 7/31/2020

Objective

Background

Hive Metastore is the central repository of Apache Hive (among others like Presto and Spark) metadata. It stores metadata for tables (e.g., schema, location, and statistics) and partitions in a relational database. It provides client access to this information by using a Thrift Metastore API.

The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, and many other languages.

gRPC is a modern open source high performance RPC framework that can run in any environment. It can efficiently connect services in and across data centers with pluggable support for load balancing, tracing, health checking and authentication. It is also applicable in the last mile of distributed computing to connect devices, mobile applications and browsers to backend services.

Providing gRPC as an option to access Metastore brings us many benefits. Compared to Thrift, gRPC supports streaming that provides better performance for large requests. In addition, it is extensible to more advanced authentication features and is fully compatible with Google’s IAM service that supports fine grained permission checks. A path to integrate gRPC with Hive Metastore is sketched out by this proposal.

Design

Overview

The overall design of the gRPC support in Hive Metastore is illustrated in Fig.1. On the server side, based on user configuration, the Hive Metastore Server can listen on a port for Thrift or gRPC request. The lifecycle of a Thrift request has not been changed. For a gRPC request, the new HiveMetastoreGrpcServer will translate an incoming gRPC request into a Thrift request, transparently pass it to HiveMetastoreThriftServer, and translate the response back into gRPC.

On the client side, a similar design is used to support converting outgoing requests from Thrift to gRPC. 

Figure 1. An overview of the new gRPC endpoint of Hive Metastore. Clarification: The only network I/O that occurs is between the user and serving processes in HiveMetaStore (gRPC, Thrift, or both).

The implementation details are described in the following sections.

Implementation

Pluggable gRPC Support

To have a loose coupling between Hive Metastore and the gRPC layer, we propose to have a pluggable layer which implements only a hook in the Hive Metastore repository, while implements the gRPC proxy library in a separate repository. To enable the gRPC server, a user set “metastore.custom.server.class” in the Hive configuration to the class path of the server in gRPC library. Hive Metastore will then instantiate this class and start the gRPC server described as follows. Here is an example of a similar pluggable library in Hive.

The gRPC layer at the client side is implemented similarly in the separate repository. Changes need to be made into Hive repository to load the gRPC Hive client if enabled by config. For example, both SessionHiveMetastoreClient.java and RetryingMetastoreClient.java can be amended to dynamically load the HiveMetastore gRPC client if the metastore.uris starts with “grpc://”.

Hive Metastore Server

Class Change

The following is assuming modification of the standalone-metastore package.

To add support for the Hive Metastore to be able to receive and process gRPC requests, additional Java classes need to be created. Due to the current coupling between application logic and startup logic in HiveMetaStore.java, a separation of logic would first be required that breaks out the Thrift implementation into a HiveMetaStoreThriftServer.java class that can then be instantiated by the HiveMetaStore.java class (the main driver program). After that, the new gRPC implementation Java class, HiveMetaStoreGrpcServer.java, can be similarly created and referenced.

The Thrift RPC definition files must be translated into gRPC protobuf files, while this has some direct incompatibilities (such as sets), these can be worked on a per-method implementation basis rather than worry about it in the protobuf files. An example service protobuf definition is shown below; keep in mind Table and CreateTableResponse are both also defined in protobuf files. There is no reference to Thrift files.

proto

service HiveMetaStoreGrpc {
    rpc getTable(Table) returns (GetTableResponse);
}

Once the service methods are defined, the gRPC server can be created so it can be instantiated by the driver class. The HiveMetaStoreGrpcServer.java class signature would implement the gRPC server interface like below:

class

private class HiveMetaStoreGrpcServer extends HiveMetaStoreGrpc.HiveMetaStoreImplBase

An example implementation of getTable translation layer is shown below.

sample

// Returns gRPC Response, and takes in gRPC GetTableRequest
public GetTableResponse getTable(GetTableRequest grpcTableRequest) {
  // convertToThriftRequest is implemented by a different library, not in Hive code
  Table thriftGetTableRequest = convertToThriftRequest(grpcTableRequest);
  // Result is a Thrift API object    
  GetTableResult result = HiveMetaStoreThriftServer.get\_table(thriftTable);
  return convertToThriftResponse(result);
}

As shown in Figure 1, green elements are newly added class while yellow is modified from the current design.

Config Changes

With the potential of starting a new reachable endpoint the requirement of additional hive-site.xml configs are required. The current proposed configuration values are shown below.

Parsing metastore.custom.server.class is implemented in Hive Metastore repository, whereas parsing the gRPC specific configs can be offloaded to the gRPC library:

Hive Metastore Client

Class Change

While a Hive Metastore that can support gRPC requests is still useful without any clients it would be helpful to also have the Hive client support gRPC communication with Hive Metastore. This is fairly similar to the previous section, but worth a section on its own for clarity.

The class IMetaStoreClient is a thin wrapper on top of the thrift interface and is implemented by HiveMetaStoreClient, however this is an entirely thrift based implementation. As opposed to above, as the client we want to take in a Thrift request (usually generated by code itself), convert the request to gRPC, and then send it out on the wire to the desired listening gRPC server.

In this case however, for backwards compatibility the easiest way to add support for gRPC would be to create a new class HiveMetaStoreGrpcClient that implements the Thrift interface IMetaStoreClient, but instead of instantiating and calling and opening a Thrift client, we create a new gRPC client and convert the input Thrift requests to the relevant gRPC, and send them down the wire with the generated gRPC client.

Example Instantiation and usage of gRPC client:

client

ManagedChannel channel = ManagedChannelBuilder.forTarget(target)
    // Channels are secure by default (via SSL/TLS). For the example we disable TLS to avoid
    // needing certificates.
    .usePlaintext()
    .build();
HiveMetaStoreGrpc.HiveMetastore BlockingStub blockingStub = HiveMetaStoreGrpc.newBlockingStub(channel);
blockingStub.getTable(getTableRequest);

The definition of the getTable method is defined in the metastore server spec, so all the client needs to do is worry about conversion of the Thrift object to gRPC and which gRPC method to call.

Configuration Changes

Similar to the changes to the server config, a user can populate the following fields to user a gRPC enabled client:

Summary

Future Work

Attachments: