Apache Hive : Links

Motivation

Today, the infrastructure provided by Hive allows for the setup of a single shared warehouse and the authorization model allows for access control within this warehouse if needed. Growth beyond a single warehouse (when datacenter capacity limits are reached) OR separation of capacity usage and allocation requires the creation of multiple warehouses with each warehouse mapping to it’s own Hive metastore. Let’s define the term physical warehouse to map to a single Hive metastore, the Hadoop cluster it maps to and the data in it.

In organizations with a large number of teams needing a warehouse, there is a need to be able to:

One way to do this is to use a single shared warehouse as we do today, but this has the below issues:

An alternative is to create a new physical warehouse per team (1:1 mapping), but this is not optimal since the physical resources are not shared across teams, and the operational cost is high. Further, data may not be cleanly partition-able and end up being replicated in multiple physical warehouses.

To provide context, in Facebook, we expect to have 20+ partitions of the warehouse, so operating each in their own physical warehouse will be impractical from an operational perspective.

Requirements

Introduce the notion of a virtual warehouse (namespace) in Hive with the below key properties:

Mapping many namespaces to 1 physical warehouse keeps the operational cost low. If a physical warehouse reaches capacity limits, portability will allow seamless migration of the namespace to another physical warehouse.

Note that users can operate on multiple namespaces simultaneously although they are likely to most often operate within one namespace. So namespaces are not trying to solve the problem of ensuring that users only have access to a subset of data in the warehouse.

From Hive, therefore the requirements are:

Design

The design that is proposed is:

Let’s take a concrete example:

There are 3 alternatives to implementing these ideas in open-source hive and Facebook extensions:

Links to JIRAS for these features:

A basic tenet of our design is that a Hive instance does not operate across physical warehouses. We are building a namespace service external to Hive that has metadata on namespace location across the Hive instances, and allows importing data across Hive instances using replication.

Alternate design options

Modeling Namespace as a Role in Hive (using the authorization model)

The idea here is to create a Role for each namespace and users operating in that namespace belong to that Role. Access to data outside the namespace is made possible by granting permissions to the foreign table/view to the Role for the namespace.

Issues

Modeling Namespace by tagging objects

The idea here is to tag tables/partitions with the namespaces that they belong to. To handle the requirements:

None of these are insurmountable problems, but using databases to model namespaces is a cleaner approach.

(Taking this idea further, a database in Hive could itself have been implemented using tags in a single global namespace which would have not been as elegant as the current implementation of a database being a first class concept in Hive.)

Modeling Namespace as a database but using views for imports

Using partitioned views:

By definition, there isn’t a one-one mapping between a view partition and a table partition. In fact, hive today does not even know about this dependency between view partitions and table partitions. Partitioned views is just a metadata concept - it is not something that the query layer understands. For e.g: if a view V partitioned on ds had 2 partitions: 1 and 2, then a query like select … from V where ds = 3 may still give valid results if the ds=3 is satisfied by the table underlying V. This means that:

The above notes make it clear that what we are trying to build is a very special case of a degenerate view, and it would be cleaner to introduce a new concept in Hive to model these ‘imports’.

Comments:

Questions from Ashutosh Chauhan (with inline responses):

Posted by sambavi at May 22, 2012 02:10 | | The first draft of this proposal is very hard to decipher because it relies on terms that aren’t well defined. For example, here’s the second sentence from the motivations section: Growth beyond a single warehouse (or) separation of capacity usage and allocation requires the creation of multiple physical warehouses, i.e., separate Hive instances. What’s the difference between a warehouse and a physical warehouse? How do you define a Hive instance? In the requirements section the term virtual warehouse is introduced and equated to a namespace, but clearly it’s more than that because otherwise DBs/Schemas would suffice. Can you please update the proposal to include definitions of these terms?

Posted by cwsteinbach at May 22, 2012 18:35 | | Prevent access using two part name syntax (Y.T) if namespaces feature is “on” in a Hive instance. This ensures the database is self-contained. The cross-namespace Hiveconf ACL proposed in HIVE-3016 doesn’t prevent anyone from doing anything because there is no way to keep users from disabling it. I’m surprised to see this ticket mentioned here since three committers have already gone on record saying that this is the wrong approach, and one committer even -1’d it. If preventing cross-db references in queries is a requirement for this project, then I think Hive’s authorization mechanism will need to be extended to support this privilege/restriction.

Posted by cwsteinbach at May 22, 2012 18:48 | | From the design section: We are building a namespace service external to Hive that has metadata on namespace location across the Hive instances, and allows importing data across Hive instances using replication. Does the work proposed in HIVE-2989 also include adding this Db/Table replication infrastructure to Hive?

Posted by cwsteinbach at May 22, 2012 18:53 | | We mention the JIRA here for the sake of completeness. We are implementing this as a pre-execution hook for now, but support for namespaces will be incomplete without this control (since you can’t guarantee self-contained namespaces unless you prevent two-part name access). What extensions to the authorization system are you thinking of? One idea would be to set role for a session (corresponding to the namespace the user is operating in), so that a user operating in the context of that role can only see the data available to that role.

Posted by sambavi at May 22, 2012 19:42 | | What extensions to the authorization system are you thinking of? Add a new privilege named something like “select_cross_db” and GRANT it to specific users as follows: GRANT select_cross_db ON DATABASE db TO USER x; GRANT select_cross_db ON DATABASE db to ROLE x; This privilege would be provided by default, but if absent, then the user would be prevented from referencing DBs outside of ‘db’ while using ‘db’ as the primary database.

Posted by cwsteinbach at May 22, 2012 20:01 | | Thanks Carl - this is an interesting suggestion - for installations with namespaces, we would need to turn this privilege off by default and have no users. groups or roles be granted this privilege. We’ll discuss internally.

Posted by sambavi at May 22, 2012 20:25 | | I’ve edited the main page to include a definition for physical warehouse and removed the term Hive instance to reduce ambiguity.

Posted by sambavi at May 22, 2012 20:26 | | No Hive-2989 will not include the replication infrastructure. We plan to provide replication in the second half of the year.

Posted by sambavi at May 23, 2012 10:30 | | I have opened a JIRA for adding a new privilege for cross database commands and resolved HIVE-3016. Thanks for the suggestion!

Posted by sambavi at May 23, 2012 12:36 | | This proposal describes the DDL and metadata changes that are necessary to support DB Links, but it doesn’t include any details about the mechanics of replicating data across clusters (it’s probably a little more complicated than just running distcp). I think the proposal needs to include these details before it can be considered complete. No Hive-2989 will not include the replication infrastructure. We plan to provide replication in the second half of the year. The metadata extensions described in this proposal will require users to run metastore upgrade scripts, and the DDL extensions will become part of the public API. The former imposes a burden on users, and the latter constitutes a continuing maintenance burden on the people who contribute to this project. Taking this into account I think we need to be able to demonstrate that the new code tangibly benefits users before it appears in a Hive release. I don’t think it will be possible to demonstrate a tangible benefit to users until the replication/update mechanism is implemented and integrated into Hive.

Posted by cwsteinbach at May 25, 2012 19:21 | | Some thoughts on this from our team:

  1. Even in a single physical warehouse, namespaces allow better quota management and isolation between independent team’s data/workload. This is independent of security and replication considerations.
  2. Further, replication as a feature is pretty big and will take a while to be built out. We can hide the table link feature behind a config parameter so that its not exposed to users who don’t need it until its completed. The only piece we cannot hide is the metastore changes, but the upgrade script for the metastore will just add few columns in few tables, and should not take more than a few minutes even for a pretty large warehouse (few thousand tables + ~100,000 partitions). In the meanwhile, users who do need this feature (like Facebook) can use it if they want to. If there was links support to start with in hive, we would have used it from the very beginning, and not gotten into the mess of one large warehouse with difficulty in dealing with multi-tenancy. We seriously believe that this is the right direction for the community, and all new users can design the warehouse in the right way from the very start, and learn from Facebook’s experience.

Posted by sambavi at May 25, 2012 21:37 | | Even in a single physical warehouse, namespaces allow better quota management and isolation between independent team’s data/workload. This is independent of security and replication considerations. Hive already provides support for namespaces in the form of databases/schemas. As far as I can tell the table link feature proposed here actually weakens the isolation guarantees provided by databases/schemas, and consequently will make quota and workload management between teams more complicated. In order to resolve this disagreement I think it would help if you provided some concrete examples of how the table link feature improves this situation.

Posted by cwsteinbach at May 29, 2012 13:54 | | Suppose there are 2 teams which want to use the same physical warehouse. Team 1 wants to use the following: (let us say that each table is partitioned by date) T1 (all partitions) T2 (all partitions) T3 (all partitions) Team 2 wants to use the following: T1 (partitions for the last 3 days) T2 (partition for a fixed day: say May 29' 2012) T4 (all partitions) Using the current hive architecture, we can perform the following:

The problems with these approaches is as follows:

Links is a degenerate case of views. With links, the above use case can be solved very easily. This is a real use case at Facebook today, and I think, there will be similar use cases for other users. Maybe, they are not solving it in the most optimal manner currently.

Posted by namit.jain at May 29, 2012 18:33 | | Furthermore, databases don’t provide isolation today since two part name access is unrestricted. By introducing a trackable way of accessing content outside the current database (using table links), we get isolation for a namespace using Hive databases.

Posted by sambavi at May 30, 2012 01:22 | | @Namit: Thanks for providing an example. I have a couple followup questions: Use a single database and have scripts for quota Can you provide some more details about how quota management works? For example, Team 1 and 2 both need access to a single partition in table T2, so who pays for the space occupied by this partition? Are both teams charged for it? Use 2 databases and copy the data in both the databases (say, the databases are DB1 and DB2 respectively) I’m not sure why this is listed as an option. What does this actually accomplish? Use 2 databases, and use views in database 2 (to be used to team 2). This seems like the most reasonable approach given my current understanding of your use case. You listed three problems with these approaches. Most of them don’t seem applicable to views: Table Discovery etc.becomes very messy. You can do that via tags, but then, all the functionality that is provided by databases can also be provided via tags. It’s hard for me to evaluate this claim since I’m not totally sure what is meant by “table discovery”. Can you please provide an example? However, my guess is that this is not a differentiator if you’re comparing the table links approach to views. Duplication of data Not applicable for views and table links. The user will have to perform the management himself. When a partition gets added to DB1.T1, the corresponding partition needs to be added to DB2.View1, and the 3 day old partition from DB2.View1 needs to be dropped. Based on the description of table links in HIVE-2989 it sounds like the user will still have to perform manual management even with table links, e.g. dropping the link that points to the partition from four days ago and adding a new link that points to the most recent partition. In this case views may actually work better since you can embed the filter condition (last three days) in the view definition instead of relying on external tools to update the table links. This has to be done outside hive, and makes the task of maintaining these partitions very difficult - how do you make sure this is atomic etc. User has to do lot more scripting. I don’t think table links make this process atomic, and as I mentioned above the process of maintaining this linked set of partitions actually seems easier if you use views instead. Links is a degenerate case of views. I agree that table links are a degenerate case of views. Since that’s the case, why is it necessary to implement table links? Why not leverage the functionality that is already provided with views?

Posted by cwsteinbach at May 31, 2012 16:45 | | Furthermore, databases don’t provide isolation today since two part name access is unrestricted. DBs in conjunction with the authorization system provide strong isolation between different namespaces. Also, it should be possible to extend the authorization system to the two-part-name-access case that you described above (e.g. HIVE-3047). By introducing a trackable way of accessing content outside the current database (using table links), we get isolation for a namespace using Hive databases. I think you already get that by using views. If I’m wrong can you please explain how the view approach falls short? Thanks.

Posted by cwsteinbach at May 31, 2012 16:59 | | Carl: I’ve addressed your questions below.

Can you provide some more details about how quota management works? For example, Team 1 and 2 both need access to a single partition in table T2, so who pays for the space occupied by this partition? Are both teams charged for it? If the partition is shared, it is accounted towards both their quota (base quota for the team that owns the partition, and imported quota for the team that imports it via a link). The reason for this is that when a namespace is moved to another datacenter, we have to account for all the quota (both imported and base) as belonging to the namespace (the data can no longer be shared directly via a link, and we will need to replicate it). Use 2 databases and copy the data in both the databases (say, the databases are DB1 and DB2 respectively) I’m not sure why this is listed as an option. What does this actually accomplish? It was just one way of achieving the same data being available in the two namespaces. You can ignore this one (smile) It’s hard for me to evaluate this claim since I’m not totally sure what is meant by “table discovery”. Can you please provide an example? However, my guess is that this is not a differentiator if you’re comparing the table links approach to views. I think Namit meant this in reference to the design option of using a single database and using scripts for quota management. In the case of views, due to views being opaque, it will be hard to see which tables are imported into the namespace. Based on the description of table links in HIVE-2989 it sounds like the user will still have to perform manual management even with table links, e.g. dropping the link that points to the partition from four days ago and adding a new link that points to the most recent partition. In this case views may actually work better since you can embed the filter condition (last three days) in the view definition instead of relying on external tools to update the table links. Maybe the description was unclear. Table links have two types: static and dynamic. Static links behave the way you describe, but dynamic links will have addition and drop of partitions when the source table (of the link) has partitions added or removed from it. I don’t think table links make this process atomic, and as I mentioned above the process of maintaining this linked set of partitions actually seems easier if you use views instead. Addressed this above - table links do keep the links updated when the source of the link has partitions added or dropped. This will be atomic since it is done in one metastore operation during an ALTER TABLE ADD/DROP PARTITION command. I agree that table links are a degenerate case of views. Since that’s the case, why is it necessary to implement table links? Why not leverage the functionality that is already provided with views? Table links allow for better accounting of imported data (views are opaque), single instancing of imports and partition pruning when the imports only have some of the partitions of the source table of the link. Given this, it seems ideal to introduce table links as a concept rather than overload views.

Posted by sambavi at May 31, 2012 18:09 | | Hi Carl, I explained how views fall short in the post below (response to your comments on Namit’s post). Please add any more questions you have - I can explain further if unclear.

Posted by sambavi at May 31, 2012 18:11 | | I think Namit meant this in reference to the design option of using a single database and using scripts for quota management. In the case of views, due to views being opaque, it will be hard to see which tables are imported into the namespace. Views are not opaque. DESCRIBE FORMATTED currently includes the following information:

# View Information	 	 
View Original Text: 	SELECT value FROM src WHERE key=86	 
View Expanded Text: 	SELECT `src`.`value` FROM `src` WHERE `src`.`key`=86	 

Currently the metastore only tracks the original and expanded text view query, but it would be straightforward to also extract and store the list of source tables that are referenced in the query when the view is created (in fact, there’s already a JIRA ticket for this (HIVE-1073), and the information is already readily available internally as described here). If we assume that it’s possible to make this change then from a quota management standpoint I don’t think the table links makes quota management easier. Maybe the description was unclear. Table links have two types: static and dynamic. Static links behave the way you describe, but dynamic links will have addition and drop of partitions when the source table (of the link) has partitions added or removed from it. I don’t think dynamic table links satisfy the use case covered by Team 2’s access requirements for table T1. Team 2 wants to see only the most recent three partitions in table T1, and my understanding of dynamic table links is that once the link is created, Team 2 will subsequently see every new partition that is added to the source table. In order to satisfy Team 2’s requirements I think you’re going to have to manually add and drop partitions from the link using the ALTER LINK ADD/DROP PARTITION command, which doesn’t sound any easier than using partitioned views. The functionality provided by dynamic links does make sense in some contexts, but the same is true for dynamic partitioned views. Why not extend the partitioned view feature to support dynamic partitions? Table links allow for better accounting of imported data (views are opaque), single instancing of imports and partition pruning when the imports only have some of the partitions of the source table of the link. Given this, it seems ideal to introduce table links as a concept rather than overload views. I addressed the “views are opaque” argument above. I’m having trouble following the rest of the sentence. What does “single instancing of imports” mean? If possible can you provide an example in terms of table links and partitioned views?

Posted by cwsteinbach at May 31, 2012 22:27 | | Going back to my example: Suppose there are 2 teams which want to use the same physical warehouse. Team 1 wants to use the following: (let us say that each table is partitioned by date) T1 (all partitions) T2 (all partitions) T3 (all partitions) Team 2 wants to use the following: T1 (partitions for the last 3 days) T2 (partition for a fixed day: say May 29' 2012) T4 (all partitions) Using the current hive architecture, we can perform the following:

We have discarded the first 2 approaches above, so let us discuss how will we use approach 3 (specifically for T1). Team 2 will create the view: create view V1T1 as select * from DB1.T1 Now, whenever a partition gets added in DB1.T1, someone (a hook or something - outside hive) needs to add the corresponding partition in V1T1. That extra layer needs to make sure that the new partition in V1T1 is part of the inputs (may be necessary for auditing etc.) Hive metastore has no knowledge of this dependency (view partition -> table partition), and it is maintained in multiple places (for possibly different teams). The same argument applies when a partition gets dropped from DB1.T1. By design, there is no one-to-one dependency between a table partition and a view partition, and we do not want to create such a dependency. The view may depend on multiple tables/partitions. The views in hive are not updatable. By design, the schema of the view and the underlying table(s) can be different. Links provide the above functionality. If I understand right, you are proposing to extend views to support the above functionality. We will end up with a very specific model for a specific type of views, which are not like normal hive views. That would be more confusing, in my opinion.

Posted by namit.jain at Jun 01, 2012 14:25 | | Please comment - we haven’t gotten any updates on the wiki as well as the jira https://issues.apache.org/jira/browse/HIVE-2989

Posted by namit.jain at Jun 02, 2012 19:35 |