Apache Hive : Links

Motivation

Today, the infrastructure provided by Hive allows for the setup of a single shared warehouse and the authorization model allows for access control within this warehouse if needed. Growth beyond a single warehouse (when datacenter capacity limits are reached) OR separation of capacity usage and allocation requires the creation of multiple warehouses with each warehouse mapping to it’s own Hive metastore. Let’s define the term physical warehouse to map to a single Hive metastore, the Hadoop cluster it maps to and the data in it.

In organizations with a large number of teams needing a warehouse, there is a need to be able to:

Maximize sharing of physical clusters to keep operational costs low
Clearly identify and track capacity usage by teams in the data warehouse

One way to do this is to use a single shared warehouse as we do today, but this has the below issues:

When the warehouse reaches datacenter capacity limits, it is hard to identify self-contained pieces that can be migrated out.
Capacity tracking and management becomes an issue.

An alternative is to create a new physical warehouse per team (1:1 mapping), but this is not optimal since the physical resources are not shared across teams, and the operational cost is high. Further, data may not be cleanly partition-able and end up being replicated in multiple physical warehouses.

To provide context, in Facebook, we expect to have 20+ partitions of the warehouse, so operating each in their own physical warehouse will be impractical from an operational perspective.

Requirements

Introduce the notion of a virtual warehouse (namespace) in Hive with the below key properties:

Can be housed in the same physical warehouse with other virtual warehouses (multi-tenancy).
Portable (so it can be moved from one physical warehouse to another). Being self-contained is a necessary condition for portability (all queries on this namespace operate only on data available in the namespace).
Unit of capacity tracking and capacity allocation. This is a nice side effect of creating self-contained namespaces and allows capacity planning based on the virtual warehouse growth.

Mapping many namespaces to 1 physical warehouse keeps the operational cost low. If a physical warehouse reaches capacity limits, portability will allow seamless migration of the namespace to another physical warehouse.

Note that users can operate on multiple namespaces simultaneously although they are likely to most often operate within one namespace. So namespaces are not trying to solve the problem of ensuring that users only have access to a subset of data in the warehouse.

From Hive, therefore the requirements are:

Provide metadata to identify tables and queries that belong to one namespace.
Provide controls to prevent operating on tables outside the namespace.
Provide commands to explicitly request that tables/partitions in namespace1 be made available in namespace2 (since some tables/partitions may be needed across multiple namespaces). Avoid making copies of tables/partitions for this.

Design

The design that is proposed is:

Modeling namespaces as databases. No explicit accounting/tracking of tables/partitions/views that belong to a namespace is needed since a database provides that already.
Prevent access using two part name syntax (Y.T). This ensures the database is self-contained.
Modeling table/partition imports across namespaces using a new concept called Links in Hive. There will be commands to create Links to tables in other databases, alter and drop them. Links do not make copies of the table/partition and hence avoid data duplication in the same physical warehouse.

Let’s take a concrete example:

Namespace A resides in database A, namespace B in database B.
Access across these namespace using A.T or B.T syntax is disabled in ‘namespace’ mode.
The user is importing table T1 from B into A .
The user issues a CREATE LINK command, which creates metadata in the target namespace A for the table + metadata to indicate which object is linked.
The ALTER LINK ADD PARTITION command is used to add partitions to the link.
These partitions are modeled by replicating partition-level metadata in the target database A for the accessible partitions.
The Link can be dynamic, which means it is kept updated as the source table gets new partitions or drops partitions.

There are 3 alternatives to implementing these ideas in open-source hive and Facebook extensions:

Implement links as a first-class concept in Hive, and use a Facebook hook to disable Y.T access unless there is a link to the table Y.T.
Implement links as a first-class concept, and introduce a new syntax T@Y to access linked content. Use a Facebook hook to disable Y.T access.
Implement links as a first-class concept, and introduce a new syntax T@Y to access linked content. Disable cross database access using a new privilege. All these changes will be in Open Source Hive.

Links to JIRAS for these features:

A basic tenet of our design is that a Hive instance does not operate across physical warehouses. We are building a namespace service external to Hive that has metadata on namespace location across the Hive instances, and allows importing data across Hive instances using replication.

Alternate design options

Modeling Namespace as a Role in Hive (using the authorization model)

The idea here is to create a Role for each namespace and users operating in that namespace belong to that Role. Access to data outside the namespace is made possible by granting permissions to the foreign table/view to the Role for the namespace.

Issues

A user who belongs to multiple namespaces (and hence multiple roles) will be able to operate on all data across those namespaces at any point in time, so namespaces are no longer self-contained. Imagine the situation of user A who has access to namespaces N1 and N2 running a query on both simultaneously. Either of those queries will be able to access data across both N1 and N2 although this shouldn’t be allowed.
Capacity tracking is more complex
Operations like show tables, and show partitions do not work without changes.

Modeling Namespace by tagging objects

The idea here is to tag tables/partitions with the namespaces that they belong to. To handle the requirements:

Tags will have to be explicitly created by a user of the warehouse (through a command)
Commands like show tables and show partitions will need changes to understand tags.
Capacity tracking and management will be more complex than using databases for this purpose.
Data migration is more complex since the data is not contained with the root folder of one database

None of these are insurmountable problems, but using databases to model namespaces is a cleaner approach.

(Taking this idea further, a database in Hive could itself have been implemented using tags in a single global namespace which would have not been as elegant as the current implementation of a database being a first class concept in Hive.)

Modeling Namespace as a database but using views for imports

The view would be a simple select * using Y.T syntax. It’s a degenerate case of view.
We would need a registry of all views which import tables/partitions from other databases for namespace accounting. This requires adding metadata to these views to distinguish them from other user-created views.
It would be harder to single instance imports using views (same table/partitions imported twice into the same namespace). Views are too opaque.

Using partitioned views:

By definition, there isn’t a one-one mapping between a view partition and a table partition. In fact, hive today does not even know about this dependency between view partitions and table partitions. Partitioned views is just a metadata concept - it is not something that the query layer understands. For e.g: if a view V partitioned on ds had 2 partitions: 1 and 2, then a query like select … from V where ds = 3 may still give valid results if the ds=3 is satisfied by the table underlying V. This means that:

View metadata doesn’t stay in sync with source partitions (as partitions get added and dropped). The user has to explicitly do this, which won’t work for our case.
We would like to differentiate the set of partitions that are available for the same imported tables across namespaces. This would require partition pruning based on the view partitions in query rewrite which is not how it works today.

The above notes make it clear that what we are trying to build is a very special case of a degenerate view, and it would be cleaner to introduce a new concept in Hive to model these ‘imports’.

Comments:



Questions from Ashutosh Chauhan (with inline responses):

What exactly is contained in tracking capacity usage. One is disk space. That I presume you are going to track via summing size under database directory. Are you also thinking of tracking resource usage in terms of CPU/memory/network utilization for different teams? Right now the capacity usage in Hive we will track is the disk space (managed tables that belong to the namespace + imported tables). We will track the mappers and reducers that the namepace utilizes directly from Hadoop.
Each namespace (ns) will have exactly one database. If so, then users are not allowed to create/use databases in such deployment? Not necessarily a problem, just trying to understand design. This is correct – this is a limitation of the design. Introducing a new concept seemed heavyweight, so we re-used databases for namespaces. But this approach means that a given namespace cannot have sub-databases in it.
How are you going to keep metadata consistent across two ns? If metadata gets updated in remote ns, will it get automatically updated in user’s local ns? If yes, how will this be implemented? If no, then every time user need to use data from remote ns, she has to bring metadata uptodate in her ns. How will she do it? Metadata will be kept in sync for linked tables. We will make alter table on the remote table (source of the link) cause an update to the target of the link. Note that from a Hive perspective, the metadata for the source and target of a link is in the same metastore.
Is it even required that metadata of two linked tables to be consistent? Seems like user has to run “alter link add partition” herself for each partition. She can choose only to add few partitions. In this case, tables in two ns have different number of partitions and thus data. What you say above is true for static links. For dynamic links, add and drop partition on the source of the link will cause the target to get those partitions as well (we trap alter table add/drop partition to provide this behavior).
Who is allowed to create links? Any user on the database who has create/all privileges on the database. We could potentially create a new privilege for this, but I think create privilege should suffice. We can similarly map alter, drop privileges to the appropriate operations.
Once user creates a link, who can use it? If everyone is allowed to access, then I don’t see how is it different from the problem that you are outlining in first alternative design option, wherein user having an access to two ns via roles has access to data on both ns. The link creates metadata in the target database. So you can only access data that has been linked into this database (access is via the T@Y or Y.T syntax depending on the chosen design option). Note that this is different than having a role that a user maps to since in that case, there is no local metadata in the target database specifying if the imported data is accessible from this database.
If links are first class concepts, then authorization model also needs to understand them? I don’t see any mention of that. Yes, we need to account for the authorization model.
I see there is a hdfs jira for implementing hard links of files in hdfs layer, so that takes care of linking physical data on hdfs. What about tables whose data is stored in external systems. For example, hbase. Does hbase also needs to implement feature of hard-linking their table for hive to make use of this feature? What about other storage handlers like cassandra, mongodb etc. The link does not create a link on HDFS. It just points to the source table/partitions. One can think of it as a Hive-level link so there is no need for any changes/features from the other storage handlers.
Migration will involve two step process of distcp’ing data from one cluster to another and then replicating one mysql instance to another. Are there any other steps? Do you plan to (later) build tools to automate this process of migration. We will be building tools to enable migration of a namespace to another cluster. Migration will involve replicating the metadata and the data as you mention above.
When migrating ns from one datacenter to another, will links be dropped or they are also preserved? We will preserve them – by copying the data for the links to the other datacenter.

Posted by sambavi at May 22, 2012 02:10 | | The first draft of this proposal is very hard to decipher because it relies on terms that aren’t well defined. For example, here’s the second sentence from the motivations section: Growth beyond a single warehouse (or) separation of capacity usage and allocation requires the creation of multiple physical warehouses, i.e., separate Hive instances. What’s the difference between a warehouse and a physical warehouse? How do you define a Hive instance? In the requirements section the term virtual warehouse is introduced and equated to a namespace, but clearly it’s more than that because otherwise DBs/Schemas would suffice. Can you please update the proposal to include definitions of these terms?

Posted by cwsteinbach at May 22, 2012 18:35 | | Prevent access using two part name syntax (Y.T) if namespaces feature is “on” in a Hive instance. This ensures the database is self-contained. The cross-namespace Hiveconf ACL proposed in HIVE-3016 doesn’t prevent anyone from doing anything because there is no way to keep users from disabling it. I’m surprised to see this ticket mentioned here since three committers have already gone on record saying that this is the wrong approach, and one committer even -1’d it. If preventing cross-db references in queries is a requirement for this project, then I think Hive’s authorization mechanism will need to be extended to support this privilege/restriction.

Posted by cwsteinbach at May 22, 2012 18:48 | | From the design section: We are building a namespace service external to Hive that has metadata on namespace location across the Hive instances, and allows importing data across Hive instances using replication. Does the work proposed in HIVE-2989 also include adding this Db/Table replication infrastructure to Hive?

Posted by cwsteinbach at May 22, 2012 18:53 | | We mention the JIRA here for the sake of completeness. We are implementing this as a pre-execution hook for now, but support for namespaces will be incomplete without this control (since you can’t guarantee self-contained namespaces unless you prevent two-part name access). What extensions to the authorization system are you thinking of? One idea would be to set role for a session (corresponding to the namespace the user is operating in), so that a user operating in the context of that role can only see the data available to that role.

Posted by sambavi at May 22, 2012 19:42 | | What extensions to the authorization system are you thinking of? Add a new privilege named something like “select_cross_db” and GRANT it to specific users as follows: GRANT select_cross_db ON DATABASE db TO USER x; GRANT select_cross_db ON DATABASE db to ROLE x; This privilege would be provided by default, but if absent, then the user would be prevented from referencing DBs outside of ‘db’ while using ‘db’ as the primary database.

Posted by cwsteinbach at May 22, 2012 20:01 | | Thanks Carl - this is an interesting suggestion - for installations with namespaces, we would need to turn this privilege off by default and have no users. groups or roles be granted this privilege. We’ll discuss internally.

Posted by sambavi at May 22, 2012 20:25 | | I’ve edited the main page to include a definition for physical warehouse and removed the term Hive instance to reduce ambiguity.

Posted by sambavi at May 22, 2012 20:26 | | No Hive-2989 will not include the replication infrastructure. We plan to provide replication in the second half of the year.

Posted by sambavi at May 23, 2012 10:30 | | I have opened a JIRA for adding a new privilege for cross database commands and resolved HIVE-3016. Thanks for the suggestion!

Posted by sambavi at May 23, 2012 12:36 | | This proposal describes the DDL and metadata changes that are necessary to support DB Links, but it doesn’t include any details about the mechanics of replicating data across clusters (it’s probably a little more complicated than just running distcp). I think the proposal needs to include these details before it can be considered complete. No Hive-2989 will not include the replication infrastructure. We plan to provide replication in the second half of the year. The metadata extensions described in this proposal will require users to run metastore upgrade scripts, and the DDL extensions will become part of the public API. The former imposes a burden on users, and the latter constitutes a continuing maintenance burden on the people who contribute to this project. Taking this into account I think we need to be able to demonstrate that the new code tangibly benefits users before it appears in a Hive release. I don’t think it will be possible to demonstrate a tangible benefit to users until the replication/update mechanism is implemented and integrated into Hive.

Posted by cwsteinbach at May 25, 2012 19:21 | | Some thoughts on this from our team:

Even in a single physical warehouse, namespaces allow better quota management and isolation between independent team’s data/workload. This is independent of security and replication considerations.
Further, replication as a feature is pretty big and will take a while to be built out. We can hide the table link feature behind a config parameter so that its not exposed to users who don’t need it until its completed. The only piece we cannot hide is the metastore changes, but the upgrade script for the metastore will just add few columns in few tables, and should not take more than a few minutes even for a pretty large warehouse (few thousand tables + ~100,000 partitions). In the meanwhile, users who do need this feature (like Facebook) can use it if they want to. If there was links support to start with in hive, we would have used it from the very beginning, and not gotten into the mess of one large warehouse with difficulty in dealing with multi-tenancy. We seriously believe that this is the right direction for the community, and all new users can design the warehouse in the right way from the very start, and learn from Facebook’s experience.

Posted by sambavi at May 25, 2012 21:37 | | Even in a single physical warehouse, namespaces allow better quota management and isolation between independent team’s data/workload. This is independent of security and replication considerations. Hive already provides support for namespaces in the form of databases/schemas. As far as I can tell the table link feature proposed here actually weakens the isolation guarantees provided by databases/schemas, and consequently will make quota and workload management between teams more complicated. In order to resolve this disagreement I think it would help if you provided some concrete examples of how the table link feature improves this situation.

Posted by cwsteinbach at May 29, 2012 13:54 | | Suppose there are 2 teams which want to use the same physical warehouse. Team 1 wants to use the following: (let us say that each table is partitioned by date) T1 (all partitions) T2 (all partitions) T3 (all partitions) Team 2 wants to use the following: T1 (partitions for the last 3 days) T2 (partition for a fixed day: say May 29' 2012) T4 (all partitions) Using the current hive architecture, we can perform the following:

Use a single database and have scripts for quota
Use 2 databases and copy the data in both the databases (say, the databases are DB1 and DB2 respectively)
Use 2 databases, and use views in database 2 (to be used to team 2).

The problems with these approaches is as follows:

Table Discovery etc.becomes very messy. You can do that via tags, but then, all the functionality that is provided by databases can also be provided via tags.
Duplication of data
The user will have to perform the management himself. When a partition gets added to DB1.T1, the corresponding partition needs to be added to DB2.View1, and the 3 day old partition from DB2.View1 needs to be dropped. This has to be done outside hive, and makes the task of maintaining these partitions very difficult - how do you make sure this is atomic etc. User has to do lot more scripting.

Links is a degenerate case of views. With links, the above use case can be solved very easily. This is a real use case at Facebook today, and I think, there will be similar use cases for other users. Maybe, they are not solving it in the most optimal manner currently.

Posted by namit.jain at May 29, 2012 18:33 | | Furthermore, databases don’t provide isolation today since two part name access is unrestricted. By introducing a trackable way of accessing content outside the current database (using table links), we get isolation for a namespace using Hive databases.

Posted by sambavi at May 30, 2012 01:22 | | @Namit: Thanks for providing an example. I have a couple followup questions: Use a single database and have scripts for quota Can you provide some more details about how quota management works? For example, Team 1 and 2 both need access to a single partition in table T2, so who pays for the space occupied by this partition? Are both teams charged for it? Use 2 databases and copy the data in both the databases (say, the databases are DB1 and DB2 respectively) I’m not sure why this is listed as an option. What does this actually accomplish? Use 2 databases, and use views in database 2 (to be used to team 2). This seems like the most reasonable approach given my current understanding of your use case. You listed three problems with these approaches. Most of them don’t seem applicable to views: Table Discovery etc.becomes very messy. You can do that via tags, but then, all the functionality that is provided by databases can also be provided via tags. It’s hard for me to evaluate this claim since I’m not totally sure what is meant by “table discovery”. Can you please provide an example? However, my guess is that this is not a differentiator if you’re comparing the table links approach to views. Duplication of data Not applicable for views and table links. The user will have to perform the management himself. When a partition gets added to DB1.T1, the corresponding partition needs to be added to DB2.View1, and the 3 day old partition from DB2.View1 needs to be dropped. Based on the description of table links in HIVE-2989 it sounds like the user will still have to perform manual management even with table links, e.g. dropping the link that points to the partition from four days ago and adding a new link that points to the most recent partition. In this case views may actually work better since you can embed the filter condition (last three days) in the view definition instead of relying on external tools to update the table links. This has to be done outside hive, and makes the task of maintaining these partitions very difficult - how do you make sure this is atomic etc. User has to do lot more scripting. I don’t think table links make this process atomic, and as I mentioned above the process of maintaining this linked set of partitions actually seems easier if you use views instead. Links is a degenerate case of views. I agree that table links are a degenerate case of views. Since that’s the case, why is it necessary to implement table links? Why not leverage the functionality that is already provided with views?

Posted by cwsteinbach at May 31, 2012 16:45 | | Furthermore, databases don’t provide isolation today since two part name access is unrestricted. DBs in conjunction with the authorization system provide strong isolation between different namespaces. Also, it should be possible to extend the authorization system to the two-part-name-access case that you described above (e.g. HIVE-3047). By introducing a trackable way of accessing content outside the current database (using table links), we get isolation for a namespace using Hive databases. I think you already get that by using views. If I’m wrong can you please explain how the view approach falls short? Thanks.

Posted by cwsteinbach at May 31, 2012 16:59 | | Carl: I’ve addressed your questions below.

Can you provide some more details about how quota management works? For example, Team 1 and 2 both need access to a single partition in table T2, so who pays for the space occupied by this partition? Are both teams charged for it? If the partition is shared, it is accounted towards both their quota (base quota for the team that owns the partition, and imported quota for the team that imports it via a link). The reason for this is that when a namespace is moved to another datacenter, we have to account for all the quota (both imported and base) as belonging to the namespace (the data can no longer be shared directly via a link, and we will need to replicate it). Use 2 databases and copy the data in both the databases (say, the databases are DB1 and DB2 respectively) I’m not sure why this is listed as an option. What does this actually accomplish? It was just one way of achieving the same data being available in the two namespaces. You can ignore this one (smile) It’s hard for me to evaluate this claim since I’m not totally sure what is meant by “table discovery”. Can you please provide an example? However, my guess is that this is not a differentiator if you’re comparing the table links approach to views. I think Namit meant this in reference to the design option of using a single database and using scripts for quota management. In the case of views, due to views being opaque, it will be hard to see which tables are imported into the namespace. Based on the description of table links in HIVE-2989 it sounds like the user will still have to perform manual management even with table links, e.g. dropping the link that points to the partition from four days ago and adding a new link that points to the most recent partition. In this case views may actually work better since you can embed the filter condition (last three days) in the view definition instead of relying on external tools to update the table links. Maybe the description was unclear. Table links have two types: static and dynamic. Static links behave the way you describe, but dynamic links will have addition and drop of partitions when the source table (of the link) has partitions added or removed from it. I don’t think table links make this process atomic, and as I mentioned above the process of maintaining this linked set of partitions actually seems easier if you use views instead. Addressed this above - table links do keep the links updated when the source of the link has partitions added or dropped. This will be atomic since it is done in one metastore operation during an ALTER TABLE ADD/DROP PARTITION command. I agree that table links are a degenerate case of views. Since that’s the case, why is it necessary to implement table links? Why not leverage the functionality that is already provided with views? Table links allow for better accounting of imported data (views are opaque), single instancing of imports and partition pruning when the imports only have some of the partitions of the source table of the link. Given this, it seems ideal to introduce table links as a concept rather than overload views.

Posted by sambavi at May 31, 2012 18:09 | | Hi Carl, I explained how views fall short in the post below (response to your comments on Namit’s post). Please add any more questions you have - I can explain further if unclear.

Posted by sambavi at May 31, 2012 18:11 | | I think Namit meant this in reference to the design option of using a single database and using scripts for quota management. In the case of views, due to views being opaque, it will be hard to see which tables are imported into the namespace. Views are not opaque. DESCRIBE FORMATTED currently includes the following information:

# View Information	 	 
View Original Text: 	SELECT value FROM src WHERE key=86	 
View Expanded Text: 	SELECT `src`.`value` FROM `src` WHERE `src`.`key`=86

Currently the metastore only tracks the original and expanded text view query, but it would be straightforward to also extract and store the list of source tables that are referenced in the query when the view is created (in fact, there’s already a JIRA ticket for this (HIVE-1073), and the information is already readily available internally as described here). If we assume that it’s possible to make this change then from a quota management standpoint I don’t think the table links makes quota management easier. Maybe the description was unclear. Table links have two types: static and dynamic. Static links behave the way you describe, but dynamic links will have addition and drop of partitions when the source table (of the link) has partitions added or removed from it. I don’t think dynamic table links satisfy the use case covered by Team 2’s access requirements for table T1. Team 2 wants to see only the most recent three partitions in table T1, and my understanding of dynamic table links is that once the link is created, Team 2 will subsequently see every new partition that is added to the source table. In order to satisfy Team 2’s requirements I think you’re going to have to manually add and drop partitions from the link using the ALTER LINK ADD/DROP PARTITION command, which doesn’t sound any easier than using partitioned views. The functionality provided by dynamic links does make sense in some contexts, but the same is true for dynamic partitioned views. Why not extend the partitioned view feature to support dynamic partitions? Table links allow for better accounting of imported data (views are opaque), single instancing of imports and partition pruning when the imports only have some of the partitions of the source table of the link. Given this, it seems ideal to introduce table links as a concept rather than overload views. I addressed the “views are opaque” argument above. I’m having trouble following the rest of the sentence. What does “single instancing of imports” mean? If possible can you provide an example in terms of table links and partitioned views?

Posted by cwsteinbach at May 31, 2012 22:27 | | Going back to my example: Suppose there are 2 teams which want to use the same physical warehouse. Team 1 wants to use the following: (let us say that each table is partitioned by date) T1 (all partitions) T2 (all partitions) T3 (all partitions) Team 2 wants to use the following: T1 (partitions for the last 3 days) T2 (partition for a fixed day: say May 29' 2012) T4 (all partitions) Using the current hive architecture, we can perform the following:

Use a single database and have scripts for quota
Use 2 databases and copy the data in both the databases (say, the databases are DB1 and DB2 respectively)
Use 2 databases, and use views in database 2 (to be used to team 2).

We have discarded the first 2 approaches above, so let us discuss how will we use approach 3 (specifically for T1). Team 2 will create the view: create view V1T1 as select * from DB1.T1 Now, whenever a partition gets added in DB1.T1, someone (a hook or something - outside hive) needs to add the corresponding partition in V1T1. That extra layer needs to make sure that the new partition in V1T1 is part of the inputs (may be necessary for auditing etc.) Hive metastore has no knowledge of this dependency (view partition -> table partition), and it is maintained in multiple places (for possibly different teams). The same argument applies when a partition gets dropped from DB1.T1. By design, there is no one-to-one dependency between a table partition and a view partition, and we do not want to create such a dependency. The view may depend on multiple tables/partitions. The views in hive are not updatable. By design, the schema of the view and the underlying table(s) can be different. Links provide the above functionality. If I understand right, you are proposing to extend views to support the above functionality. We will end up with a very specific model for a specific type of views, which are not like normal hive views. That would be more confusing, in my opinion.

Posted by namit.jain at Jun 01, 2012 14:25 | | Please comment - we haven’t gotten any updates on the wiki as well as the jira https://issues.apache.org/jira/browse/HIVE-2989

Posted by namit.jain at Jun 02, 2012 19:35 |