Apache Hive : DynamicPartitions

Dynamic Partitions

Documentation

This is the design document for dynamic partitions in Hive. Usage information is also available:

References:

Terminology

Syntax

DP columns are specified the same way as it is for SP columns – in the partition clause. The only difference is that DP columns do not have values, while SP columns do. In the partition clause, we need to specify all partitioning columns, even if all of them are DP columns.

In INSERT … SELECT … queries, the dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause.


   INSERT OVERWRITE TABLE T PARTITION (ds, hr) 
   SELECT key, value, ds, hr FROM srcpart WHERE ds is not null and hr>10;


   INSERT OVERWRITE TABLE T PARTITION (ds='2010-03-03', hr) 
   SELECT key, value, /*ds,*/ hr FROM srcpart WHERE ds is not null and hr>10;


   -- throw an exception
   INSERT OVERWRITE TABLE T PARTITION (ds, hr = 11) 
   SELECT key, value, ds/*, hr*/ FROM srcpart WHERE ds is not null and hr=11;


   FROM S
   INSERT OVERWRITE TABLE T PARTITION (ds='2010-03-03', hr) 
   SELECT key, value, ds, hr FROM srcpart WHERE ds is not null and hr>10
   INSERT OVERWRITE TABLE R PARTITION (ds='2010-03-03, hr=12)
   SELECT key, value, ds, hr from srcpart where ds is not null and hr = 12;


   CREATE TABLE T (key int, value string) PARTITIONED BY (ds string, hr int) AS 
   SELECT key, value, ds, hr+1 hr1 FROM srcpart WHERE ds is not null and hr>10;

The above example shows the case of all DP columns in CTAS. If you want put some constant for some partitioning column, you can specify it in the select-clause. e.g,


   CREATE TABLE T (key int, value string) PARTITIONED BY (ds string, hr int) AS 
   SELECT key, value, "2010-03-03", hr+1 hr1 FROM srcpart WHERE ds is not null and hr>10;

Design

Design issues

  1. Data type of the dynamic partitioning column:

A dynamic partitioning column could be the result of an expression. For example:


 -- part\_col is partitioning column
 create table T as select a, concat("part\_", part\_col) from S where part\_col is not null; 

Although currently there is not restriction on the data type of the partitioning column, allowing non-primitive columns to be partitioning column probably doesn’t make sense. The dynamic partitioning column’s type should be derived from the expression. The data type has to be able to be converted to a string in order to be saved as a directory name in HDFS.

  1. Partitioning column value to directory name conversion:

After converting column value to string, we still need to convert the string value to a valid directory name. Some reasons are:

We need to define a UDF (say hive_qname_partition(T.part_col)) to take a primitive typed value and convert it to a qualified partition name.

  1. Due to 2), this dynamic partitioning scheme qualifies as a hash-based partitioning scheme, except that we define the hash function to be as close as

the input value. We should allow users to plugin their own UDF for the partition hash function. Will file a follow up JIRA if there is sufficient interests.

  1. If there are multiple partitioning columns, their order is significant since that translates to the directory structure in HDFS: partitioned by (ds string, dept int) implies a directory structure of ds=2009-02-26/dept=2. In a DML or DDL involving partitioned table, So if a subset of partitioning columns are specified (static), we should throw an error if a dynamic partitioning column is lower. Example:

 create table nzhang\_part(a string) partitioned by (ds string, dept int);
 insert overwrite nzhang\_part (dept=1) select a, ds, dept from T where dept=1 and ds is not null;