Apache Hive : DeveloperGuide

Developer Guide

Code Organization and a Brief Architecture

Introduction

Hive has 3 main components:

Apart from these major components, Hive also contains a number of other components. These are as follows:

The following top level directories contain helper libraries, packaged configuration files etc..:

Hive SerDe

What is a SerDe?

Note that the “key” part is ignored when reading, and is always a constant when writing. Basically row object is stored into the “value”.

One principle of Hive is that Hive does not own the HDFS file format. Users should be able to directly read the HDFS files in the Hive tables using other tools or use other tools to directly write to HDFS files that can be loaded into Hive through “CREATE EXTERNAL TABLE” or can be loaded into Hive through “LOAD DATA INPATH,” which just move the file into Hive’s table directory.

Note that org.apache.hadoop.hive.serde is the deprecated old SerDe library. Please look at org.apache.hadoop.hive.serde2 for the latest version.

Hive currently uses these FileFormat classes to read and write HDFS files:

Hive currently uses these SerDe classes to serialize and deserialize data:

ALTER TABLE person SET SERDEPROPERTIES ('serialization.encoding'='GBK');

LazySimpleSerDe can treat ‘T’, ’t', ‘F’, ‘f’, ‘1’, and ‘0’ as extended, legal boolean literals if the configuration property hive.lazysimple.extended_boolean_literal is set to true (Hive 0.14.0 and later). The default is false, which means only ‘TRUE’ and ‘FALSE’ are treated as legal boolean literals.

Also:

See SerDe for detailed information about input and output processing. Also see Storage Formats in the HCatalog manual, including CTAS Issue with JSON SerDe. For information about how to create a table with a custom or native SerDe, see Row Format, Storage Format, and SerDe.

How to Write Your Own SerDe

Some important points about SerDe:

ObjectInspector

Hive uses ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns.

ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including:

A complex object can be represented by a pair of ObjectInspector and Java Object. The ObjectInspector not only tells us the structure of the Object, but also gives us ways to access the internal fields inside the Object.

NOTE: Apache Hive recommends that custom ObjectInspectors created for use with custom SerDes have a no-argument constructor in addition to their normal constructors for serialization purposes. See HIVE-5380 for more details.

Registration of Native SerDes

As of Hive 0.14a registration mechanism has been introduced for native Hive SerDes.  This allows dynamic binding between a “STORED AS” keyword in place of a triplet of {SerDe, InputFormat, and OutputFormat} specification, in CreateTable statements.

The following mappings have been added through this registration mechanism:

Syntax Equivalent
STORED AS AVRO /STORED AS AVROFILE ROW FORMAT SERDE``'org.apache.hadoop.hive.serde2.avro.AvroSerDe'``STORED AS INPUTFORMAT``'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'``OUTPUTFORMAT``'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
STORED AS ORC /STORED AS ORCFILE ROW FORMAT SERDE````'org.apache.hadoop.hive.ql.io.orc.OrcSerde````'``STORED AS INPUTFORMAT````'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat````'``OUTPUTFORMAT````'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat````'
STORED AS PARQUET /STORED AS PARQUETFILE ROW FORMAT SERDE```'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe```'``STORED AS INPUTFORMAT```'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat```'``OUTPUTFORMAT```'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat```'
STORED AS RCFILE STORED AS INPUTFORMAT``'org.apache.hadoop.hive.ql.io.RCFileInputFormat'``OUTPUTFORMAT``'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
STORED AS SEQUENCEFILE STORED AS INPUTFORMAT``'org.apache.hadoop.mapred.SequenceFileInputFormat'``OUTPUTFORMAT``'org.apache.hadoop.mapred.SequenceFileOutputFormat'
STORED AS TEXTFILE STORED AS INPUTFORMAT``'org.apache.hadoop.mapred.TextInputFormat'``OUTPUTFORMAT``'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'

To add a new native SerDe with STORED AS keyword, follow these steps:

  1. Create a storage format descriptor class extending from AbstractStorageFormatDescriptor.java that returns a “stored as” keyword and the names of InputFormat, OutputFormat, and SerDe classes.
  2. Add the name of the storage format descriptor class to the StorageFormatDescriptor registration file.

MetaStore

MetaStore contains metadata regarding tables, partitions and databases. This is used by Query Processor during plan generation.

Query Processor

The following are the main components of the Hive Query Processor:

Compiler

Parser

TypeChecking

Semantic Analysis

Plan generation

Task generation

Execution Engine

Plan

Operators

UDFs and UDAFs

A helpful overview of the Hive query processor can be found in this Hive Anatomy slide deck.

Compiling and Running Hive

Ant to Maven

As of version 0.13 Hive uses Maven instead of Ant for its build. The following instructions are not up to date.

See the Hive Developer FAQ for updated instructions.

Hive can be made to compile against different versions of Hadoop.

Default Mode

From the root of the source tree:

ant package

will make Hive compile against Hadoop version 0.19.0. Note that:

Advanced Mode

ant -Dtarget.dir=<my-install-dir> package

ant -Dhadoop.version=0.17.1 package

ant -Dhadoop.root=~/src/hadoop-19/build/hadoop-0.19.2-dev -Dhadoop.version=0.19.2-dev

note that:

In this particular example - ~/src/hadoop-19 is a checkout of the Hadoop 19 branch that uses 0.19.2-dev as default version and creates a distribution directory in build/hadoop-0.19.2-dev by default.

Run Hive from the command line with ‘$HIVE_HOME/bin/hive’, where $HIVE_HOME is typically build/dist under your Hive repository top-level directory.

$ build/dist/bin/hive

If Hive fails at runtime, try ‘ant very-clean package’ to delete the Ivy cache before rebuilding.

Running Hive Without a Hadoop Cluster

From Thejas:

export HIVE\_OPTS='--hiveconf mapred.job.tracker=local --hiveconf fs.default.name=file:///tmp \
    --hiveconf hive.metastore.warehouse.dir=file:///tmp/warehouse \
    --hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/tmp/metastore\_db;create=true'

Then you can run ‘build/dist/bin/hive’ and it will work against your local file system.

Unit tests and debugging

Layout of the unit tests

Hive uses JUnit for unit tests. Each of the 3 main components of Hive have their unit test implementations in the corresponding src/test directory e.g. trunk/metastore/src/test has all the unit tests for metastore, trunk/serde/src/test has all the unit tests for serde and trunk/ql/src/test has all the unit tests for the query processor. The metastore and serde unit tests provide the TestCase implementations for JUnit. The query processor tests on the other hand are generated using Velocity. The main directories under trunk/ql/src/test that contain these tests and the corresponding results are as follows:

Running unit tests

Ant to Maven

As of version 0.13 Hive uses Maven instead of Ant for its build. The following instructions are not up to date.

See the Hive Developer FAQ for updated instructions.

Run all tests:

ant package test

Run all positive test queries:

ant test -Dtestcase=TestCliDriver

Run a specific positive test query:

ant test -Dtestcase=TestCliDriver -Dqfile=groupby1.q

The above test produces the following files:

Run the set of unit tests matching a regex, e.g. partition_wise_fileformat tests 10-16:

ant test -Dtestcase=TestCliDriver -Dqfile\_regex=partition\_wise\_fileformat1[0-6]

Note that this option matches against the basename of the test without the .q suffix.

Apparently the Hive tests do not run successfully after a clean unless you run ant package first. Not sure why build.xml doesn’t encode this dependency.

Adding new unit tests

Ant to Maven

As of version 0.13 Hive uses Maven instead of Ant for its build. The following instructions are not up to date.

See the Hive Developer FAQ for updated instructions. See also Tips for Adding New Tests in Hive and How to Contribute: Add a Unit Test.

First, write a new myname.q in ql/src/test/queries/clientpositive.

Then, run the test with the query and overwrite the result (useful when you add a new test).

ant test -Dtestcase=TestCliDriver -Dqfile=myname.q -Doverwrite=true

Then we can create a patch by:

svn add ql/src/test/queries/clientpositive/myname.q ql/src/test/results/clientpositive/myname.q.out
svn diff > patch.txt

Similarly, to add negative client tests, write a new query input file in ql/src/test/queries/clientnegative and run the same command, this time specifying the testcase name as TestNegativeCliDriver instead of TestCliDriver. Note that for negative client tests, the output file if created using the overwrite flag can be be found in the directory ql/src/test/results/clientnegative.

Debugging Hive Code

Hive code includes both client-side code (e.g., compiler, semantic analyzer, and optimizer of HiveQL) and server-side code (e.g., operator/task/SerDe implementations). Debugging is different for client-side and server-side code, as described below.

Debugging Client-Side Code

The client-side code runs on your local machine so you can easily debug it using Eclipse the same way you debug any regular local Java code. Here are the steps to debug code within a unit test.

Debugging Server-Side Code

The server-side code is distributed and runs on the Hadoop cluster, so debugging server-side Hive code is a little bit complicated. In addition to printing to log files using log4j, you can also attach the debugger to a different JVM under unit test (single machine mode). Below are the steps on how to debug on server-side code.

    > ant -Djavac.debug=on package

If you have already built Hive without javac.debug=on, you can clean the build and then run the above command.

    > ant clean  # not necessary if the first time to compile
    > ant -Djavac.debug=on package

    > export HIVE\_DEBUG\_PORT=8000
    > export HIVE\_DEBUG="-Xdebug -Xrunjdwp:transport=dt\_socket,address=${HIVE\_DEBUG\_PORT},server=y,suspend=y"

In particular HIVE_DEBUG_PORT is the port number that the JVM is listening on and the debugger will attach to. Then run the unit test as follows:

    > export HADOOP\_OPTS=$HIVE\_DEBUG
    > ant test -Dtestcase=TestCliDriver -Dqfile=<mytest>.q

The unit test will run until it shows:

     [junit] Listening for transport dt\_socket at address: 8000

    > jdb -attach 8000

or if you are running Eclipse and the Hive projects are already imported, you can debug with Eclipse. Under Eclipse Run -> Debug Configurations, find “Remote Java Application” at the bottom of the left panel. There should be a MapRedTask configuration already. If there is no such configuration, you can create one with the following property:

+ Name: any task such as MapRedTask
+ Project: the Hive project that you imported.
+ Connection Type: Standard (Socket Attach)
+ Connection Properties:
	- Host: localhost
	- Port: 8000  
	 Then hit the "Debug" button and Eclipse will attach to the JVM listening on port 8000 and continue running till the end. If you define breakpoints in the source code before hitting the "Debug" button, it will stop there. The rest is the same as debugging client-side Hive.

Debugging without Ant (Client and Server Side)

There is another way of debugging Hive code without going through Ant.
You need to install Hadoop and set the environment variable HADOOP_HOME to that.

    > export HADOOP\_HOME=<your hadoop home>
 

Then, start Hive:

    >  ./build/dist/bin/hive --debug
 

It will then act similar to the debugging steps outlines in Debugging Hive code. It is faster since there is no need to compile Hive code,
and go through Ant. It can be used to debug both client side and server side Hive.

If you want to debug a particular query, start Hive and perform the steps needed before that query. Then start Hive again in debug to debug that query.

    >  ./build/dist/bin/hive
    >  perform steps before the query
 
    >  ./build/dist/bin/hive --debug
    >  run the query
 

Note that the local file system will be used, so the space on your machine will not be released automatically (unlike debugging via Ant, where the tables created in test are automatically dropped at the end of the test). Make sure to either drop the tables explicitly, or drop the data from /User/hive/warehouse.

Pluggable interfaces

File Formats

Please refer to Hive User Group Meeting August 2009 Page 59-63.

SerDe - how to add a new SerDe

Please refer to Hive User Group Meeting August 2009 Page 64-70.

Map-Reduce Scripts

Please refer to Hive User Group Meeting August 2009 Page 71-73.

UDFs and UDAFs - how to add new UDFs and UDAFs

Please refer to Hive User Group Meeting August 2009 Page 74-87.