User Manual

49 documents

Capture Lineage Information In Hive Hooks

Jul 29, 2025

Background

In Hive, lineage information is captured in the form of LineageInfo object. This object is created in the SemanticAnalyzer and is passed to the HookContext object. Users can use the following existing Hooks or implement their own custom hooks to capture this information and utilize it.

Existing Hooks

org.apache.hadoop.hive.ql.hooks.PostExecutePrinter
org.apache.hadoop.hive.ql.hooks.LineageLogger
org.apache.atlas.hive.hook.HiveHook

To facilitate the capture of lineage information in a custom hook or in a use case where the existing hooks are not set in hive.exec.post.hooks, a new configuration hive.lineage.hook.info.enabled was introduced in HIVE-24051. This configuration is set to false by default.

Apache Hive : AccumuloIntegration

Dec 12, 2024

Apache Hive : Accumulo Integration

Overview

Apache Accumulo is a sorted, distributed key-value store based on the Google BigTable paper. The API methods that Accumulo provides are in terms of Keys and Values which present the highest level of flexibility in reading and writing data; however, higher-level query abstractions are typically an exercise left to the user. Leveraging Apache Hive as a SQL interface to Accumulo complements its existing high-throughput batch access and low-latency random lookups.

Apache Hive : AuthDev

Dec 12, 2024

Apache Hive : AuthDev

This is the design document for the original Hive authorization mode. See Authorization for an overview of authorization modes, which include storage based authorization and SQL standards based authorization.

1. Privilege

1.1 Access Privilege

Admin privilege, DB privilege, Table level privilege, column level privilege

1.1.1 Admin privileges are global privileges, and are used to perform administration.

1.1.2 DB privileges are database specific, and apply to all objects inside that database.

Apache Hive : AvroSerDe

Dec 12, 2024

Apache Hive : AvroSerDe

Availability

Earliest version AvroSerde is available

The AvroSerde is available in Hive 0.9.1 and greater.

Overview – Working with Avro from Hive

The AvroSerde allows users to read or write Avro data as Hive tables. The AvroSerde’s bullet points:

Infers the schema of the Hive table from the Avro schema. Starting in Hive 0.14, the Avro schema can be inferred from the Hive table schema.
Reads all Avro files within a table against a specified schema, taking advantage of Avro’s backwards compatibility abilities
Supports arbitrarily nested schemas.
Translates all Avro data types into equivalent Hive types. Most types map exactly, but some Avro types don’t exist in Hive and are automatically converted by the AvroSerde.
Understands compressed Avro files.
Transparently converts the Avro idiom of handling nullable types as Union[T, null] into just T and returns null when appropriate.
Writes any Hive table to Avro files.
Has worked reliably against our most convoluted Avro schemas in our ETL process.
Starting in Hive 0.14, columns can be added to an Avro backed Hive table using the Alter Table statement.

For general information about SerDes, see Hive SerDe in the Developer Guide. Also see SerDe for details about input and output processing.

Apache Hive : CompressedStorage

Dec 12, 2024

Apache Hive : CompressedStorage

Compressed Data Storage

Keeping data compressed in Hive tables has, in some cases, been known to give better performance than uncompressed storage; both in terms of disk usage and query performance.

You can import text files compressed with Gzip or Bzip2 directly into a table stored as TextFile. The compression will be detected automatically and the file will be decompressed on-the-fly during query execution. For example:

Apache Hive : Configuration Properties

Dec 12, 2024

Apache Hive : Configuration Properties

This document describes the Hive user configuration properties (sometimes called parameters, variables, or options), and notes which releases introduced new properties.

The canonical list of configuration properties is managed in the HiveConf Java class, so refer to the HiveConf.java file for a complete list of configuration properties available in your Hive release.

For information about how to use these configuration properties, see Configuring Hive. That document also describes administrative configuration properties for setting up Hive in the Configuration Variables section. Hive Metastore Administration describes additional configuration properties for the metastore.

Apache Hive : Cost-based optimization in Hive

Dec 12, 2024

Apache Hive : Cost-based optimization in Hive

Abstract

Apache Hadoop is a framework for the distributed processing of large data sets using clusters of computers typically composed of commodity hardware. Over last few years Apache Hadoop has become the de facto platform for distributed data processing using commodity hardware. Apache Hive is a popular SQL interface for data processing using Apache Hadoop.

User submitted SQL query is converted by Hive to physical operator tree which is optimized and converted to Tez Jobs and is then executed on Hadoop cluster. Distributed SQL query processing in Hadoop differs from conventional relational query engine when it comes to handling of intermediate result sets. Hive query processing often requires sorting and reassembling of intermediate result set; this is called shuffling in Hadoop parlance.

Apache Hive : CSV Serde

Dec 12, 2024

Apache Hive : CSV Serde

Availability

Earliest version CSVSerde is available

The CSVSerde is available in Hive 0.14 and greater.

Background

The CSV SerDe is based on https://github.com/ogrodnek/csv-serde, and was added to the Hive distribution in HIVE-7777.

Limitation

This SerDe treats all columns to be of type String. Even if you create a table with non-string column types using this SerDe, the DESCRIBE TABLE output would show string column type. The type information is retrieved from the SerDe. To convert columns to the desired type in a table, you can create a view over the table that does the CAST to the desired type.

Apache Hive : Data Connector for Hive and Hive-like engines

Dec 12, 2024

Apache Hive : Data Connector for Hive and Hive-like engines

What is a Data connector?

Data connectors (referred to as “connector” in Hive Query Language) are top level objects in Hive where users can define a set of properties required to be able to connect to an external datasource from hive. This document illustrates example of the data connector framework can be used to do SQL query federation between two distinct “hive” clusters/installations or between Hive and another hive-like compute engines (eg: EMR).

Apache Hive : Data Connectors in Hive

Dec 12, 2024

Apache Hive : Data Connectors in Hive

What is a Data connector?

Data connectors (referred to as “connector” in Hive Query Language) are top level objects in Hive where users can define a set of properties required to be able to connect to a datasource from hive. So a connector has a type (closed enumerated set) that allows Hive to determine the driver class (for JDBC) and other URL params, a URL and a set of properties that could include the default credentials for the remote datasource. Once defined, users can use the same connector object to map multiple databases from the remote datasource to local hive metastore.

Apache Hive : Druid Integration

Dec 12, 2024

Apache Hive : Druid Integration

This page documents the work done for the integration between Druid and Hive introduced in Hive 2.2.0 (HIVE-14217). Initially it was compatible with Druid 0.9.1.1, the latest stable release of Druid to that date.

Objectives

Our main goal is to be able to index data from Hive into Druid, and to be able to query Druid datasources from Hive. Completing this work will bring benefits to the Druid and Hive systems alike:

Apache Hive : FileFormats

Dec 12, 2024

Apache Hive : FileFormats

File Formats and Compression

File Formats

Hive supports several file formats:

Text File
SequenceFile
RCFile
Avro Files
ORC Files
Parquet
Custom INPUTFORMAT and OUTPUTFORMAT

The hive.default.fileformat configuration parameter determines the format to use if it is not specified in a CREATE TABLE or ALTER TABLE statement. Text file is the parameter’s default value.

For more information, see the sections Storage Formats and Row Formats & SerDe on the DDL page.

File Compression

Apache Hive : HBaseIntegration

Dec 12, 2024

Apache Hive : HBase Integration

This page documents the Hive/HBase integration support originally introduced in HIVE-705. This feature allows Hive QL statements to access HBase tables for both read (SELECT) and write (INSERT). It is even possible to combine access to HBase tables with native Hive tables via joins and unions.

A presentation is available from the HBase HUG10 Meetup

This feature is a work in progress, and suggestions for its improvement are very welcome.

Apache Hive : Hive Aws EMR

Dec 12, 2024

Apache Hive : Hive Aws EMR

Amazon Elastic MapReduce and Hive

Amazon Elastic MapReduce is a web service that makes it easy to launch managed, resizable Hadoop clusters on the web-scale infrastructure of Amazon Web Services (AWS). Elastic Map Reduce makes it easy for you to launch a Hive and Hadoop cluster, provides you with flexibility to choose different cluster sizes, and allows you to tear them down automatically when processing has completed. You pay only for the resources that you use with no minimums or long-term commitments.

Apache Hive : Hive Configurations

Dec 12, 2024

Apache Hive : Hive Configurations

Hive has more than 1600 configs around the service. The hive-site.xml contains the default configurations for the service. In this config file, you can change the configs. Every config change needs to restart the service(s).

Here you can find the most important configurations and default values.

Config Name	Default Value	Description	Config file
`hive.metastore.client.cache.v2.enabled`	true	This property enabled a Caffaine Cache for Metastore client	MetastoreConf

More configs are in MetastoreConf.java file

Apache Hive : Hive deprecated authorization mode / Legacy Mode

Dec 12, 2024

Apache Hive : Hive deprecated authorization mode / Legacy Mode

This document describes Hive security using the basic authorization scheme, which regulates access to Hive metadata on the client side. This was the default authorization mode used when authorization was enabled. The default was changed to SQL Standard authorization in Hive 2.0 (HIVE-12429).

Disclaimer

Hive authorization is not completely secure. The basic authorization scheme is intended primarily to prevent good users from accidentally doing bad things, but makes no promises about preventing malicious users from doing malicious things. See the Hive authorization main page for the secure options.

Apache Hive : Hive HPL/SQL

Dec 12, 2024

Apache Hive : Hive HPL/SQL

Hive Hybrid Procedural SQL On Hadoop (HPL/SQL) is a tool that implements procedural SQL for Hive. It is available in Hive 2.0.0 (HIVE-11055).

HPL/SQL is an open source tool (Apache License 2.0) that implements procedural SQL language for Apache Hive, SparkSQL, Impala as well as any other SQL-on-Hadoop implementation, any NoSQL and any RDBMS.

HPL/SQL is a hybrid and heterogeneous language that understands syntaxes and semantics of almost any existing procedural SQL dialect, and you can use with any database, for example, running existing Oracle PL/SQL code on Apache Hive and Microsoft SQL Server, or running Transact-SQL on Oracle, Cloudera Impala or Amazon Redshift.

Apache Hive : Hive Metrics

Dec 12, 2024

Apache Hive : Hive Metrics

The metrics that Hive collects can be viewed in the HiveServer2 Web UI by using the “Metrics Dump” tab.

The metrics dump will display any metric available over JMX encoded in JSON:

Alternatively the metrics can be written directly into HDFS, a JSON file on the local file system where the HS2 instance is running or to the console by enabling the corresponding metric reporters. By default only the JMX and the JSON file reporter are enabled.

Apache Hive : Hive on Spark

Dec 12, 2024

Apache Hive : Hive on Spark

1. Introduction

We propose modifying Hive to add Spark as a third execution backend(HIVE-7292), parallel to MapReduce and Tez.

Spark is an open-source data analytics cluster computing framework that’s built outside of Hadoop’s two-stage MapReduce paradigm but on top of HDFS. Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. By being applied by a series of transformations such as groupBy and filter, or actions such as count and save that are provided by Spark, RDDs can be processed and analyzed to fulfill what MapReduce jobs can do without having intermediate stages.

Apache Hive : Hive Transactions

Dec 12, 2024

Apache Hive : ACID Transactions

Upgrade to Hive 3+

Any transactional tables created by a Hive version prior to Hive 3 require Major Compaction to be run on every partition before upgrading to 3.0. More precisely, any partition which has had any update/delete/merge statements executed on it since the last Major Compaction, has to undergo another Major Compaction. No more update/delete/merge may happen on this partition until after Hive is upgraded to Hive 3.

Apache Hive : Hive Transactions (Hive ACID)

Dec 12, 2024

Apache Hive : Hive Transactions (Hive ACID)

What is ACID and why should you use it?

ACID stands for four traits of database transactions: Atomicity (an operation either succeeds completely or fails, it does not leave partial data), Consistency (once an application performs an operation the results of that operation are visible to it in every subsequent operation), Isolation (an incomplete operation by one user does not cause unexpected side effects for other users), and Durability (once an operation is complete it will be preserved even in the face of machine or system failure). These traits have long been expected of database systems as part of their transaction functionality.

Apache Hive : Hive-Iceberg Integration

Dec 12, 2024

Apache Hive : Hive-Iceberg Integration

Apache Hive starting from 4.0 out of the box supports the Iceberg table format, the iceberg tables can be created like regular hive external or ACID tables, without adding any extra jars.

Creating an Iceberg Table

An iceberg table can be created using STORED BY ICEBERG keywords while creating a table.

Creating an Iceberg table using normal create command

CREATE TABLE TBL_ICE (ID INT) STORED BY ICEBERG;

The above creates an iceberg table named ‘TBL_ICE’

Apache Hive : HiveAws HivingS3nRemotely

Dec 12, 2024

Apache Hive : HiveAws HivingS3nRemotely

= Querying S3 files from your PC (using EC2, Hive and Hadoop) =

Usage Scenario

The scenario being covered here goes as follows:

A user has data stored in S3 - for example Apache log files archived in the cloud, or databases backed up into S3.
The user would like to declare tables over the data sets here and issue SQL queries against them
These SQL queries should be executed using computed resources provisioned from EC2. Ideally, the compute resources can be provisioned in proportion to the compute costs of the queries
Results from such queries that need to be retained for the long term can be stored back in S3

This tutorial walks through the steps required to accomplish this. Please send email to the hive-users mailing list in case of any problems with this tutorial.

Apache Hive : HiveClient

Dec 12, 2024

Apache Hive : HiveClient

This page describes the different clients supported by Hive. The command line client currently only supports an embedded server. The JDBC and Thrift-Java clients support both embedded and standalone servers. Clients in other languages only support standalone servers.

For details about the standalone server see Hive Server or HiveServer2.

Command Line

Operates in embedded mode only, that is, it needs to have access to the Hive libraries. For more details see Getting Started and Hive CLI.

Apache Hive : HiveCounters

Dec 12, 2024

Apache Hive : HiveCounters

Task counters created by Hive during query execution

For Tez execution, %context is set to the mapper/reducer name. For other execution engines it is not included in the counter name.

Counter Name	Description
RECORDS_IN[_%context]	Input records read
RECORDS_OUT[_%context]	Output records written
RECORDS_OUT_INTERMEDIATE[_%context]	Records written as intermediate records to ReduceSink (which become input records to other tasks)
CREATED_FILES	Number of files created
DESERIALIZE_ERRORS	Deserialization errors encountered while reading data

Apache Hive : HiveServer2 Clients

Dec 12, 2024

Apache Hive : HiveServer2 Clients

This page describes the different clients supported by HiveServer2.

Version information

Introduced in Hive version 0.11. See HIVE-2935.

Beeline – Command Line Shell

HiveServer2 supports a command shell Beeline that works with HiveServer2. It’s a JDBC client that is based on the SQLLine CLI (http://sqlline.sourceforge.net/). There’s detailed documentation of SQLLine which is applicable to Beeline as well.

Replacing the Implementation of Hive CLI Using Beeline

The Beeline shell works in both embedded mode as well as remote mode. In the embedded mode, it runs an embedded Hive (similar to Hive CLI) whereas remote mode is for connecting to a separate HiveServer2 process over Thrift. Starting in Hive 0.14, when Beeline is used with HiveServer2, it also prints the log messages from HiveServer2 for queries it executes to STDERR. Remote HiveServer2 mode is recommended for production use, as it is more secure and doesn’t require direct HDFS/metastore access to be granted for users.

Apache Hive : HiveServer2 Overview

Dec 12, 2024

Apache Hive : HiveServer2 Overview

Introduction

HiveServer2 (HS2) is a service that enables clients to execute queries against Hive. HiveServer2 is the successor to HiveServer1 which has been deprecated. HS2 supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.

HS2 is a single process running as a composite service, which includes the Thrift-based Hive service (TCP or HTTP) and a Jetty web server for web UI.

Apache Hive : JDBC Storage Handler

Dec 12, 2024

Apache Hive : JDBC Storage Handler

Syntax

JdbcStorageHandler supports reading from jdbc data source in Hive. Currently writing to a jdbc data source is not supported. To use JdbcStorageHandler, you need to create an external table using JdbcStorageHandler. Here is a simple example:

CREATE EXTERNAL TABLE student_jdbc
(
  name string,
  age int,
  gpa double
)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
    "hive.sql.database.type" = "MYSQL",
    "hive.sql.jdbc.driver" = "com.mysql.jdbc.Driver",
    "hive.sql.jdbc.url" = "jdbc:mysql://localhost/sample",
    "hive.sql.dbcp.username" = "hive",
    "hive.sql.dbcp.password" = "hive",
    "hive.sql.table" = "STUDENT",
    "hive.sql.dbcp.maxActive" = "1"
);

You can also alter table properties of the jdbc external table using alter table statement, just like other non-native Hive table:

Apache Hive : Kudu Integration

Dec 12, 2024

Apache Hive : Kudu Integration

Overview

Apache Kudu is a an Open Source data storage engine that makes fast analytics on fast and changing data easy.

Implementation

The initial implementation was added to Hive 4.0 in HIVE-12971 and is designed to work with Kudu 1.2+.

There are two main components which make up the implementation: the KuduStorageHandler and the KuduPredicateHandler. The KuduStorageHandler is a Hive StorageHandler implementation. The primary roles of this class are to manage the mapping of a Hive table to a Kudu table and configures Hive queries. The KuduPredicateHandler is used push down filter operations to Kudu for more efficient IO.

Apache Hive : Materialized views in Hive

Dec 12, 2024

Apache Hive : Materialized views in Hive

Objectives

Traditionally, one of the most powerful techniques used to accelerate query processing in data warehouses is the pre-computation of relevant summaries or materialized views.

The initial implementation focuses on introducing materialized views and automatic query rewriting based on those materializations in the project. In particular, materialized views can be stored natively in Hive or in other systems such as Druid using custom storage handlers, and they can seamlessly exploit new exciting Hive features such as LLAP acceleration. Then the optimizer relies in Apache Calcite to automatically produce full and partial rewritings for a large set of query expressions comprising projections, filters, join, and aggregation operations.

Apache Hive : MultiDelimitSerDe

Dec 12, 2024

Apache Hive : MultiDelimitSerDe

Introduction

Introduced in HIVE-5871, MultiDelimitSerDe allows user to specify multiple-character string as the field delimiter when creating a table.

Version

Hive 0.14.0 and later.

Hive QL Syntax

You can use MultiDelimitSerDe in a create table statement like this:

CREATE TABLE test (
 id string,
 hivearray array<binary>,
 hivemap map<string,int>) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.MultiDelimitSerDe'                  
WITH SERDEPROPERTIES ("field.delim"="[,]","collection.delim"=":","mapkey.delim"="@");

where field.delim is the field delimiter, collection.delim and mapkey.delim is the delimiter for collection items and key value pairs, respectively.

Apache Hive : Parquet

Dec 12, 2024

Apache Hive : Parquet

Parquet is supported by a plugin in Hive 0.10, 0.11, and 0.12 and natively in Hive 0.13 and later.

Introduction

Parquet (http://parquet.io/) is an ecosystem wide columnar format for Hadoop. Read Dremel made simple with Parquet for a good introduction to the format while the Parquet project has an in-depth description of the format including motivations and diagrams. At the time of this writing Parquet supports the follow engines and data description languages:

Apache Hive : Permission Inheritance in Hive

Dec 12, 2024

Apache Hive : Permission Inheritance in Hive

This document describes how attributes (permission, group, extended ACL’s) of files representing Hive data are determined.

HDFS Background

When a file or directory is created, its owner is the user identity of the client process, and its group is inherited from parent (the BSD rule). Permissions are taken from default umask. Extended Acl’s are taken from parent unless they are set explicitly.

Goals

To reduce need to set fine-grain file security props after every operation, users may want the following Hive warehouse file/dir to auto-inherit security properties from their directory parents:

Apache Hive : Query ReExecution

Dec 12, 2024

Apache Hive : Query ReExecution

Query reexecution provides a facility to re-run the query multiple times in case of an unfortunate event happens. Introduced in Hive 3.0 (HIVE-17626)

ReExecition strategies

Overlay

Enables to change the hive settings for all reexecutions which will be happening. It works by adding a configuration subtree as an overlay to the actual hive settings(reexec.overlay.*)

Example

set zzz=1;
set reexec.overlay.zzz=2;

set hive.query.reexecution.enabled=true;
set hive.query.reexecution.strategies=overlay;

create table t(a int);
insert into t values (1);
select assert_true(${hiveconf:zzz} > a) from t group by a;

Every hive setting which has a prefix of “reexec.overlay” will be set for all reexecutions.

Apache Hive : RCFile

Dec 12, 2024

Apache Hive : RCFile

RCFile (Record Columnar File) is a data placement structure designed for MapReduce-based data warehouse systems. Hive added the RCFile format in version 0.6.0.

RCFile stores table data in a flat file consisting of binary key/value pairs. It first partitions rows horizontally into row splits, and then it vertically partitions each row split in a columnar way. RCFile stores the metadata of a row split as the key part of a record, and all the data of a row split as the value part.

Apache Hive : RCFileCat

Dec 12, 2024

Apache Hive : RCFileCat

$HIVE_HOME/bin/hive –rcfilecat is a shell utility which can be used to print data or metadata from RC files.

Data

Prints out the rows stored in an RCFile, columns are tab separated and rows are newline separated.

Usage:

hive --rcfilecat [--start=start_offset] [--length=len] [--verbose] fileName

--start=start_offset           Start offset to begin reading in the file
--length=len                   Length of data to read from the file
--verbose                      Prints periodic stats about the data read,
                               how many records, how many bytes, scan rate

Metadata

New in 0.11.0

Apache Hive : Rebalance compaction

Dec 12, 2024

Apache Hive : Rebalance compaction

In order to improve performance, Hive under the hood creates bucket files even for non-explicitly bucketed tables. Depending on the usage, the data loaded into these non-explicitly bucketed full-acid ORC tables may lead to unbalanced distribution, where some of the buckets are much larger (> 100 times) than the others. Unbalanced tables has performance penalty, as larger buckets takes more time to read. Rebalance compaction addresses this issue by equally redistributing the data among the implicit bucket files.

Apache Hive : Replacing the Implementation of Hive CLI Using Beeline

Dec 12, 2024

Apache Hive : Replacing the Implementation of Hive CLI Using Beeline

Why Replace the Existing Hive CLI?

Hive CLI is a legacy tool which had two main use cases. The first is that it served as a thick client for SQL on Hadoop and the second is that it served as a command line tool for Hive Server (the original Hive server, now often referred to as “HiveServer1”). Hive Server has been deprecated and removed from the Hive code base as of Hive 1.0.0 (HIVE-6977) and replaced with HiveServer2 (HIVE-2935), so the second use case no longer applies. For the first use case, Beeline provides or is supposed to provide equal functionality, yet is implemented differently from Hive CLI.

Apache Hive : SerDe

Dec 12, 2024

Apache Hive : SerDe

SerDe Overview

SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for IO. The interface handles both serialization and deserialization and also interpreting the results of serialization as individual fields for processing.

A SerDe allows Hive to read in data from a table, and write it back out to HDFS in any custom format. Anyone can write their own SerDe for their own data formats.

Apache Hive : StarRocks Integration

Dec 12, 2024

Apache Hive : StarRocks Integration

StarRocks has the ability to setup a Hive catalog which enables you to query data from Hive without loading data into StarRocks or creating external tables. See here for more information.

Apache Hive : Storage Based Authorization in the Metastore Server

Dec 12, 2024

Apache Hive : Storage Based Authorization in the Metastore Server

The metastore server security feature with storage based authorization was added to Hive in release 0.10. This feature was introduced previously in HCatalog.

HIVE-3705 added metastore server security to Hive in release 0.10.0.

For additional information about storage based authorization in the metastore server, see the HCatalog document Storage Based Authorization.
For an overview of Hive authorization models and other security options, see the Authorization document.

The Need for Metastore Server Security

When multiple clients access the same metastore in a backing database, such as MySQL, the database connection credentials may be visible in the hive-site.xml configuration file. A malicious or incompetent user could cause serious damage to metadata even though the underlying data is protected by HDFS access controls.

Apache Hive : Streaming Data Ingest

Dec 12, 2024

Apache Hive : Streaming Data Ingest

Hive 3 Streaming API

Hive 3 Streaming API Documentation - new API available in Hive 3

Hive HCatalog Streaming API

Traditionally adding new data into Hive requires gathering a large amount of data onto HDFS and then periodically adding a new partition. This is essentially a “batch insertion”. Insertion of new data into an existing partition is not permitted. Hive Streaming API allows data to be pumped continuously into Hive. The incoming data can be continuously committed in small batches of records into an existing Hive partition or table. Once data is committed it becomes immediately visible to all Hive queries initiated subsequently.

Apache Hive : Streaming Data Ingest V2

Dec 12, 2024

Apache Hive : Streaming Data Ingest V2

Starting in release Hive 3.0.0, Streaming Data Ingest is deprecated and is replaced by newer V2 API (HIVE-19205).

Hive Streaming API

Traditionally adding new data into Hive requires gathering a large amount of data onto HDFS and then periodically adding a new partition. This is essentially a “batch insertion”.

Hive Streaming API allows data to be pumped continuously into Hive. The incoming data can be continuously committed in small batches of records into an existing Hive partition or table. Once data is committed it becomes immediately visible to all Hive queries initiated subsequently.

Apache Hive : TeradataBinarySerde

Dec 12, 2024

Apache Hive : TeradataBinarySerde

Availability

Earliest version CSVSerde is available

The TeradataBinarySerDe is available in Hive 2.4 or greater.

Overview

Teradata can use TPT(Teradata Parallel Transporter) or BTEQ(Basic Teradata Query) to export and import data files compressed by gzip in very high speed. However such binary files are encoded in Teradata’s proprietary format and can’t be directly consumed by Hive without a customized SerDe.

The TeradataBinarySerde enables users to read or write Teradata binary data as Hive Tables:

Apache Hive : Transitivity on predicate pushdown

Dec 12, 2024

Apache Hive : Transitivity on predicate pushdown

Before Hive 0.8.0, the query


set hive.mapred.mode=strict;
create table invites (foo int, bar string) partitioned by (ds string);
create table invites2 (foo int, bar string) partitioned by (ds string);
select count(*) from invites join invites2 on invites.ds=invites2.ds where invites.ds='2011-01-01';

would give the error


Error in semantic analysis: No Partition Predicate Found for Alias "invites2" Table "invites2"

Here, the filter is applied to the table invites as invites.ds=‘2011-01-01’ but not invites2.ds=‘2011-01-01’. This causes Hive to reject the query in strict mode to prevent scanning all the partitions of invites2. This can be seen by using explain plan on the query without strict mode on:

Apache Hive : Tutorial

Dec 12, 2024

Apache Hive : Tutorial

Concepts

What Is Hive

Hive is a data warehousing infrastructure based on Apache Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware.

Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides SQL which enables users to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive’s SQL gives users multiple places to integrate their own functionality to do custom analysis, such as User Defined Functions (UDFs).

Apache Hive : Union Optimization

Dec 12, 2024

Apache Hive : Union Optimization

Consider the query

select * from

(subq1

UNION ALL

sub2) u;

If the parents to union were map reduce jobs, they will write the output to temporary files. The Union will then read the rows from these temporary files and write to a final directory. In effect, the results are read and written twice unnecessarily. We can avoid this by directly writing to the final directory.

Apache Hive : User FAQ

Dec 12, 2024

Apache Hive : User FAQ

General

I see errors like: Server access Error: Connection timed out url=http://archive.apache.org/dist/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz

Run the following commands:
cd ~/.ant/cache/hadoop/core/sources
wget <http://archive.apache.org/dist/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz>

How to change the warehouse.dir location for older tables?

To change the base location of the Hive tables, edit the hive.metastore.warehouse.dir param. This will not affect the older tables. Metadata needs to be changed in the database (MySQL or Derby). The location of Hive tables is in table SDS and column LOCATION.

Apache Hive : Using TiDB as the Hive Metastore database

Dec 12, 2024

Apache Hive : Using TiDB as the Hive Metastore database

Why use TiDB in Hive as the Metastore database?

TiDB is a distributed SQL database built by PingCAP and its open-source community. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability. It’s a one-stop solution for both Online Transactional Processing (OLTP) and Online Analytical Processing (OLAP) workloads.

In scenarios with enormous amounts of data, due to TiDB’s distributed architecture, query performance is not limited to the capability of a single machine. When the data volume reaches the bottleneck, you can add nodes to improve TiDB’s storage capacity.