Apache Hive

The Apache Hive ™ is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale and facilitates reading, writing, and managing petabytes of data residing in distributed storage using SQL.

What is Hive?

Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale. Hive Metastore(HMS) provides a central repository of metadata that can easily be analyzed to make informed, data driven decisions, and therefore it is a critical component of many data lake architectures. Hive is built on top of Apache Hadoop and supports storage on S3, adls, gs etc though hdfs. Hive allows users to read, write, and manage petabytes of data using SQL.

Key Features

beeline -u "jdbc:hive2://host:10001/default" Connected to: Apache Hive jdbc:hive2://host:10001/>select count(*) from test_t1;

HiveServer2 (HS2)

HS2 supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.

Hive Metastore Server (HMS)

The Hive Metastore (HMS) is a central repository of metadata for Hive tables and partitions in a relational database, and provides clients (including Hive, Impala and Spark) access to this information using the metastore service API. It has become a building block for data lakes that utilize the diverse world of open-source software, such as Apache Spark and Presto. In fact, a whole ecosystem of tools, open-source and otherwise, are built around the Hive Metastore, some of which this diagram illustrates.

Hive ACID

Hive provides full ACID support for ORC tables and insert only support to all other formats.

Hive Data Compaction

Query-based and MR-based data compactions are supported out-of-the-box.

jdbc:hive2://> alter table test_t1 compact "MAJOR"; Done! jdbc:hive2://> alter table test_t1 compact "MINOR"; Done! jdbc:hive2://> show compactions;

Hive Iceberg

Hive provides out of the box support for Apache Iceberg Tables, a cloud-native, high-performance open table format, via Hive StorageHandler.

Security and Observability

Apache Hive supports kerberos auth and integrates with Apache Ranger and Apache Atlas for security and observability.

Hive LLAP

Apache Hive enables interactive and subsecond SQL through Low Latency Analytical Processing (LLAP), introduced in Hive 2.0 that makes Hive faster by using persistent query infrastructure and optimized data caching

Query planner and Cost based Optimizer

Hive uses Apache Calcite's cost based query optimizer (CBO) and query execution framework to optimize sql queries.

jdbc:hive2://> explain cbo select ss.ss_net_profit, sr.sr_net_loss from store_sales ss join store_returns sr on (ss.ss_item_sk=sr.sr_item_sk) limit 5 ; +---------------------------------------------+ Explain +---------------------------------------------+ CBO PLAN: HiveSortLimit(fetch=[5]) HiveProject(ss_net_profit=[$1], sr_net_loss=[$3]) HiveJoin(condition=[=($0, $2)], joinType=[inner]) HiveProject(ss_item_sk=[$2], ss_net_profit=[$22]) HiveFilter(condition=[IS NOT NULL($2)]) HiveTableScan(table=[[tpcds_text_10, store_sales]], table:alias=[ss]) HiveProject(sr_item_sk=[$2], sr_net_loss=[$19]) HiveFilter(condition=[IS NOT NULL($2)]) HiveTableScan(table=[[tpcds_text_10, store_returns]], table:alias=[sr]) +---------------------------------------------+

jdbc:hive2://> repl dump src with ( . . .> 'hive.repl.dump.version'= '2', . . .> 'hive.repl.rootdir'= 'hdfs://<host>:<port>/user/replDir/d1' . . .> ); Done! jdbc:hive2://> repl load src into tgt with ( . . .> 'hive.repl.rootdir'= 'hdfs://<host>:<port>/user/replDir/d1' . . .> ); Done!

Hive Replication

Hive supports bootstap and incremental replication for backup and recovery.