What is Hive?
Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale. Hive Metastore(HMS) provides a central repository of metadata that can easily be analyzed to make informed, data driven decisions, and therefore it is a critical component of many data lake architectures. Hive is built on top of Apache Hadoop and supports storage on S3, adls, gs etc though hdfs. Hive allows users to read, write, and manage petabytes of data using SQL.
Key Features
Hive Metastore Server (HMS)
The Hive Metastore (HMS) is a central repository of metadata for Hive tables and partitions in a relational database, and provides clients (including Hive, Impala and Spark) access to this information using the metastore service API. It has become a building block for data lakes that utilize the diverse world of open-source software, such as Apache Spark and Presto. In fact, a whole ecosystem of tools, open-source and otherwise, are built around the Hive Metastore, some of which this diagram illustrates.
Hive ACID
Hive provides full ACID support for ORC tables and insert only support to all other formats.
Security and Observability
Apache Hive supports kerberos auth and integrates with Apache Ranger and Apache Atlas for security and observability.
Query planner and Cost based Optimizer
Hive uses Apache Calcite's cost based query optimizer (CBO) and query execution framework to optimize sql queries.