Apache Hive

Distributed Data Warehouse at Massive Scale

The Apache Hive™ is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale and facilitates reading, writing, and managing petabytes of data residing in distributed storage using SQL.

Get Started View on GitHub

Docker Mailing Lists Community Documentation

18+

Years of Development

1000+

Enterprise Deployments

Petabytes

Data Processed Daily

Global

Community

Why Choose Apache Hive?

Trusted by enterprises worldwide for mission-critical data analytics

Used by Industry Leaders

Battle-Tested Performance

Over 18 years of development and optimization, handling petabytes of data in production environments across the globe.

Vibrant Ecosystem

Seamlessly integrates with Spark, Presto, Impala, and hundreds of other tools in the modern data stack.

SQL-First Approach

Familiar SQL interface makes it easy for data analysts and engineers to work with big data without learning new languages.

Cloud-Native Ready

Native support for S3, Azure Data Lake, Google Cloud Storage, and other cloud storage systems.

Enterprise Security

Comprehensive security features including Kerberos authentication, fine-grained access control, and audit logging.

Apache Foundation

Backed by the Apache Software Foundation with a strong commitment to open source principles and community governance.

What is Apache Hive?

Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale.

Central Metadata Repository

Hive Metastore (HMS) provides a central repository of metadata that can easily be analyzed to make informed, data driven decisions, making it a critical component of many data lake architectures.

SQL Analytics at Scale

Built on top of Apache Hadoop with support for S3, ADLS, GS and more. Hive allows users to read, write, and manage petabytes of data using familiar SQL syntax.

Key Features & Capabilities

Powerful tools for modern data analytics and management

beeline -u "jdbc:hive2://host:10001/default" Connected to: Apache Hive jdbc:hive2://host:10001/>select count(*) from test_t1;

HiveServer2 (HS2)

HS2 supports multi-client concurrency and authentication with better support for open API clients like JDBC and ODBC, enabling seamless integration with business intelligence tools and applications.

Learn More

Hive Metastore Server (HMS)

The central repository of metadata for Hive tables and partitions, providing clients including Hive, Impala, and Spark access through the metastore service API. A fundamental building block for modern data lakes.

Learn More

ACID Transactions

Full ACID support for ORC tables and insert-only support for all other formats, ensuring data consistency and reliability in concurrent environments.

Learn More

jdbc:hive2://> alter table test_t1 compact "MAJOR"; Done! jdbc:hive2://> alter table test_t1 compact "MINOR"; Done! jdbc:hive2://> show compactions;

Data Compaction

Query-based and MapReduce-based data compactions are supported out-of-the-box, optimizing storage efficiency and query performance.

Learn More

Apache Iceberg Support

Out-of-the-box support for Apache Iceberg tables, a cloud-native, high-performance open table format, via Hive StorageHandler for modern data lake architectures.

Learn More

Security & Observability

Enterprise-grade security with Kerberos authentication and seamless integration with Apache Ranger for authorization and Apache Atlas for data lineage and governance.

Learn More

Low Latency Analytics (LLAP)

Interactive and sub-second SQL queries through persistent query infrastructure and optimized data caching, making Hive suitable for real-time analytics workloads.

Learn More

jdbc:hive2://> explain cbo select ss.ss_net_profit, sr.sr_net_loss from store_sales ss join store_returns sr on (ss.ss_item_sk=sr.sr_item_sk) limit 5 ; +---------------------------------------------+ Explain +---------------------------------------------+ CBO PLAN: HiveSortLimit(fetch=[5]) HiveProject(ss_net_profit=[$1], sr_net_loss=[$3]) HiveJoin(condition=[=($0, $2)], joinType=[inner]) HiveProject(ss_item_sk=[$2], ss_net_profit=[$22]) HiveFilter(condition=[IS NOT NULL($2)]) HiveTableScan(table=[[tpcds_text_10, store_sales]], table:alias=[ss]) HiveProject(sr_item_sk=[$2], sr_net_loss=[$19]) HiveFilter(condition=[IS NOT NULL($2)]) HiveTableScan(table=[[tpcds_text_10, store_returns]], table:alias=[sr]) +---------------------------------------------+

Cost-Based Optimizer

Apache Calcite's cost-based query optimizer (CBO) and execution framework automatically optimize SQL queries for optimal performance and resource utilization.

Learn More

jdbc:hive2://> repl dump src with ( 'hive.repl.dump.version'= '2', 'hive.repl.rootdir'= 'hdfs://<host>:<port>/user/replDir/d1' ); Done! jdbc:hive2://> repl load src into tgt with ( 'hive.repl.rootdir'= 'hdfs://<host>:<port>/user/replDir/d1' ); Done!

Data Replication

Bootstrap and incremental replication capabilities for robust backup and disaster recovery, ensuring business continuity and data protection.

Learn More

Ready to Get Started with Apache Hive?

Join thousands of organizations using Apache Hive to power their data analytics and build modern data lakes.

Start Your Journey Try with Docker

Documentation Source Code Mailing Lists Contribute