Apache Hive : Query ReExecution

Query reexecution provides a facility to re-run the query multiple times in case of an unfortunate event happens.

Introduced in Hive 3.0 (HIVE-17626)

ReExecition strategies

Overlay

Enables to change the hive settings for all reexecutions which will be happening. It works by adding a configuration subtree as an overlay to the actual hive settings(reexec.overlay.*)

Example

set zzz=1;
set reexec.overlay.zzz=2;

set hive.query.reexecution.enabled=true;
set hive.query.reexecution.strategies=overlay;

create table t(a int);
insert into t values (1);
select assert\_true(${hiveconf:zzz} > a) from t group by a;

Every hive setting which has a prefix of “reexec.overlay” will be set for all reexecutions.

A more real life example would be to disable join auto conversion for all reexecutions:

set reexec.overlay.hive.auto.convert.join=false;

Reoptimize

During query execution; the actual number passing rows in every operator is tracked. This information is reused during re-planning which could result in a better plan.

Situation in which this would be needed:

It’s not that easy to craft queries which will lead to OOM situations; but to enable it:

set hive.query.reexecution.strategies=overlay,reoptimize;

Operator Matching

Operator level statistics are matched to the new plan using operator subtree matching this also enables to match the information to a query which have “similar” parts.

Configuration

Configuration default
hive.query.reexecution.enabled true Feature enabler
hive.query.reexecution.strategies overlay,reoptimize reexecution plugins; currently overlay and reoptimize is supported
hive.query.reexecution.stats.persist.scope query runtime statistics can be persisted:* query: - only used during the reexecution