Capture Lineage Information In Hive Hooks
Background
In Hive, lineage information is captured in the form of LineageInfo
object. This object is created in the SemanticAnalyzer
and is passed to the HookContext
object. Users can use the following existing Hooks or implement their own custom hooks to capture this information and utilize it.
Existing Hooks
- org.apache.hadoop.hive.ql.hooks.PostExecutePrinter
- org.apache.hadoop.hive.ql.hooks.LineageLogger
- org.apache.atlas.hive.hook.HiveHook
To facilitate the capture of lineage information in a custom hook or in a use case where the existing hooks are not set in hive.exec.post.hooks
, a new configuration hive.lineage.hook.info.enabled
was introduced in HIVE-24051. This configuration is set to false
by default.
To provide filtering capability on query type in the lineage information, a new configuration hive.lineage.hook.info.query.type
was introduced in HIVE-28409, with default value as “ALL”. Users can tune the configuration accordingly to capture lineage information only for the required query types. In HIVE-28409, the previously introduced configuration hive.lineage.hook.info.enabled
was marked as deprecated.
NOTE: HIVE-28409, will be available in Hive-4.1.0 release.
Usage example:
hive.lineage.hook.info.query.type=ALL -- will capture lineage info for all the queries.
hive.lineage.hook.info.query.type=CREATE_VIEW,CREATE_TABLE_AS_SELECT -- will capture lineage info related to these 2 particulare query types only.
hive.lineage.hook.info.query.type=NONE -- will not capture lineage info for any query.
Previously, to capture lineage information, users has 2 ways:
- Set any of the above mentioned existing hooks in
hive.exec.post.hooks
configuration. - Set
hive.lineage.hook.info.enabled
as true in cluster and restart HiveServer2 service. (Valid since Hive-4.0.0 release).
NOTE: Just by enabling hive.lineage.hook.info.enabled
, lineage information for “Create View” query type won’t be captured, user has to set the existing hooks in hive.exec.post.hooks
along with their custom hook class name.
Changes done in HIVE-28768
The hardcoded values of the existing hooks that capture lineage information in SemanticAnalyzer
and Optimizer
code has been removed and to determine, whether lineage information should be captured or not, the value of hive.lineage.hook.info.query.type
configuration is checked. The default value of hive.lineage.hook.info.query.type
has been set to “NONE”.
Implications of HIVE-28768 on users
- Users migrating directly from Hive-3.x to HIVE-4.1.0 will observe breaking changes in the way lineage information is captured. Setting
hive.exec.post.hooks
to any of the existing hooks will not capture lineage information anymore. Users will have to make use ofhive.lineage.hook.info.query.type
configuration to capture lineage information. - Users migrating from Hive-4.0.x to Hive-4.1.0 who don’t have
hive.lineage.hook.info.enabled
set to true, will also observe breaking changes in the way lineage information is captured.
NOTE: Recommended way to capture lineage information is though hive.lineage.hook.info.query.type
configuration as hive.lineage.hook.info.enabled
is marked as deprecated and is subjected to be removed in future release