Apache Hive : Subqueries in SELECT

Problem

Currently Hive doesn’t support subqueries in a SELECT statement, for example, the following query will not run on Hive:

SELECT customer.customer\_num,
	(SELECT SUM(ship\_charge) 
		FROM orders
		WHERE customer.customer\_num = orders.customer\_num
	) AS total\_ship\_chg
FROM customer 

Recently a lot of work has been done to extend support for subqueries (HIVE-15456). But this work primarily targeted extending subquery support in WHERE and HAVING clauses. We plan to continue the work done in HIVE-15456 to support subqueries in a select list (see HIVE-16091).

Assumptions

We plan to limit the scope with the following assumptions and limitations.

Not Supported

-- subquery in non-simple expression
SELECT 1 + (SELECT SUM(ship\_charge) FROM orders), customer.customer\_num FROM customer
 
-- subquery in CASE
SELECT CASE WHEN (select count(*) from store\_sales 
                  where ss\_quantity between 1 and 20) > 409437
            THEN (select avg(ss\_ext\_list\_price) 
                  from store\_sales 
                  where ss\_quantity between 1 and 20) 
            ELSE (select avg(ss\_net\_paid\_inc\_tax)
                  from store\_sales
                  where ss\_quantity between 1 and 20) end bucket1
FROM reason
WHERE r\_reason\_sk = 1

Not Supported

SELECT customer.customer\_num,
	(SELECT ship\_charge 
		FROM orders
		WHERE customer.customer\_num = orders.customer\_num
	) AS total\_ship\_chg
FROM customer 

Not Supported

SELECT customer.customer\_num,
	(SELECT ship\_charge, customer\_num
		FROM orders LIMIT 1
	) AS total\_ship\_chg
FROM customer

Not Supported

SELECT customer.customer\_num,
	(SELECT customer.customer\_num 
		FROM orders
		WHERE customer.customer\_num = orders.customer\_num
	) AS total\_ship\_chg
FROM customer 

Design

Given the assumptions above, the following kind of subqueries could be used in SELECT. 

SELECT customer.customer\_num,
	(SELECT SUM(ship\_charge) 
		FROM orders
		WHERE customer.customer\_num = orders.customer\_num
	) AS total\_ship\_chg
FROM customer 
SELECT p\_size IN (
		SELECT MAX(p\_size) FROM part)
FROM part
SELECT EXISTS(SELECT p\_size FROM part)
FROM part

All of the above queries could be correlated or uncorrelated.

Design for this will be similar to the work done in HIVE-15456.

 HIVE-16091 covers the initial work for supporting subqueries in SELECT.