Apache Hive : Subqueries in SELECT

Problem

Currently Hive doesn’t support subqueries in a SELECT statement, for example, the following query will not run on Hive:

SELECT customer.customer_num,
	(SELECT SUM(ship_charge) 
		FROM orders
		WHERE customer.customer_num = orders.customer_num
	) AS total_ship_chg
FROM customer 

Recently a lot of work has been done to extend support for subqueries (HIVE-15456). But this work primarily targeted extending subquery support in WHERE and HAVING clauses. We plan to continue the work done in HIVE-15456 to support subqueries in a select list (see HIVE-16091).

Assumptions

We plan to limit the scope with the following assumptions and limitations.

Not Supported

-- subquery in non-simple expression
SELECT 1 + (SELECT SUM(ship_charge) FROM orders), customer.customer_num FROM customer
 
-- subquery in CASE
SELECT CASE WHEN (select count(*) from store_sales 
                  where ss_quantity between 1 and 20) > 409437
            THEN (select avg(ss_ext_list_price) 
                  from store_sales 
                  where ss_quantity between 1 and 20) 
            ELSE (select avg(ss_net_paid_inc_tax)
                  from store_sales
                  where ss_quantity between 1 and 20) end bucket1
FROM reason
WHERE r_reason_sk = 1

Not Supported

SELECT customer.customer_num,
	(SELECT ship_charge 
		FROM orders
		WHERE customer.customer_num = orders.customer_num
	) AS total_ship_chg
FROM customer 

Not Supported

SELECT customer.customer_num,
	(SELECT ship_charge, customer_num
		FROM orders LIMIT 1
	) AS total_ship_chg
FROM customer

Not Supported

SELECT customer.customer_num,
	(SELECT customer.customer_num 
		FROM orders
		WHERE customer.customer_num = orders.customer_num
	) AS total_ship_chg
FROM customer 

Design

Given the assumptions above, the following kind of subqueries could be used in SELECT. 

SELECT customer.customer_num,
	(SELECT SUM(ship_charge) 
		FROM orders
		WHERE customer.customer_num = orders.customer_num
	) AS total_ship_chg
FROM customer 
SELECT p_size IN (
		SELECT MAX(p_size) FROM part)
FROM part
SELECT EXISTS(SELECT p_size FROM part)
FROM part

All of the above queries could be correlated or uncorrelated.

Design for this will be similar to the work done in HIVE-15456.

 HIVE-16091 covers the initial work for supporting subqueries in SELECT.