How to cache subquery result in WITH clause in Spark SQL

Question

I wonder if Spark SQL support caching result for the query defined in WITH clause. The Spark SQL query is something like this:

with base_view as
(
 select some_columns from some_table
WHERE 
 expensive_udf(some_column) = true
)
... multiple query join based on this view

While this query works with Spark SQL, I noticed that the UDF were applied to the same data set multiple times. In this use case, the UDF is very expensive. So I'd like to cache the query result of base_view so the subsequent queries would benefit from the cached result.

P.S. I know you can create and cache a table with the given query and then reference it in the subqueries. In this specific case, though, I can't create any tables or views.

[Mark your UDF as nondeterministic](https://stackoverflow.com/q/42367464/10465355)? — 10465355, Feb 17 '19 at 23:37

score 2 · Accepted Answer · answered Feb 17 '19 at 23:14

2

That is not possible. The WITH result cannot be persisted after execution or substituted into new Spark SQL invocation.

answered Feb 17 '19 at 23:14

thebluephantom

16,458
8
40
83

score 1 · Answer 2 · answered Feb 17 '19 at 23:26

1

The WITH clause allows you to give a name to a temporary result set so it ca be reused several times within a single query. I believe what he's asking for is a materialized view.

answered Feb 17 '19 at 23:26

Jim Castro

864
5
10

Please provide an example and I will withdraw my answer and stand corrected. – thebluephantom Feb 18 '19 at 09:59

score 0 · Answer 3 · answered Jun 22 '21 at 08:47

This can be done by excuting several sql query.

-- first cache sql
spark.sql("
CACHE TABLE base_view as
  select some_columns
  from some_table
  WHERE 
  expensive_udf(some_column) = true")

-- then use
spark.sql("
... multiple query join based on this view
")

marie20 · Answer 4 · 2022-04-11T19:45:54.230

Not sure if you are still interested in the solution, but the following is a workaround to accomplish the same:-

spark.sql("""
         | create temp view my_view
         | as
         | WITH base_view as
         | (
         | select some_columns 
         | from some_table
         | WHERE 
         | expensive_udf(some_column) = true
         | )
         | SELECT *
         | from base_view 
      """);

spark.sql("""CACHE TABLE my_view""");

Now you can use the my_view temp view to join to other tables as shown below-

spark.sql("""
         | select mv.col1, t2.col2, t3.col3
         | from my_view mv
         | join tab2 t2
         | on mv.col2 = t2.col2 
         | join tab3 t3
         | on mv.col3 = t3.col3 
      """);

Remember to uncache the view after using-

spark.sql("""UNCACHE TABLE my_view""");

Hope this helps.

How to cache subquery result in WITH clause in Spark SQL

4 Answers4