How do I eliminate a second seq scan over a table when deriving a new table?

Question

Let's say I have some sample data, 100 million rows.

CREATE TEMP TABLE foo
AS
  SELECT id, md5(id::text), trunc(random()*1e6)
  FROM generate_series(1,1e6) AS t(id);

This will generate a table like this..

 id |               md5                | trunc  
----+----------------------------------+--------
  1 | c4ca4238a0b923820dcc509a6f75849b | 159632
  2 | c81e728d9d4c2f636f067f89cc14862c | 182952
  3 | eccbc87e4b5ce2fe28308fd9f2a7baf3 | 438287
  4 | a87ff679a2f3e71d9181a67b7542122c |  78240
  5 | e4da3b7fbbce2345d7772b0674a318d5 |  20293
  6 | 1679091c5a880faf6fb5e6087eb1b2dc | 909742
  7 | 8f14e45fceea167a5a36dedd4bea2543 | 926496
  8 | c9f0f895fb98ab9159f51fd0297e236d | 463718
  9 | 45c48cce2e2d7fbdea1afc51c7c6ad26 |  65842
 10 | d3d9446802a44259755d38e6d163e820 |  81791

How can I then generate a table with one scan that resembles this..

SELECT id, md5::text AS x
FROM foo
UNION ALL
  SELECT id, trunc::text
  FROM foo;

 id |                x                 
----+----------------------------------
  1 | c4ca4238a0b923820dcc509a6f75849b
  1 | 961453
  2 | c81e728d9d4c2f636f067f89cc14862c
  2 | 842364
  3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
  3 | 784693
  4 | a87ff679a2f3e71d9181a67b7542122c
  4 | 602039
  5 | e4da3b7fbbce2345d7772b0674a318d5
  5 | 176938
...

But that generates a query plan like this,

                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..33832.52 rows=1514052 width=64) (actual time=0.025..1034.740 rows=2000000 loops=1)
   ->  Seq Scan on foo  (cost=0.00..16916.26 rows=757026 width=64) (actual time=0.025..173.272 rows=1000000 loops=1)
   ->  Seq Scan on foo foo_1  (cost=0.00..16916.26 rows=757026 width=64) (actual time=0.016..715.279 rows=1000000 loops=1)
 Planning time: 0.128 ms
 Execution time: 1103.499 ms
(5 rows)

What would it look like to have one sec scan, and would it be faster if the table was only read once?

Question inspired from this conversation

score 2 · Accepted Answer · edited Apr 13 '17 at 12:42

Answer inspired from this

A few methods, all test with PostgreSQL 9.5.

`CROSS JOIN LATERAL ... VALUES`

This is actually slower, but it seems good for a first attempt..

SELECT id, x
FROM foo
CROSS JOIN LATERAL (VALUES (md5),(trunc::text))
  AS t(x);

                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..50982.43 rows=1514052 width=64) (actual time=0.035..1934.061 rows=2000000 loops=1)
   ->  Seq Scan on foo  (cost=0.00..16916.26 rows=757026 width=72) (actual time=0.027..114.655 rows=1000000 loops=1)
   ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2 width=32) (actual time=0.000..0.001 rows=2 loops=1000000)
 Planning time: 0.115 ms
 Execution time: 2027.840 ms
(5 rows)

`CROSS JOIN ... VALUES CASE`

SELECT id, CASE WHEN x THEN md5 ELSE trunc::text END AS x 
FROM foo
CROSS JOIN (VALUES (true),(false))
  AS t(x);

                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..43412.20 rows=1514052 width=73) (actual time=0.036..1318.494 rows=2000000 loops=1)
   ->  Seq Scan on foo  (cost=0.00..16916.26 rows=757026 width=72) (actual time=0.026..108.375 rows=1000000 loops=1)
   ->  Materialize  (cost=0.00..0.04 rows=2 width=1) (actual time=0.000..0.000 rows=2 loops=1000000)
         ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2 width=1) (actual time=0.002..0.003 rows=2 loops=1)
 Planning time: 0.104 ms
 Execution time: 1381.685 ms
(6 rows)

Row duplication with `ARRAY/unnest`

SELECT id, x
FROM foo
CROSS JOIN LATERAL unnest(ARRAY[md5,trunc::text])
  AS t(x);

                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.01..1530968.27 rows=75702600 width=64) (actual time=0.036..3329.324 rows=2000000 loops=1)
   ->  Seq Scan on foo  (cost=0.00..16916.26 rows=757026 width=72) (actual time=0.015..156.087 rows=1000000 loops=1)
   ->  Function Scan on unnest t  (cost=0.01..1.01 rows=100 width=32) (actual time=0.002..0.003 rows=2 loops=1000000)
 Planning time: 0.054 ms
 Execution time: 3439.064 ms
(5 rows)

tldr;

Neither of these methods are faster. They're slower, and more complex. Stick with the UNION ALL.

That is unexpected. Because in SQL Server, the opposite is true. Using CROSS APPLY to double up the rows and do a single scan is about half the time of doing two scans. — ErikE, Feb 16 '17 at 23:17
I'm not sure what exactly is happening under the hood here, or why it has to be so or what optimizations are missing. — Evan Carroll, Feb 16 '17 at 23:19
Maybe we should make another question as to why it's slower tag it with database-internals and wait for Erwin to answer it. Out of my pay grade past the work I've done here. — Evan Carroll, Feb 16 '17 at 23:20
I gave you the credit under the question for your post in the chat. =) Your inspiration for the chat is aside from my inspiration for the question. I already answered that question anyway. Every time I mention it a special someone wants to to have me banned for brining it up. So I leave it as-is. — Evan Carroll, Feb 16 '17 at 23:29
What is faster or not might depend on whether the second scan finds everything already in memory, or a table sufficiently large (that will not fit in memory) ends up with minimal data cached. In the second case, the second scan will need to access disk a second time and be (possibly) more expensive than the CROSS JOIN LATERAL. 1 M rows fits in memory of contemporary computers... but there is always a quantity that doesn't fit. — joanolo, Feb 17 '17 at 00:16
I found a simpler way which is SELECT id, unnest(ARRAY[md5, trunc::text]) from foo;. I did some testing and found that @joanolo is correct. If the entire table fits in shared buffers, the UNION ALL is faster. If it doesn't, the other solutions are faster as they only have to read from disk once. — malisper, Feb 17 '17 at 06:35

How do I eliminate a second seq scan over a table when deriving a new table?

1 Answers1

CROSS JOIN LATERAL ... VALUES

CROSS JOIN ... VALUES CASE

Row duplication with ARRAY/unnest

tldr;

`CROSS JOIN LATERAL ... VALUES`

`CROSS JOIN ... VALUES CASE`

Row duplication with `ARRAY/unnest`