Query plan for a dynamically built `query_int` using `format()`

Question

To make things simple here is a statically built version of the query that runs perfectly fast:

SELECT * FROM media WHERE tag_ids @@ '123&321'::query_int ORDER BY rating DESC LIMIT 20;

The catch is that the tag IDs are kept in another table and of course the app's UI accepts tags as strings. So the actual query looks something like this:

SELECT * FROM media WHERE tag_ids
  @@ (SELECT format('%s', (
    SELECT ARRAY_TO_STRING((
      SELECT ARRAY(
        WITH tag_ids as (SELECT id FROM tags WHERE tag IN ('kittens', 'puppies')
      )
      SELECT id FROM tag_ids
      UNION
      SELECT 0 FROM tag_ids WHERE NOT EXISTS (SELECT 1 FROM tag_ids)
    )), '&')
  )))::query_int
ORDER BY rating DESC LIMIT 20;

So the problem is that the query planner always predicts that the tag_ids @@ ... clause will return around 32k rows. Considering that there is a LIMIT of 20 then an index scan on rating, followed by a sequential scan makes total sense. But! Some, indeed most, tag filters return way less than that, making an index scan problematically slow, eg:

->  Index Scan using media_rating_index on media  (cost=0.43..2741321.93 rows=32509 width=11) (actual time=70.816..24270.968 rows=20 loops=1)
         Filter: (tag_ids @@ ($5)::query_int)
         Rows Removed by Filter: 1565931
 Planning Time: 0.514 ms
 Execution Time: 24271.155 ms

Of course when using a statically built tag_ids @@ ... clause for tags that are known to have a low occurrence a tag-based index is chosen by the query planner, eg:

Limit  (cost=9163.00..9163.05 rows=20 width=11) (actual time=3.838..3.841 rows=3 loops=1)
   ->  Sort  (cost=9163.00..9168.96 rows=2384 width=11) (actual time=3.834..3.836 rows=3 loops=1)
         Sort Key: rating DESC
         Sort Method: quicksort  Memory: 25kB
         ->  Bitmap Heap Scan on media  (cost=38.48..9099.57 rows=2384 width=11) (actual time=2.780..3.796 rows=3 loops=1)
               Recheck Cond: (tag_ids @@ '654&321'::query_int)
               Heap Blocks: exact=3
               ->  Bitmap Index Scan on media_tag_ids_gin__int_ops_index  (cost=0.00..37.88 rows=2384 width=0) (actual time=2.023..2.023 rows=3 loops=1)
                     Index Cond: (tag_ids @@ '654&321'::query_int)
 Planning Time: 2.218 ms
 Execution Time: 3.952 ms

One way to solve this is to just send 2 queries to the DB from the application, then the format() can even be done in the application code. But that just feels a bit like giving up. The other thing is an SQL function, but that seems overkill. Is there another more idiomatic way to give the query planner the tag IDs so it can better predict the number of returned rows and thus plan a better query?

What's the query_int cast here '123&321'::query_int - is it a user defined type? — Vérace, Oct 28 '19 at 05:31
Ah yes, that's part of the intarray extension: https://www.postgresql.org/docs/11/intarray.html — tombh, Oct 28 '19 at 07:20
Postgres version, row count, table and index definitions would be useful for the best solution. Also typical array length and typical query_int length. Is the query always built with & as your query suggests? — Erwin Brandstetter, Oct 29 '19 at 01:17

Erwin Brandstetter · Accepted Answer · 2019-11-01T16:35:02.350

First of all, you can simplify the query you have. format() is not needed, among other things:

SELECT *
FROM   media
WHERE  tag_ids @@ array_to_string(ARRAY (
         WITH  tag_ids AS (SELECT id FROM tags WHERE tag IN ('kittens', 'puppies'))
         TABLE tag_ids
         UNION ALL
         SELECT 0 WHERE NOT EXISTS (TABLE tag_ids)
         ), '&')::query_int
ORDER  BY rating DESC
LIMIT  20;

While your query_int value only uses &, further simplify to:

SELECT *
FROM   media
WHERE  tag_ids @> COALESCE(NULLIF(ARRAY(SELECT id FROM tags WHERE tag IN ('kittens', 'puppies')), '{}'), '{0}')
ORDER  BY rating DESC
LIMIT  20;

Should be faster already. But the core problem remains - which is very similar to the one discussed in detail here:

Postgres sometimes uses inferior index for WHERE a IN (...) ORDER BY b LIMIT N

Basically this: Postgres (necessarily) chooses the query plan based on incomplete information. Column statistics cannot contain all frequencies for many rare values. So it's very hard to estimate which plan will be faster:

plan A: traverse the index media_rating_index and filter
plan B: bitmap index scan on media_tag_ids_gin__int_ops_index, sort by ratings, limit

Your case adds complication with a convoluted subquery, which makes it even harder to estimate what might come out of it.

Whatever else you do, make sure you run the most recent version of Postgres available to you, as your particular problem may well profit from a number of recent improvements around this.

Depending on undisclosed information, one of these may work:

1.

Rewrite your query with LEFT JOIN LATERAL as demonstrated in the referenced answer.

2.

Completely disable plan A by obfuscating the ORDER BY. Assuming it's a numerical column:

SELECT ...
ORDER  BY rating + 0 DESC
LIMIT  20;

Like Laurenz suggested:

Postgres sometimes uses inferior index for WHERE a IN (...) ORDER BY b LIMIT N

3.

Even more radical, delete the index media_rating_index - unless it's needed for other queries!

4.

Jeff suggested a manual setting for n_distinct in the related case, but I am not sure how to achieve the same for array elements ...

Postgres sometimes uses inferior index for WHERE a IN (...) ORDER BY b LIMIT N

5.

Here is a related case from the pgsql-performance list. Your case is simpler, you may improve your situation by just upping the statistics target for tag_ids:

ALTER TABLE media ALTER COLUMN tag_ids SET STATISTICS 1000;

And then run ANALYZE media.
This may be useful in addition to other options, especially the following.

6.

In a related case I had success with a fake-IMMUTABLE function:

CREATE FUNCTION f_tag_id_array(VARIADIC text[])
  RETURNS int[] IMMUTABLE PARALLEL SAFE LANGUAGE plpgsql AS
$func$
BEGIN
   RETURN COALESCE(NULLIF(ARRAY(SELECT id FROM tags WHERE tag = ANY($1)), '{}'), '{0}');
END
$func$

Then use it in your query:

SELECT *
FROM   media
WHERE  tag_ids @> f_tag_id_array('kittens', 'puppies')
ORDER  BY rating DESC
LIMIT  20;

Since Postgres now assumes that f_tag_id_array() is immutable it may choose to evaluate it early and continue with the returned constant - thereby achieving the different query plan for known rare IDs you were after.

Note that we are lying about IMMUTABLE - it is not. So don't build an index on it. And it's not safe to use with concurrent write access on table tags.

And since we are lying there, the function cannot be inlined, so using a simpler SQL function won't buy anything in this case. Rather have the saved query plan feature of plpgsql.

The same worked with STABLE on a Postgres 11 instance, while a Postgres 10 instance required the IMMUTABLE hack ...

Thanks so much for your answer. It doesn't just solve my issue, but I feel like I've levelled up my Postgres knowledge in general. It's given me so much to think about and explore. — tombh, Nov 01 '19 at 06:40

Query plan for a dynamically built `query_int` using `format()`

1 Answers1

1.

2.

3.

4.

5.

6.