2

I have a question on how to improve sorting performance on PostgreSQL on geospatial data. I want to sort restaurants based on their distance and price, but it's noticeably slower than my study on MongoDB.

My Query:

SELECT (distance + price) AS score, name 
FROM (
    SELECT geom <-> 'SRID=4326;POINT(0.0 0.0)' AS distance, price, name 
    FROM restaurants 
    WHERE ST_DWithin(geom, ST_SetSRID(ST_MakePoint(0.0, 0.0), 4326)::geography, 10000) 
) AS t 
ORDER BY score DESC NULLS LAST 
LIMIT 1000;

My Index:

CREATE INDEX geom_idx on restaurants using GIST(geom);
ANALYZE;

When I run EXPLAIN ANALYZE, I got this result:

                                                                                     QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=457318.37..457393.17 rows=204 width=35) (actual time=564.937..724.273 rows=1000 loops=1)
   ->  Gather Merge  (cost=457318.37..457393.17 rows=204 width=35) (actual time=564.935..724.241 rows=1000 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Result  (cost=456318.35..456371.13 rows=102 width=35) (actual time=498.930..499.574 rows=614 loops=3)
               ->  Sort  (cost=456318.35..456318.60 rows=102 width=59) (actual time=498.927..498.988 rows=614 loops=3)
                     Sort Key: (((restaurants.geom <-> '01010101010101010101010101010101010101010101010101'::geography) + (restaurants.price)::double precision)) DESC NULLS LAST
                     Sort Method: top-N heapsort  Memory: 271kB
                     Worker 0:  Sort Method: top-N heapsort  Memory: 274kB
                     Worker 1:  Sort Method: top-N heapsort  Memory: 272kB
                     ->  Parallel Bitmap Heap Scan on restaurants  (cost=2774.30..456314.94 rows=102 width=59) (actual time=266.590..495.592 rows=9616 loops=3)
                           Filter: st_dwithin(geom, '01010101010101010101010101010101010101010101010101'::geography, '10000'::double precision, true)
                           Rows Removed by Filter: 9647
                           Heap Blocks: exact=18183
                           ->  Bitmap Index Scan on geom_idx  (cost=0.00..2774.24 rows=33293 width=0) (actual time=307.298..307.298 rows=57789 loops=1)
                                 Index Cond: (geom && _st_expand('01010101010101010101010101010101010101010101010101'::geography, '10000'::double precision))
 Planning Time: 0.218 ms
 Execution Time: 725.379 ms
(18 rows)

While it quite fast, I'm surprised how big the loss between sorting by distance only and sorting by calculated values. It's also slower than my test on MongoDB.

Is there any way to reduce latency on when sorting rows from ST_DWithin query using calculated values like based on restaurant's distance and price?

Vince
  • 20,017
  • 15
  • 45
  • 64
  • This seems slightly too slow for the given load - a quick test on 10M rows and about three times your bitmap heap size (so many more hits to filter through) resolves in less than 200ms on a small docker default install; since the pages found to be scanned (Heap Blocks) is rather large, consider CLUSTER restaurants USING <spatial_index_name>; and definitely run a VACUUM ANALYZE FULL! – geozelot Dec 02 '22 at 13:14
  • Hello @geozelot, I've read somewhere CLUSTER only available when all rows do not contains null and not all crawled restaurant data have geotag. This makes me wonder if i can separate the restaurants to restaurants_with_geotag and restaurants_without_geotag and use UNION operation on other queries without performance penalty. Thank you – Yukha Dharmeswara Dec 03 '22 at 10:26
  • NULL handling is primarily managed by the index implementation used to cluster. Pages with NULLS will get appended on disk. – geozelot Dec 03 '22 at 13:35

1 Answers1

1

A few things that I see:

  • As you can see in the bottom rows of the analyze, the first filter is done on index that work with rectangle, so the planner do a st_expand(your_point, 10000) to filter rows, then it filter with st_dwithin to have the round shape, and this operation take 200ms. Maybe you can allow to have squared zone and remove this part by replacing your st_dwithin by a geom && st_expand(your_point, 10000), or if your really want to have a circle try to see if filtering yourself after (because you need to compute the distance anyway) can improve a little ?
  • Your computation is done on 2 workers, and the result is merged after (the merge take 200ms). Parallelization on 2 worker for small jobs is not always worth it, so I would see 2 ways:
    • 2 workers is not much, be sure to have your postgresql parameters set to match your configuration. You can take a look at an old answer of mine to have an idea of the parameters you should start with when you try to improve performance. Default postgresql is very conservative, and having more workers and more memory to work with can do a big difference, especially with geometries that are usually quite big in memory. If you can have your indexes fully loaded in cache for example it can help a lot in your respond time.
    • You can try to run SET max_parallel_workers_per_gather = 0; before your small query to see if it improve performance. I'm not sure if it's always a good idea, but it can help to see if it's part of the problem.
robin loche
  • 2,465
  • 5
  • 9
  • Hello @robin-loche, first of all, thank you for your answer. Your old answer really helps me as i just started to learn about PostgreSQL server configuration.

    I prefer circle zone than square as it reflects better on inputted distance. I've also tested querying with KNN Searching, geom <-> point. But sometimes it's slower than ST_DWithin (I'm not sure how, i can't replicate it). If compared with geom && st_expand(your_point, 10000), which one do you think is better? Thank you.

    – Yukha Dharmeswara Dec 03 '22 at 10:36
  • geom && st_expand(your_point, 10000) should be shorter than geom <-> point (which is an other way of saying 'distance') because the first one only use what is stored in the index, whereas the other one need to read the point position to compute the distance. Also, KNN search is not only <-> but also when you have order by geom <-> point limit N to get only the N closest, which does not seems to be what you want... – robin loche Dec 05 '22 at 09:30