Improving cold query performance with pg_prewarm

Question

We have a projects UI page where users can search projects. I noticed that search queries will be slow the first run (up to 10 seconds) and faster on subsequent runs (around 2 seconds). Could this explain the 2.28% disk hit? This seems like a high percentage right? If we want to ensure these queries are always fast, could we use pg_prewarm to keep them in memory? Would we need to run pg_prewarm periodically? This projects table is 15 GB and we have over 100 GB of server memory on Heroku.

           table name            | disk hits | % disk hits | % cache hits |  total hits
---------------------------------+-----------+-------------+--------------+--------------
 all                             | 730484143 |        0.32 |        99.68 | 225556991278
 projects                        |  34777074 |        2.28 |        97.72 |   1526319404

I'm also trying to understand the pg_prewarm config autoprewarm_interval. What is this used for? This means it will unload the projects table from memory for example?

Here's the memory from the logs https://devcenter.heroku.com/articles/heroku-postgres-metrics-logs

6 GB - sample#memory-postgres: Approximate amount of memory used by your database’s Postgres processes in kB. This includes shared buffer cache as well as memory for each connection.

125 GB - sample#memory-total: Total amount of server memory available.

875 MB - sample#memory-free: Amount of free memory available in kB.

114 GB - sample#memory-cached: Amount of memory being used by the OS for page cache, in kB.

I'd say that the only way to avoid this is to have the whole database cached in RAM. But perhaps you can tune the query to be faster and hit fewer 8kB blocks. — Laurenz Albe, Nov 18 '21 at 07:33
Thanks! It turns out users typically query for projects created within the last three or six months. Three months of projects is 1 GB. Could it make sense to query like every hour the most recent three or six months of projects to push it into the cache? — Todd, Nov 19 '21 at 05:48
Okay thank you! Sorry just to clarify. Like every hour is excessive of the idea of running the query to cache them? Otherwise if we don't cache, we could occasionally run into cold searches of older data? — Todd, Nov 19 '21 at 18:07
Maybe not excessive - it just seems complicated to do that, and it is uncertain - the tables could still drop from cache. More memory seems to be the safer way. Is there a chance to reduce the footprint of the query, so that it has to read less and is faster? — Laurenz Albe, Nov 19 '21 at 20:41
We think the footprint is optimized but agree this isn't the best solution. We now think it's from the search query itself. We have code that makes the search really broad. Are there best practices for address/name search? Performance improves when simplifying the search and removing * from one/two letter terms or removing them altogether as such:
3204:* & 70th:* & St:* & W:* & Lehigh:* & Acres:* & FL:* & 33971:* — Todd, Nov 19 '21 at 22:16
Searching for something like W:* can never be efficient. Usually, you are doing yourself and the user a favor if you force them to use at least one selective criterion that is well indexed. — Laurenz Albe, Nov 20 '21 at 05:52
Thank you! We're going to investigate this! We're also wondering about websearch_to_tsquery and will see how it compares. — Todd, Nov 21 '21 at 08:21

Erwin Brandstetter · Accepted Answer · 2021-11-17T23:14:41.037

3

Could this explain the 2.28% disk hit?

Certainly could. Disk hits are the reason for the slow first run each time.

could we use pg_prewarm to keep them in memory? Would we need to run pg_prewarm periodically?

Yes, pg_prewarm can populate the cache. But nothing keeps the OS (or Postgres - two different caches) from evicting pages from cache if it's needed for other data. So it depends on competing activity how long pages remain in cache.
If your queries don't have side-effects (purely SELECT) and don't include expensive computations, you might also just run the queries of interest from time to time. May be even better to just pull in the right data pages from table and indices. (Neither has to be cached as a whole.) That depends on what you know about expected queries.

See:

This projects table is 15 GB and we have over 100 GB of server memory on Heroku.

That's not the whole picture.

How much of the 100 GB RAM is available for caching? (Postgres file buffer and OS cache.) work_mem, maintenance_work_mem or other things compete for the same resource.
The table has 15 GB. What about indices?
Surely, this will not be the only table in use?
Heroku may be doing Heroku-things to cache memory that I am not aware of.

edited Nov 17 '21 at 23:14

answered Nov 17 '21 at 22:36

Erwin Brandstetter

175,982
27
439
600

Thank you! I'm an application developer trying to learn more about postgres.
I'm thinking of periodically querying recent data like past 90 days. Going to investigate and see the date range of projects that users access.
– Todd Nov 17 '21 at 23:00
I added the memory stats to the question body – Todd Nov 17 '21 at 23:02
The query is a SELECT that includes a gin search query. We're seeing if we can add a date range to see if it can also use a date or customer_id index. It's paginated so only returns 20 results. – Todd Nov 17 '21 at 23:06
2

@Todd: The "2 seconds" for warm cache don't exactly call for speed-ticket, either. You might be able to pre-warm only relevant parts of table & index if you narrow down the query. Plus, optimizing the query (in combination with the corresponding index) might help in addition to the cache issue. Might be worth a separate question ... – Erwin Brandstetter Nov 17 '21 at 23:12

Improving cold query performance with pg_prewarm

1 Answers1

Linked