Assuming a db full of products. A product can belong to exactly 1 collection and is created by a user. Rough scale of the db:
- Products: 52.000.000
- Collections: 9.000.000
- Users: roughly 9.000.000 as well
I am trying to retrieve the amount of products+collections a user has, and the amount of products inside each collection (this information is supposed to get generated all x days and indexed in ElasticSearch) .
For the user query, I am currently doing something like this:
SELECT
users.*,
(SELECT
count(*)
FROM
products product
WHERE
product.user_id = user.id
) AS product_count,
(SELECT
count(*)
FROM
collections collection
WHERE
collection.user_id = user.id
) AS collection_count
FROM
users user
all *_id fields are indexed. Using explain(analyze, verbose) (sensitive information removed):
Limit (cost=0.00..156500.97 rows=100 width=41) (actual time=0.064..28345.363 rows=100 loops=1)
Output: (...), ((SubPlan 1)), ((SubPlan 2))
-> Seq Scan on public.users user (cost=0.00..14549429167.11 rows=9296702 width=41) (actual time=0.064..28345.241 rows=100 loops=1)
Output: (...), (SubPlan 1), (SubPlan 2)
SubPlan 1
-> Aggregate (cost=1415.84..1415.85 rows=1 width=0) (actual time=261.101..261.102 rows=1 loops=100)
Output: count(*)
-> Bitmap Heap Scan on public.products product (cost=7.32..1414.95 rows=355 width=0) (actual time=0.282..260.767 rows=382 loops=100)
Output: (...)
Recheck Cond: (product.user_id = user.id)
Heap Blocks: exact=32882
-> Bitmap Index Scan on products_user_id_index (cost=0.00..7.23 rows=355 width=0) (actual time=0.165..0.165 rows=382 loops=100)
Index Cond: (product.user_id = user.id)
SubPlan 2
-> Aggregate (cost=149.13..149.14 rows=1 width=0) (actual time=22.333..22.333 rows=1 loops=100)
Output: count(*)
-> Index Only Scan using collections_user_id_index on public.collections collection (cost=0.43..149.02 rows=44 width=0) (actual time=0.610..22.300 rows=28 loops=100)
Output: collection.user_id
Index Cond: (collection.user_id = user.id)
Heap Fetches: 2850
Planning time: 0.214 ms
Execution time: 28345.508 ms
When timing the read queries:
- LIMIT 1: 0.695ms
- LIMIT 10: 10434ms
- LIMIT 100: 150471ms
As the query times become unusably slow when retrieving more than a couple of rows, I am wondering if it is possible to speed this up a bit.
If I were to beef up the DB machine, would adding more CPUs help? AFAIK postgres doesn't execute queries on multiple cores so I am not sure how much that would help.
(Also slightly related, but how come the count() for collections uses a index only scan, while products uses a bitmap heap scan instead?)
LIMITthe query to a sane size. However even with a small size likeLIMIT 10and a offset of 1.000.000, the query becomes unusably slow. Would you happen to have any advise for this particular case? – patchrail Nov 02 '16 at 00:09OFFSETis typically almost as expensive as if you would increaseLIMITby the same amount. It depends on details of your query, tables and indexes. Related answers here or here. But this is hardly related to your original question any more. Please ask a *new question* if you still need it. – Erwin Brandstetter Nov 02 '16 at 00:22