Why doesn't an OFFSET query benefit more from the index?

Question

I've got a simple example books table with an integer id primary key ("books_pkey" PRIMARY KEY, btree (id)) and 100,000,000 random rows. If I run:

EXPLAIN
SELECT *
FROM books
ORDER BY id
OFFSET 99999999
LIMIT 1;

I see a query plan like:

 Limit  (cost=3137296.54..3137296.57 rows=1 width=14)
   ->  Index Scan using books_pkey on books  (cost=0.57..3137296.57 rows=100000000 width=14)

Do I understand correctly that PostgreSQL is loading 100000000 rows into memory, only for the OFFSET to discard all but 1? If so, why can't it do the "load and discard" step using the index and only load one row into memory?

I understand that the typical solution to this is to use keyset pagination - to say WHERE id > x. I'm just trying to understand why an index alone doesn't solve this. Adding another index which is explicitly sorted the same way as this query (CREATE INDEX books_id_ordered ON books (id ASC)) makes no difference to EXPLAIN.

What led you to the conclusion that "PostgreSQL is loading 100000000 rows into memory"? You may want to run EXPLAIN (ANALYZE, BUFFERS) to see a clearer picture. — mustaccio, Aug 13 '21 at 19:29
On top of degrading performance, LIMIT / OFFSET typically does not work correctly at all under concurrent write load. Related: https://dba.stackexchange.com/a/205286/3684 — Erwin Brandstetter, Aug 14 '21 at 16:43

J.D. · Answer 1 · 2021-08-14T04:16:48.160

A typical index stores the data in a B-Tree data structure (as even evident from your index definition). A B-Tree data structure, while sorted, is not enumerated (as mentioned by Akina) and therefore the data is not stored in a linear fashion, by design. A linear data structure (such as an enumerated list) would actually be a less efficient data structure for indexing data in most cases, since it has a Big O search (seek) notation of O(n) whereas a B-Tree is O(log(n)).

So simply put, because of the traditional algorithm of a B-Tree data structure, there is no way for the database system to process a LIMIT and OFFSET clause without scanning the entire data first (within the filters of your predicates of course), such that it can correctly apply those clauses.

Could modern database systems be improved to be more robust and maintain an enumerated linear data structure, whenever an index is created, in addition to the B-Tree to improve this specific use case of LIMIT and OFFSET clauses?...Yes, but at the tradeoff of consuming extra storage space for a copy of the data, and even worse off, having to deal with the extra work required for applying changes to the linear data structure (in addition to the B-Tree) as the Table's data changes.

Any speculations concerning linear data structures in indexes quickly crumble under concurrent write load. — Erwin Brandstetter, Aug 14 '21 at 16:47

score 3 · Answer 2 · answered Aug 16 '21 at 19:22

It is obvious that PostgreSQL can only deliver the 100000000th row if it scans the first 100000000 result rows and discards the first 99999999. This can be done using only the index if it is an index-only scan; if not, the table row is fetched too.

There is a potential optimization here in that an index-only scan could be used to fetch the first 99999999 rows (since we don't need the column values) and an index scan is used only for the rows that are actually fetched, but PostgreSQL doesn't do that.

Avoid using large OFFSET values for good performance.

Why doesn't an OFFSET query benefit more from the index?

2 Answers2