Character varying index overhead & length limit

Question

I have a unique constraint on a character varying column that will mostly have lengths of approximately 600 but could get to 10 million.

I have read about B-Trees, but I can't determine how much disk space the index will consume or if such large-sized values will prevent the index from functioning performantly.

What are the impacts upon disk space and time of a B-Tree index on character varying columns with such potentially large lengths?

Erwin Brandstetter · Accepted Answer · 2023-04-01T22:12:08.187

11

In Postgres, a unique constraint is implemented with a unique B-tree index. The manual:

Adding a unique constraint will automatically create a unique B-tree index on the column or group of columns used in the constraint.

Indexes use the same basic storing mechanisms as tables: arrays of data pages. Indexes have some minor additional features. The manual:

In a table, an item is a row; in an index, an item is an index entry.

Every table and index is stored as an array of pages of a fixed size (usually 8 kB, although a different page size can be selected when compiling the server).

The maximum size of an index entry is roughly a third of a data page. See:

Why *not* ERROR: index row size xxxx exceeds maximum 2712 for index "foo"?

The maximum size was reduced to 2704 bytes in Postgres 12. Still the same in Postgres 15. Meaning your UNIQUE constraint is not possible.

Consider a unique index on hash value, either after adding a redundant column with the hash, or just an index on the hash expression. You could use the built-in function md5() or something more efficient for huge values. Related:

edited Apr 01 '23 at 22:12

answered Jun 26 '14 at 23:45

Erwin Brandstetter

175,982
27
439
600

Thank you Erwin Brandstetter, incredible advice as usual! For performance, should I use bytea? If so, should I let Postgres do the conversions with md5 and decode or sha512() with this library http://www.zedwood.com/article/cpp-sha512-function? I'm fairly certain I'll never need to extract the hash, but I will need to compare with external data. Thank you so very much for sharing your great knowledge! – Jun 27 '14 at 01:04
@Gracchus: I suggest you put that in new questions (one topic per piece). Comments are not the place. You can always link to this one for reference. – Erwin Brandstetter Jun 27 '14 at 01:16
Thank you Erwin Brandstetter! I have posted per your recommendation: http://dba.stackexchange.com/questions/69170/optimal-hash-technique-to-index-large-text – Jun 27 '14 at 01:45
3

I wonder if something has changed in newer Postgres version, as I can insert text values with 10000 (random) characters into a column using a unique index: http://rextester.com/ZDYS66763 – Sep 05 '18 at 12:41
2

@a_horse_with_no_name: Interesting, indeed. Worthy of a new question: https://dba.stackexchange.com/q/217087/3684 (I think I found the answer.) – Erwin Brandstetter Sep 08 '18 at 16:22
What about using hash indexes (on Postgresql 14)? Is that a better solution than using b-trees to index the md5 of the column? – Gili Oct 06 '22 at 07:12

Character varying index overhead & length limit

1 Answers1

Linked