54

I am developing an application in Ruby on Rails with the PostgreSQL (9.4) database. For my use case, columns in tables will be looked up very frequently, as the whole point of the application is searching for very specific attributes on a model.

I am currently deciding whether to use an integer type or simply use a typical string type (e.g. character varying(255), which is the default in Rails) for the columns, as I'm not sure what the performance difference will be on the index.

These columns are enums. They have a fixed size for the amount of possible values they can have. Most enum lengths do not exceed 5, meaning the index would be more or less fixed throughout the lifetime of the application; thus, the integer and string indexes would be identical in the number of nodes.

However, the string that would be indexed could be around 20 characters long, which in memory is roughly 5x that of the integer (if an integer is 4 bytes, and the strings are pure ASCII at 1 byte per character, then this holds). I don't know how database engines do index look-ups, but if it needs to "scan" the string until it matches exactly, then in essence that means that the string lookup would be 5x slower than an integer lookup; the "scan" until match for the integer lookup would be 4 bytes instead of 20. This is what I'm imagining:

The lookup value is (integer) 4:

scanning............................ FOUND | getting records... |BYTE_1|BYTE_2|BYTE_3|BYTE_4|BYTE_5|BYTE_6|BYTE_7|BYTE_8|...|

The lookup value is (string) "some_val" (8 bytes):

scanning..................................................................................... FOUND | getting records... |BYTE_1|BYTE_2|BYTE_3|BYTE_4|BYTE_5|BYTE_6|BYTE_7|BYTE_8|...|

I hope that that makes sense. Basically, because the integer takes up less space, it can be "matched on" faster than its string counterpart. Perhaps this is a completely wrong guess, but I'm no expert, so that's why I'm asking you guys! I suppose that this answer I just found seems to support my hypothesis, but I want to be sure.

The number of possible values in the column wouldn't change in using either one, so the index itself would not change (unless I added a new value to the enum). In this case, would there be a performance difference in using integer or varchar(255), or does using an integer type make more sense?


The reason I am asking is that Rails' enum type maps integers to string keys, but they aren't meant to be user-facing columns. Essentially, you can't do verification that the enum value is a valid one, because an invalid value will cause an ArgumentError before any validations can be run. Using a string type would allow validations, but if there's a performance cost I'd rather just hack away around the validation problem.

Chris Cirefice
  • 643
  • 1
  • 5
  • 7
  • If you've only got 5 enums, it might not be worth adding an index to that column. I'm no expert but the cardinality value is quite low. for example, it would be like having a phone book with index entries for everyone who's surname was: "Singh, Mary, or Mohammed"....you'd get millions upon millions of results, adding significant overhead for storing those index values without any corresponding benefit. i'll let a database expert comment if i've erred. – BenKoshy Oct 01 '20 at 02:21

1 Answers1

61

Short answer: integer is faster than varchar or text in every aspect. Won't matter much for small tables and / or short keys. The difference grows with the length of the keys and the number of rows.

string ... 20 characters long, which in memory is roughly 5x that of the integer (if an integer is 4 bytes, and the strings are pure ASCII at 1 byte per character, then this holds)

To be precise, text or varchar occupy exactly 21 bytes for 20 ASCII characters on disk and 24 bytes in RAM. Detailed assessment:

Also important: COLLATION rules can make sorting character data more expensive - unlike numeric data types:

Index size is probably responsible for the lion share of performance difference in most cases. Consider the overhead per index tuple (basically the same as for a table): 4 bytes for the item identifier and 8 bytes for the index tuple header. So the index tuple for integer would amount to 20 bytes (including 4 bytes of alignment padding) and for varchar(20) with 20 ASCII characters it would be 36 bytes (also incl. padding). Details:

All the theory aside: it's best to just test:

Postgres 9.5 introduced an optimization for sorting long strings of character data (key word "abbreviated keys"). But a bug in some C library functions on Linux forced the project to disable the feature for non-C collations in Postgres 9.5.2. Details in the release notes.

However, if you actually use Postgres enum types, most of these considerations are irrelevant, since those are implemented with real values internally anyway. The manual:

An enum value occupies four bytes on disk.

Aside: the odd length restriction of 255 characters in varchar(255) has no special impact on performance in Postgres at all. Typically, it's a misunderstanding.

Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600
  • 2
    There is no hidden optimization in SQL Server for varchar(255) vs. e.g varchar(260). There might have been such a thing with SQL Server 6.x but this has not been true for a long time. –  May 09 '16 at 05:45
  • @a_horse_with_no_name: thanks, I clarified accordingly. – Erwin Brandstetter May 09 '16 at 06:06
  • Sorry for taking so long to accept this, I've been slow on the development of that project ;) – Chris Cirefice Aug 15 '16 at 16:02
  • Is this answer still valid for Postgres 10, please? – Matty May 19 '18 at 09:01
  • 5
    @Matty: Still valid. And I don't see anything changing for pg 11 yet, either. – Erwin Brandstetter May 19 '18 at 13:34
  • @Erwin Brandstetter: thank you very much! – Matty May 19 '18 at 21:03
  • Your answer seems to thoroughly explain that strings (varchar or text) take up significantly more space than integers. I don't really see any explanation for why that translates to: "integer is faster than varchar or text in every aspect". Are you basically just implicitly saying that longer indices take more time to compare than shorter ones? – Akaisteph7 Aug 08 '23 at 15:46
  • 1
    @Akaisteph7: Size, collation rules, sorting, encoding and decoding, programming, differences between Postgres versions and with other RDBMS, potential for bugs, misunderstandings, and points of failure (think of deceiving UTF lookalike characters). Everything is simpler and faster with plain integer, unless you actually need the varchar value. Size being the most prominent item. – Erwin Brandstetter Aug 08 '23 at 23:39
  • @ErwinBrandstetter I mean I get that theoretically, but that's not the same as an actual analysis of the time taken. If you only start seeing noticeable differences because of these points above 2 million rows, that's a very important point. I guess what I'm saying is missing is an actual timing analysis like you did for "Working of indexes in PostgreSQL". It would be good if a caveat like that was made explicit rather than not mentioned at all. – Akaisteph7 Aug 09 '23 at 15:11