Non-integer primary key considerations

Question

Context

I'm designing a database (on PostgreSQL 9.6) which will store data from a distributed application. Due to the application's distributed nature, I can not use auto-increment integers (SERIAL) as my primary key because of potential race-conditions.

The natural solution is to use an UUID, or a globally unique identifier. Postgres comes with a built-in UUID type, which is a perfect fit.

The problem I have with UUID is related to debugging: it's a non-human-friendly string. The identifier ff53e96d-5fd7-4450-bc99-111b91875ec5 tells me nothing, whereas ACC-f8kJd9xKCd, while not guaranteed to be unique, tells me I'm dealing with an ACC object.

From a programming perspective, it is common to debug application queries relating several different objects. Suppose the programmer wrongly searches for an ACC (account) object at the ORD (order) table. With a human-readable identifier, the programmer instantly identifies the problem, while using UUIDs he would spend some time figuring out what was wrong.

I do not need the "guaranteed" uniqueness of UUIDs; I do need some room for generating keys without conflicts, but UUID is overkill. Also, worst case scenario, it wouldn't be the end of the world if a collision happened (the database rejects it and the application can recover). So, trade-offs considered, a smaller but human-friendly identifier would be the ideal solution for my use case.

Identifying application objects

The identifier I came up with has the following format: {domain}-{string}, where {domain} is replaced with the object domain (account, order, product) and {string} is a randomly generated string. In some cases, it might even make sense to insert a {sub-domain} before the random string. Let's ignore the length of {domain} and {string} for the purpose of guaranteeing uniqueness.

The format can have a fixed size if it helps the indexing/querying performance.

The problem

Knowing that:

I want to have primary keys with a format like ACC-f8kJd9xKCd.
These primary keys will be part of several tables.
All these keys will be used on several joins/relationships, on a 6NF database.
Most tables will have a medium to large-ish size (averaging ~1M rows; largest ones with ~100M rows).

Regarding performance, what is the best way to store this key?

Below are four possible solutions, but since I have little experience with databases I'm unsure which (if any) is the best.

Considered solutions

1. Store as string (`VARCHAR`)

(Postgres makes no difference between CHAR(n) and VARCHAR(n), so I'm ignoring CHAR).

After some research, I've found out that string comparison with VARCHAR, specially on join operations, is slower than using INTEGER. This makes sense, but is it something that I should worry about at this scale?

2. Store as binary (`bytea`)

Unlike Postgres, MySQL does not have a native UUID type. There are several posts explaining how to store an UUID using a 16-byte BINARY field, instead of a 36-byte VARCHAR one. These posts gave me the idea of storing the key as binary (bytea on Postgres).

This saves size, but I'm more concerned with performance. I had little luck finding an explanation on which comparison is faster: binary or string ones. I believe binary comparisons are faster. If they are, then bytea is probably better than VARCHAR, even though the programmer now has to encode/decode the data every time.

I might be wrong, but I think both bytea and VARCHAR will compare (equality) byte by byte (or character by character). Is there a way to "skip" this step-by-step comparison and simply compare "the whole thing"? (I don't think so, but it doesn't cost checking).

I think storing as bytea is the best solution, but I wonder if there are any other alternatives that I'm ignoring. Also, the same concern I expressed on solution 1 holds true: is the overhead on comparisons enough that I should worry about?

"Creative" solutions

I came up with two very "creative" solutions that could work, I'm just unsure at which extent (i.e. if I would have trouble scaling them to more than a couple thousand rows in a table).

3. Store as `UUID` but with a "label" attached to it

The main reason to not use UUIDs is so that programmers can better debug the application. But what if we can use both: the database stores all keys as UUIDs only, but it wraps the object before/after queries are made.

For example, the programmer asks for ACC-{UUID}, the database ignores the ACC- part, fetches the results, and return all of them as {domain}-{UUID}.

Maybe this would be possible with some hackery with stored procedures or functions, but some questions come to mind:

Is this (removing/adding the domain at each query) a substantial overhead?
Is this even possible?

I've never used stored procedures or functions before, so I'm not sure whether this is even possible. Can someone shed some light? If I can add a transparent layer between the programmer and the stored data, it seems a perfect solution.

4. (My favorite) Store as IPv6 `cidr`

Yes, you read it right. It turns out that the IPv6 address format solves my problem perfectly.

I can add domains and sub-domains at the first few octets, and use the remaining ones as the random string.
The collision odds are OK. (I wouldn't be using 2^128 though, but it is still OK.)
Equality comparisons are (hopefully) optimized, so I might get better performance than simply using bytea.
I can actually perform some interesting comparisons, like contains, depending on how the domains and their hierarchy are represented.

For example, suppose I use code 0000 to represent the domain "products". Key 0000:0db8:85a3:0000:0000:8a2e:0370:7334 would represent the product 0db8:85a3:0000:0000:8a2e:0370:7334.

The main question here is: compared to bytea, is there any main advantage or disadvantage on using cidr data type?

How many distributed nodes are possible? Do you know their number (and names) ahead of time? Did you consider composite (multicolumn) PKs? A domain (depending on my first question), plus a plain serial column might be smallest, simplest and fastest ... — Erwin Brandstetter, Oct 02 '16 at 12:32
@Phil thanks! @ErwinBrandstetter Regarding the application, it's being designed to auto-scale according to load, so there is very little information ahead of time. I've thought about using (domain, UUID) as PK, but this would repeat "domain" all over, domain would still be varchar among many other problems. I did not know about pg's domains, which is great to learn about. I see domains being used to validate if a given query is using the correct object, but it still would rely on having a non-integer index. Not sure if there is a "secure" way of using serial here (without one lock step). — Renato Massaro, Oct 02 '16 at 13:06
The domain doesn't necessarily have to be a varchar. Consider making it an FK integer type and add a lookup table for it. That way you can have both human readability and you'll protect your composite PK from insert/update anomalies (putting a non-existent domain). — yemet, Oct 02 '16 at 13:37
That does make sense; I wanted to have the domain being part of the identification string, but having a look-up table on the domain will automatically remove anomalies. It still doesn't solve SELECT & WHERE clauses, but it may be good enough for my use case. I'll keep the question open as I'm still wondering what is the best option if I absolutely had to store non-integer keys. I'll also make some benchmarks and post here (I'm specially curious about solution 4 :). — Renato Massaro, Oct 02 '16 at 14:24
text is preferable over varchar. Look at https://www.depesz.com/2010/03/02/charx-vs-varcharx-vs-varchar-vs-text/ and https://www.postgresql.org/docs/current/static/datatype-character.html — pietrop, Oct 02 '16 at 22:52
@RenatoSiqueiraMassaro great question btw. I do really like the IPv6 idea because it stands upfront bytea regarding the operands involved. I like even more jsonb as it will allow you to efficiently index subkeys also ;) . Somehow reminds me to the Group Replication basics for MySQL. Even though, I'm curious how you expect to find the key in the servers (I expect a global hash table?). Also, I think if you want to have a distributed solution, you should stick to stable solutions like Citus Data, Greenplum, PostgreSQL XL, etc. — 3manuek, Oct 03 '16 at 02:42
“I want to have primary keys with a format like ACC-f8kJd9xKCd.” ←That appears to be a job for the good old composite PRIMARY KEY. — MDCCL, Oct 04 '16 at 15:19
What are you talking about: "Postgres makes no difference between CHAR(n) and VARCHAR(n), so I'm ignoring CHAR".. One is whitespace-padded with a length constraint, the other isn't. — Evan Carroll, Mar 30 '17 at 00:13

Evan Carroll · Answer 1 · 2017-02-17T03:14:56.487

Using `ltree`

If IPV6 works, great. It doesn't support "ACC". ltree does.

A label path is a sequence of zero or more labels separated by dots, for example L1.L2.L3, representing a path from the root of a hierarchical tree to a particular node. The length of a label path must be less than 65kB, but keeping it under 2kB is preferable. In practice this is not a major limitation; for example, the longest label path in the DMOZ catalog (http://www.dmoz.org) is about 240 bytes.

You'd use it like this,

CREATE EXTENSION ltree;
SELECT replace('ACC-f8kJd9xKCd', '-', '.')::ltree;

We create sample data.

SELECT x, (
  CASE WHEN x%7=0 THEN 'ACC'
    WHEN x%3=0 THEN 'XYZ'
    ELSE 'COM'
  END ||'.'|| md5(x::text)
  )::ltree
FROM generate_series(1,10000) AS t(x);

CREATE INDEX ON foo USING GIST (ltree);
ANALYZE foo;


  x  |                ltree                 
-----+--------------------------------------
   1 | COM.c4ca4238a0b923820dcc509a6f75849b
   2 | COM.c81e728d9d4c2f636f067f89cc14862c
   3 | XYZ.eccbc87e4b5ce2fe28308fd9f2a7baf3
   4 | COM.a87ff679a2f3e71d9181a67b7542122c
   5 | COM.e4da3b7fbbce2345d7772b0674a318d5
   6 | XYZ.1679091c5a880faf6fb5e6087eb1b2dc
   7 | ACC.8f14e45fceea167a5a36dedd4bea2543
   8 | COM.c9f0f895fb98ab9159f51fd0297e236d

And viola..

                                                          QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on foo  (cost=103.23..234.91 rows=1414 width=57) (actual time=0.422..0.908 rows=1428 loops=1)
   Recheck Cond: ('ACC'::ltree @> ltree)
   Heap Blocks: exact=114
   ->  Bitmap Index Scan on foo_ltree_idx  (cost=0.00..102.88 rows=1414 width=0) (actual time=0.389..0.389 rows=1428 loops=1)
         Index Cond: ('ACC'::ltree @> ltree)
 Planning time: 0.133 ms
 Execution time: 1.033 ms
(7 rows)

See the docs for more info and operators

If you're creating the product ids, I would ltree. If you need something to create them, I would use UUID.

score 1 · Answer 2 · answered Oct 18 '16 at 22:57

Just regarding the performance comparison with bytea. comparison of network is done in 3 steps: first on the common bits of the network part, then on the length of the network part, and then on the whole unmasked address. see: network_cmp_internal

so it should be a little slower then bytea which goes straigt to memcmp. I've run a simple test on a table with 10 million rows looking for a single one:

using numeric id (integer) it took me 1000ms.
using cidr it took 1300ms.
using bytea it took 1250ms.

I can't say there's much of a difference between the bytea and cidr (although the gap remained consistent) Just the additional if statement - guess that's not too bad for 10m tuples.

Hope it helps - would love to hear what did you end up choosing.

Non-integer primary key considerations

Context

Identifying application objects

The problem

Considered solutions

1. Store as string (`VARCHAR`)

2. Store as binary (`bytea`)

"Creative" solutions

3. Store as `UUID` but with a "label" attached to it

4. (My favorite) Store as IPv6 `cidr`

2 Answers2

Using `ltree`

Linked

Non-integer primary key considerations

Context

Identifying application objects

The problem

Considered solutions

1. Store as string (VARCHAR)

2. Store as binary (bytea)

"Creative" solutions

3. Store as UUID but with a "label" attached to it

4. (My favorite) Store as IPv6 cidr

2 Answers2

Using ltree

Linked

1. Store as string (`VARCHAR`)

2. Store as binary (`bytea`)

3. Store as `UUID` but with a "label" attached to it

4. (My favorite) Store as IPv6 `cidr`

Using `ltree`