Context
I'm designing a database (on PostgreSQL 9.6) which will store data from a distributed application. Due to the application's distributed nature, I can not use auto-increment integers (SERIAL) as my primary key because of potential race-conditions.
The natural solution is to use an UUID, or a globally unique identifier. Postgres comes with a built-in UUID type, which is a perfect fit.
The problem I have with UUID is related to debugging: it's a non-human-friendly string. The identifier ff53e96d-5fd7-4450-bc99-111b91875ec5 tells me nothing, whereas ACC-f8kJd9xKCd, while not guaranteed to be unique, tells me I'm dealing with an ACC object.
From a programming perspective, it is common to debug application queries relating several different objects. Suppose the programmer wrongly searches for an ACC (account) object at the ORD (order) table. With a human-readable identifier, the programmer instantly identifies the problem, while using UUIDs he would spend some time figuring out what was wrong.
I do not need the "guaranteed" uniqueness of UUIDs; I do need some room for generating keys without conflicts, but UUID is overkill. Also, worst case scenario, it wouldn't be the end of the world if a collision happened (the database rejects it and the application can recover). So, trade-offs considered, a smaller but human-friendly identifier would be the ideal solution for my use case.
Identifying application objects
The identifier I came up with has the following format: {domain}-{string}, where {domain} is replaced with the object domain (account, order, product) and {string} is a randomly generated string. In some cases, it might even make sense to insert a {sub-domain} before the random string. Let's ignore the length of {domain} and {string} for the purpose of guaranteeing uniqueness.
The format can have a fixed size if it helps the indexing/querying performance.
The problem
Knowing that:
- I want to have primary keys with a format like
ACC-f8kJd9xKCd. - These primary keys will be part of several tables.
- All these keys will be used on several joins/relationships, on a 6NF database.
- Most tables will have a medium to large-ish size (averaging ~1M rows; largest ones with ~100M rows).
Regarding performance, what is the best way to store this key?
Below are four possible solutions, but since I have little experience with databases I'm unsure which (if any) is the best.
Considered solutions
1. Store as string (VARCHAR)
(Postgres makes no difference between CHAR(n) and VARCHAR(n), so I'm ignoring CHAR).
After some research, I've found out that string comparison with VARCHAR, specially on join operations, is slower than using INTEGER. This makes sense, but is it something that I should worry about at this scale?
2. Store as binary (bytea)
Unlike Postgres, MySQL does not have a native UUID type. There are several posts explaining how to store an UUID using a 16-byte BINARY field, instead of a 36-byte VARCHAR one. These posts gave me the idea of storing the key as binary (bytea on Postgres).
This saves size, but I'm more concerned with performance. I had little luck finding an explanation on which comparison is faster: binary or string ones. I believe binary comparisons are faster. If they are, then bytea is probably better than VARCHAR, even though the programmer now has to encode/decode the data every time.
I might be wrong, but I think both bytea and VARCHAR will compare (equality) byte by byte (or character by character). Is there a way to "skip" this step-by-step comparison and simply compare "the whole thing"? (I don't think so, but it doesn't cost checking).
I think storing as bytea is the best solution, but I wonder if there are any other alternatives that I'm ignoring. Also, the same concern I expressed on solution 1 holds true: is the overhead on comparisons enough that I should worry about?
"Creative" solutions
I came up with two very "creative" solutions that could work, I'm just unsure at which extent (i.e. if I would have trouble scaling them to more than a couple thousand rows in a table).
3. Store as UUID but with a "label" attached to it
The main reason to not use UUIDs is so that programmers can better debug the application. But what if we can use both: the database stores all keys as UUIDs only, but it wraps the object before/after queries are made.
For example, the programmer asks for ACC-{UUID}, the database ignores the ACC- part, fetches the results, and return all of them as {domain}-{UUID}.
Maybe this would be possible with some hackery with stored procedures or functions, but some questions come to mind:
- Is this (removing/adding the domain at each query) a substantial overhead?
- Is this even possible?
I've never used stored procedures or functions before, so I'm not sure whether this is even possible. Can someone shed some light? If I can add a transparent layer between the programmer and the stored data, it seems a perfect solution.
4. (My favorite) Store as IPv6 cidr
Yes, you read it right. It turns out that the IPv6 address format solves my problem perfectly.
- I can add domains and sub-domains at the first few octets, and use the remaining ones as the random string.
- The collision odds are OK. (I wouldn't be using 2^128 though, but it is still OK.)
- Equality comparisons are (hopefully) optimized, so I might get better performance than simply using
bytea. - I can actually perform some interesting comparisons, like
contains, depending on how the domains and their hierarchy are represented.
For example, suppose I use code 0000 to represent the domain "products". Key 0000:0db8:85a3:0000:0000:8a2e:0370:7334 would represent the product 0db8:85a3:0000:0000:8a2e:0370:7334.
The main question here is: compared to bytea, is there any main advantage or disadvantage on using cidr data type?
varcharamong many other problems. I did not know about pg's domains, which is great to learn about. I see domains being used to validate if a given query is using the correct object, but it still would rely on having a non-integer index. Not sure if there is a "secure" way of usingserialhere (without one lock step). – Renato Massaro Oct 02 '16 at 13:06varchar. Consider making it anFKintegertype and add a lookup table for it. That way you can have both human readability and you'll protect your compositePKfrom insert/update anomalies (putting a non-existent domain). – yemet Oct 02 '16 at 13:37SELECT&WHEREclauses, but it may be good enough for my use case. I'll keep the question open as I'm still wondering what is the best option if I absolutely had to store non-integer keys. I'll also make some benchmarks and post here (I'm specially curious about solution 4 :). – Renato Massaro Oct 02 '16 at 14:24textis preferable overvarchar. Look at https://www.depesz.com/2010/03/02/charx-vs-varcharx-vs-varchar-vs-text/ and https://www.postgresql.org/docs/current/static/datatype-character.html – pietrop Oct 02 '16 at 22:52bytearegarding the operands involved. I like even morejsonbas it will allow you to efficiently index subkeys also ;) . Somehow reminds me to the Group Replication basics for MySQL. Even though, I'm curious how you expect to find the key in the servers (I expect a global hash table?). Also, I think if you want to have a distributed solution, you should stick to stable solutions like Citus Data, Greenplum, PostgreSQL XL, etc. – 3manuek Oct 03 '16 at 02:42ACC-f8kJd9xKCd.” ←That appears to be a job for the good old composite PRIMARY KEY. – MDCCL Oct 04 '16 at 15:19