Have I misunderstood the amount of "automatic optimization" that PostgreSQL does without me having to be "clever" when designing my tables?

Question

Suppose I have a table such as this, for logging all web requests made to my website:

CREATE TABLE weblog
(
    id      bigserial,
    moment      timestamp with time zone NOT NULL DEFAULT now(),
    ua      text, -- user agent
    ref     text, -- referer
    "IP-addy"   inet NOT NULL,
    PRIMARY KEY (id)
)

Most people will be sending the same user-agent, such as "MOSAIC 1.0". Many will send the same referer as well. This means that, although the id, moment, and IP-addy columns will differ, a huge amount of records/rows in the table will end up having the same values for "ua" and "ref".

This really got me thinking about storage space and efficiency, but having experimented with this for a few weeks now, I've basically come "full circle", back to thinking that this does make sense after all. Allow me to explain.

Although seemingly a gigantic waste, storing these repeated values on the disk may be necessary for performance. I'm not sure about it, but it seems like it probably is.

When I back up my database, which was my biggest worry rather than it taking a lot of disk space while "live", I apply the maximum compression to pg_dump (level 9), which I assume means that all of those duplicated values will turn into a single string stored, and then just referenced with small bits for each record which contains the same cell value. This seems to be the case because the difference between using level 0 (no compression) and 9 is pretty huge in the file size of the resulting .pg_dump archive (using the oddly named "custom" (binary) format).

I didn't originally think about this, but since it applies compression, the backups probably already are as small as they could be, regardless of my table structure.

Currently, I have them structured like this:

CREATE TABLE weblog
(
    id          bigserial,
    moment          timestamp with time zone NOT NULL DEFAULT now(),
    ua          bigint, -- user agent
    ref         bigint, -- referer
    "IP-addy"       inet NOT NULL,
    PRIMARY KEY     (id),
    FOREIGN KEY     blabla,
    FOREIGN KEY     blabla
)

And then I have one table for each of "ua" and "ref", where I just store a unique string value with a bigserial id and then associate it with the ua/ref bigints (they are no longer text). This makes both SELECT and INSERT queries cumbersome and require an extra query for each INSERT where it puts in the new string (if it doesn't already exist in the table).

Basically, it's an "ugly" (not very elegant) solution to the user (me).

And it makes me wonder if this method saves so much disk space anyway. I have only done this for my biggest tables -- not for ones that I know will remain "reasonable". It's not worth it to do this for every single text column, and it would feel like I'm really fighting PostgreSQL and probably am doing something wrong.

What do you all say about this? Have I fundamentally misunderstood something? Does PG just "expand" my table structure back into the original structure internally, in order to boost performance at the expense of "live" disk space taken up by my database?

I feel as if I have a bad understanding of just how much (or little) PG actually "abstracts away", and how "clever" I have to be in my structure these days. I may be mentally stuck in the 1970s or something by thinking like this. Maybe I've gained nothing from all this work with extra tables, FKs, and extra/complicated queries?

How many distinct ua and ref do you expect? I.e do you need bigint, or could you do with int or even smallint? You can find the domains and space req at: https://www.postgresql.org/docs/12/datatype-numeric.html — Lennart - Slava Ukraini, Aug 10 '20 at 07:42

score 0 · Answer 1 · answered Aug 10 '20 at 09:49

Having separate tables for the ua and ref strings will result in space savings if those values are long enough and repeated often. The added complexity during query and insert is nothing I would worry about.

A bigint uses 8 bytes of storage, and a row in a table has an overhead of at least 24 bytes. The latter is irrelevant if the values are repeated often enough.

A string like MOSAIC 1.0 will occupy 11 bytes on disk after TOASTing, so the savings over the 8 bytes of a bigint will be minimal. I wouldn't consider using a separate table in that case just for saving space.

Have I misunderstood the amount of "automatic optimization" that PostgreSQL does without me having to be "clever" when designing my tables?

1 Answers1