Suppose I have a table such as this, for logging all web requests made to my website:
CREATE TABLE weblog
(
id bigserial,
moment timestamp with time zone NOT NULL DEFAULT now(),
ua text, -- user agent
ref text, -- referer
"IP-addy" inet NOT NULL,
PRIMARY KEY (id)
)
Most people will be sending the same user-agent, such as "MOSAIC 1.0". Many will send the same referer as well. This means that, although the id, moment, and IP-addy columns will differ, a huge amount of records/rows in the table will end up having the same values for "ua" and "ref".
This really got me thinking about storage space and efficiency, but having experimented with this for a few weeks now, I've basically come "full circle", back to thinking that this does make sense after all. Allow me to explain.
Although seemingly a gigantic waste, storing these repeated values on the disk may be necessary for performance. I'm not sure about it, but it seems like it probably is.
When I back up my database, which was my biggest worry rather than it taking a lot of disk space while "live", I apply the maximum compression to pg_dump (level 9), which I assume means that all of those duplicated values will turn into a single string stored, and then just referenced with small bits for each record which contains the same cell value. This seems to be the case because the difference between using level 0 (no compression) and 9 is pretty huge in the file size of the resulting .pg_dump archive (using the oddly named "custom" (binary) format).
I didn't originally think about this, but since it applies compression, the backups probably already are as small as they could be, regardless of my table structure.
Currently, I have them structured like this:
CREATE TABLE weblog
(
id bigserial,
moment timestamp with time zone NOT NULL DEFAULT now(),
ua bigint, -- user agent
ref bigint, -- referer
"IP-addy" inet NOT NULL,
PRIMARY KEY (id),
FOREIGN KEY blabla,
FOREIGN KEY blabla
)
And then I have one table for each of "ua" and "ref", where I just store a unique string value with a bigserial id and then associate it with the ua/ref bigints (they are no longer text). This makes both SELECT and INSERT queries cumbersome and require an extra query for each INSERT where it puts in the new string (if it doesn't already exist in the table).
Basically, it's an "ugly" (not very elegant) solution to the user (me).
And it makes me wonder if this method saves so much disk space anyway. I have only done this for my biggest tables -- not for ones that I know will remain "reasonable". It's not worth it to do this for every single text column, and it would feel like I'm really fighting PostgreSQL and probably am doing something wrong.
What do you all say about this? Have I fundamentally misunderstood something? Does PG just "expand" my table structure back into the original structure internally, in order to boost performance at the expense of "live" disk space taken up by my database?
I feel as if I have a bad understanding of just how much (or little) PG actually "abstracts away", and how "clever" I have to be in my structure these days. I may be mentally stuck in the 1970s or something by thinking like this. Maybe I've gained nothing from all this work with extra tables, FKs, and extra/complicated queries?