Using PostgreSQL 9.3 on Debian 7.
I have a system that searches for text tokens inside PDF files and store where it was found for indexation.
Basically the table structure is:
Table "capture.process_publication"
+----------------------------+-----------------------------+--------------------------------------------------------------------------------------+
| Column | Type | Modifiers |
+----------------------------+-----------------------------+--------------------------------------------------------------------------------------+
| id_process_publication | bigint | not null default nextval('process_publication_id_process_publication_seq'::regclass) |
| search_token | character varying(25) | not null |
| publication_date | date | not null |
| id_publication_site | smallint | not null |
| id_publication_book | smallint | not null |
| search_token_page | integer | not null |
| search_token_page_position | integer | not null |
| publication_snippet | text | not null |
| insert_time | timestamp without time zone | not null default now() |
| integration_consume_time | timestamp without time zone | |
| elastic_consume_time | timestamp without time zone | |
+----------------------------+-----------------------------+--------------------------------------------------------------------------------------+
Indexes:
"process_publication_pkey" PRIMARY KEY, btree (id_process_publication)
"idx_pp_publication_date_desc_for_elastic_consume" btree (publication_date DESC) WHERE integration_consume_time IS NOT NULL AND elastic_consume_time IS NULL
"idx_pp_publication_date_desc_integration_consume_time_null" btree (publication_date DESC) WHERE integration_consume_time IS NULL
"idx_uniq_pp" btree (publication_date, id_publication_book, search_token_page, search_token_page_position, search_token)
Foreign-key constraints:
"fk_pp_pubb_id_publication_book" FOREIGN KEY (id_publication_book) REFERENCES publication_book(id_publication_book)
"fk_pp_pubs_id_publication_site" FOREIGN KEY (id_publication_site) REFERENCES publication_site(id_publication_site)
Has OIDs: no
The problem was that certain tokens used to appear multiple times on the same page and it was generating a lot of redundant information, so I've changed the table structure to store only one record per token per page storing the positions and snippets as arrays as follows:
Table "capture.process_publication_cp"
+-----------------------------+-----------------------------+--------------------------------------------------------------------------------------------+
| Column | Type | Modifiers |
+-----------------------------+-----------------------------+--------------------------------------------------------------------------------------------+
| id_process_publication_cp | bigint | not null default nextval('process_publication_cp_id_process_publication_cp_seq'::regclass) |
| search_token | character varying(25) | |
| publication_date | date | |
| id_publication_site | smallint | |
| id_publication_book | smallint | |
| search_token_page | integer | |
| search_token_page_positions | integer[] | |
| publication_snippets | text[] | |
| insert_time | timestamp without time zone | |
| integration_consume_time | timestamp without time zone | |
| elastic_consume_time | timestamp without time zone | |
+-----------------------------+-----------------------------+--------------------------------------------------------------------------------------------+
Indexes:
"process_publication_cp_pkey" PRIMARY KEY, btree (id_process_publication_cp)
"idx_uniq_pp_cp" UNIQUE, btree (publication_date, id_publication_book, search_token_page, search_token_page_positions, search_token)
Has OIDs: no
The old table has 440 million records while its new version has 432 million, 8 million less. The weird thing is that the new version is using 1GB more disk space than the old one as follows:
capture.process_publication 188 GB
capture.process_publication_cp 189 GB
My question is very simple: Why? How does PostgreSQL store and manipulates array columns? It doesn't make sense to me at all.
NOT NULLconstraints. What you get in psql with\d capture.token_publication_v2. – Erwin Brandstetter Aug 18 '16 at 19:09