3

I tried to import a subset of open street map data to a Postgres database using the standard tool (osm2pgsql). I then tried to rebuild the indices on the table using plain sql (dropping them, then recreating them) and they end up taking massive amount of space.

The table size is 9GB, with an int id, 70 text columns for attributes (for example column access_type has 250 different text values, and is null in 90% of cases), and a possibly large geometry column. If I try:

 CREATE INDEX planet_osm_polygon_accesstype_index 
   ON planet_osm_polygon 
   USING BTREE (access_type)

I end up with an index that's 1GB in size for 140k rows, which is massive considering how little data there is. This doesn't seem to happen on other systems, and I tried to ANALYZE, then VACUUM.

I don't know much about Postgres, but any hints about how to reduce the index size would be very appreciated.

(OS is ubuntu 12-04, PostgreSQL is version 9.1)

ypercubeᵀᴹ
  • 97,895
  • 13
  • 214
  • 305
raph
  • 33
  • 2

1 Answers1

4

You seem to expect that rows with NULL values are excluded from a B-tree index automatically, but that's not the case. Those are indexed as well and can be searched for. However, since:

access_type ... is null in 90% of cases

that's hardly useful in your case. Such common values hardly ever make sense in an index to begin with, be it NULL or any other value. Exclude NULL values from the index with a partial index.

CREATE INDEX planet_osm_polygon_accesstype_index ON planet_osm_polygon (access_type)
WHERE access_type IS NOT NULL;

Should be much smaller and thus faster.

Remember that you may have to include the same WHERE condition in queries to make Postgres realize it can apply the partial index.

Related answers with more details:

Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600
  • 1
    Just for completeness: for a partial index using is not null you don't have to use is no null explicitly in the query. If you compare against a value where access_type = 'foo', the optimizer is smart enough to recognize that –  Jul 20 '14 at 11:39
  • @a_horse_with_no_name: You still may have to include the condition explicitly, for instance in a prepared statement or function where the parameter can be NULL. The explicit condition would allow Postgres to use the index. – Erwin Brandstetter Jul 20 '14 at 12:12
  • To Ypercube: the table actually had 26 million rows waiting for a commit, which explains the index size (40 bytes per row, so looks reasonable) – raph Jul 20 '14 at 17:36
  • @a_horse_with_no_name: solved it for me, the index is 750kB per column. Index adding is slightly long, but the resulting mapnik map renderings are blazingly fast. Thanks for the help, Raph. – raph Jul 20 '14 at 17:39
  • 1
    Will try to post to osm2pgsql devs. Most maps rendering calls are geometry selections, so they try to find columns with specific non null attributes, having a cheap index on it might help. – raph Jul 20 '14 at 17:59
  • @raph: I didn't mean that the solution was wrong, just that a where access_type is null condition is not required in the query (even if that is used in the index), as long as you compare the column to a non-null value: where access_type = 'foo'. The planner will still make use of the index –  Jul 20 '14 at 18:09
  • Sorry, i meant the solution is Erwin's with both your added precisions. I didn't test the parameterized prepared statements, mapnik uses statements that are 'hardcoded' as far as the DB is concerned (editable in an xml file with no ? params). – raph Jul 20 '14 at 23:35