7

Possible Duplicate:
Acquiring ArcGIS-like speed in Postgis

I have been using PostGIS 2.0 for all of my GIS operations. PostGIS is an excellent GIS database, but no matter how I format my queries or design my tables I still have queries that take many hours or days to run. These same operations, when carried out in ArcGIS would take orders of magnitude less time to run comparatively. It is very depressing and stressful to just sit there and wait hours for simple queries to complete when you know they could be completed in much less time and with fewer chances for error. ArcGIS itself is not an option because I don't just have 75,000 dollars to spare for a windows-only program with strict restrictions and an ancient version of python.

Consider a simple point in polygon operation on a parcel dataset of a few hundred thousand parcels with a few hundred thousand points. This can take 45 minutes to an hour (with everything correctly indexed and using st_intersects); whereas in ArcGIS it would take less than a minute.

I have been told that this is because postgis is meant to handle operations on individual features rather than layers of features. In my experience though, QGIS does not perform operations much faster than in postgis and is much more subject to random crashes and generally seems unstable. The other thing that bothers me is that I have 3 local machines, each with an i7 processor and 8-16 GB of ram just sitting around doing nothing when they really are needed for helping out with GIS processing --but nothing I have at the moment can utilize them and I seriously doubt for that matter that my system even uses my single machine efficiently for these purposes.

I am very interested in clusters and distributed map reduce for GIS with NoSQL --from what I can tell it seems like these technologies are probably what I need to build anything that is significantly faster than what is out there. The problem is I really have almost no experience in these areas and am a very new programmer with only a working knowledge of python sql and a little java. Furthermore, the only GIS programs I have ever used are postgis, QGIS and ArcGIS.

The various options seem infinite so I am looking to the GIS community for some direction on where a beginner with no networking experience like myself should start and what technologies I should use to build a faster --or improve an existing-- GIS platform.

My requirements are as follows:

  1. Significantly faster than postgis for raw data processing on large datasets.
  2. Is supported in that it uses standard modules like GEOS or JTS, mongoDB etc. --I don't really want to build complex mathematical/geometric algorithms on my own as I lack the background to do this (at present, although I am slowly making an effort to get better at such things). In other words I want to use software that isn't extremely obscure and has a vibrant user/dev community.
  3. Works on Ubuntu server 12.04+ (I suppose debian would be fine too).
  4. Uses my unused machines on a local network. --This is not strictly a requirement as ArcGIS doesn't use multiple machines to get blazing speed; however, since they are just sitting there I might as well use them for something if at all possible.

I anticipate that with any of these options I will have the following two questions based on my research:

  1. How do I store geometries in a standard way that I can easily convert to postgis or shapefiles.
  2. How do I create an efficient index on the data.

Thanks.

EDIT: Due to all of the follow up questions in the comments being about things I could have possibly done wrong in postGIS, I have posted a question that more specially addresses the problems I have with postgis, so that this question can specially refer to what it would take to create a better platform

THX1138
  • 1,621
  • 3
  • 15
  • 18
  • 11
    Although I have some thoughts about how to leverage MapReduce for doing geo-cluster computation, I am highly skeptical about some operation that is taking 45 minutes in PostGIS and less than a minute in ArcGIS. That just tells me that there is something that is being expressed very differently in PostGIS and that can be highly improved to match the speed of ArcGIS. Can you please elaborate with the exact query you are executing in PostGIS and what tool you are using in ArcGIS? – Ragi Yaser Burhum Aug 04 '12 at 04:47
  • 1
    I agree that sounds very strange. – Nathan W Aug 04 '12 at 04:52
  • 3
    I'm not sure I understand exactly what you are trying to do, but it looks like there is something badly wrong with the query or indexes for this to occur. You must be doing multiple full table scans for this to happen.

    Can you show the queries that you are running?

    – BradHards Aug 04 '12 at 07:21
  • 1
    I also agree that it seems unlikely that PostGIS should be slower than ArcGIS when it comes to point in polygon. PostGIS is absolutely not only optimized for single feature operations. Actually, the real benefits of using PostGIS use to show when doing complex nested calculations on bigger datasets. That is because the PostgreSQL query-planner in a majority of cases makes very good decisions about in what order to do things. In Posnt in POlygon example POstGIS also uses a prepared geoemetry algorithm. – Nicklas Avén Aug 04 '12 at 10:59
  • Also take a look at this question and answer (even if the answer is a little bit messy because of editing) http://gis.stackexchange.com/questions/24554/looking-for-fastest-solution-for-point-in-polygon-analysis-of-200-million-points/24575#24575 – Nicklas Avén Aug 04 '12 at 10:59
  • 1
    I am using Postigs 1.5.2 on Windows filled with Openstreetmap data by their original installer and it works quiet fast, compared to a self-made spatialite database. But it makes a great difference if the tables, queries and spatial indexes are not in the manner they are designed in the OSM toolchain.

    So I think your qgis or postgis is missing something.

    – AndreJ Aug 04 '12 at 16:44
  • My intuition tells me that PostGIS should be quicker than ArcGIS if anything, so I'm with everyone else here in thinking that you need to look at your data structure and your queries. It is definitely worth using the EXPLAIN command to figure out what takes your queries so long. It can take a bit of work to figure out what the stats are telling you, but there are several web pages that you can use to help you decode the output. – MerseyViking Aug 04 '12 at 22:56
  • 1
    I'll agree with others that the solution is probably solved by optimizing your query/indexes and that distributed computing might not be necessary. For point-in-polygon, why do you use st_intersects and not st_contains? Are your points individual geometries or are some stored as multipoint? – Sean Aug 05 '12 at 17:59
  • Have you tweaked your shared_buffers setting. http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server If you have got a lot of space to spare, that would be the first thing I'd do as that will cause a lot of your database to be loaded in hot motherboard memory where it's more easily accessible. Something that I suspect ArcGIS just does without permission. – Regina Obe Aug 06 '12 at 12:11
  • @THX1138 you are going to answer some of these question if you want some answers. – Nathan W Aug 07 '12 at 00:45
  • This is the first time I have heard of ArcGIS shp being slower than postgis. Very disconcerting to hear as I am just lobbying and trying to convert our large datasets to postgis for access from mapinfo/esri/qgis without retaining tab/shp/fgdb copies of them but if the processing is slower ...hmm. – GeorgeC Aug 07 '12 at 03:24
  • Running PostGIS 2.0 64bit in Windows 7 on a 3 year old Dual Xeon 2.93GHz 72Gb RAM machine with the main PostGIS tuning options maxed out (see http://workshops.opengeo.org/postgis-intro/tuning.html) - I got a 312 million row result from 2m points and 55k polygons in 45 minutes. No GIS could've done that in 45 mins. Lastly, I was using geometry types, not geography types as they are currently much slower. I would recommend you have a look at the link and play with your tuning options – minus34 Aug 07 '12 at 22:48
  • I posted a new question about the problems I am having with postgis as this question would become way to ambiguous otherwise. – THX1138 Aug 11 '12 at 22:06
  • 1
    @THX1138 I think it would be more helpful if you would clean up this question and address the follow-up questions which were raised in the comments. I don't think spreading the discussion over two threads will improve the results. What's the core question of this thread if it's not about your problems with PostGIS? – underdark Aug 11 '12 at 22:27

0 Answers0