12

This is a simple yet possibly controversial question: why do most (if not all) GIS packages require that a determined layer have a unique not nullable numeric identifier?

Why there is the need for such a surrogate key instead of a natural one?

Examples:

  • ArcGIS enforces OBJECTID (or a GlobalID)

  • QGIS does not load layers when they don't have a numeric id.

whuber
  • 69,783
  • 15
  • 186
  • 281
George Silva
  • 6,298
  • 3
  • 36
  • 71

5 Answers5

6

Because they need to have a optimized indexable field. To index a string field over and over again would require more overhead and in the end is just not as efficient.

ESRI actually supports in the SDE world the 'GLOBALID' which is a GUID field, so this is a 32char field but still is indexed to increase performance.

D.E.Wright
  • 5,497
  • 23
  • 32
  • 3
    That's a good explanation for the efficiency advantage of a numeric id. But I think @George is probing more deeply than this. Technically, RDBMSes do not need their identifiers to be numeric, so why should GISes? – whuber Jul 11 '11 at 21:49
  • 2
    The problem here is not perfomance. A not nullable unique key would do it. But why it must be numeric? Once i've heard or read that it needs to be numeric because it uses that key to control rendering...was in Modelling Our World from ESRI? – George Silva Jul 11 '11 at 22:05
  • 2
    Because a GIS is not an RDBMS, although it can make use of one. A GIS will usually have some rules and assumptions, such as the assumption that the primary key will be an indexed integer or GUID, for the sake of performance and coding sanity. – blah238 Jul 11 '11 at 22:07
  • 1
    ok, but why to assume a numeric? why can´t we choose our key when creating a layer? – George Silva Jul 11 '11 at 22:09
  • 1
    I'd imagine the main reason is that those assumptions make the job of writing the code that makes a GIS package work much, much easier. – blah238 Jul 11 '11 at 22:21
  • 1
    In the end we are left to what the vendor/author feels is best for there platform. I would be willing to say that if I had the ability to create from the ground up my own platform and saw that this was a good option that i might give the user a choice. But I think most vendors; especially those that use SHP as there foundation stick to the known conventions; this is a common format that is predictable and gives the user a known response as well as other vendors for compatability. – D.E.Wright Jul 12 '11 at 05:37
4

If you start adding records to a layer you could rely on a user entering a unique alphanumeric code for every new feature just before writing it to disk..

..or you could implement a simple autoincrementing integer field.

geographika
  • 14,320
  • 4
  • 53
  • 77
4

As many people have suggested, it is a question of convenience; but perhaps more profoundly, it is convention.

As a programmer, my first instinct would be to use a numeric key for a layer ID because that is the way it has always been done. Indeed, it may not even occur to me, on a conscious level at least, that I should do it any other way. Of course, if there is a technical reason not to use integers, say if there's a possibility of there being more layers than can be stored in 32-bits (a very unlikely proposition!), or if there is a business reason for it, then alternatives would be considered.

There are also algorithmic considerations with numeric keys. Sorting, and searching of a list of sorted values ultimately boils down to a comparison between two numbers, even if it is a list of strings or complex objects; they merely get turned into numbers with a hashing function. Having said that, on modern computers, searching a list of say 100 or even 1000 items is usually as quick with a brute-force approach as it is with a highly optimized algorithm. In the case of layers in a GIS, I can't see even the most complex of maps having more than 1000 or so, and even if it did, the other associated computations would take orders of magnitude longer than any small gain from an optimized search of a short list.

Integer keys "just make sense" to a programmer, and as Brad says, there is more effort in using non-numeric keys. Maybe not more code, but more mental effort, and we are lazy creatures of habit. Also, the key that uniquely identifies something like a layer in a GIS is considered "hidden" from the user, to make sure they don't mess about with it and break code that relies on its uniqueness (DB UNIQUE keywords notwithstanding). Because if you give a user enough rope, sooner or later someone will hang themselves with it. By all means enforce uniqueness on a user-editable field, but the underlying system must assume its key is unique and untampered with.

MerseyViking
  • 14,543
  • 1
  • 41
  • 75
  • The OpenStreetMap is one example of a project that needs more than 32-bit integers. They use bigint for their primary keys. – Mike T Jul 12 '11 at 10:15
  • For ways/nodes, yes. But the original question was about layers in a GIS. – MerseyViking Jul 12 '11 at 10:25
  • OpenStreetMap stores GIS layers. – George Silva Jul 12 '11 at 11:43
  • OSM just stores ways and nodes which have key/value tags. It is up to the presentation system (e.g. OpenLayers) and the rendering backend (e.g. Mapnik, Osmarender) to determine its notion of layers based on those tags or something else. But Mike is right, it uses bigints for all its tables' primary keys. – MerseyViking Jul 12 '11 at 12:24
  • +1 for mentioning it's about convention. It's a convention because it equals better performance. – CaptDragon Jul 12 '11 at 14:04
3

This question has been a confusing one to people (like me) that develop the geodatabase-side of things.

It's not a limitation of database storage, as PostgreSQL can define tables with composite PRIMARY KEYS of different data types, however, these tables cannot be loaded into programs like QGIS. On a related historic note, PostgreSQL used to require an OID column as an internal key, which was also a 32-bit integer. This was required until version 7.2.

The 32-bit integer ID requirement is really a programming limitation. It is much simpler to have an index to a set of records as a fixed data type (32-bit integer), and it is convenient for this to also be the PRIMARY KEY for that record. It is more challenging to make a program allow a composite primary key, and for it to retrieve a unique record based on multiple and/or varying data types. However, like PostgreSQL's OID, this limitation can be overcome with development time. For QGIS, the [now] 5 year old bug might be resolved some day (here is some recent discussion on the topic).

Mike T
  • 42,095
  • 10
  • 126
  • 187
  • +1 Well said. As further evidence that this is a programming limitation, note that ESRI did not require (or use) any internal identifier fields in ArcView before ArcGIS 8.x came out. The old ArcView was capable of all the database operations that ArcGIS performs (and actually was faster at many of them). – whuber Jul 12 '11 at 13:40
2

In ESRI, and other GIS software, it is common to have a folder or set of files which make on feature class or dataset.
e.g. arcinfo coverage, shapefile, file geodatabase.
These "sets" of files need to be "joined" by the software to allow for many GIS functions.
Attrubute tables, network, topological controls.
That is the purpose of the OID and also the reason for making it non-nullable, hidden, software controlled.

Brad Nesom
  • 17,412
  • 2
  • 42
  • 68
  • I think the GIS operations may have something to do with this, really. intersect, (spatial) unions, difference, etc. Can anyone confirm or present this more detailed? – George Silva Jul 11 '11 at 22:07
  • Take a look at how a single SDE feature class is actually stored in a database such as Oracle. There is one table for the attributes, one table for the geometry, one table for the spatial index, one or more tables for the attribute indexes, etc. If ESRI had to support every code page/character encoding for a string PKEY we'd all still be on ArcView 3.x. – blah238 Jul 11 '11 at 22:16
  • @George - as noted by blah238 There are very few GIS applications that use one single file to store both (all) the data. Which can consist of coordinates, measures, attributes, rules, relationships, and more depending on the package. It is more to do with being able to keep track of which spatial row goes with which attribute row,which network row, so on so forth. – Brad Nesom Jul 11 '11 at 22:22
  • 1
    I'm sorry blah238, I really don't think the ammount of code was determinant in this issue. The enconding has nothing to do with this. The database will do the "math" and decide if a sequence of chars are equal or not, therefore, enforcing PKEY. It's not on the software layer. @Brad Nesom: that makes also makes sense. But in Oracle and PostGIS you can store all your attributes on a single table. I agree that shapefiles needed the dreaded ObjectID...and that may have set the standard? – George Silva Jul 11 '11 at 22:22
  • @George Shapefiles neither needed nor, as a general rule, used an ObjectID. That OID field was introduced with ArcGIS 8. Therefore I doubt that shapefiles have anything to do with the question. – whuber Jul 12 '11 at 13:43
  • @whuber respectfully are you sure? I can remember as far back as arcview 1 having oid. They were completely hidden back then and there weren't a lot of other tools that would expose them. Just wondering if my memory is bad? Probably! – Brad Nesom Jul 12 '11 at 14:05
  • hmm maybe not so much... http://webhelp.esri.com/arcgisserver/9.3/java/index.htm#geodatabases/a_short_hi9898730.htm – Brad Nesom Jul 12 '11 at 14:08
  • @Brad That link shows an ArcGIS 8+ table, not one from ArcGIS 1, 2, or 3, which never had an OID field. (Via programming, a user could reference records by number; the number was based on the physical offset of the dBase records in the .dbf part of the shapefile.) Any shapefile OID has to be created on the fly; it probably corresponds to the physical order in which features are stored. It does not exist as a field anywhere in the shapefile. – whuber Jul 12 '11 at 14:15
  • told you my memeory was bad – Brad Nesom Jul 12 '11 at 14:32