15

We are creating a website that will collect location information (points) from users. We are exploring techniques to preserve users' location privacy (e.g., often users will share their home address, which is sensitive). One option that came to mind is to obfuscate or "hash" the points before storing them in the database, eliminating the need to store these sensitive data at all.

Our basic requirements are, I believe:

  1. Given a single obfuscated point, it is not possible to derive the original point within (say) a kilometer or so, even given all the metadata associated with the point (i.e., assume the entire database is compromised).

  2. Given an arbitrarily large set of obfuscated points corresponding to the same original point, it is still not possible to derive the original point. (For example, an easy technique would be to add a random vector to the original point, but if you do this enough times, the obfuscated points will cluster around the original point.)

It would be nice if various statistical properties were preserved, though I don't know which properties are important at this stage. For example, I'd rather that obfuscated points scatter in a "natural" way rather than accumulating into a grid. However, privacy is more important than this.

Reid
  • 602
  • 4
  • 15
  • Your requirements don't mention what sort of accuracy you wish to maintain, you only focus on the obfuscation requirement. The following algorithm trivially satisfies the requirements you listed, but is rather worthless: map each point to 0° N, 0° east. Presumably you also want to satisfy some criterion, like the obfuscated point is within x km of the actual point. – Llaves Jun 20 '12 at 04:14
  • A second question: you mention metadata and being able to reconstruct the true point if the entire database is compromised. If the metadata doesn't allow you to identify obfuscated points associated with the same "true point", then how can someone reconstruct the "true point" from repeated random samples if you can't associate them with each other? On the other hand, if the metadata does allow you to associate the points, then when you are asked to report again the location of some already obfuscated point, just return the same obfuscated value returned all the previous times. – Llaves Jun 20 '12 at 04:29
  • Do you need to be able to recreate the actual location from the hashed data, or will it just be used for confirming a person is where they say they are? If it's the latter, a one-way hash, hashing a salt + the WKT of the geometry would suffice. If it's the former, then you'll have to have some function somewhere to do the inverse transformation of your hash function - a two-way hash. – MerseyViking Jun 20 '12 at 08:56
  • Will the points be compared with other users data/other datasets as a part of the service? – Matthew Snape Jun 20 '12 at 13:34
  • @Llaves, I do actually: "within a kilometer or so". But I would hope the obfuscation level is a parameter to the algorithm. Regarding your second comment, yes, the metadata allows association of points (e.g., one user might enter the same point many times). And an algorithm that results in the same obfuscated point given the same original point is fine; but if the algorithm doesn't do that, I can't recover the original point (that's the whole reason for the question) in order to test if the same obfuscated point should be used. – Reid Jun 20 '12 at 17:24
  • @Mersey, I don't need to recreate the original point. But it does need to be a geographic point that I can use in later analysis. (The points are citizen science observations, but we may be able to get away with a reduction in precision that comes with obfuscation.) – Reid Jun 20 '12 at 17:27
  • @Matthew, that sort of analysis is feasible, yes, but we don't need to join the points with points in another dataset (I envision questions more like "how many points in this polygon from another dataset). – Reid Jun 20 '12 at 17:30
  • One problem with that kind of accuracy: If you can determine if it's within "a KM" of a point, I only have to brute-force 500M points to cover the surface of the earth. That's totally easy. Since you can't do better than that, you might as well just round to the nearest minute of lat/long, and store that. – zebediah49 Jun 20 '12 at 20:29
  • @zebedia, I don't understand your issue. The point is that a user might input the locations of their home, and I don't want to associate that location with their other metadata or even with the fact that someone at a particular location participated in the site. How is your approach relevant to that? – Reid Jun 20 '12 at 21:03
  • I may have phrased that badly. I'm saying that at some level if you can use the information to, for example, determine how close the user is to a point, that can be used to locate the user. The earth is small enough that you need to discard data--most likely by losing enough accuracy that it's not a concern any more. Your other option is to encode it such that it cannot be used for anything useful ("within area X" is no longer an option). – zebediah49 Jun 20 '12 at 21:15
  • at the risk of belaboring the point, my comment asks how accurate the obfuscated point must. Your reference to "one kilometer" states "it is not possible to derive the original point within (say) a kilometer or so". This is an entirely different concept. – Llaves Jun 21 '12 at 02:57

4 Answers4

7

Have a look at:

MP Armstrong, Rushton G, Zimmerman DL. Geographically masking health data to preserve confidentiality. Stat Med.1999; 18:497–525.

(citation, full text)

They discuss different 'geo-masks' for point data including displacement, rotation, random perturbation and aggregation. Although they don't discuss specific technical solutions on how to implement it, there are useful pointers to information on what you gain/loose with every approach.

For more theoretical considerations have a look at my answer to the question on similar topic.

radek
  • 10,579
  • 12
  • 86
  • 121
  • 2
    Nice reference, it is an active field so many are available. I've recommended an overview article (Mathews & Harel, 2011) in another question. I also believe the International Journal of Health Geographics has papers on it from time to time (see my citeulike library with the geomask tag). I haven't come across any tools though to do the job though, probably a useful endeavor. – Andy W Jun 20 '12 at 17:48
  • 1
    @AndyW Thanks for pointers Andy. Indeed - with the growing amount of high resolution geodata used in the public health / spatial epidemiology the problem becomes more and more relevant. I had the same feeling that practical solutions are still far behind theoretical ones - definitely a place where some nice developments can be made! – radek Jun 21 '12 at 10:38
1

You could try using Perlin noise to shift your points by a random amount, but with the advantage that points close to each other will remain close to each other, but this similarity falls off with distance. If the noise function is centred around 0, statistical analysis should still return similar data as on the source, as Perlin noise (especially the 2002 version) is a roughly Gaussian distribution.

MerseyViking
  • 14,543
  • 1
  • 41
  • 75
  • If I shift many copies of the same point, could the original point then be recovered by analyzing the shifted points? – Reid Jun 20 '12 at 20:20
  • The way I imagined it, you would use the coordinates of the point as a lookup into the noise function. So two identical points would remain coincident. You could use a third value, say the date the point was created as a lookup into a 3D Perlin noise function. Then (and I'm no statistician), it would be impractical to reconstruct the source data unless the random seed and the scale of the noise you chose was known. Even then I'm not sure it'd be practically workable. – MerseyViking Jun 20 '12 at 21:03
  • Ah, so you're making it into a hash function. It may be unsafe to assume that the random seed and scale remain secret, though; I'm assuming that the server has been entirely compromised. – Reid Jun 20 '12 at 21:05
  • Phew! OK then,I like a challenge :) Now you're really talking about physical security. You have a separate off-site machine to generate the hashes, send them over a secure connection with something like SSL. You could set up a watchdog on one or both of the servers such that if one goes down, or you press a big red button, the other automatically shuts down. If you used cloud instances, then there would be no practical way of getting anything from the other instance, short of breaking into Amazon's datacentres... – MerseyViking Jun 20 '12 at 21:15
  • As a corollary, you should only spend as much on data security as the data is worth. There are many layers you could add to your security model, but at some point you have to say enough. It'd be worth perhaps fielding this question to one of the other SE sites. – MerseyViking Jun 20 '12 at 21:20
0

This is perhaps more convoluted and involved than need be, however it may be a route to take:

Create a simple python script that takes your original input points, buffers them by a certain acceptable obfuscating distance, creates n number of random points using the buffers as a feature constraint (100, for example), and then selects one of the points using a pseudo-random number generator to use as the new obfuscated point. It would also be necessary to create a new pseudo-random number for each obfuscation.

Depending on your scenario, this could be packaged in a Toolbox and accessed as a GPService with a REST endpoint so the obfuscation occurs in memory locations and only the obfuscated point is posted to your physical database.

AHigh
  • 1,566
  • 13
  • 22
  • 1
    This assumes an ArcGIS implementation, but none was mentioned in the OP. Still, an interesting solution! – blah238 Jun 20 '12 at 09:32
  • 3
    This natural solution has some potential flaws upon examination: (1) several distinct points may get mapped to the same point. (2) It's easy to unmask points, as the OP shows. (3) Often points need to stand in some geographic relationship to related features: e.g., house locations ought to be near streets and not in lakes or in rail yards. Issues such as these make the problem genuinely difficult, interesting, and worthy of GIS analysis (for otherwise one could just jitter the original coordinates randomly when they are first entered into the database and be done with it). – whuber Jun 20 '12 at 14:03
0

OK, so the algorithm we are considering is as follows:

  1. Round the point to a 200-meter grid (to compensate for vagaries in geocoding).
  2. Hash the text of the point's coordinates using some cryptographic hashing algorithm (e.g., SHA2).
  3. Replace the lower-order bits of the point's coordinates (up to the desired obfuscation level of 1km) with results from the hash function.
Reid
  • 602
  • 4
  • 15