4

I have a GeoDataFrame (gdf) of several thousand point locations. I want to drop duplicate records from the gdf - that is, records with the same attribute information and the same location. However, the coordinates in the geometry column have way more precision than I need for my analysis, which means that coordinates that are "functionally the same" for my purposes (e.g., the same up to the 5th decimal place) don't have identical geometries.

Sample data for reproducibility:

import pandas as pd
import geopandas as gpd

df = pd.DataFrame({'fid': [0, 1, 2, 3, 4, 5], 'location_name': ['ABC', 'ABC', 'DEF', 'DEF', 'JKL', 'JKL'], 'equipment': ['tank', 'tank', 'generator', 'generator', 'tank', 'generator'] })

coords = ['POINT (-68.85052703049803 -46.03444179295434)', 'POINT (-68.85052703049802 -46.03443956295743)', 'POINT (-68.60401999999993 -37.49876999999998)', 'POINT (-69.17996992199994 -38.91214629699994)', 'POINT (-69.29235725099994 -38.55542628499995)', 'POINT (-69.29235725099992 -38.5554262849999)']

gdf = gpd.GeoDataFrame(data=df, geometry=gpd.GeoSeries.from_wkt(coords), crs=4326)

print(gdf) fid location_name equipment geometry >> 0 ABC tank POINT (-68.85053 -46.03444) # fid 0 and 1 have same attribs and nearly-identical geometry, so a duplicate should be removed >> 1 ABC tank POINT (-68.85053 -46.03444) >> 2 DEF generator POINT (-68.60402 -37.49877) # fid 2 and 3 have the same attribs but different geometry, so both should be kept >> 3 DEF generator POINT (-69.17997 -38.91215) >> 4 JKL tank POINT (-69.29236 -38.55543) # fid 4 and 5 have the same geometry but not identical attributes, so both are kept >> 5 JKL generator POINT (-69.29236 -38.55543)

If I only wanted to remove duplicate records based on attributes, I could use Pandas's drop_duplicates() function: gdf.drop_duplicates(subset=['location_name', 'equipment', 'geometry'], keep='first'). However, because the values in the geometry column are not exactly identical to one another, no records are dropped.

If I were just comparing two sets of coordinates to see if they're close enough to be considered identical, I could use something like np.isclose() and define my threshold for "sameness", but I don't know how I'd apply this sort of analysis across a gdf where I don't know in advance which rows might be similar to one another.

How can I identify records in my gdf with similar enough (according to a threshold) geometries and attributes, and drop those records?

Desired result:

   fid location_name  equipment                     geometry
>>    0           ABC       tank  POINT (-68.85053 -46.03444)
>>    2           DEF  generator  POINT (-68.60402 -37.49877)
>>    3           DEF  generator  POINT (-69.17997 -38.91215)
>>    4           JKL       tank  POINT (-69.29236 -38.55543)
>>    5           JKL  generator  POINT (-69.29236 -38.55543)
Ratislaus
  • 229
  • 7
neirbom9
  • 165
  • 11

4 Answers4

5

You can round the coordinates of the geometries first using shapely.set_precision:

# Round the coordinates to 5 decimals
gdf.geometry = shapely.set_precision(gdf.geometry, grid_size=0.00001)

Note: in GeoPandas 1.0, planned to be released 31 march 2024, set_precision will also be available like this:

# Round the coordinates to 5 decimals
gdf.geometry = gdf.geometry.set_precision(grid_size=0.00001)

Full code sample using shapely:

import pandas as pd
import geopandas as gpd
import shapely

df = pd.DataFrame( { "fid": [0, 1, 2, 3, 4, 5], "location_name": ["ABC", "ABC", "DEF", "DEF", "JKL", "JKL"], "equipment": ["tank", "tank", "generator", "generator", "tank", "generator"], } )

coords = [ "POINT (-68.85052703049803 -46.03444179295434)", "POINT (-68.85052703049802 -46.03443956295743)", "POINT (-68.60401999999993 -37.49876999999998)", "POINT (-69.17996992199994 -38.91214629699994)", "POINT (-69.29235725099994 -38.55542628499995)", "POINT (-69.29235725099992 -38.5554262849999)", ]

gdf = gpd.GeoDataFrame(data=df, geometry=gpd.GeoSeries.from_wkt(coords), crs=4326)

print(gdf)

Round the coordinates to 5 decimals

gdf.geometry = shapely.set_precision(gdf.geometry, grid_size=0.00001) print(gdf.drop_duplicates(["location_name", "equipment", "geometry"]))

This results in the output being the "desired result":

   fid location_name  equipment                     geometry
0    0           ABC       tank  POINT (-68.85053 -46.03444)
2    2           DEF  generator  POINT (-68.60402 -37.49877)
3    3           DEF  generator  POINT (-69.17997 -38.91215)
4    4           JKL       tank  POINT (-69.29236 -38.55543)
5    5           JKL  generator  POINT (-69.29236 -38.55543)
Pieter
  • 1,876
  • 7
  • 9
4

You could just create new columns based on your point geometry, round it, and then use drop_duplicates(), e.g.:

gdf["x"] = round(gdf.geometry.x, 5)
gdf["y"] = round(gdf.geometry.y, 5)

gdf.drop_duplicates(["location_name", "equipment", "x", "y"])

Ratislaus
  • 229
  • 7
NielsFlohr
  • 236
  • 1
  • 6
2

You can round off the coordinates to a desired precision level, which is not obvious with geopandas. See here.

from shapely.ops import transform

def round_coordinates(geom, ndigits=2):

def _round_coords(x, y, z=None): x = round(x, ndigits) y = round(y, ndigits)

  if z is not None:
      z = round(x, ndigits)
      return (x,y,z)
  else:
      return (x,y)

return transform(_round_coords, geom)

import pandas as pd import geopandas as gpd

df = pd.DataFrame({'fid': [0, 1, 2, 3, 4, 5], 'location_name': ['ABC', 'ABC', 'DEF', 'DEF', 'JKL', 'JKL'], 'equipment': ['tank', 'tank', 'generator', 'generator', 'tank', 'generator'] })

coords = ['POINT (-68.85052703049803 -46.03444179295434)', 'POINT (-68.85052703049802 -46.03443956295743)', 'POINT (-68.60401999999993 -37.49876999999998)', 'POINT (-69.17996992199994 -38.91214629699994)', 'POINT (-69.29235725099994 -38.55542628499995)', 'POINT (-69.29235725099992 -38.5554262849999)']

gdf = gpd.GeoDataFrame(data=df, geometry=gpd.GeoSeries.from_wkt(coords), crs=4326)

Retains all original entries

gdf.drop_duplicates(subset=['location_name','equipment','geometry'])

gdf['geometry'] = gdf.geometry.apply(round_coordinates, ndigits=4)

Now drops the duplicate entry in the beginning after changing precision

gdf.drop_duplicates(subset=['location_name','equipment','geometry'])

Shawn
  • 1,817
  • 9
  • 21
0

Alongside with a possible solution suggested in NielsFlohr's answer. I would like to suggest an option where one does not need to split point's geometry into x and y-coordinates.

It is primarily based on two methods loads() and dumps(), of the shapely library. They can be implemented on the stage of creating the GeoSeries.

import pandas as pd
import geopandas as gpd
from shapely.wkt import loads, dumps

df = pd.DataFrame({ 'fid': [0, 1, 2, 3, 4, 5], 'location_name': ['ABC', 'ABC', 'DEF', 'DEF', 'JKL', 'JKL'], 'equipment': ['tank', 'tank', 'generator', 'generator', 'tank', 'generator'], 'geometry': [ 'POINT (-68.85052703049803 -46.03444179295434)', 'POINT (-68.85052703049802 -46.03443956295743)', 'POINT (-68.60401999999993 -37.49876999999998)', 'POINT (-69.17996992199994 -38.91214629699994)', 'POINT (-69.29235725099994 -38.55542628499995)', 'POINT (-69.29235725099992 -38.5554262849999)' ] })

geoms = gpd.GeoSeries.from_wkt(dumps(loads(df['geometry']), rounding_precision=5)) gdf = gpd.GeoDataFrame(data=df, geometry=geoms, crs="EPSG:4326")

print(gdf.drop_duplicates(subset=['location_name', 'equipment', 'geometry'], keep='first'))

After, your output should look like this:

   fid location_name  equipment                     geometry
0    0           ABC       tank  POINT (-68.85053 -46.03444)
2    2           DEF  generator  POINT (-68.60402 -37.49877)
3    3           DEF  generator  POINT (-69.17997 -38.91215)
4    4           JKL       tank  POINT (-69.29236 -38.55543)
5    5           JKL  generator  POINT (-69.29236 -38.55543)

How actually the combination of loads and dumps works?

from shapely.wkt import loads, dumps
# set a string variable as a WKT representation of a point
point_wkt = 'POINT (-68.85052703049803 -46.03444179295434)' # <class 'str'>
# create a Point from its WKT representation
point = loads(point_wkt) # <class 'shapely.geometry.point.Point'>
print(point) # POINT (-68.85052703049803 -46.03444179295434)
# bring the Point back to its WKT representation 
point_ = dumps(point, rounding_precision=3) # POINT (-68.85053 -46.03444)
print(point_) # <class 'str'>

Also, keep in mind that the default display precision of coordinates is 5, see this thread Getting more precision with GeoPandas? for details.


References:

Taras
  • 32,823
  • 4
  • 66
  • 137