Dropping nearly-identical point locations with GeoPandas

Question

I have a GeoDataFrame (gdf) of several thousand point locations. I want to drop duplicate records from the gdf - that is, records with the same attribute information and the same location. However, the coordinates in the geometry column have way more precision than I need for my analysis, which means that coordinates that are "functionally the same" for my purposes (e.g., the same up to the 5th decimal place) don't have identical geometries.

Sample data for reproducibility:

import pandas as pd
import geopandas as gpd
df = pd.DataFrame({'fid': [0, 1, 2, 3, 4, 5],
                   'location_name': ['ABC', 'ABC', 'DEF', 'DEF', 'JKL', 'JKL'],
                   'equipment': ['tank', 'tank', 'generator', 'generator', 'tank', 'generator']
                   })
coords = ['POINT (-68.85052703049803 -46.03444179295434)',
          'POINT (-68.85052703049802 -46.03443956295743)',
          'POINT (-68.60401999999993 -37.49876999999998)',
          'POINT (-69.17996992199994 -38.91214629699994)',
          'POINT (-69.29235725099994 -38.55542628499995)',
          'POINT (-69.29235725099992 -38.5554262849999)']
gdf = gpd.GeoDataFrame(data=df,
                       geometry=gpd.GeoSeries.from_wkt(coords),
                       crs=4326)
print(gdf)
   fid location_name  equipment                     geometry
>>    0           ABC       tank  POINT (-68.85053 -46.03444)  # fid 0 and 1 have same attribs and nearly-identical geometry, so a duplicate should be removed
>>    1           ABC       tank  POINT (-68.85053 -46.03444)
>>    2           DEF  generator  POINT (-68.60402 -37.49877)  # fid 2 and 3 have the same attribs but different geometry, so both should be kept
>>    3           DEF  generator  POINT (-69.17997 -38.91215)
>>    4           JKL       tank  POINT (-69.29236 -38.55543)  # fid 4 and 5 have the same geometry but not identical attributes, so both are kept
>>    5           JKL  generator  POINT (-69.29236 -38.55543)

If I only wanted to remove duplicate records based on attributes, I could use Pandas's drop_duplicates() function: gdf.drop_duplicates(subset=['location_name', 'equipment', 'geometry'], keep='first'). However, because the values in the geometry column are not exactly identical to one another, no records are dropped.

If I were just comparing two sets of coordinates to see if they're close enough to be considered identical, I could use something like np.isclose() and define my threshold for "sameness", but I don't know how I'd apply this sort of analysis across a gdf where I don't know in advance which rows might be similar to one another.

How can I identify records in my gdf with similar enough (according to a threshold) geometries and attributes, and drop those records?

Desired result:

   fid location_name  equipment                     geometry
>>    0           ABC       tank  POINT (-68.85053 -46.03444)
>>    2           DEF  generator  POINT (-68.60402 -37.49877)
>>    3           DEF  generator  POINT (-69.17997 -38.91215)
>>    4           JKL       tank  POINT (-69.29236 -38.55543)
>>    5           JKL  generator  POINT (-69.29236 -38.55543)

Can you add a screenshot showing how the points are distributed on a map? — BERA, Feb 18 '24 at 09:11

Pieter · Answer 1 · 2024-02-18T08:35:03.830

You can round the coordinates of the geometries first using shapely.set_precision:

# Round the coordinates to 5 decimals
gdf.geometry = shapely.set_precision(gdf.geometry, grid_size=0.00001)

Note: in GeoPandas 1.0, planned to be released 31 march 2024, set_precision will also be available like this:

# Round the coordinates to 5 decimals
gdf.geometry = gdf.geometry.set_precision(grid_size=0.00001)

Full code sample using shapely:

import pandas as pd
import geopandas as gpd
import shapely
df = pd.DataFrame(
    {
        "fid": [0, 1, 2, 3, 4, 5],
        "location_name": ["ABC", "ABC", "DEF", "DEF", "JKL", "JKL"],
        "equipment": ["tank", "tank", "generator", "generator", "tank", "generator"],
    }
)
coords = [
    "POINT (-68.85052703049803 -46.03444179295434)",
    "POINT (-68.85052703049802 -46.03443956295743)",
    "POINT (-68.60401999999993 -37.49876999999998)",
    "POINT (-69.17996992199994 -38.91214629699994)",
    "POINT (-69.29235725099994 -38.55542628499995)",
    "POINT (-69.29235725099992 -38.5554262849999)",
]
gdf = gpd.GeoDataFrame(data=df, geometry=gpd.GeoSeries.from_wkt(coords), crs=4326)
print(gdf)
Round the coordinates to 5 decimals
gdf.geometry = shapely.set_precision(gdf.geometry, grid_size=0.00001)
print(gdf.drop_duplicates(["location_name", "equipment", "geometry"]))

This results in the output being the "desired result":

   fid location_name  equipment                     geometry
0    0           ABC       tank  POINT (-68.85053 -46.03444)
2    2           DEF  generator  POINT (-68.60402 -37.49877)
3    3           DEF  generator  POINT (-69.17997 -38.91215)
4    4           JKL       tank  POINT (-69.29236 -38.55543)
5    5           JKL  generator  POINT (-69.29236 -38.55543)

score 4 · Answer 2 · edited Feb 18 '24 at 03:55

4

You could just create new columns based on your point geometry, round it, and then use drop_duplicates(), e.g.:

gdf["x"] = round(gdf.geometry.x, 5)
gdf["y"] = round(gdf.geometry.y, 5)
gdf.drop_duplicates(["location_name", "equipment", "x", "y"])

edited Feb 18 '24 at 03:55

Ratislaus

229
7

answered Feb 17 '24 at 07:00

NielsFlohr

236
1
6

score 2 · Answer 3 · answered Feb 17 '24 at 19:31

You can round off the coordinates to a desired precision level, which is not obvious with geopandas. See here.

from shapely.ops import transform
def round_coordinates(geom, ndigits=2):
def _round_coords(x, y, z=None):
      x = round(x, ndigits)
      y = round(y, ndigits)
  if z is not None:
      z = round(x, ndigits)
      return (x,y,z)
  else:
      return (x,y)


return transform(_round_coords, geom)
import pandas as pd
import geopandas as gpd
df = pd.DataFrame({'fid': [0, 1, 2, 3, 4, 5],
                   'location_name': ['ABC', 'ABC', 'DEF', 'DEF', 'JKL', 'JKL'],
                   'equipment': ['tank', 'tank', 'generator', 'generator', 'tank', 'generator']
                   })
coords = ['POINT (-68.85052703049803 -46.03444179295434)',
          'POINT (-68.85052703049802 -46.03443956295743)',
          'POINT (-68.60401999999993 -37.49876999999998)',
          'POINT (-69.17996992199994 -38.91214629699994)',
          'POINT (-69.29235725099994 -38.55542628499995)',
          'POINT (-69.29235725099992 -38.5554262849999)']
gdf = gpd.GeoDataFrame(data=df,
                       geometry=gpd.GeoSeries.from_wkt(coords),
                       crs=4326)
Retains all original entries
gdf.drop_duplicates(subset=['location_name','equipment','geometry'])
gdf['geometry'] = gdf.geometry.apply(round_coordinates, ndigits=4)
Now drops the duplicate entry in the beginning after changing precision
gdf.drop_duplicates(subset=['location_name','equipment','geometry'])

Taras · Answer 4 · 2024-02-17T20:01:10.897

Alongside with a possible solution suggested in NielsFlohr's answer. I would like to suggest an option where one does not need to split point's geometry into x and y-coordinates.

It is primarily based on two methods loads() and dumps(), of the shapely library. They can be implemented on the stage of creating the GeoSeries.

import pandas as pd
import geopandas as gpd
from shapely.wkt import loads, dumps
df = pd.DataFrame({
    'fid': [0, 1, 2, 3, 4, 5],
    'location_name': ['ABC', 'ABC', 'DEF', 'DEF', 'JKL', 'JKL'],
    'equipment': ['tank', 'tank', 'generator', 'generator', 'tank', 'generator'],
    'geometry': [
        'POINT (-68.85052703049803 -46.03444179295434)',
        'POINT (-68.85052703049802 -46.03443956295743)',
        'POINT (-68.60401999999993 -37.49876999999998)',
        'POINT (-69.17996992199994 -38.91214629699994)',
        'POINT (-69.29235725099994 -38.55542628499995)',
        'POINT (-69.29235725099992 -38.5554262849999)'
    ]
    })
geoms = gpd.GeoSeries.from_wkt(dumps(loads(df['geometry']), rounding_precision=5))
gdf = gpd.GeoDataFrame(data=df, geometry=geoms, crs="EPSG:4326")
print(gdf.drop_duplicates(subset=['location_name', 'equipment', 'geometry'], keep='first'))

After, your output should look like this:

   fid location_name  equipment                     geometry
0    0           ABC       tank  POINT (-68.85053 -46.03444)
2    2           DEF  generator  POINT (-68.60402 -37.49877)
3    3           DEF  generator  POINT (-69.17997 -38.91215)
4    4           JKL       tank  POINT (-69.29236 -38.55543)
5    5           JKL  generator  POINT (-69.29236 -38.55543)

How actually the combination of loads and dumps works?

from shapely.wkt import loads, dumps
# set a string variable as a WKT representation of a point
point_wkt = 'POINT (-68.85052703049803 -46.03444179295434)' # <class 'str'>
# create a Point from its WKT representation
point = loads(point_wkt) # <class 'shapely.geometry.point.Point'>
print(point) # POINT (-68.85052703049803 -46.03444179295434)
# bring the Point back to its WKT representation 
point_ = dumps(point, rounding_precision=3) # POINT (-68.85053 -46.03444)
print(point_) # <class 'str'>

Also, keep in mind that the default display precision of coordinates is 5, see this thread Getting more precision with GeoPandas? for details.

References:

Rounding all coordinates in shapely?

Please give me a hint, why is it -1? I am keen to educate myself :) — Taras, Feb 19 '24 at 17:40

Dropping nearly-identical point locations with GeoPandas

4 Answers4

Round the coordinates to 5 decimals

Retains all original entries

Now drops the duplicate entry in the beginning after changing precision