3

I have on the order of 2M Geometry Collections with N (median: 2, stddev: 6.6, 75th: 6, 95th: 21) Point objects (repeated points allowed and important) in a PostGIS table. I want to identify clusters within each collection, with a specified maximum diameter. From these I would identify a main cluster (which I assume exists), based on the number of points it contains (and in case of a tie, choose the cluster with the smallest area).

Since I will be using these clusters further in the database, I'd rather use a technique within PostgreSQL/PostGIS.

K-means clustering is not ideal because it requires specifying a priori the number of clusters, which is impossible due to the heterogeneity of the data. And specifying 1 cluster would return the entire collection.

raphael
  • 3,407
  • 23
  • 61
  • 1
    glancing at this question, I can't easily understand the structure of your data. Are you saying you have 100K rows, each with a column that contains a GeometryCollection with 20 Point objects? Or do you mean you have 100K multipoints, each with 20 coordinates? Can you clarify the structure of your data? – BenjaminGolder Sep 12 '14 at 03:00
  • 1
    Why won't k-means do what you need? If you already tried that path you should explain so others won't try to suggest it anyway. – Jakub Kania Sep 13 '14 at 11:38

1 Answers1

5

I've written a bottom-up hierarchical clustering algorithm, it has extra parameters that might not be useful to other users, but those should be easy to remove in implementation. First, create a new type to have point ids and geometries.

CREATE TYPE pt AS (
gid character varying(32),
the_geom geometry(Point))

and a type with cluster id

CREATE TYPE clustered_pt AS (
collection_id character varying(16),
gid character varying(32),
the_geom geometry(Point)
cluster_id int)

Next the algorithm function

CREATE OR REPLACE FUNCTION buc(collection_id character varying, points pt[], radius integer)
RETURNS SETOF clustered_pt AS
$BODY$

DECLARE
srid int;
joined_clusters int[];

BEGIN

--If there's only 1 point, don't bother with the loop.
IF array_length(points,1)<2 THEN
    RETURN QUERY SELECT collection_id::character varying(16), gid, the_geom, 1 FROM unnest(points);
    RETURN;
END IF;

CREATE TEMPORARY TABLE IF NOT EXISTS points2 (LIKE pt) ON COMMIT DROP;

BEGIN
    ALTER TABLE points2 ADD COLUMN cluster_id serial;
EXCEPTION
    WHEN duplicate_column THEN --do nothing. Exception comes up when using this function multiple times
END;

TRUNCATE points2;
    --inserting points in
INSERT INTO points2(gid, the_geom)
    (SELECT (unnest(points)).* ); 

--Store the srid to reconvert points after, assumes all points have the same SRID
srid := ST_SRID(the_geom) FROM points2 LIMIT 1;

UPDATE points2 --transforming points to a UTM coordinate system so distances will be calculated in meters.
SET the_geom =  ST_TRANSFORM(the_geom,26986);

LOOP
    --If the smallest maximum distance between two clusters is greater than 2x the desired cluster radius, then there are no more clusters to be formed
    IF (SELECT ST_MaxDistance(ST_Collect(a.the_geom),ST_Collect(b.the_geom))  FROM points2 a, points2 b
        WHERE a.cluster_id <> b.cluster_id
        GROUP BY a.cluster_id, b.cluster_id 
        ORDER BY ST_MaxDistance(ST_Collect(a.the_geom),ST_Collect(b.the_geom)) LIMIT 1)
        > 2 * radius
    THEN
        EXIT;
    END IF;

    joined_clusters := ARRAY[a.cluster_id,b.cluster_id]
        FROM points2 a, points2 b
        WHERE a.cluster_id <> b.cluster_id
        GROUP BY a.cluster_id, b.cluster_id
        ORDER BY ST_MaxDistance(ST_Collect(a.the_geom),ST_Collect(b.the_geom)) 
        LIMIT 1;

    UPDATE points2
    SET cluster_id = joined_clusters[1]
    WHERE cluster_id = joined_clusters[2];

    --If there's only 1 cluster left, exit loop
    IF (SELECT COUNT(DISTINCT cluster_id) FROM points2) < 2 THEN
        EXIT;

    END IF;

END LOOP;

RETURN QUERY SELECT collection_id::character varying(16), gid, ST_TRANSFORM(the_geom, srid)::geometry(point), cluster_id FROM points2;
END;
$BODY$
LANGUAGE plpgsql

Because of the function syntax in psql, implementation goes as

WITH subq AS(
    SELECT collection_id, ARRAY_AGG((gid, the_geom)::pt) AS points
    FROM data
    GROUP BY collection_id)
SELECT (clusters).* FROM 
    (SELECT buc(collection_id, points, radius) AS clusters FROM subq
) y;

If one were clustering one gigantic point cloud rather than many little clouds, it would be a good idea to add a GiST index on the temporary table. This takes 30 min to process 1.8M collections with 3M total points (i7-3930K @3.2GHz w/ 64GB ram). Any other suggestions for optimization welcome!

raphael
  • 3,407
  • 23
  • 61