2

I've been learning Postgres (coming from SQL Server), and this error really confuses me.

Here is the code with some sample data:

create table T (
ID serial primary key,
A varchar(1),
B varchar(1),
C varchar(1)
)

↑ Testing table.

insert into T (A, B, C)
values('A', 'B', 'C'), ('A', 'B', 'C')

↑ Insert duplicates

delete from T
where ID in (
    select t.ID
    from (  select ID, row_number() over (partition by A,B,C order by A,B,C) as rn
            from T) as t
    where t.rn < (select max(t.rn) from t)
    )

↑ Delete duplicate keeping the last entry.

The problem is in the (select max(t.rn) from t) I am assuming this is a noob mistake related to not knowing postgres syntax when it comes to referencing columns with aliases?

Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600
Chessbrain
  • 1,193
  • 2
  • 15
  • 23
  • 1
    Unrelated, but: you can simplify this by using dense_rank() over (... order by a desc, b desc, c desc) as rn and then use where t.rn > 1 removing the need for the max() subquery –  Sep 06 '19 at 12:45

1 Answers1

6

First off, partition by A,B,C order by A,B,C makes no sense. Since you intent to keep the "last" row (meaning the one with the greatest ID), you probably meant:

partition by A,B,C order by ID

Still, the syntax is invalid on top of this. This subquery expression contains a reference to a column of the outer query: (select max(t.rn) from t). The scope of the subquery does not include columns from the outer query so rn is not visible in there. Only columns of the table t are.

You could use a CTE to allow the reference and make the syntax valid:

WITH cte AS (SELECT id, row_number() OVER (PARTITION by a,b,c ORDER BY id) AS rn FROM t)
DELETE FROM t
WHERE  id IN (
    SELECT id
    FROM   cte
    WHERE  rn < (SELECT max(rn) FROM cte)
    )

Still, the query is dangerous nonsense. Do not use this!
Comparing to the greatest row number is logical nonsense as each group of peers might have a different number of dupes. Would delete a lot more than it should.

Simpler, and correct:

DELETE FROM t
WHERE  id IN (
    SELECT t1.id
    FROM  (SELECT id, row_number() OVER (PARTITION by a,b,c ORDER BY id DESC) AS rn FROM t) t1
    WHERE  t1.rn > 1  --  all but the latest
    );

Which, in turn, can be had more cheaply as (assuming all columns NOT NULL!):

DELETE FROM t
WHERE  EXISTS (
   SELECT FROM t AS t1
   WHERE t1.a = t.a
   AND   t1.b = t.b
   AND   t1.c = t.c
   AND   t1.id > t.id
   );

"... where a dupe exists with a greater ID".

Related:

Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600