Is there a way to SELECT n ON (like DISTINCT ON, but more than one of each)

Question

I have a table us_customers that looks like this (with hundreds of thousands of rows):

+----------+----------+
|    id    | us_state |
+----------+----------+
| 12345678 | MA       |
| 23456781 | AL       |
| 34567812 | GA       |
| 45678123 | FL       |
| 56781234 | AZ       |
| 67812345 | MA       |
| 78123456 | CO       |
| 81234567 | FL       |
+----------+----------+

... and I want to SELECT a sample of n customers from each us_state.

Is there a way to do that cleanly in PostgreSQL 9.3?

I can get one customer from each us_state easily with:

SELECT DISTINCT ON (us_state) id
FROM us_customers
ORDER BY us_state;

But if I want, say, three customers from each state, is there a way I can do this without running the same query multiple times?

Julien Vavasseur · Accepted Answer · 2016-03-25T18:26:33.400

You can number and order id by us_state using the ROW_NUMBER() Window Function and only keep the n first values:

SELECT * 
FROM (
  SELECT *
    , ROW_NUMBER() OVER(PARTITION BY us_state ORDER BY id) as n
  FROM data
) as ord
WHERE n <= 2
ORDER BY us_state
;

Or you can CROSS JOIN with a subquery:

SELECT l.*
FROM (
  SELECT DISTINCT us_state FROM data
) as s
CROSS JOIN LATERAL (
  SELECT * 
  FROM data d
  WHERE d.us_state = s.us_state
  ORDER BY id
  LIMIT 2
) as l
ORDER BY l.us_state
;

Sample SQL Fiddle here
I used a small sample of 1 to 3 rows per state. Therefore I limit only to 2 values
I ordered it by ids but you can change that and order by whatever works best for you

Output with my small sample:

       id | us_state | n
      123 |       AL | 1 
      456 |       AL | 2 
 56781234 |       AZ | 1 
 78123456 |       CO | 1 
 45678123 |       FL | 1 
 81234567 |       FL | 2 
 34567812 |       GA | 1 
      123 |       MA | 1 
      456 |       MA | 2

Note that n is the result of ROW_NUMBER and doesn't exist in the second query. On a big table, an index on the partition (us-state) and order (id here) columns will help.

Sample used:

CREATE TABLE data
    ("id" int, "us_state" varchar(2))
;

INSERT INTO data
    ("id", "us_state")
VALUES
    (12345678, 'MA'),
    (123, 'MA'),
    (456, 'MA'),
    (23456781, 'AL'),
    (123, 'AL'),
    (456, 'AL'),
    (34567812, 'GA'),
    (45678123, 'FL'),
    (56781234, 'AZ'),
    (67812345, 'MA'),
    (78123456, 'CO'),
    (81234567, 'FL')
;

Both queries have to read the whole table (repeatedly) in Postgres and are much slower than necessary since Postgres does not have loose index scans (yet). There are ways around it. — Erwin Brandstetter, Mar 26 '16 at 03:43
@ErwinBrandstetter if there is a us_states table and we replace it for s in the 2nd query, it will not read the whole table, corerct? (I mean there is not much difference between Julien's CROSS JOIN LATERAL and yours LEFT JOIN LATERAL (always assuimng there is a states tables we can use instead of the select distinct subquery). — ypercubeᵀᴹ, Mar 26 '16 at 08:49
@ypercubeᵀᴹ:That's right. The expensive part in the 2nd query is the SELECT DISTINCT us_state FROM data because there are few distinct values with many rows, as explained here. The rest is cheap either way assuming we have the mentioned index(es), best a multicolumn index on (us_state, id). Else the (uncalled) ORDER BY in the LATERAL subquery gets expensive, too. I only chose LEFT JOIN to include all states in the result either way, which probably makes sense once we have a us_states table. — Erwin Brandstetter, Mar 26 '16 at 14:04

Erwin Brandstetter · Answer 2 · 2021-11-09T22:12:52.650

With `us_states` table

If you don't have another table with a complete set of possible values for us_state then create it:

CREATE TABLE us_states (
  us_state varchar(2) PRIMARY KEY
  -- ... more columns?
);

For one, you may want to add a FK constraint on us_customers.us_state to enforce legal values. But more to the point, it allows for a substantially faster query:

SELECT u.us_state, d.id  -- more columns?
FROM   us_states u
LEFT   JOIN LATERAL (
   SELECT id             -- more columns?
   FROM   us_customers c
   WHERE  c.us_state = u.us_state
   LIMIT  3              -- 3 customers per state
   ) d ON true
ORDER  BY u.us_state, d.id;

Extracting 50 distinct values from "hundreds of thousands of rows" with DISTINCT every time can be more expensive than the rest of the query and would be a big waste of time.

While it's undefined which rows to pick, an arbitrary selection is good enough. So no ORDER BY in he subquery, that's cheaper.

Make that a LEFT JOIN to include every state at least once, even with no matching rows in us_customers.

Without `us_states` table

If you don't have a us_states table, there are still substantially faster ways than with the techniques suggested by @Julien. Emulate a loose index scan with a CTE. You need the index outlined below.

WITH RECURSIVE us_states AS (
   (  -- parentheses required
   SELECT us_state
   FROM   us_customers
   ORDER  BY 1
   LIMIT  1
   )
UNION ALL
   SELECT (SELECT c.us_state 
           FROM   us_customers c
           WHERE  c.us_state > u.us_state
           ORDER  BY 1
           LIMIT  1)                    -- correlated subquery
   FROM   us_states u
   WHERE  u.us_state IS NOT NULL
   )
SELECT us_state
FROM   us_states
WHERE  us_state IS NOT NULL;

This query can be a drop-in replacement for the missing us_states table, or it can be used to create the missing table.

WITH RECURSIVE us_states AS (
   (  -- parentheses required
   SELECT us_state
   FROM   us_customers
   ORDER  BY us_state
   LIMIT  1
   )
UNION ALL
   SELECT (SELECT c.us_state
           FROM   us_customers c
           WHERE  c.us_state > u.us_state
           ORDER  BY 1
           LIMIT  1)
   FROM   us_states u
   WHERE  u.us_state IS NOT NULL
   )

SELECT u.us_state, c.id  -- more columns?
FROM   us_states u
CROSS  JOIN LATERAL (
   SELECT c.id           -- more columns?
   FROM   us_customers c
   WHERE  c.us_state = u.us_state  -- eliminates NULL value from CTE
   LIMIT  3              -- 3 customers per state
   ) c
ORDER  BY u.us_state;

Using CROSS JOIN this time since the CTE only found existing values anyway and we don't need to add WHERE us_state IS NOT NULL this way.

Or, if you can guarantee there are at least 3 rows per us_state, this would be very fast:

WITH RECURSIVE us_states AS (
   (
   SELECT us_state, id
   FROM   us_customers3
   ORDER  BY us_state
   LIMIT  3
   )
UNION ALL
   SELECT c.*
   FROM  (SELECT us_state FROM us_states LIMIT 1) u
   ,      LATERAL (
      SELECT c.us_state, c.id
      FROM   us_customers3 c
      WHERE  c.us_state > u.us_state
      ORDER  BY c.us_state
      LIMIT  3
      ) c
   )
TABLE us_states
ORDER BY us_state;

The last query fails if any state has less than three rows.

Index

Either way, create a multicolumn index like this!

CREATE INDEX data_covering_idx ON data (us_state, id);

db<>fiddle here
_{Old sqlfiddle}

Very cool, thanks for the detailed explanation! I'll try this on Monday and see how it goes. — Shaun Scovil, Mar 26 '16 at 11:40

Is there a way to SELECT n ON (like DISTINCT ON, but more than one of each)

2 Answers2

With us_states table

Without us_states table

Index

With `us_states` table

Without `us_states` table