With us_states table
If you don't have another table with a complete set of possible values for us_state then create it:
CREATE TABLE us_states (
us_state varchar(2) PRIMARY KEY
-- ... more columns?
);
For one, you may want to add a FK constraint on us_customers.us_state to enforce legal values. But more to the point, it allows for a substantially faster query:
SELECT u.us_state, d.id -- more columns?
FROM us_states u
LEFT JOIN LATERAL (
SELECT id -- more columns?
FROM us_customers c
WHERE c.us_state = u.us_state
LIMIT 3 -- 3 customers per state
) d ON true
ORDER BY u.us_state, d.id;
Extracting 50 distinct values from "hundreds of thousands of rows" with DISTINCT every time can be more expensive than the rest of the query and would be a big waste of time.
While it's undefined which rows to pick, an arbitrary selection is good enough. So no ORDER BY in he subquery, that's cheaper.
Make that a LEFT JOIN to include every state at least once, even with no matching rows in us_customers.
Without us_states table
If you don't have a us_states table, there are still substantially faster ways than with the techniques suggested by @Julien. Emulate a loose index scan with a CTE. You need the index outlined below.
WITH RECURSIVE us_states AS (
( -- parentheses required
SELECT us_state
FROM us_customers
ORDER BY 1
LIMIT 1
)
UNION ALL
SELECT (SELECT c.us_state
FROM us_customers c
WHERE c.us_state > u.us_state
ORDER BY 1
LIMIT 1) -- correlated subquery
FROM us_states u
WHERE u.us_state IS NOT NULL
)
SELECT us_state
FROM us_states
WHERE us_state IS NOT NULL;
This query can be a drop-in replacement for the missing us_states table, or it can be used to create the missing table.
WITH RECURSIVE us_states AS (
( -- parentheses required
SELECT us_state
FROM us_customers
ORDER BY us_state
LIMIT 1
)
UNION ALL
SELECT (SELECT c.us_state
FROM us_customers c
WHERE c.us_state > u.us_state
ORDER BY 1
LIMIT 1)
FROM us_states u
WHERE u.us_state IS NOT NULL
)
SELECT u.us_state, c.id -- more columns?
FROM us_states u
CROSS JOIN LATERAL (
SELECT c.id -- more columns?
FROM us_customers c
WHERE c.us_state = u.us_state -- eliminates NULL value from CTE
LIMIT 3 -- 3 customers per state
) c
ORDER BY u.us_state;
Using CROSS JOIN this time since the CTE only found existing values anyway and we don't need to add WHERE us_state IS NOT NULL this way.
Or, if you can guarantee there are at least 3 rows per us_state, this would be very fast:
WITH RECURSIVE us_states AS (
(
SELECT us_state, id
FROM us_customers3
ORDER BY us_state
LIMIT 3
)
UNION ALL
SELECT c.*
FROM (SELECT us_state FROM us_states LIMIT 1) u
, LATERAL (
SELECT c.us_state, c.id
FROM us_customers3 c
WHERE c.us_state > u.us_state
ORDER BY c.us_state
LIMIT 3
) c
)
TABLE us_states
ORDER BY us_state;
The last query fails if any state has less than three rows.
Index
Either way, create a multicolumn index like this!
CREATE INDEX data_covering_idx ON data (us_state, id);
db<>fiddle here
Old sqlfiddle
Related:
us_statestable and we replace it forsin the 2nd query, it will not read the whole table, corerct? (I mean there is not much difference between Julien'sCROSS JOIN LATERALand yoursLEFT JOIN LATERAL(always assuimng there is a states tables we can use instead of the select distinct subquery). – ypercubeᵀᴹ Mar 26 '16 at 08:49SELECT DISTINCT us_state FROM databecause there are few distinct values with many rows, as explained here. The rest is cheap either way assuming we have the mentioned index(es), best a multicolumn index on(us_state, id). Else the (uncalled)ORDER BYin theLATERALsubquery gets expensive, too. I only choseLEFT JOINto include all states in the result either way, which probably makes sense once we have aus_statestable. – Erwin Brandstetter Mar 26 '16 at 14:04