0

I am trying to optimize a text search that involves two large tables, persons and emails, and currently uses ILIKE operators. I have those two indexes:

CREATE INDEX idx_persons_full_name ON persons 
USING gin (full_name gin_trgm_ops, to_iso_date(date_created) gin_trgm_ops);

CREATE INDEX idx_emails_gin_trgm_idx ON emails USING gin (email gin_trgm_ops);

This query uses the GIN index:

explain analyze select p.id, e.id from persons p left join emails e on p.email_id=e.id 
where email ilike '%john%';
->  Bitmap Heap Scan on emails e  (cost=97.80..1000.32 rows=1265 width=4) (actual time=0.884..2.662 rows=89 loops=1)
      Recheck Cond: (email ~~* '%john%'::text)
      Heap Blocks: exact=81
      ->  Bitmap Index Scan on idx_emails_gin_trgm_idx  (cost=0.00..97.49 rows=1265 width=0) (actual time=0.829..0.829 rows=89 loops=1)
            Index Cond: (email ~~* '%john%'::text)

... however when I add a second ilike clause I end up with a sequential scan on emails:

explain analyze select p.id, e.id from persons p left join emails e on p.email_id=e.id 
where p.full_name ilike '%john%' or e.email ilike '%john%';

Hash Left Join (cost=2254.88..4527.57 rows=1598 width=8) (actual time=30.053..97.757 rows=138 loops=1) Hash Cond: (p.email_id = e.id) Filter: ((p.full_name ~~* '%john%'::text) OR (e.email ~~* '%john%'::text)) Rows Removed by Filter: 78569 -> Seq Scan on persons p (cost=0.00..2066.07 rows=78707 width=23) (actual time=0.016..4.569 rows=78707 loops=1) -> Hash (cost=1472.17..1472.17 rows=62617 width=27) (actual time=29.364..29.364 rows=62459 loops=1) Buckets: 65536 Batches: 1 Memory Usage: 4148kB -> Seq Scan on emails e (cost=0.00..1472.17 rows=62617 width=27) (actual time=0.049..13.182 rows=62459 loops=1)

I can think a a few ways to address the issue, what would you recommend?

  1. Changing something about the existing queries or indexes to get the indexes to be used
  2. bringing the e.email into the persons table (using triggers, or in a separate materialized view) and using a single gin index on that denormalized table
  3. Replacing the ILIKEs with trigram searches

The final query looks like this, with pagination on date_created & id

SELECT
  p.id AS id, p.is_staff AS is_staff, p.first_name AS first_name, p.last_name AS last_name, p.email_id AS email_id, p.address_id AS address_id, p.is_instructor AS is_instructor, p.date_created AS date_created, p.birthdate AS birthdate, p.phone AS phone, p.full_name AS full_name, p.external_id AS external_id, p.venue_id AS venue_id, p.ssn AS ssn, p.gender AS gender, e.id AS id, e.email AS email, e.confirmed AS confirmed, e.owner_id AS owner_id, e.account_id AS account_id
FROM
  persons p 
  LEFT JOIN emails e ON p.email_id = e.id
WHERE
  p.date_archived IS NULL
  AND p.date_created < '2023-06-13T08:51:00Z'::timestampwithtimezone
  AND p.id < 321482
  AND p.is_staff = false
  AND (
    p.full_name ILIKE UNACCENT('%john%')
    OR e.email ILIKE '%john%'
    OR TO_ISO_DATE(p.date_created) ILIKE '%john%'
  )
ORDER BY
  p.date_created DESC
LIMIT 40

Thanks! Franck

Franck
  • 13
  • 6
  • 1
    That is in the nature of OR; avoid it. See this question. – Laurenz Albe Feb 28 '24 at 09:11
  • I know, but there is no easy alternative. I gave up on materialized view - too much data duplication and I forgot they have to be refreshed manually. Triggers to maintain joined columns + a single multi-column GIN index work pretty well for my use case. – Franck Feb 28 '24 at 15:11
  • If you know, why don't you use that knowledge? – Laurenz Albe Feb 28 '24 at 15:24
  • uhh... because there is no easy alternative? If you have a suggestion I'll take it, I do need to match on multiple columns, and AFAIK unions won't play well with pagination. – Franck Feb 28 '24 at 18:06
  • This bit made me laugh: OR TO_ISO_DATE(p.date_created) ILIKE '%john%'. Maybe differentiate between string and date filters? That should buy quite a bit already. – Erwin Brandstetter Feb 28 '24 at 18:09
  • There is a single search field for an entire table, that's a user interface choice. So it's "john" in my example, it could be "2023" in the next query. – Franck Feb 28 '24 at 20:19
  • Multiple columns are no problem -> multiple UNION ALLs. Pagination won't work worse than with your current query: if few rows are found, the query is fast. If many rows are found, performance will suck no matter what. – Laurenz Albe Feb 28 '24 at 20:28
  • Thanks, I tried splitting the query above into a union of 3 selects, there was no improvements however I don't have the right indexes anymore after adding my cloned columns. I will give that a try when I get a chance. I assume the pattern is to move the order by & the limit outside the union, and Postgres will be smart enough to retrieve just enough results from each result set to meet the global "limit 40" clause? – Franck Feb 29 '24 at 09:37

0 Answers0