Get partial match from GIN indexed TSVECTOR column

Question

I would like to get results by query this:

SELECT *
FROM  (
   SELECT id, subject
   FROM mailboxes
   WHERE tsv @@ plainto_tsquery('avail')
   ) AS t1
ORDER BY id DESC;

This works and return rows with tsv containing 'Available'. But if I use 'avai' (truncating '...lable') it does not find anything.

Do all queries have to be in dictionary? Can't we just query such letters? I have a table that contains e-mail bodies (content) and I would like to make queries fast as it grows rapidly. Currently, I am using:

... WHERE content ~* 'avail'

Erwin Brandstetter · Accepted Answer · 2023-08-01T18:07:13.970

Do all queries have to be in dictionary?

No. Because only word stems (according to the used text search configuration) are in the index to begin with. But more importantly:

No. Because, on top of that Full Text Search is also capable of prefix matching:

This would work:

SELECT id, subject
FROM   mailboxes
WHERE  tsv @@ to_tsquery('simple', 'avail:*')
ORDER  BY id DESC;

Note 3 things:

Use to_tsquery(), not plainto_tsquery(), in this case because (quoting the manual):

... plainto_tsquery will not recognize tsquery operators, weight labels, or prefix-match labels in its input
Use the 'simple' text search configuration to generate the tsquery since you obviously want to take the word 'avail' as is and not apply stemming.
Append :* to make it a prefix search, i.e find all lexemes starting with 'avail'.

Important: This is a prefix search on lexemes (word stems) in the document. A regular expression match without wildcards (content ~* 'avail') is not exactly the same! The latter is not left-anchored (to the start of lexemes) and would also find 'FOOavail' etc.

It's unclear whether you want the behavior outlined in your query or the equivalent of the added regular expression. Trigram indexes using the additional module (pg_trgm) are the right tool for that. There are many related questions on dba.SE, try a search.

Overview:

Pattern matching with LIKE, SIMILAR TO or regular expressions

Demo

SELECT *
FROM  (
   VALUES
     ('Zend has no framework')
   , ('Zend Framework')
   ) sub(t), to_tsvector(t) AS tsv
WHERE  tsv @@ to_tsquery('zend <-> fram:*');

 id |       t        |          tsv
----+----------------+------------------------
  2 | Zend Framework | 'framework':2 'zend':1

Related answer (chapter "Different approach to optimize search"):

How can I generate all trailing substrings following a delimeter?

Emails?

Since you mentioned emails, be aware that the text search parser identifies emails and does not split them into separate words / lexemes. Consider:

SELECT ts_debug('english', 'xangr@some.domain.com')

(email,"Email address",xangr@some.domain.com,{simple},simple,{xangr@some.domain.com})

I would replace the separators @ and . in your emails with space (' ') to index contained words.

Also, since you are dealing with names in emails, not with English (or some other language) words, I would use the 'simple' text search configuration to disable stemming and other language features:

Build the ts_vector column with:

SELECT to_tsvector('simple', translate('joe.xangr@some.domain.com', '@.', '  ')) AS tsv;

I'm deleting my answer for this because either way because I'm clearly wrong for the first time ever and I'd rather not be reminded of it. I have two questions for you 1) where is :* documented, and 2) shouldn't a mention to build to_tsvector('simple'..) come hand-in-hand with instructions that future querying of that tsv will require the 'simple' configuration too to tsquery? I think you should clarify the ramifications of disabling stemming on a tsvector/tsquery. — Evan Carroll, Dec 13 '16 at 03:53
@EvanCarroll: Using the 'simple' configuration is not required. It just avoids stemming (like 'rats' to 'rat') which may or may not be desirable. Not desirable for the given example. Manual: I've added links above... — Erwin Brandstetter, Dec 13 '16 at 04:35
@EvanCarroll: Aside: Thinking you're wrong the first time ever, would be the second time. And that would be wrong, recursively. ;) — Erwin Brandstetter, Dec 13 '16 at 04:42
@ErwinBrandstetter, Wow, your way just gave me a full speed search. Before your way it tooks 0.380ms to get result. After your way it took 0.079 ms. — xangr, Dec 13 '16 at 06:17
@ErwinBrandstetter, Is there any way to wildcard before any word? Like *vail*...? Or something like *:vail:* I couldn't find any proper doc. about this. Maybe looking into wrong pages and don't want to open other question as its regarding to same question. Thanks. — xangr, Dec 14 '16 at 08:56
@xangr: No, FTS only offers prefix matching for lexemes. For anything more, look to pg_trgm. FTS is faster (with a smaller index). You can even combine both indexes ... — Erwin Brandstetter, Dec 14 '16 at 10:36
@ErwinBrandstetter is still unable to make suffix/middle matching fror lexems? — P.A, Oct 23 '17 at 14:48
@HmidiSlim: There were no functional changes for text search in Postgres 10 (that I am aware of) except that it now also works for JSON and JSONB. — Erwin Brandstetter, Oct 23 '17 at 16:50
begin with postgres11, we can use starts_with, might be faster than full text search. if just for prefix search. — jian, Jun 19 '23 at 04:10

Get partial match from GIN indexed TSVECTOR column

1 Answers1

Demo

Emails?

Linked