4

I am facing a situation where users SHOULD submit items that are written in plain English. However, it happens that names have accents - coming from various languages. I have a tsvector column in my table which is indexed for fulltext search.

Here is the info about this column, and relevant info about the table:

                                                         Table "public.companies"
     Column     |          Type          |                       Modifiers                        | Storage  | Stats target | Description 
----------------+------------------------+--------------------------------------------------------+----------+--------------+-------------
 id             | integer                | not null default nextval('companies_id_seq'::regclass) | plain    |              | 
 name           | character varying(128) | not null                                               | extended |              | 
 description    | text                   |                                                        | extended |              | 
 tsmeta         | tsvector               |                                                        | extended |              | 

Indexes:
    "companies_pkey" PRIMARY KEY, btree (id) CLUSTER
    "companies_search_idx" gin (tsmeta)
Triggers:
    companies_tsmeta_update BEFORE INSERT OR UPDATE ON companies FOR EACH ROW EXECUTE PROCEDURE companies_tsmeta_trigger()

Now the problem comes when the user creates a name like français. Not all the users have the ç character at hand and those who don't will have troubles finding this entry - meaning that it will be stored with the cedilla-c in the tsvector column. So my thought was to use the unaccent extension in order to strip these accents. However now users that just type français will have problems and will not find the entry. So I came up with a function that just concatenates both versions (accented and accent-stripped) of the name and descriptions. However, descriptions can get pretty large and I worry the index will take too much storage.

create or replace function companies_tsmeta_trigger() returns trigger as $$
begin
  new.tsmeta :=
    setweight(to_tsvector('english', coalesce(new.name, '')), 'A') ||
    setweight(to_tsvector('english', unaccent(coalesce(new.name, ''))), 'B') ||
    setweight(to_tsvector('english', coalesce(new.description, '')), 'C') ||
    setweight(to_tsvector('english', unaccent(coalesce(new.description, ''))), 'D');
  return new;
end;
$$ language plpgsql;

What are your suggestions?

Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600
Victor
  • 436
  • 1
  • 5
  • 14
  • Related, yes: https://dba.stackexchange.com/questions/177020/creating-a-case-insensitive-and-accent-diacritics-insensitive-search-on-a-field But not a duplicate, weights are not covered. – Erwin Brandstetter May 30 '18 at 15:16
  • @Victor do you have any objections to confining the question to one of weight, myself or Erwin will be happy to do it. There is something unique here -- Erwin is right about that, but I don't 80% the question and answer are already covered on this site if you don't target weight specifically. – Evan Carroll May 30 '18 at 17:38

1 Answers1

3

My first impulse was this: Use an expression index on the unaccented strings only. Don't store tsmeta in the table redundantly. So you also don't need the trigger. And certainly don't bloat the index with original and unaccented string. This way your table and index are smaller and faster on read and write.

However, The index expression needs to be IMMUTABLE. setweight() and to_tsvector(regconfig,text) are IMMUTABLE, that's good. But unaccent() is only STABLE. Read the detailed assessment here first:

If you have multiple use cases, create the function wrapper f_unaccent() as instructed over there.

For just the case at hand, we can wrap the whole tsvector creation in a single IMMUTABLE function:

CREATE OR REPLACE FUNCTION f_tsmeta(_a text, _b text)
  RETURNS tsvector AS
$func$
  SELECT setweight(to_tsvector('english', public.unaccent('public.unaccent', _a)), 'A')
      || setweight(to_tsvector('english', public.unaccent('public.unaccent', COALESCE(_b, ''))), 'B')
$func$  LANGUAGE sql IMMUTABLE;

Assuming that the additional module unaccent is installed in the schema public as explained in the linked answer.

Since name is defined NOT NULL, we don't need COALESCE for that column.

Then the index can look like this:

CREATE INDEX companies_search_idx ON companies USING gin (f_tsmeta(name, description));

Used in queries like:

SELECT * FROM companies
WHERE f_tsmeta(name, description) @@ to_tsquery('english', 'foo');

On second thought, weights are not included in a GIN index. So it may pay after all to include the pre-computed value in the table. The manual:

GIN indexes store only the words (lexemes) of tsvector values, and not their weight labels. Thus a table row recheck is needed when using a query that involves weights.

With the demonstrated expression index, the expression has to be re-evaluated when using weights, which adds extra cost.

Redundant column or expression index, it should be an improvement to integrate the unaccent dictionary into a custom TEXT SEARCH CONFIGURATION directly and build index (and table column) using that, as demonstrated by Evan in this related answer:

Simplifies things and should be more efficient.

Erwin Brandstetter
  • 175,982
  • 27
  • 439
  • 600
  • Everything you are proposing seems reasonable and looks like it will help me implement what I need. So I will accept your answer, although I have not tested this yet! Thank you very much! Is it ok if I come up with any later question in a comment here (related to your answer)? – Victor May 31 '18 at 08:08
  • @Victor: It's best to post follow-up questions as questions in their own right. You can always link to this question / answer for context and leave a comment here to link back (and get my attention). Except for minor details, or course, that would not warrant a new question. – Erwin Brandstetter May 31 '18 at 16:23