I have a table that contains data that is extracted from text documents. The data is stored in a column called "CONTENT" for which I have created this index using GIN:
CREATE INDEX "File_contentIndex"
ON "File"
USING gin
(setweight(to_tsvector('english'::regconfig
, COALESCE("CONTENT", ''::character varying)::text), 'C'::"char"));
I use the following query to perform a full text search on the table:
SELECT "ITEMID",
ts_rank(setweight(to_tsvector('english', coalesce("CONTENT",'')), 'C') ,
plainto_tsquery('english', 'searchTerm')) AS "RANK"
FROM "File"
WHERE setweight(to_tsvector('english', coalesce("CONTENT",'')), 'C')
@@ plainto_tsquery('english', 'searchTerm')
ORDER BY "RANK" DESC
LIMIT 5;
The File table contains 250 000 rows and each "CONTENT" entry consists of one random word and a text string that is the same for all rows.
Now, when I search for a random word (1 hit in the whole table) the query runs very fast (<100 ms). However, when I search for a word that is present in all the rows, the query runs extremely slow (10 minutes or more).
EXPLAIN ANALYZE shows that for the 1-hit search a Bitmap Index Scan followed by a Bitmap Heap Scan is performed. For the slow search a Seq Scan is performed instead, which is what is taking so long.
Granted, it is not realistic to have the same data in all rows. But since I can't control the text documents that's uploaded by the users, nor the searches they perform it is possible that a similar scenario arises (search on terms with very high occurrence in DB). How can I increase the performance of my search query for such a scenario?
Running PostgreSQL 9.3.4
Query plans from EXPLAIN ANALYZE:
Quick search (1 hit in DB)
"Limit (cost=2802.89..2802.90 rows=5 width=26) (actual time=0.037..0.037 rows=1 loops=1)"
" -> Sort (cost=2802.89..2806.15 rows=1305 width=26) (actual time=0.037..0.037 rows=1 loops=1)"
" Sort Key: (ts_rank(setweight(to_tsvector('english'::regconfig, (COALESCE("CONTENT", ''::character varying))::text), 'C'::"char"), '''wfecg'''::tsquery))"
" Sort Method: quicksort Memory: 25kB"
" -> Bitmap Heap Scan on "File" (cost=38.12..2781.21 rows=1305 width=26) (actual time=0.030..0.031 rows=1 loops=1)"
" Recheck Cond: (setweight(to_tsvector('english'::regconfig, (COALESCE("CONTENT", ''::character varying))::text), 'C'::"char") @@ '''wfecg'''::tsquery)"
" -> Bitmap Index Scan on "File_contentIndex" (cost=0.00..37.79 rows=1305 width=0) (actual time=0.012..0.012 rows=1 loops=1)"
" Index Cond: (setweight(to_tsvector('english'::regconfig, (COALESCE("CONTENT", ''::character varying))::text), 'C'::"char") @@ '''wfecg'''::tsquery)"
"Total runtime: 0.069 ms"
Slow search (250k hits in DB)
"Limit (cost=14876.82..14876.84 rows=5 width=26) (actual time=519667.404..519667.405 rows=5 loops=1)"
" -> Sort (cost=14876.82..15529.37 rows=261017 width=26) (actual time=519667.402..519667.402 rows=5 loops=1)"
" Sort Key: (ts_rank(setweight(to_tsvector('english'::regconfig, (COALESCE("CONTENT", ''::character varying))::text), 'C'::"char"), '''cyberspace'''::tsquery))"
" Sort Method: top-N heapsort Memory: 25kB"
" -> Seq Scan on "File" (cost=0.00..10541.43 rows=261017 width=26) (actual time=2.097..519465.953 rows=261011 loops=1)"
" Filter: (setweight(to_tsvector('english'::regconfig, (COALESCE("CONTENT", ''::character varying))::text), 'C'::"char") @@ '''cyberspace'''::tsquery)"
" Rows Removed by Filter: 6"
"Total runtime: 519667.429 ms"
ORDER BY "RANK" DESC. I would investigatepg_trgmwith GiST index and the similarity / distance operators as alternative. Consider: http://dba.stackexchange.com/questions/56224/postgresql-fts-and-trigram-similarity-query-optimization/56232#56232. Might even produce "better" results (besides being faster). – Erwin Brandstetter Sep 11 '15 at 00:47explain (analyze, buffers), preferably with track_io_timing set toON? There is no way it should take 520 seconds to seq scan that table, unless you have it stored on a RAID of floppy disks. Something is definitely pathological there. Also, what is your setting forrandom_page_cost, and the other cost parameters? – jjanes Sep 14 '15 at 18:11