I have two tables like this:
CREATE TABLE cmap5 (
name varchar(2000),
lexemes tsquery
);
and
CREATE TABLE IF NOT EXISTS synonyms_all_gin_tsvcolumn (
cid int NOT NULL, -- REFERENCES pubchem_compounds_index(cid)
name varchar(2000) NOT NULL,
synonym varchar(2000) NOT NULL,
tsv_syns tsvector,
PRIMARY KEY (cid, name, synonym)
);
My current query is:
SELECT s.cid, s.synonym, c.name, ts_rank(s.tsv_syns,c.lexemes,16)
FROM synonyms_all_gin_tsvcolumn s, cmap5 c
WHERE c.lexemes @@ s.tsv_syns
And the output is:
cid | synonym | name (query) | rank
5474706 | 10-Methoxyharmalan | 10-methoxyharmalan | 0.0901673
1416 | (+/-)12,13-EODE | 12,13-EODE | 0.211562
5356421 | LEUKOTOXIN B (12,13-EODE) | 12,13-EODE | 0.211562
180933 | 1,4-Chrysenequinone | 1,4-chrysenequinone | 0.211562
5283035 | 15-Deoxy-delta-12,14-prostaglandin J2 | 15-delta prostaglandin J2 | 0.304975
5311211 | 15-deoxy-delta 12 14-prostaglandin J2 | 15-delta prostaglandin J2 | 0.304975
5311211 | 15-deoxy-Delta(12,14)-prostaglandin J2| 15-delta prostaglandin J2 | 0.304975
5311211 | 15-Deoxy-delta-12,14-prostaglandin J2 | 15-delta prostaglandin J2 | 0.304975
5311211 | 15-Deoxy-delta 12, 14-Prostaglandin J2| 15-delta prostaglandin J2 | 0.304975
I would like to return the name matches of all rows in cmap5 in my main table ranked by the ts_rank() function but for each row in cmap5 I want to:
- select only the best X
cids to each query (group by cid) - or
ORDER BY my results as 1+ts_rank/count(cid)
To get the best match I tried to add select distinct on c.name, but when the rank is the same I want to get the cid with more matches to the query. I have tried adding a simple group by at the end of the query but I get an error, how could I do this?
Added comments:
On one hand for those results whose rank is the same, eg. above 5283035 and 5311211, get 5311211 as top result because that cid has more hits than 5283035, so I sort of want to take into account the number of hits / cid in the rank, like final_rank = 1+ts_rank(cid)/no. of hits(cid).
On the other hand I want to get the first X cids per query name. If I use LIMIT X it returns the first X results of the entire query table, not the first X per name (row) of the query table as I want.