5

I have a column in my Postgresql database named Description which contains text. For some analytics purpose, I want to find total number of words (seperated by spaces) - all words and unique words - in that description and set those counts in columns word_count and unique_word_count. For eg.

id Description
1 What a great day
2 This is a product. It is useful

I want to have following output:

id word_count unique_word_count Description
1 4 4 What a great day
2 7 6 This is a product. It is useful

The unique_word_count for id = 2 is 6 because the word is has been repeated 2 times.

Chan Guan Yu
  • 49
  • 1
  • 6
Pranjal
  • 155
  • 1
  • 5

1 Answers1

6

EDIT:

Not being 100% happy with my answer, I awarded a bonus for a question of my own on this topic here.

Erwin Brandstetter (as usual) provided a sublime answer - please use that instead of what I wrote below - unless your use case is very simple and even then...

==============================================================

To solve your issue I did the following:

CREATE TABLE wordcount (id SERIAL NOT NULL PRIMARY KEY, description TEXT NOT NULL);

Populate it:

INSERT INTO wordcount (description) VALUES ('What a great day');
INSERT INTO wordcount (description) VALUES ('This is a product. It is useful');

Then performed the following query (fiddle available here):

WITH arranged AS
(
  SELECT id, 
  UNNEST
  (
    STRING_TO_ARRAY
    (
      REGEXP_REPLACE(description,  '[^\w\s]', '', 'g'), ' '
    )
  ) AS word, 
  description 
  FROM wordcount
)  
SELECT a.id, COUNT(a.word), COUNT(DISTINCT(a.word)), a.description
FROM arranged a
GROUP BY a.id, a.description;

Result:

id Word_Count Unique_Word_Count Description
1 4 4 What a great day
2 7 6 This is a product. It is useful

If you're new to all this, I would advise you to go through the fiddle here which shows the development of my thought processes which led to the query above.

I would advise you to put this logic into an onInsert/onUpdate trigger - it's probably fairly computationally intensive.

Vérace
  • 29,825
  • 9
  • 70
  • 84
  • @Pranjal - I would advise you to put this logic into an onInsert/onUpdate trigger - it's probably fairly computationally intensive. Version 12 will allow generated columns which could be a good solution. – Vérace Jun 05 '19 at 19:45
  • Thank you so much for this solution. I tried this solution and it worked like a charm. Only problem is that if the string array is like ['a', ' ', ' ', 'b'], then I get total count as 4 because it count spaces as well. Do you know how can I remove it? – Pranjal Jun 06 '19 at 07:03
  • Hi Pranjal, and thanks for the correct answer! I've looked at your problem string, and it appears very tricky - I've got as far as here but I'm out of ideas as the minute. I do have to put bread on the table! :-) Maybe you could ask another question pointing back to this one? I think the dollar quotes may be the key? BTW, +1 for an interesting and thought-provoking question - I learnt a lot answering it. Oh, and if I didn't say it before, welcome to the forum! :-) – Vérace Jun 06 '19 at 08:34
  • Hi again Pranjal - I've solved it! I'm like a dog with a bone once I get my teeth into a problem! :-) I'm actually going to post it as a new question, because I don't feel that my solution is very elegant and I want to ask the "real" gurus here if it's an optimal approach. I'll let you know when it's asked. – Vérace Jun 06 '19 at 10:09
  • Hi again Pranjal - you might be interested to know that I've posted a solution to your ['a', ' ', ' ', 'b'] string problem here in case you're interested! In fact, I got into it so much, I'm going to offer a bonus - as I said, I really learnt a lot. :-) – Vérace Jun 06 '19 at 12:47
  • Thanks for the supplementary question. All this is pretty much new to me but I am so much enjoying learning these new things. Thank you once again for such a wonder solution and the follow up question :) – Pranjal Jun 06 '19 at 19:36