How to find the most common phrases in a varchar field

Question

I have a table which contains support tickets. Each ticket has a title and many of them have common phrases in them (XXX cannot connect to YYY, XXX requires a new YYY, XXX requires upgrading, etc. etc.)

I would like to do some analysis on the data and extract these common phrases along with a count of their usage.

Repeated character sequences would be acceptable, but ideally I would want to split the title into words and identify repeated sequences of words. Here is a simple example:

Title
------------------------------------------------
John cannot connect to SERVER01
The backup service cannot connect to the client
Mary cannot connect her laptop to the network

From this data I would like to be able to extract the following:

Phrase                                             Count
-------------------------------------------------- -------
cannot connect                                     3
cannot connect to                                  2

I have a plan to do this using a c# application and brute force, but can anyone suggest a more elegant solution?

I like the question, time ago I had to do the same on a Metadata catalog to generate a list of potential 'keywords' from a varchar field. At the time I didn't even thought of doing it by querying; instead I extract the contents of the field, parse the data, count and remove duplicates on a '5-time' rule (less than 5 times doesn't count) and finally remove non related words (a, the, from,, etc). Would like to see if there's a way of doing it with SQL. — Nelson, Apr 08 '15 at 11:14
Build a fulltext index and query it for the matches of every single token or connections of the tokens. Solving this in SQL will be far too complicated imho. — Falcon, Apr 08 '15 at 11:51
The sample results should also include: 'to', 3, 'to the', 2 and 'the', 3, right? — ypercubeᵀᴹ, Apr 08 '15 at 12:52
Yes @ypercube, just trying to keep it brief. I wouldn't mind if it contained 'John', 1, 'SERVER01', 1, etc. too. — Darren, Apr 08 '15 at 12:54
I don't see how a fulltext index would help me @Falcon; I don't know what the phrases will be to start with. — Darren, Apr 08 '15 at 12:56
It at least tokenizes for you and can give you token or 'hit' counts for phrases. Makes things much easier. What you are trying to do is in fact very computational intensive and quite academic. You can compare it to searching for substrings in human DNA. Google "String matching algorithms" and you will realize, that it won't be feasible to implement this in SQL imho. — Falcon, Apr 08 '15 at 13:47
I suspect you are right @Falcon, but there are people on here who are a lot smarter than me so I thought it was worth asking... — Darren, Apr 08 '15 at 13:49

score 0 · Accepted Answer · answered Apr 09 '15 at 13:21

So I ended up solving this with a C# application. I thought I would share my code in case anyone else needed something similar. It's using Entity Framework to do the data access.

        using (var ticketsContext = new CMS.Data.Tickets.TicketsContext("name=CMS"))
        {
            using (var analysisContext = new TicketTitleAnalysis.Analysis())
            {
                int lastTicketID = analysisContext.Phrases.Max(p => (int?)p.LastTicketID).GetValueOrDefault();

                var tickets = (from t in ticketsContext.Tickets where t.TicketID > lastTicketID orderby t.TicketID select new { t.TicketID, t.Title });

                foreach (var ticket in tickets)
                {
                    var titleWords = ticket.Title.ToLower().Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries).Where(w => char.IsLetter(w[0])).ToArray();

                    for (int startingWord = 0; startingWord < titleWords.Length; startingWord++)
                    {
                        for (int numberOfWords = 1; numberOfWords <= titleWords.Length - startingWord; numberOfWords++)
                        {
                            var words = string.Join(" ", titleWords.Skip(startingWord).Take(numberOfWords));

                            if (words.Length > 5)
                            {
                                var phrase = analysisContext.Phrases.FirstOrDefault(p => p.Words == words);

                                if (phrase == null)
                                {
                                    phrase = new Phrase();
                                    phrase.Words = words;

                                    analysisContext.Phrases.Add(phrase);
                                }

                                phrase.LastTicketID = ticket.TicketID;
                                phrase.Count++;
                            }
                        }
                    }

                    analysisContext.SaveChanges();
                }
            }
        }

Could you possibly show this in vb.net? I tried getting it converted but it wouldn't work — schizoid04, Sep 22 '18 at 22:37
Ran into needing this again today, if someone could get this in vb.net... would super appreciate it..... :/ — schizoid04, Jan 31 '19 at 22:17
Sorry, I don't have the time or the skills to convert this for you, but you could try using a convert like this one: http://converter.telerik.com/ — Darren, Feb 01 '19 at 11:31
The telerik converter fails, but I understand what you mean with the time thing. Was just hopeful :) — schizoid04, Feb 01 '19 at 14:40

How to find the most common phrases in a varchar field

1 Answers1