Find the most frequent phrases in a text string

  • I have a list of text strings and I want to know which two phrases occurred the most often in all of my text strings. For example:


    i want to purchase a new phone
    considering the purchase of phone
    want to buy some tools


    The results would be:
    purchase phone = occurred 2 times
    want to = occurred 2 times


    Can anyone help please?

  • Re: Find the most frequent phrases in a text string


    This is a pretty big ask, and not a lot to go on - does the code need to be case sensitive? Does it include words such as "an, a, the, or, and" ?


    Please provide more detail on what exactly you are trying to achieve and supply a sample workbook with some data in to test.


    [sw]*[/sw]

  • Re: Find the most frequent phrases in a text string


    The code does not have to be case sensitive and does not have to exclude stop words such as "an, a, the, or, and". If stop words can be excluded, then this is a plus!


    Please find attached a sample workbook. Tab 1 includes raw text strings and Tab 2 includes desired output/results.

  • Re: Find the most frequent phrases in a text string


    There's an old adage - Horses for courses.


    This is not Excels' forté and a bit like using a thoroughbred to pull a brewers dray - hang on, that implies Excel is a thoroughbred, so maybe it would be more accurate to say it's a bit like entering a Suffolk Punch (still a thoroughbred though :)) in the Grand National... it'll plod along and eventually get to the end, but it'll have to skirt around those big jumps.


    That implies it can be done - and it can, in a manner of speaking.


    A simple UDF to count the occurrances of a word or phase in a range is easy. You simply take the length of the original string and the length of the original string with the characters/words to count removed; subtract the second from the first and divide by the length of the original string.


    It won't be efficient and will be limited in what it can count (your 'Purchase phone' Vs the actual 'purchase a new phone' cannot be handled without more code - excluding 'new' & 'a', for example), but it can be refined (improved, if you like) as much as you want at the expense of speed - the more you put in, the slower it gets.


    Your workbook with the UDF attached, this calculates the results (and they correspond to your sample results). The entire block can then be sorted manually to order the list.

Participate now!

Don’t have an account yet? Register yourself now and be a part of our community!