Counting Word Instances in a String

Yesterday in the IRC channel someone asked if there was a way to count the number of times each unique word appears in a string. While it was obvious that this could be done manually (see below), no one knew of a more elegant solution. Can anyone think of one? Here is the solution I used and it definitely falls into the "manual" (and probably slow) category.

First I made my string:

<cfsavecontent variable="string">
This is a paragraph with some text in it. Certain words will be repeated, and other words
will not be repeated. The question is though, how much can I write before I begin to sound
like a complete and utter idiot. Let's call that the "Paris Point". At the Paris Point, any
further words sound like gibberish and are completely worthless.
</cfsavecontent>

I then used some regex to get an array of words:

<cfset words = reMatch("[[:word:]]+", string)>

Next I created a structure:

<cfset wordCount = structNew()>

And then looped over the array and inserted the words into the structure:

<cfloop index="word" array="#words#">
   <cfif structKeyExists(wordCount, word)>
      <cfset wordCount[word]++>
   <cfelse>
      <cfset wordCount[word] = 1>
   </cfif>
</cfloop>

Note that this will be inherently case-insenstive, which I think is a good thing. At this point we are done, but I added some display code as well:

<cfset sorted = structSort(wordCount, "textnocase", "desc")>

<table border="1" width="400">
<tr>
   <th width="50%">Word</th>
   <th>Count</th>
</tr>

<cfloop index="word" array="#sorted#">
   <cfoutput>
   <tr>
      <td>#word#</td>
      <td>#wordCount[word]#</td>
   </tr>
   </cfoutput>
</cfloop>

Comments

If "Paris Point" becomes part of the daily lexicon, you can officially coin it. Nice code work too.
# Posted By nick | 8/2/07 12:31 PM
REMatch() makes me happy :)
# Posted By Ben Nadel | 8/2/07 12:32 PM
Probably not faster, but you could create a query with a single column and use qoq to get the count with a group by.
# Posted By Quan Tran | 8/2/07 12:32 PM
Couldn't you do something like
#ListLen(string, " #Chr(13)##Chr(10)#")#

(it seems to work with the string variable you posted)
# Posted By Gareth | 8/2/07 12:54 PM
Gareth, that counts the words. We need a count of the number of each word. Ie, the string has The ten times. Etc.
# Posted By Raymond Camden | 8/2/07 12:58 PM
Whoops, unique instances...

Let me try that again :)

<cfset new_string = ListSort(REReplaceNoCase(LCase(string), "[^a-z ]", "", "ALL"), "Text", "Asc", " #Chr(13)##Chr(10)#")>

<cfscript>
// had to use this as CF does not allow lookbehind in regular expressions, but JAVA does
obj = createobject("java","java.util.regex.Pattern"); // create pattern searching object
x = obj.compile("(?<=[ ]|^)([^ ]*)([ ]\1)+(?=[ ]|$)"); // compile the regular expression for use
new_string = x.matcher(new_string).replaceAll("$1"); // remove all duplicates
</cfscript>

#ListLen(new_string, " ")#
# Posted By Gareth | 8/2/07 1:02 PM
OK, I'm going to stop now :)
I got a total of the unique words, but not a count of the number of duplicate words (that's what I get for trying to do write code for one thing while checking out the blogs in another tab :) )
# Posted By Gareth | 8/2/07 1:05 PM
Issue: the word "Let's" gets broken into "Let" and "s" because of your RE. Solution? Still working on it... ;)
# Posted By todd sharp | 8/2/07 1:49 PM
@Todd,

I ran into that same problem during one of Ray's Friday Puzzlers... trust me - don't try to figure it out, your brain will only end up hurting. Here's why, these are all single "words":

hatin'
let's
sweet-ass
cf.objective()
O'connell

... if you can write an algorithm to use all those "non-word" characters as parts of words, well then, you are the man!
# Posted By Ben Nadel | 8/2/07 1:56 PM
Although, one could argue that sweet-ass wouldn't be so bad as two words. hatin' is slang, and would become hatin, which is ok.

I think if you could just make get single quotes to work, you would get most "real" words.

I wonder - maybe switch from [[:word:]] to

(any non alpha except single quote)(alpha,1 or more)(optional ' if followed by alpha)(any non alpha except single quote)

Then again - another solution? Remove '. You end up with words like "lets", which could be confused with "Ray lets Paris call him", but it would be better than let and s as words.
# Posted By Raymond Camden | 8/2/07 2:09 PM
But, isn't the next word signified by a space? So everything between a space, is a word?
# Posted By noname | 8/2/07 2:26 PM
Yeah, stripping out the single quotes is probably the easiest thing to do. Least amount of damage for the best results.
# Posted By Ben Nadel | 8/2/07 2:27 PM
So why not just change from "[[:word:]]" to "[a-zA-Z0-9]+'[a-zA-Z0-9]+|[a-zA-Z0-9]+"

That seems to do the trick for let's lets. But it still doesn't make the counting any more elegant.
# Posted By Ron Alexander | 8/2/07 3:23 PM
Thats pretty cool there Ron.
# Posted By Raymond Camden | 8/2/07 3:27 PM
There's a CF IRC channel floating around somewhere? Anyone feel like sharing the info? :)
# Posted By Jonathon | 8/2/07 3:47 PM
The one I use is #coldfusion on Dalnet.
# Posted By Raymond Camden | 8/2/07 3:50 PM
This is better:

(?:[a-zA-Z0-9\(\)])+(?:\'|-|\.)?(?:[a-zA-Z0-9\(\)])+

It matches method chains like myarray.dedup().sort()
# Posted By Ron Alexander | 8/2/07 3:59 PM
And just to match Ben's "hatin'" example:
"(?:[a-zA-Z0-9\(\)])+(?:\'|-|\.)?(?:[a-zA-Z0-9\(\)\-\'])+"

Don't forget (like I did) to throw in the \'\- into the last non-capturing group.

Ben does that meet your needs?
# Posted By Ron Alexander | 8/2/07 4:02 PM
This is probably how I would do it:

<cfset string = reReplace(string,'(\.|"(?=\w))','','all') />
<cfset wordAry = listToArray(string,'#chr(10)##chr(13)##chr(32)#') />
<cfset wordQry = queryNew('word','VarChar') />
<cfloop from="1" to="#arrayLen(wordAry)#" index="i">
   <cfset queryAddRow(wordQry) />
   <cfset querySetCell(wordQry,'word',reReplace(wordAry[i],'[",]$','')) />
</cfloop>
<cfquery dbtype="query" name="uniqueWords">
   SELECT word, count(*) as wordCount FROM wordQry group by word order by wordCount desc
</cfquery>
<cfdump var="#uniqueWords#">
# Posted By Dustin | 8/2/07 4:07 PM
this seems to work for me:
<cfset words = arrayToList(string.split('\s'))>
<cfset wordCount = structNew()>
<cfloop index="word" list="#words#">
<cfset wordCount[word] = ListValueCountNoCase(words, word)>
</cfloop>
# Posted By db | 8/3/07 8:59 AM
no, that's not right - its including punctuation as part of the word. so i tried with the "hatin'" list and got this working:
<cfset words = arrayToList(string.split("\.?[[^()]\s&&([""()][\s])]"))>
# Posted By db | 8/3/07 10:28 AM
Wouldn't this work?

<cfset wordcount = structNew()/>
<cfloop list="#string#" delimiters=' ,"' index="word">
   <cfset word = replaceList(word,"',.","")/>
   <cfif structKeyExists(wordCount, word)>
<cfset wordCount[word] = wordCount[word] + 1/>
<cfelse>
<cfset wordCount[word] = 1/>
</cfif>
</cfloop>
# Posted By Johnny | 8/4/07 2:31 PM