Counting Word Instances in a String
Yesterday in the IRC channel someone asked if there was a way to count the number of times each unique word appears in a string. While it was obvious that this could be done manually (see below), no one knew of a more elegant solution. Can anyone think of one? Here is the solution I used and it definitely falls into the "manual" (and probably slow) category.
First I made my string:
<cfsavecontent variable="string">
This is a paragraph with some text in it. Certain words will be repeated, and other words
will not be repeated. The question is though, how much can I write before I begin to sound
like a complete and utter idiot. Let's call that the "Paris Point". At the Paris Point, any
further words sound like gibberish and are completely worthless.
</cfsavecontent>
I then used some regex to get an array of words:
<cfset words = reMatch("[[:word:]]+", string)>
Next I created a structure:
<cfset wordCount = structNew()>
And then looped over the array and inserted the words into the structure:
<cfloop index="word" array="#words#">
<cfif structKeyExists(wordCount, word)>
<cfset wordCount[word]++>
<cfelse>
<cfset wordCount[word] = 1>
</cfif>
</cfloop>
Note that this will be inherently case-insenstive, which I think is a good thing. At this point we are done, but I added some display code as well:
<cfset sorted = structSort(wordCount, "textnocase", "desc")>
<table border="1" width="400">
<tr>
<th width="50%">Word</th>
<th>Count</th>
</tr>
<cfloop index="word" array="#sorted#">
<cfoutput>
<tr>
<td>#word#</td>
<td>#wordCount[word]#</td>
</tr>
</cfoutput>
</cfloop>
Comments
#ListLen(string, " #Chr(13)##Chr(10)#")#
(it seems to work with the string variable you posted)
Let me try that again :)
<cfset new_string = ListSort(REReplaceNoCase(LCase(string), "[^a-z ]", "", "ALL"), "Text", "Asc", " #Chr(13)##Chr(10)#")>
<cfscript>
// had to use this as CF does not allow lookbehind in regular expressions, but JAVA does
obj = createobject("java","java.util.regex.Pattern"); // create pattern searching object
x = obj.compile("(?<=[ ]|^)([^ ]*)([ ]\1)+(?=[ ]|$)"); // compile the regular expression for use
new_string = x.matcher(new_string).replaceAll("$1"); // remove all duplicates
</cfscript>
#ListLen(new_string, " ")#
I got a total of the unique words, but not a count of the number of duplicate words (that's what I get for trying to do write code for one thing while checking out the blogs in another tab :) )
I ran into that same problem during one of Ray's Friday Puzzlers... trust me - don't try to figure it out, your brain will only end up hurting. Here's why, these are all single "words":
hatin'
let's
sweet-ass
cf.objective()
O'connell
... if you can write an algorithm to use all those "non-word" characters as parts of words, well then, you are the man!
I think if you could just make get single quotes to work, you would get most "real" words.
I wonder - maybe switch from [[:word:]] to
(any non alpha except single quote)(alpha,1 or more)(optional ' if followed by alpha)(any non alpha except single quote)
Then again - another solution? Remove '. You end up with words like "lets", which could be confused with "Ray lets Paris call him", but it would be better than let and s as words.
That seems to do the trick for let's lets. But it still doesn't make the counting any more elegant.
(?:[a-zA-Z0-9\(\)])+(?:\'|-|\.)?(?:[a-zA-Z0-9\(\)])+
It matches method chains like myarray.dedup().sort()
"(?:[a-zA-Z0-9\(\)])+(?:\'|-|\.)?(?:[a-zA-Z0-9\(\)\-\'])+"
Don't forget (like I did) to throw in the \'\- into the last non-capturing group.
Ben does that meet your needs?
<cfset string = reReplace(string,'(\.|"(?=\w))','','all') />
<cfset wordAry = listToArray(string,'#chr(10)##chr(13)##chr(32)#') />
<cfset wordQry = queryNew('word','VarChar') />
<cfloop from="1" to="#arrayLen(wordAry)#" index="i">
<cfset queryAddRow(wordQry) />
<cfset querySetCell(wordQry,'word',reReplace(wordAry[i],'[",]$','')) />
</cfloop>
<cfquery dbtype="query" name="uniqueWords">
SELECT word, count(*) as wordCount FROM wordQry group by word order by wordCount desc
</cfquery>
<cfdump var="#uniqueWords#">
<cfset words = arrayToList(string.split('\s'))>
<cfset wordCount = structNew()>
<cfloop index="word" list="#words#">
<cfset wordCount[word] = ListValueCountNoCase(words, word)>
</cfloop>
<cfset words = arrayToList(string.split("\.?[[^()]\s&&([""()][\s])]"))>
<cfset wordcount = structNew()/>
<cfloop list="#string#" delimiters=' ,"' index="word">
<cfset word = replaceList(word,"',.","")/>
<cfif structKeyExists(wordCount, word)>
<cfset wordCount[word] = wordCount[word] + 1/>
<cfelse>
<cfset wordCount[word] = 1/>
</cfif>
</cfloop>

