Yesterday in the IRC channel someone asked if there was a way to count the number of times each unique word appears in a string. While it was obvious that this could be done manually (see below), no one knew of a more elegant solution. Can anyone think of one? Here is the solution I used and it definitely falls into the "manual" (and probably slow) category.
First I made my string:
2 This is a paragraph with some text in it. Certain words will be repeated, and other words
3 will not be repeated. The question is though, how much can I write before I begin to sound
4 like a complete and utter idiot. Let's call that the "Paris Point". At the Paris Point, any
5 further words sound like gibberish and are completely worthless.
6 </cfsavecontent>
I then used some regex to get an array of words:
Next I created a structure:
And then looped over the array and inserted the words into the structure:
2 <cfif structKeyExists(wordCount, word)>
3 <cfset wordCount[word]++>
4 <cfelse>
5 <cfset wordCount[word] = 1>
6 </cfif>
7 </cfloop>
Note that this will be inherently case-insenstive, which I think is a good thing. At this point we are done, but I added some display code as well:
2
3 <table border="1" width="400">
4 <tr>
5 <th width="50%">Word</th>
6 <th>Count</th>
7 </tr>
8
9 <cfloop index="word" array="#sorted#">
10 <cfoutput>
11 <tr>
12 <td>#word#</td>
13 <td>#wordCount[word]#</td>
14 </tr>
15 </cfoutput>
16 </cfloop>
Comment 1 written by nick on 2 August 2007, at 12:31 PM
Comment 2 written by Ben Nadel on 2 August 2007, at 12:32 PM
Comment 3 written by Quan Tran on 2 August 2007, at 12:32 PM
Comment 4 written by Gareth on 2 August 2007, at 12:54 PM
#ListLen(string, " #Chr(13)##Chr(10)#")#
(it seems to work with the string variable you posted)
Comment 5 written by Raymond Camden on 2 August 2007, at 12:58 PM
Comment 6 written by Gareth on 2 August 2007, at 1:02 PM
Let me try that again :)
<cfset new_string = ListSort(REReplaceNoCase(LCase(string), "[^a-z ]", "", "ALL"), "Text", "Asc", " #Chr(13)##Chr(10)#")>
<cfscript>
// had to use this as CF does not allow lookbehind in regular expressions, but JAVA does
obj = createobject("java","java.util.regex.Pattern"); // create pattern searching object
x = obj.compile("(?<=[ ]|^)([^ ]*)([ ]\1)+(?=[ ]|$)"); // compile the regular expression for use
new_string = x.matcher(new_string).replaceAll("$1"); // remove all duplicates
</cfscript>
#ListLen(new_string, " ")#
Comment 7 written by Gareth on 2 August 2007, at 1:05 PM
I got a total of the unique words, but not a count of the number of duplicate words (that's what I get for trying to do write code for one thing while checking out the blogs in another tab :) )
Comment 8 written by todd sharp on 2 August 2007, at 1:49 PM
Comment 9 written by Ben Nadel on 2 August 2007, at 1:56 PM
I ran into that same problem during one of Ray's Friday Puzzlers... trust me - don't try to figure it out, your brain will only end up hurting. Here's why, these are all single "words":
hatin'
let's
sweet-ass
cf.objective()
O'connell
... if you can write an algorithm to use all those "non-word" characters as parts of words, well then, you are the man!
Comment 10 written by Raymond Camden on 2 August 2007, at 2:09 PM
I think if you could just make get single quotes to work, you would get most "real" words.
I wonder - maybe switch from [[:word:]] to
(any non alpha except single quote)(alpha,1 or more)(optional ' if followed by alpha)(any non alpha except single quote)
Then again - another solution? Remove '. You end up with words like "lets", which could be confused with "Ray lets Paris call him", but it would be better than let and s as words.
Comment 11 written by noname on 2 August 2007, at 2:26 PM
Comment 12 written by Ben Nadel on 2 August 2007, at 2:27 PM
Comment 13 written by Ron Alexander on 2 August 2007, at 3:23 PM
That seems to do the trick for let's lets. But it still doesn't make the counting any more elegant.
Comment 14 written by Raymond Camden on 2 August 2007, at 3:27 PM
Comment 15 written by Jonathon on 2 August 2007, at 3:47 PM
Comment 16 written by Raymond Camden on 2 August 2007, at 3:50 PM
Comment 17 written by Ron Alexander on 2 August 2007, at 3:59 PM
(?:[a-zA-Z0-9\(\)])+(?:\'|-|\.)?(?:[a-zA-Z0-9\(\)])+
It matches method chains like myarray.dedup().sort()
Comment 18 written by Ron Alexander on 2 August 2007, at 4:02 PM
"(?:[a-zA-Z0-9\(\)])+(?:\'|-|\.)?(?:[a-zA-Z0-9\(\)\-\'])+"
Don't forget (like I did) to throw in the \'\- into the last non-capturing group.
Ben does that meet your needs?
Comment 19 written by Dustin on 2 August 2007, at 4:07 PM
<cfset string = reReplace(string,'(\.|"(?=\w))','','all') />
<cfset wordAry = listToArray(string,'#chr(10)##chr(13)##chr(32)#') />
<cfset wordQry = queryNew('word','VarChar') />
<cfloop from="1" to="#arrayLen(wordAry)#" index="i">
<cfset queryAddRow(wordQry) />
<cfset querySetCell(wordQry,'word',reReplace(wordAry[i],'[",]$','')) />
</cfloop>
<cfquery dbtype="query" name="uniqueWords">
SELECT word, count(*) as wordCount FROM wordQry group by word order by wordCount desc
</cfquery>
<cfdump var="#uniqueWords#">
Comment 20 written by db on 3 August 2007, at 8:59 AM
<cfset words = arrayToList(string.split('\s'))>
<cfset wordCount = structNew()>
<cfloop index="word" list="#words#">
<cfset wordCount[word] = ListValueCountNoCase(words, word)>
</cfloop>
Comment 21 written by db on 3 August 2007, at 10:28 AM
<cfset words = arrayToList(string.split("\.?[[^()]\s&&([""()][\s])]"))>
Comment 22 written by Johnny on 4 August 2007, at 2:31 PM
<cfset wordcount = structNew()/>
<cfloop list="#string#" delimiters=' ,"' index="word">
<cfset word = replaceList(word,"',.","")/>
<cfif structKeyExists(wordCount, word)>
<cfset wordCount[word] = wordCount[word] + 1/>
<cfelse>
<cfset wordCount[word] = 1/>
</cfif>
</cfloop>
Comment 23 written by Mike Cohen on 25 August 2008, at 11:06 AM
<cfset myString = "blah blah blah bldfadsff fd ">
<cfset mCounter = stringToArray(myString,"a")>
<cfset numberOfAs = arraylen(mcounter)>
?
Comment 24 written by todd sharp on 25 August 2008, at 11:25 AM
Comment 25 written by todd sharp on 25 August 2008, at 11:26 AM
Comment 26 written by D. Davis on 23 February 2009, at 4:54 PM
Wanted to note: the sort on this ("textnocase") needs to be "numeric","desc" otherwise you're not getting your top numbers right (ie, textnocase sort would look like 4,3,20,17).
Great code on this as a first step to making a word cloud, looping it on DB-pulled text fields.
Comment 27 written by Raymond Camden on 25 February 2009, at 4:10 PM
[Add Comment] [Subscribe to Comments]