Right off the bat, there is nothing baked into ColdFusion that will do this automatically, so we have to look at other possible solutions. For some cases, you could use straight substitutions. You could have a list of common alternative spellings ("st. for st") and loop over them and do replacements. That will work for simple things but probably isn't going to scale up. You could also compare strings and see how close they are. I did some Googling on that, and kept coming across the Levenshtein distance formula. I was actually working to rebuild the pseudo-code on Wikipedia when I remembered CFLib - turns out it has a udf, levDistance, that can use this logic. This is still somewhat of a slow process as you would need to compare your search against all the words in the database. This is probably a case where something like Verity, or the full text searching feature of your database, may be more handy. Anyone done anything like this in ColdFusion?Is there a way to compare the similarities of variable strings in ColdFusion? For instance, lets say variable.foo = "15 St Mungo Place" and variable.coo = "15 St. Mungo Place" Sql will not return a match but you and I both know that these two strings mean the same thing AND that there is only a one character difference. My question is - Is there a way to get Coldfusion to recognise that there is only a one character difference (similarities) between the two strings. Obviously the potential applications for search queries etc.. could be enormous here so I'm VERY curious to see if this is possible.
Martin asks a pretty interesting question about comparing strings:
Comment 1 written by Scott P on 3 January 2008, at 2:00 PM
Comment 2 written by David Herman on 3 January 2008, at 2:22 PM
Comment 3 written by Joshua Curtiss on 3 January 2008, at 2:34 PM
Using verity is a great idea...Flag it as a possible match if one of Verity's suggested phrases is a match...
Comment 4 written by Rick O on 3 January 2008, at 2:59 PM
Note that I am in no way saying that this is a *good* idea, just something that might be possible.
Comment 5 written by Ben Forta on 3 January 2008, at 3:23 PM
--- Ben
Comment 6 written by joel on 3 January 2008, at 3:32 PM
Comment 7 written by Scott Fitchet on 3 January 2008, at 3:34 PM
Comment 8 written by orangepips on 3 January 2008, at 8:04 PM
Query Syntax: http://lucene.apache.org/java/docs/queryparsersynt...
JavaDocs: http://lucene.apache.org/java/2_2_0/api/org/apache...
This would require making a Lucene index of values to compare against. In my own experience, Lucene is fast, and scales well as the record count of an index increases.
Comment 9 written by Gus on 3 January 2008, at 8:20 PM
Comment 10 written by theo on 3 January 2008, at 9:18 PM
the results are reasonably fast, however a full table scan is requires for each individual search. So performance could be an issue if run quite often.
Theo
Comment 11 written by Adam Fairbanks on 4 January 2008, at 12:55 AM
Web Service info:
http://www.serviceobjects.com/products/dots_addres...
Demo:
http://www.serviceobjects.com/demos/addressvalidat...
It standardizes the address, similar to how usps.com does it (but server-side):
http://zip4.usps.com/zip4/welcome.jsp
If addresses are standardized before saving them to the database, it's an easy SQL Query to compare a new address to the other addresses in the database.
(Not to mention all the other benefits of having standardized, validated addresses -- minimal returned mail, professional looking labels, etc.)
Comment 12 written by Martin on 4 January 2008, at 7:21 AM
My reason for asking is that my boss has asked me to write a back-end WEB BASED app to compare closely matching addresses for manual verification. We get two sets of data from different sources and the address is the only key we can use to compare.
Saint (St. St) is a good example because it's easy to confuse with Street (St St.) so look ups aren't a perfect solution.
MySql also has "Full Text Searching" so you could add FULLTEXT(Address) to your table and do a reReplace on the address string to transform "15 St Mungo Place" into a Sql statement such as :
SELECT foo.Address, coo.Address
FROM Properties as foo, Muggles as coo
WHERE MATCH(foo.Address) AGAINST('+15 +St +Mungo +Place')
This is fine for comparing one address at a time (as you would get ranked multiple results) but what I really want to do is end up with a single table of ONLY the most likely matches in descending order.
Levenshtein could do this but could score poorly with flats: eg. Flat 1, 16 St Mungo Place & 1/16 St. Mungo Place (Levenshtein score of 9).
Using an Address Validation Service like Adam suggested (I will need to find a UK one but excellent idea Adam) will certainly help this.
And finally... I could also use some proprietary software to do this and just press a button but then.... a) I might not be able to afford that and b)what would I actually learn from doing that?
Obviously a combination of methods will have to be used for matching and filtering, I'm just glad I "Asked a Jedi." I will definitely reply here again if I find anything else!!
Thanks again to all.
Martin
Comment 13 written by Martin on 31 October 2008, at 11:53 AM
I don't know if this has been changed in MySql 5 but
SELECT SOUNDEX("10 St Andrews"),SOUNDEX("10 St. Andrews"),SOUNDEX("10 St Andrew's"),SOUNDEX("10, St-Andrews") ALL return the same result.
This is VERY cool!
[Add Comment] [Subscribe to Comments]