Ask a Jedi: Flexible string comparisons
Martin asks a pretty interesting question about comparing strings:
Is there a way to compare the similarities of variable strings in ColdFusion? For instance, lets say variable.foo = "15 St Mungo Place" and variable.coo = "15 St. Mungo Place"
Sql will not return a match but you and I both know that these two strings mean the same thing AND that there is only a one character difference.
My question is - Is there a way to get Coldfusion to recognise that there is only a one character difference (similarities) between the two strings. Obviously the potential applications for search queries etc.. could be enormous here so I'm VERY curious to see if this is possible.
Right off the bat, there is nothing baked into ColdFusion that will do this automatically, so we have to look at other possible solutions.
For some cases, you could use straight substitutions. You could have a list of common alternative spellings ("st. for st") and loop over them and do replacements. That will work for simple things but probably isn't going to scale up.
You could also compare strings and see how close they are. I did some Googling on that, and kept coming across the Levenshtein distance formula. I was actually working to rebuild the pseudo-code on Wikipedia when I remembered CFLib - turns out it has a udf, levDistance, that can use this logic.
This is still somewhat of a slow process as you would need to compare your search against all the words in the database. This is probably a case where something like Verity, or the full text searching feature of your database, may be more handy.
Anyone done anything like this in ColdFusion?
Comments
Using verity is a great idea...Flag it as a possible match if one of Verity's suggested phrases is a match...
Note that I am in no way saying that this is a *good* idea, just something that might be possible.
--- Ben
Query Syntax: http://lucene.apache.org/java/docs/queryparsersynt...
JavaDocs: http://lucene.apache.org/java/2_2_0/api/org/apache...
This would require making a Lucene index of values to compare against. In my own experience, Lucene is fast, and scales well as the record count of an index increases.
the results are reasonably fast, however a full table scan is requires for each individual search. So performance could be an issue if run quite often.
Theo
Web Service info:
http://www.serviceobjects.com/products/dots_addres...
Demo:
http://www.serviceobjects.com/demos/addressvalidat...
It standardizes the address, similar to how usps.com does it (but server-side):
http://zip4.usps.com/zip4/welcome.jsp
If addresses are standardized before saving them to the database, it's an easy SQL Query to compare a new address to the other addresses in the database.
(Not to mention all the other benefits of having standardized, validated addresses -- minimal returned mail, professional looking labels, etc.)
My reason for asking is that my boss has asked me to write a back-end WEB BASED app to compare closely matching addresses for manual verification. We get two sets of data from different sources and the address is the only key we can use to compare.
Saint (St. St) is a good example because it's easy to confuse with Street (St St.) so look ups aren't a perfect solution.
MySql also has "Full Text Searching" so you could add FULLTEXT(Address) to your table and do a reReplace on the address string to transform "15 St Mungo Place" into a Sql statement such as :
SELECT foo.Address, coo.Address
FROM Properties as foo, Muggles as coo
WHERE MATCH(foo.Address) AGAINST('+15 +St +Mungo +Place')
This is fine for comparing one address at a time (as you would get ranked multiple results) but what I really want to do is end up with a single table of ONLY the most likely matches in descending order.
Levenshtein could do this but could score poorly with flats: eg. Flat 1, 16 St Mungo Place & 1/16 St. Mungo Place (Levenshtein score of 9).
Using an Address Validation Service like Adam suggested (I will need to find a UK one but excellent idea Adam) will certainly help this.
And finally... I could also use some proprietary software to do this and just press a button but then.... a) I might not be able to afford that and b)what would I actually learn from doing that?
Obviously a combination of methods will have to be used for matching and filtering, I'm just glad I "Asked a Jedi." I will definitely reply here again if I find anything else!!
Thanks again to all.
Martin

