Quick example of cleaning up Verity results
Christian Ready pinged me a few days ago about an interesting problem he was having at one of his web sites. His search (Verity-based on CFMX7) was returning HTML. The HTML was escaped so the user literally saw stuff like this in the results:
Hi, my name is <b>Bob</b> and I'm a rabid developer!
I pointed out that the regex used to remove HTML would also work for escaped html:
<cfset cleaned = rereplace(str, "<.*?>", "", "all")>
In English, this regex matches the escaped less than sign (<), any character (non greedy, more on that in a bit), and then the escaped greater than symbol (>). The "non greedy" part means to match the smallest possible match possible. Without this, the regex would remove the html tag and everything inside of it! We just want to remove the tags themselves.
This worked - but then exposed another problem. Verity was returning text with incomplete HTML tags. As an example, consider this text block:
ul>This is some <b>bold</b> html with <i>markup</i> in it.
Here is <b
Notice the incomplete HTML tag at the beginning and end of the string. Luckily regex provides us with a simple way to look for patterns at either the beginning or end of a string. Consider these two lines:
<cfset cleaned = rereplace(cleaned, "<.*?$", "", "all")>
<cfset cleaned = rereplace(cleaned, "^.*?>", "", "all")>
</code
The first line looks for a match of a < at the end of the string. The next line looks for a > at the beginning of the string. Both allow for bits of the html tag as well.
So all together this is the code I gave him:
<code>
<cfset cleaned = rereplace(str, "<.*?>", "", "all")>
<cfset cleaned = rereplace(cleaned, "<.*?$", "", "all")>
<cfset cleaned = rereplace(cleaned, "^.*?>", "", "all")>
Most likely this could be done in one regex instead.
Comments
That said, I've never used the spider. Does that come with the OEM version of Verity in CF?
For some people it works easily - others have issues. Peter Bell has a good blog post if you are interested:
http://www.pbell.com/index.cfm/2006/10/21/What-is-...
And I have a few things on my wiki (which I need to update)
http://www.thecrumb.com/wiki/code/coldfusion
It works exactly like the Google or Yahoo spiders - it hits your homepage (or wherever you direct it) and 'spiders' each page by hitting the links in each page - so it only sees what your visitor's see.
text = ReReplace(text, "<[^>]*>", "", "all");
For example in my tests, when I search on "job" I could receive the following back as a result:
"Select position, job, from table where id = 5"
Any ideas?
Speaking of which - I didn't hear about any changes in CF8 to search? Nothing new or nothing announced yet? It would be really nice if they made vspider easier to work with...
So it _sounds_ like you don't need the spider at all.
Also, we have multiple sites within IIS on a single server.
Since VSPIDER can only search on localhost, I am having difficulty figuring out how this will work with our current setup. When I create the collection, it embeds the "http://localhost/..." URL within the collection, so how does one get it to relect the true site URL of say, "http://mysite.com"? Sorry for the ignorance here, I must be doing something wrong...
In the past I setup virtual domains off localhost:
http://localhost/siteone
http://localhost/sitetwo
Depending on how you have things pathed - sometimes images and CSS wouldn't work - but the spider doesn't care. Just as long as the links work you are set.
I think you can also spider file system paths and use -prefixmap to replace it with a URL.
Check out Peter Bell's site (http://www.pbell.com) he has a few good vspider posts.
vspider is great IF you can get it working and that unfortunately seems very hit or miss.
Maybe now that ColdFusion has an image tag (after an eternity!) they can fix up search in ColdFusion 9! :)
Jim, like you said, when it works, it works great!
I ended up mixing and matching vspider collections, with regular collections to meet my needs.
Ray, I took your advice and used the replace()...works great.
Thank you all for your help; see you at CFUNITED?

