Not many people know that ColdFusion ships with a HTTP spider that integrates with Verity. Unfortunately, this spider will only work with localhost as a server. This means if you want to spider multiple sites, you can't. Well, not without playing with your host headers. (More information on the Verity Spider and ColdFusion may be found here.)
What I worked on today was a way to work around this limitation. It turns out - if you have a sitemap, you already have a "spider" of your site. BlogCFC supports sitemaps out of the box, and I've blogged in the past a simple UDF to generate sitemaps. Let's look at how we can convert a sitemap into Verity data.
To begin with - let's take a look at some very simple sitemap data.2 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
3 <url>
4 <loc>http://www.foo.com/index.cfm</loc>
5 </url>
6 <url>
7 <loc>http://www.foo.com/index2.cfm</loc>
8 </url>
9 <url>
10 <loc>http://www.foo.com/index3.cfm</loc>
11 </url>
12 </urlset>
2 <cfset myxml = fileRead(expandPath("./sitemap.xml"))>
3 <!--- convert to xml --->
4 <cfset myxml = xmlParse(myxml)>
2 <cfset request.data = structNew()>
2 <cfloop index="x" from="1" to="#min(20,arrayLen(myxml.urlset.url))#">
3
4 <cfset tname = "thread#x#">
5 <cfthread name="#tname#" url="#myxml.urlset.url[x].loc.xmltext#">
6 <cfhttp url="#attributes.url#" result="result">
7 <cfset request.data[attributes.url] = structNew()>
8 <cfset request.data[attributes.url].title = getHTMLTitle(result.filecontent)>
9 <cfset request.data[attributes.url].body = getHTMLBody(result.filecontent)>
10 <!--- remove all html from body --->
11 <cfset request.data[attributes.url].body = rereplace(request.data[attributes.url].body, "<.*?>", "", "all")>
12 <cfset headers = getMetaHeaders(result.filecontent)>
13 <cfset request.data[attributes.url].keywords = "">
14 <cfset request.data[attributes.url].description = "">
15
16 <cfset request.data[attributes.url].x = headers>
17
18 <!--- find description and keywords --->
19 <cfloop index="x" from="1" to="#arrayLen(headers)#">
20 <cfif structKeyExists(headers[x], "name")>
21 <cfif headers[x].name is "description">
22 <cfset request.data[attributes.url].description = headers[x].content>
23 <cfelseif headers[x].name is "keywords">
24 <cfset request.data[attributes.url].keywords = headers[x].content>
25 </cfif>
26 </cfif>
27 </cfloop>
28 </cfthread>
29
30 </cfloop>
2 <cfthread action="join" name="#structKeyList(cfthread)#" />
2 <cfset info = queryNew("url,body,title,keywords,description")>
3 <cfloop item="c" collection="#request.data#">
4 <cfset queryAddRow(info)>
5 <cfset querySetCell(info, "url", c)>
6 <cfset querySetCell(info, "body", request.data[c].body)>
7 <cfset querySetCell(info, "title", request.data[c].title)>
8 <cfset querySetCell(info, "keywords", request.data[c].keywords)>
9 <cfset querySetCell(info, "description", request.data[c].description)>
10 </cfloop>
2 <cfindex collection="sitemaptest" action="refresh" query="info" title="title" key="url" body="body" urlpath="url" custom1="keywords" custom2="description" status="status">
2 <p>
3 Done indexing. Did #info.recordCount# rows. Took #totaltime# ms.
4 </p>
5 </cfoutput>
6
7 <cfdump var="#status#">
- First off - you can really tell that Verity wasn't 100% sure what to do with my data. That's why I removed the HTML. I could have considered taking the data I sucked down, saving it to an HTML file, and then running a file based index. While this would be slower, it could have resulted in better indexing.
- Second - my code, ignoring the Mind(), will suck down every URL and index it. As I mentioned, sitemaps can store more than just URLs. They can also store the last time they were modified. If I were reading my XML data once a day, then it would make sense to only suck down URLs that were modified today. This would greatly improve the speed of the indexing.
2
3 <cfset thetime = getTickCount()>
4
5 <cfscript>
6 /**
7 * Parses an HTML page and returns the title.
8 *
9 * @param str The HTML string to check.
10 * @return Returns a string.
11 * @author Raymond Camden (ray@camdenfamily.com)
12 * @version 1, December 3, 2001
13 */
14 function GetHTMLTitle(str) {
15 var matchStruct = reFindNoCase("<[[:space:]]*title[[:space:]]*>([^<]*)<[[:space:]]*/title[[:space:]]*>",str,1,1);
16 if(arrayLen(matchStruct.len) lt 2) return "";
17 return Mid(str,matchStruct.pos[2],matchStruct.len[2]);
18 }
19
20 function GetHTMLBody(str) {
21 var matchStruct = reFindNoCase("<.*?body.*?>(.*?)<[[:space:]]*/body[[:space:]]*>",str,1,1);
22 if(arrayLen(matchStruct.len) lt 2) return "";
23 return Mid(str,matchStruct.pos[2],matchStruct.len[2]);
24 }
25
26 function GetMetaHeaders(str) {
27 var matchStruct = structNew();
28 var name = "";
29 var content = "";
30 var results = arrayNew(1);
31 var pos = 1;
32 var regex = "<meta[[:space:]]*(name|http-equiv)[[:space:]]*=[[:space:]]*(""|')([^""]*)(""|')[[:space:]]*content=(""|')([^""]*)(""|')[[:space:]]*/{0,1}>";
33
34 matchStruct = REFindNoCase(regex,str,pos,1);
35 while(matchStruct.pos[1]) {
36 results[arrayLen(results)+1] = structNew();
37 results[arrayLen(results)][ Mid(str,matchStruct.pos[2],matchStruct.len[2])] = Mid(str,matchStruct.pos[4],matchStruct.len[4]);
38 results[arrayLen(results)].content = Mid(str,matchStruct.pos[7],matchStruct.len[7]);
39 pos = matchStruct.pos[6] + matchStruct.len[6] + 1;
40 matchStruct = REFindNoCase(regex,str,pos,1);
41 }
42 return results;
43 }
44 </cfscript>
45
46 <!--- create collection if needed --->
47 <cfcollection action="list" name="mycollections">
48
49 <cfif not listFindNoCase(valueList(mycollections.name), "sitemaptest")>
50 <cfoutput><p>Creating collection.<p></cfoutput>
51 <cfcollection action="create" collection="sitemaptest" path="#server.coldfusion.rootdir#/collections">
52 </cfif>
53
54 <!--- read in xml --->
55 <cfset myxml = fileRead(expandPath("./sitemap.xml"))>
56 <!--- convert to xml --->
57 <cfset myxml = xmlParse(myxml)>
58 <!--- place to store data --->
59 <cfset request.data = structNew()>
60
61 <!--- now loop through.... --->
62 <cfloop index="x" from="1" to="#min(20,arrayLen(myxml.urlset.url))#">
63
64 <cfset tname = "thread#x#">
65 <cfthread name="#tname#" url="#myxml.urlset.url[x].loc.xmltext#">
66 <cfhttp url="#attributes.url#" result="result">
67 <cfset request.data[attributes.url] = structNew()>
68 <cfset request.data[attributes.url].title = getHTMLTitle(result.filecontent)>
69 <cfset request.data[attributes.url].body = getHTMLBody(result.filecontent)>
70 <!--- remove all html from body --->
71 <cfset request.data[attributes.url].body = rereplace(request.data[attributes.url].body, "<.*?>", "", "all")>
72 <cfset headers = getMetaHeaders(result.filecontent)>
73 <cfset request.data[attributes.url].keywords = "">
74 <cfset request.data[attributes.url].description = "">
75
76 <cfset request.data[attributes.url].x = headers>
77
78 <!--- find description and keywords --->
79 <cfloop index="x" from="1" to="#arrayLen(headers)#">
80 <cfif structKeyExists(headers[x], "name")>
81 <cfif headers[x].name is "description">
82 <cfset request.data[attributes.url].description = headers[x].content>
83 <cfelseif headers[x].name is "keywords">
84 <cfset request.data[attributes.url].keywords = headers[x].content>
85 </cfif>
86 </cfif>
87 </cfloop>
88 </cfthread>
89
90 </cfloop>
91
92 <!--- join the threads --->
93 <cfthread action="join" name="#structKeyList(cfthread)#" />
94
95 <!--- make a query for the data --->
96 <cfset info = queryNew("url,body,title,keywords,description")>
97 <cfloop item="c" collection="#request.data#">
98 <cfset queryAddRow(info)>
99 <cfset querySetCell(info, "url", c)>
100 <cfset querySetCell(info, "body", request.data[c].body)>
101 <cfset querySetCell(info, "title", request.data[c].title)>
102 <cfset querySetCell(info, "keywords", request.data[c].keywords)>
103 <cfset querySetCell(info, "description", request.data[c].description)>
104 </cfloop>
105
106 <!--- insert data --->
107 <cfindex collection="sitemaptest" action="refresh" query="info" title="title" key="url" body="body" urlpath="url" custom1="keywords" custom2="description" status="status">
108
109 <cfset totaltime = getTickCount() - thetime>
110
111 <cfoutput>
112 <p>
113 Done indexing. Did #info.recordCount# rows. Took #totaltime# ms.
114 </p>
115 </cfoutput>
116
117 <cfdump var="#status#">
Comment 1 written by Mike Henke on 2 October 2007, at 1:55 PM
Comment 2 written by Raymond Camden on 2 October 2007, at 3:55 PM
Comment 3 written by Mike Henke on 2 October 2007, at 5:05 PM
Comment 4 written by Raymond Camden on 3 October 2007, at 5:16 PM
Comment 5 written by Michael Evangelista on 28 November 2007, at 10:37 AM
Hi Ray -
I am trying to implement this on a site that sits on a CF7 server. I managed to remove the cfthread references and replaced fileRead with a cffile action='read'... and it seems to work, sorta but not quite.
My sitemap.xml has about 80 nodes.
When running the modified indexing page, I get a success message saying 20 rows have been added - but nothing at all in search results, even searching for words that should be in every document. (side note: is there any way to view or dump the contents of a verity collection?)
In your notes you say 'this should be easy to downgrade to cf7' , but I am wondering what I have missed. I know this is vague but... any ideas?
Comment 6 written by Raymond Camden on 28 November 2007, at 10:58 AM
The only simple way to dump an entire collection is to search for *.
Comment 7 written by Michael Evangelista on 28 November 2007, at 11:35 AM
thanks for the quick reply.
yes, the dump shows 80 rows (dumping out the same 'myxml' variable that is being parsed).
I hacked this in two ways for cf7
Took out the fileRead like this, adding cffile instead
<cffile action="read" file="#siteMapPath#" variable="myxml">
<!--- <cfset myxml = fileRead(expandPath("./sitemap.xml"))> --->
And commented out the cfthread references in 3 places.
I just sent you a direct email with links to the pages and a bit of explanation, and I am playing with the cfdump now.
I'd love to get this working for CF7, and hope it might be useful to others too.
Comment 8 written by Michael Evangelista on 28 November 2007, at 12:40 PM
http://garkaneenergy.com/verity/veritysitemap.cfm
For the moment, I have this page dumping out each of the request.data... structs for each page, along with the page name.
Super cool to see all that meaty text content in there!
below that i am dumping the full xml variable.
it goes through the first 20 in the file, and then chokes.
The fact that I am getting the first sequential 20 makes me curious. I deleted number 20 and 21 in case it was bad data... then i deleted the first 20 and still only got 20.
DOH!!!!
<smacking head>
Your example code has 20 rows limited in the cfloop.
< ashamed... >
<cfloop index="x" from="1" to="#min(20,arrayLen(myxml.urlset.url))#">
Changing that to the more obvious value
<cfloop index="x" from="1" to="#arrayLen(myxml.urlset.url)#">
did the trick. Duh, double duh.
So... ok ... we are past the 'why only 20' hurdle.
The page shows I am getting 58 rows from the xml file which sounds perfect, and I can see , in my lovely cfdumps, all the meaty text content and perfectly organized meta values... awesome!
Listing my verity collection info
I see I have a DocCount of 58.
But... still no searchy.
http://garkaneenergy.com/search.cfm
Running a search here for "garkane" should bring up 58 pages at the current time.
Now I am back to wishing I could 'dump' the contents of a verity collection. I want to see what those 58 DocCount records actually contain!
I can only think that I screwed it up somewhere by simply hacking out the cfthread tags. More investigation!
Comment 9 written by Michael Evangelista on 28 November 2007, at 12:48 PM
I had my cfsearch set up to find content only '<in> body'... but there's no more body tag to search through.
Ray this is AWESOME, and it has been the catalyst for quite the spontaneous learning experience.
Thanks for being a sounding board... I think it works!!
Comment 10 written by Raymond Camden on 28 November 2007, at 1:37 PM
Comment 11 written by Michael Evangelista on 28 November 2007, at 3:47 PM
Now that I have struggled with it, I *get* it and , Ray, this rocks!
Taking the content filtering one step further, I added some regular expression code to strip out everything from the retrieved 'body' code that is not inside of my "mainCol" div, so the first thing shown in the site summary from the cfindex is the heading of the main part of the page, then the page text - no messy menu, or preceding-column junk to contend with.
This full circle code-trip means that I can now
- generate an xml site map for any visible site
- edit the sitemap.xml as an easy no-frills way to limit or extend which pages verity spiders
- use this code to create a cfcollection in *minutes*,
filtering the retained content according to any inserted tag or comment in my pages' code (i.e. only main column, etc)
- run a cfsearch against the collection, resulting in a super-fast lightweight homegrown in-site magic ColdFusion search!
thanks again... I am really psyched about having this entire code set in my collection.
Comment 12 written by Raymond Camden on 28 November 2007, at 3:50 PM
Comment 13 written by Michael Evangelista on 28 November 2007, at 4:07 PM
I think I will put together a demo and blog post on the full circle trip using the sitemap creator and this code, plus how I restricted the search to specific parts of the page markup... If I do that, can I include a modified copy of your code as a downloadable file (with credit given, of course)?
I didnt have a clue about any of this until yesterday... now I feel like I've been handed a shiny new toolbox that lots of folks are constantly looking for... this could be really useful to a lot of people once they see it in action!
Comment 14 written by Raymond Camden on 28 November 2007, at 4:09 PM
ONE MILLION DOLLARS! (finger by mouth and evil laugh)
Comment 15 written by Michael Evangelista on 28 November 2007, at 5:50 PM
http://mredesign.com/cfdev/one-million-dollars.cfm...
Comment 16 written by Raymond Camden on 29 November 2007, at 8:23 AM
Comment 17 written by Michael Evangelista on 1 December 2007, at 6:56 PM
http://mredesign.com/demos/verity-sitemap/index.cf...
This is pretty neat - I'm using the sitemap generator to make the xml, then feeding it to your verity writer, all with a neat little skin.
Coolest part - download the zip, and drop the files into any site, then browse to the index page - walk through the steps and presto chango, instant site search!
Comment 18 written by Brent on 3 December 2007, at 6:10 PM
first off I visit your blog almost daily and usually find exactly what I am looking for. Thanks for this great resource.
My question is about his script is:
While crawling my local website with this script it works perfectly except. I often get time out errors. I have adjusted the time out to 1200 but it still occurs even on smaller crawls.
Is there anyway to debug this to find out what cfhttp calls are getting hung up and perhaps just skip them ?
Comment 19 written by Raymond Camden on 4 December 2007, at 5:59 AM
<cfset t = getTickCount()>
<cfhttp .....>
<cfset duration = getTickCount() - t>
<cflog file="test" text="To get url #x#, it took #duration# ms">
This may flag the culprit.
Worse comes to worse - feed the code portions of the XML at a time.
Comment 20 written by Brad on 24 January 2008, at 11:34 AM
You mentioned "I could have considered taking the data I sucked down, saving it to an HTML file, and then running a file based index. While this would be slower, it could have resulted in better indexing."
What would be the easy way to modify this code to work that way instead? I currently have save all my dynamic page saved off as html files using cfhttp and have verity indexing them for my search. I would like to switch to this instead.
Comment 21 written by Raymond Camden on 24 January 2008, at 11:37 AM
Comment 22 written by Connie DeCinko on 13 March 2009, at 5:26 PM
How would you recommend telling the function to exclude certain links? Would you just have to maintain a no follow list? Or is it kinda pointless since Google is going to follow the link no matter what?
Comment 23 written by Raymond Camden on 13 March 2009, at 5:29 PM
[Add Comment] [Subscribe to Comments]