Sitemap Generator

Earlier today Yahoo and Google announced their collaboration on Sitemaps.org. Sitemaps provide a way to describe to a search engine what pages make up your web site. I've had sitemap support in BlogCFC for a while, but today I wrote a little UDF you can use to generate sitemap xml. It will take either a list of URLs or a query of URLs. Enjoy. I'll post it to CFLib later in the week.

<cffunction name="generateSiteMap" output="false" returnType="xml">
   <cfargument name="data" type="any" required="true">
   <cfargument name="lastmod" type="date" required="false">
   <cfargument name="changefreq" type="string" required="false">
   <cfargument name="priority" type="numeric" required="false">
   
   <cfset var header = "<?xml version=""1.0"" encoding=""UTF-8""?><urlset xmlns=""http://www.sitemaps.org/schemas/sitemap/0.9"">">
   <cfset var result = header>
   <cfset var aurl = "">
   <cfset var item = "">
   <cfset var validChangeFreq = "always,hourly,daily,weekly,monthly,yearly,never">
   <cfset var newDate = "">
   <cfset var tz = getTimeZoneInfo().utcHourOffset>
   
   <cfif structKeyExists(arguments, "changefreq") and not listFindNoCase(validChangeFreq, arguments.changefreq)>
      <cfthrow message="Invalid changefreq (#arguments.changefreq#) passed. Valid values are #validChangeFreq#">
   </cfif>

   <cfif structKeyExists(arguments, "priority") and (arguments.priority lt 0 or arguments.priority gt 1)>
      <cfthrow message="Invalid priority (#arguments.priority#) passed. Must be between 0.0 and 1.0">
   </cfif>
   
   <!--- reformat datetime as w3c datetime / http://www.w3.org/TR/NOTE-datetime --->
   <cfif structKeyExists(arguments, "lastmod")>         
      <cfset newDate = dateFormat(arguments.lastmod, "YYYY-MM-DD") & "T" & timeFormat(arguments.lastmod, "HH:mm")>
      <cfif tz gte 0>
         <cfset newDate = newDate & "-" & tz & ":00">
      <cfelse>
         <cfset newDate = newDate & "+" & tz & ":00">
      </cfif>      
   </cfif>
   
   <!--- Support either a query or list of URLs --->
   <cfif isSimpleValue(arguments.data)>
      <cfloop index="aurl" list="#arguments.data#">
         <cfsavecontent variable="item">
<cfoutput>
<url>
   <loc>#xmlFormat(aurl)#</loc>
   <cfif structKeyExists(arguments,"lastmod")>
   <lastmod>#newDate#</lastmod>
   </cfif>
   <cfif structKeyExists(arguments,"changefreq")>
   <changefreq>#arguments.changefreq#</changefreq>
   </cfif>
   <cfif structKeyExists(arguments,"priority")>
   <priority>#arguments.priority#</priority>
   </cfif>
</url>
</cfoutput>
         </cfsavecontent>
         <cfset item = trim(item)>
         <cfset result = result & item>
      </cfloop>
      
   <cfelseif isQuery(arguments.data)>
      <cfloop query="arguments.data">
         <cfsavecontent variable="item">
<cfoutput>
<url>
   <loc>#xmlFormat(url)#</loc>
   <cfif listFindNoCase(arguments.data.columnlist,"lastmod")>
      <cfset newDate = dateFormat(lastmod, "YYYY-MM-DD") & "T" & timeFormat(lastmod, "HH:mm")>
      <cfif tz gte 0>
         <cfset newDate = newDate & "-" & tz & ":00">
      <cfelse>
         <cfset newDate = newDate & "+" & tz & ":00">
      </cfif>      
      <lastmod>#newDate#</lastmod>
   </cfif>
   <cfif listFindNoCase(arguments.data.columnlist,"changefreq")>
   <changefreq>#changefreq#</changefreq>
   </cfif>
   <cfif listFindNoCase(arguments.data.columnlist,"priority")>
   <priority>#priority#</priority>
   </cfif>
</url>
</cfoutput>
         </cfsavecontent>
         <cfset item = trim(item)>
         <cfset result = result & item>
      
      </cfloop>
   </cfif>
   
   <cfset result = result & "</urlset>">
   
   <cfreturn result>
   
</cffunction>

Comments

How actualy it works?
# Posted By aleksandar | 11/20/06 6:50 AM
You pass in either a list of URLs or a query. I added it to CFLib last night and there is a bit more documentation there.

http://www.cflib.org/udf.cfm?id=1596
# Posted By Raymond Camden | 11/20/06 8:02 AM
do you think it would be hard to build and site crawler and link parser in cf to use with this udf?
# Posted By BL | 11/28/06 10:41 AM
BL: Sure, I'll make it a Friday test. ;)
# Posted By Raymond Camden | 11/28/06 2:31 PM
nice. you feelin a little regexy?
# Posted By BL | 11/28/06 6:21 PM
Can I offer a couple of amendments in the light of my experience of using this UDF to submit to Google.

Code changes occur after the comment "reformat datetime as w3c datetime / http://www.w3.org/TR/NOTE-datetime";.

1. Change the test of tz to be "gt" rather than "gte". To be honest this is really just a personal style thing, +00:00 looks better than -00:00 to me, and doesn't seem to effect Google.

2. Make the hour number format "00" for the newDate offset eg. numberFormat(tz,"00"). So the lines should read newDate = newDate & "-" & numberFormat(tz,"00") & ":00" and newDate = newDate & "+" & numberFormat(tz,"00") & ":00"

HTH
# Posted By dickbob | 3/3/07 7:07 AM
I've made both changes. I've also changed the UDF to use stringbuffer, this makes it quicker. Unfortunately - CFLIB is causing me fits now - so it's not hooked up yet. I will refresh it later.
# Posted By Raymond Camden | 3/3/07 4:20 PM
I know you most likely thought of it but I should have mentioned that the same changes need to be applied when the data is supplied as a query.
# Posted By dickbob | 3/6/07 9:13 AM
Hi there, I was just tasked to produce an XML sitemap. I noticed that you mention that you would make a Friday test out of the idea of making a site crawler. I searched for that term in your blog and didn't see any results. Did this ever occur?
# Posted By Ruth | 4/19/07 2:45 PM
Not yet - no.
# Posted By Raymond Camden | 4/19/07 5:32 PM
I started a solution to create a map on the server side - but it isn't a crawler. I'll post the code after I get it cleaned up enough and some of the kinks worked out.

I am having issues with cfdirectory w/recursion at webroot. I get the pesky null pointer error, which I am attributing to archived directories, etc bloating the query. I still need to prove that is the cause.
# Posted By Ruth | 4/20/07 3:04 PM
Ruth, don't forget that if you confirm a bug, you can report it at:

http://www.adobe.com/go/wish
# Posted By Raymond Camden | 4/20/07 3:11 PM
Ray, the udf on cflib although dated March 9 2007 doesn't have the changes you mentioned you'd added to timezone and the use of stringbuffer. Also the getTimeZoneInfo().utcHourOffset assigned to tz returns (for me at least) an offset value, eg "-1" for UK, so the test which adds the "+" or "-" later is unnecessary and makes the date format invalid (eg 2007-08-04T19:24+-1:00).
# Posted By Jeremy Halliwell | 8/14/07 7:06 AM
I'm updating it in 10 seconds. Will you please give it a try?
# Posted By Raymond Camden | 8/14/07 8:00 AM
Yes that works great, thanks Ray.
# Posted By Jeremy Halliwell | 8/14/07 8:22 AM
You say blogcfc has had sitemap support for some time, in what way?
Is it supposed to generate a sitemap.xml file ?
And if so, how? I can't find any option to do this.
Or do I need to update as I am on blogCFC 5.5
# Posted By Snake | 9/4/07 12:40 PM
There should be a file named sitemap or googlesitemap.cfm in the root directory.
# Posted By Raymond Camden | 9/4/07 12:53 PM
Hi I can't figur out how to combine these values in to one and to an xml output:

<cfset siteMapXML = generateSiteMap(data=urls,changefreq="daily",priority="1.0", lastmod=now())>
<cfdump var="#xmlParse(siteMapXML)#">
<cfset siteMapXML = generateSiteMap(qurls)>
<cfdump var="#xmlParse(siteMapXML)#">

I want these combined as a need to put it all to one xml sitemap, the .cfm sitemap takes to long to load, big sitemap.

thanks
# Posted By marco | 10/12/07 9:17 PM
Well I think you can just combine both XML files. You would want to remove the <xml> header from the second one though. Not exactly sure - but it's definitely possible.
# Posted By Raymond Camden | 10/23/07 5:03 PM
of course, i was thinking the hard way as usual, thx
# Posted By m van den oever | 10/23/07 6:18 PM
Could someone break this down for me? I've read this post over and over and looked at the CFLib documentation. I don't know if I"m missing something or (most likely) I just don't know what I'm doing. Any shove in the right direction is greatly appreciated.
# Posted By Adam | 1/18/08 1:49 PM