When is XML not XML?
Here is a mystery for folks. I've updated my parsing engine for coldfusionbloggers.org. I'm using CFHTTP now so I can check Etag type stuff. I take the result text and save it to a file to be parsed by CFFEED.
But before I do that I check to ensure it's valid XML. Here is where it gets weird. Charlie Griefer's blog works with CFFEED directly, but isXML on the result returns false. But - I can xmlParse the string no problem. Simple example:
<cfset f= "http://cfblog.griefer.com/feeds/rss2-0.cfm?blogid=30">
<cfhttp url="#f#">
<cfset text = cfhttp.filecontent>
<cfif isXml(text)>
yes
<cfelse>
no
<cfset z = xmlParse(text)>
<cfdump var="#z#">
</cfif>
If you run this, you will see "no" output, and than an XML object. If you use CFFEED on the URL directly, that works as well. So it seems like isXML is being strict about something. I can update my code to try/catch an xmlParse obviously, but I'd rather figure out why the above is happening first.
Comments
http://www.validome.org/rss-atom/validate
http://validator.w3.org
http://feedvalidator.org
This is really neat! I can't see a reason this would fail to be valid XML.
The problem is <![CDATA[]]> in the xmlNode. I always use the W3C validator which escapes malformed HTML within the CDATA. I used CDATA on purpose because xmlFormat() doesn't always re-format correctly for valid RSS Feeds - especially when non-technical users are providing the input.
Coldfusion's isXML() doesn't appear to escape the CDATA content, however. For example, the following feed using xmlFormat() with Charlie's content returns isXML() true:
http://cfblog.griefer.com/feeds/rss_test.cfm?blogi...
Whereas the original does not:
http://cfblog.griefer.com/feeds/rss2-0.cfm?blogid=...
Interesting stuff!
<b>foo
In my CDATA, CF would consider it bad because I enver closed the B?
did jon just call me a 'non-technical user'? :)
"did jon just call me a 'non-technical user'? :)"
Errr..... :-O No actually, the original change to using CDATA was from a couple of non-technically oriented blog portals like pieceoftexas.com. Users were pasting from word and even with the WYSIWYG, xmlFormat() wasn't cleaning it up enough. There were also intermittent problems with feed readers decoding inline javascript like YouTube posts, etc. from users content.
@Ray
I'm going to play around with it, but it appears that any raw HTML in CDATA will cause isXML() to fail - which is the reason for using CDATA in the first place.
In my testing, it didn't appear to be looking for parity/balance of tags, so much as it was looking for parity of brackets. That is, <b> without </b> is okay, as is <a>foo</b>, but <b (no closing bracket) is not. His feed at this moment has an A tag that has been chopped off between the tagName and its first attribute.


i sent this entry to jon clausen, the big brain behind cfblog. i know you said you think it's a "cf thing" more than a "cfblog thing", but i figured jon might have some insights he can offer up.