ColdFusion 8: Working with PDFs (Part 7)

In today's entry I'll be discussing the processDDX action of the CFPDF tag. I have to admit that I wasn't looking forward to this entry. Every time I had looked at the documentation, it just didn't make sense. I didn't see the point. But now that I've looked at it again more in depth, I'm almost in awe at how cool this feature is. I'm definitely just scratching the surface in this blog post, but hopefully it will encourage others to look into DDX and how it works with ColdFusion.

So as you can probably guess, CFPDF's processDDX action lets ColdFusion work with DDX. Ok, so what in the heck is DDX? DDX stands for Document Description XML. You can think of it like a template for a PDF file. At a basic level, it lets you lay out PDF files (like the Merge option does) and add special commands (generate a table of contents for example). DDX is used by Adobe's LiveCycle Assembler product. ColdFusion ships with a stripped down version of this product. The exact XML tags not allowed in ColdFusion are listed in the documentation. As far as I can see, there is no way to enter a serial and enable the full power of LiveCycle Assembler. But even with the restrictions there is an incredible amount of power that you have built in. As I mentioned above, this entry is only going to talk at a high level about DDX. You can find the DDX reference here. Also as Charlie Arehart has mentioned in a comment in my PDF series, the ColdFusion documentation is excellent. I want to credit them for my examples below as all are either direct copies or modified versions of their examples. Also note that this is a very complex topic. There is a good chance I will screw something up so please let me know if I do.

Let's begin by talking about how you use DDX in ColdFusion. ColdFusion 8 adds an isDDX() function. This function takes either a relative/absolute path to a filename or an actual string of DDX tags. Don't worry too much about the XML just yet, but here is a simple example of checking a string to see if it is valid DDX:

<cfsavecontent variable="myddx">
<?xml version="1.0" encoding="UTF-8"?>
<DDX xmlns="http://ns.adobe.com/DDX/1.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd">
<PDF result="Out1">
<PDF source="Title"/>
<TableOfContents/>
<PDF source="Doc1"/>
<PDF source="Doc2"/>
</PDF>
</DDX>
</cfsavecontent>
<cfset myddx = trim(myddx)>

<cfif isDDX(myddx)>
yes, its ddx
<cfelse>
no its not
</cfif>

In this example I've just used the CFSAVECONTENT tag to wrap my DDX XML. I trim it and then check to see if it is DDX. Now that I've shown you a bit of DDX, let me talk a bit about what that example does. Ignoring the DDX tag, there are 2 XML tags in use here, PDF and TableOfContents. The first PDF tag uses result="Out1" and wraps the other tags. This basically says the result of everything on the inside should be put into a result named Out1. On the inside there are 3 PDF tags with a source. You can think of this like a merge. The tags specify an order based on names: Title, Doc1, and Doc2. So far so good. But then note that a TableOfContents tag exists right after the Title PDF. This particular tag can do a lot - but at a basic level, it just says, "Create a table of contents using the PDFs following me."

So let me repeat what I said above. This is partially for my sake to ensure I'm describing it right (remember what I said, I'm new to this!). What we have is a template that takes 3 PDFs. It puts the Title PDF first. It defines a page as a Table of Contents. It then lays down two more PDFs. Let's take a look at how ColdFusion can work with this DDX.

First note that the DDX worked with PDF names. Notice I don't have any real file names. Nor do I have ColdFusion variables. Instead I have labels like Out1, Title, Doc1, and Doc2. So we need a way to pass real values so that LiveCycle Assembler can use them when processing the DDX. The CFPDF tag takes two related attributes, inputFiles and outputFiles. Each of these are a structure of names to file names. So using our sample DDX above, I can define my 3 input PDFs like so:

<cfset inputStruct=StructNew()>
<cfset inputStruct.Title="title.pdf">
<cfset inputStruct.Doc1="paris.pdf">
<cfset inputStruct.Doc2="booger.pdf">

Defining the output file is also struct based:

<cfset outputStruct=StructNew()>
<cfset outputStruct.Out1="output1.pdf">

Ok so at this point I've detailed all the various variables used in the DDX file. Now lets use CFPDF to run the process:

<cfpdf action="processddx" ddxfile="#myddx#" inputfiles="#inputStruct#" outputfiles="#outputStruct#" name="ddxVar">

Pretty trivial I think. I passed in my structs and DDX. At this point I now have a result. If I dump ddxVar, I will see a structure. Each key of the structure maps to the output key from my DDX. I had used this tag:

<PDF result="Out1">

So ddxVar.out1 will contain a status message for my result. It will either be "successful" or "failed" followed by a reason. One quick note. You will notice I used paths for my PDFs. In order to use DDX, you have to work with real files. You can't pass in a PDF created in memory. Obviously you can make the PDF on the fly and save it in the same request.

If you view your PDF now (remember it was named output1.pdf), you may notice that you don't have a table of contents. Turns out that the TableOfContents tag looks for a bookmark. I had to switch this code:

<cfdocument format="pdf" filename="paris.pdf" overwrite="true">
<h2>Paris Hilton</h2>

<p>
Here is the collected wisdom of Paris Hilton.
</p>
</cfdocument>

To this:

<cfdocument format="pdf" filename="paris.pdf" overwrite="true" bookmark="true">
<cfdocumentsection name="Paris Section">
<h2>Paris Hilton</h2>

<p>
Here is the collected wisdom of Paris Hilton.
</p>
</cfdocumentsection>
</cfdocument>

Note the use of bookmark=true and a cfdocumentsection that wraps the entire page. That was slightly confusing at first, but the end result is perfect. What is great is that my ColdFusion Cookbook site will be able to benefit from this. Right now I have something like 120+ pages in a PDF with no real easy way to navigate. By using DDX I'll be able to add a real table of contents to document!

So what else can you do with DDX? As I mentioned some features were removed from the bundled product, but what is left is still pretty awesome. Charlie Arehart added a comment to another of my blog articles saying that he wished it were simpler to add a watermark to a PDF. I.e., just add "Foo" to the PDF without needing to make a new PDF or an image. Turns out DDX supports that as well. Here is some sample DDX that demonstrates how to apply a watermark. Again - check the LiveCycle Assembler DDX documentation for explicit documentation on each tag.

<?xml version="1.0" encoding="UTF-8"?>
<DDX xmlns="http://ns.adobe.com/DDX/1.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd">
<PDF result="Out1">
<PDF source="Doc1">
<Watermark rotation="30" opacity="50%">
<StyledText><p font-size="85pt" font-weight="bold" color="gray" font="Arial">FINAL</p></StyledText>
</Watermark>
</PDF>
</PDF>
</DDX>

Nothing too terribly complex here. Frankly I find this a bit easier than earlier PDF and watermarks blog article. Maybe not easier per se - but I find it to be more direct. And in case it isn't obvious - since the DDX is completely abstracted, you can pass any PDF in that you want and specify any output. One thing I'm not sure on is if the value of the watermark, the text, can be dynamic as well. Obviously I can generated my DDX in ColdFusion, so yes, it can be dynamic, but I'm curious to know if DDX supports variables for values like the text between the P tags.

One more example. I always wondering why there wasn't a way to read the text of a PDF. Turns out there is - DDX. Consider this simple DDX example:

<?xml version="1.0" encoding="UTF-8"?>
<DDX xmlns="http://ns.adobe.com/DDX/1.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd">
<DocumentText result="Out1">
<PDF source="doc1"/>
</DocumentText>
</DDX>

Here is the source PDF I used: <cfdocument format="pdf" filename="paristoberead.pdf" overwrite="true">
<h2>Paris Hilton</h2>

<p>
<cfoutput>
This is the text of a PDF. It has a bit of randomness (#randRange(1,100)#) in it.
</cfoutput>
</p>

<cfdocumentitem type="pagebreak" />

<h2>Fetch Adams</h2>

<p>
<cfoutput>
This is the second page. It has a bit of randomness (#randRange(1,100)#) in it.
</cfoutput>
</p>

</cfdocument>

When processed, you get an XML file. The result will look something like so:

<?xml version="1.0" encoding="UTF-8"?>
<DocText xmlns="http://ns.adobe.com/DDX/DocText/1.0/">
<TextPerPage>
<Page pageNumber="1">Paris Hilton This is the text of a PDF . It has a bit of randomness ( 67 ) in it .</Page>
<Page pageNumber="2">Fetch Adams This is the second page . It has a bit of randomness ( 7 ) in it .</Page>
</TextPerPage>
</DocText>

Notice how the HTML was removed. What's cool about this is that if you ned to index PDF data and you don't want to use Verity, you could use this instead. (I think tonight I'll write a quick UDF just for this.)

That's it for this blog entry. I want to remind folks - DDX is a big topic and I didn't cover much at all. I also used a lot of code in this example so I've taken all my test CFMs and PDFs and packaged them as a zip attached to this article.

Comments

Ray,
Do you know, are the CF8 PDF functions still using iText at all, or is it all Adobe technology now?
# Posted By RobW | 7/25/07 8:20 AM
Sorry, I don't. Hopefully an Adobe reader can chime in here.
# Posted By Raymond Camden | 7/25/07 8:25 AM
Is it possible to use this to enter positioned text into an existing PDF document or tabular data?
# Posted By Daniel Budde II | 7/25/07 11:56 PM
I'm not sure. I'd check the DDX docs.
# Posted By Raymond Camden | 7/26/07 8:14 AM
Great series Ray. I've scoured the documentation and for the life of me can't see how I can read a pdf file's table of contents. The ddx documentation appears to only show how to inject ddx info and not how to extract it from an existing pdf file. Am I just missing something?
# Posted By Dave Hoff | 7/27/07 11:46 AM
Well I know DDX can do extractions as that I how I got the text out. In theory you could get the text from the page that had the TOC. This text wouldn't be structured though.
# Posted By Raymond Camden | 7/27/07 11:49 AM
It appears that CF8 does not support the Bookmarks DDX element that allows you to extract bookmarks from a pdf. In the CF8 docs, it lists the restricted DDX elements but "Bookmarks" is not in the list.

Too bad, I was really hoping for this functionality.
# Posted By Dave Hoff | 7/27/07 3:26 PM
If it isn't listed in the restricted list, can you please file a bug report?
# Posted By Raymond Camden | 7/27/07 4:05 PM
Anyone ever seen this error message:

failed: DDXM_S18005: An error occurred in the PrepareTOC phase while building <TableOfContents>. Cause given.

I only get this error when using the TableOfContents element in the DDX:

<TableOfContents maxBookmarkLevel="infinite" bookmarkTitle="Table of Contents" includeInTOC="false">
<Footer styleReference="CatalogueFooter" />
</TableOfContents>

Any thoughts?
# Posted By Brian | 8/8/07 2:17 PM
Reading through the livecycle docs I found a neat little parameter that can be added to the <header> and <footer> tags called replaceExisting=true SEE: http://livedocs.adobe.com/livecycle/es/sdkHelp/pro...

Unfortunately I have been able to get it to work yet. I have added a comment to the CF8 docs but would also love to hear if anybody else has used this successfully.
# Posted By Martin | 2/11/08 9:42 AM
do most people use verity on cf8 to search though pdf's?
or do they parse out pdf's into text files and search through those?

what are the differences in on resources?
# Posted By -paul | 3/20/08 9:28 AM
I don't know if there is a right answer to that. I don't think a lot of people use Verity in CF, even though they should.
# Posted By Raymond Camden | 3/20/08 10:23 AM
@my own comment about replaceExisting="true"
I HAVE been able to use this with the pdf's I have created with ColFusion8. My initial confusion was dealing with (existing) pdf's that "looked" as if they had a header and footer BUT when I converted those pdf's to text I found that it was actually body text stretched out the edge of the page.

@Verity
I have been reluctant to use Verity because a) it does consume quite a bit of RAM and b)Databases like MySql come with Full Text Searching built in. Fair enough MySql doesn't search pdf documents though.
# Posted By Martin | 3/20/08 10:58 AM
I agree with Ray, I would really like to use Verity more, but I also agree with Martin and I tend to shy away from it when it comes to the resources used. I tend to use it more when I absolutely need full search capabilities with document context and scoring.

I tried using the Adobe docs to split Verity off onto its own server, but I was never able to make it work successfully on CF7 or CF8. If anyone has ever completed it successfully, I certainly would be interested.
# Posted By Daniel Budde II | 3/20/08 11:18 AM
I've got an odd error when running this ddx watermark example. The text "FINAL" sometimes appears as random nonsense characters like #&$&^%*. This happens maybe 1 out of 5 times I run the code. I'm using CF8 on OS X Leopard and opening the files in Apple's Preview program. I have not seen this problem with PDF's generated from my OS X server and viewing on Windows. Have any other Mac users out there noticed this intermittent watermark text problem? Has anyone solved the issue?

Thanks!
# Posted By Matt MacDougall | 5/27/08 12:27 PM
Did you change the hard coded font to something else?

Also - note that in 8.0.1, you can now supply HTML for watermarks. This means you don't need to use DDX for it anymore.
# Posted By Raymond Camden | 5/27/08 12:56 PM
Thanks Ray. I used your ddxpdf.zip example as is. In my last test I generated one copy of output2.pdf from ddx3.cfm. When I browse to the ddxpdf folder to open and close output2.pdf, at least 1 in 5 times, Preview renders FINAL as junk text.

I've confirmed this same behavior on another Mac running Leopard too. You've got a Mac right? You don't see this behavior?

I appreciate the heads up on 8.0.1 allowing watermark text outside of ddx, I'll give that a shot.
# Posted By Matt MacDougall | 5/27/08 1:39 PM
Tell me this - in the other 4 times, do you see font changes? I mean - still readable, but random fonts?
# Posted By Raymond Camden | 5/27/08 1:48 PM
When I see the word FINAL, it's always Arial. In fact when it doesn't read FINAL but something like &^&*%&* it looks like Arial as well.
# Posted By Matt MacDougall | 5/27/08 1:51 PM
You got me on that one. _Are_ you using 8.0.1 yet?
# Posted By Raymond Camden | 5/27/08 1:55 PM
Thanks for bouncing around some ideas Ray. I am using 8.0.1. I tried using the new addwatermark text functionality and ran into the same problem. It looks like the problem though is with the Arial font. I don't see an issue on my main machine or others when using Verdana or Courier.
# Posted By Matt MacDougall | 5/27/08 2:38 PM
Looks like it's time to log a bug. :)

http://www.adobe.com/go/wish
# Posted By Raymond Camden | 5/27/08 2:43 PM
Hi Ray,

I'm running your example code for pdf generation using a DDX files. Specifically, the ddx2.cfm

I'm getting the same error as Brian above.

failed: DDXM_S18005: An error occurred in the PrepareTOC phase while building <TableOfContents>. Cause given.

I've narrowed it down. When you add bookmark="true" to cfdocument you get the error. If you don't have bookmark="true" it works but no TOC. But I saw your output2.pdf HAS a TOC. Any idea why your code won't run on my copy of CF8? I've tried it on the developer edition and a standard version.

Thanks!
# Posted By Sid Maestre | 6/10/08 11:13 PM
Are you running 801 along with the cumulative hot fix?
# Posted By Raymond Camden | 6/11/08 8:18 AM
Hi Ray,

Installing the 8.01 update solved the problem. That will teach me to not run the latest version of CF.

Did a <a href="http://www.designovermatter.com/blog/index.cfm/200... post</a> for anyone who does a google search on the error message. (Which is what I did).

Cheers
# Posted By Sid Maestre | 6/12/08 7:40 AM
Hi Ray, Obviously being a bit stupid here but it appears the only way I can create a List of Contents is by setting the bookmark is true and giving the bookmark a name using the <cfdocument> tag. Is there no other way?
# Posted By Terry Collinson | 8/14/08 5:44 AM
Terry - the short answer is yes. The long answer is that it appears, MAYBE, that you can do it via DDX as well:

http://livedocs.adobe.com/livecycle/8.2/ddxRef/wwh...

But certainly it's easier doing it in CFML (imho).
# Posted By Raymond Camden | 8/14/08 8:18 AM