Yesterday I blogged about ColdFusion and DDX, a way to some fancy-pants neato transformations of PDF documents. One of the cooler examples was that DDX could be used to grab the text from a PDF file. For those who thought it might be too difficult to use the DDX, I've wrapped up the code in a new ColdFusion Component I'm calling PDF Utils. (Coming to a RIAForge near you soon. Watch the skies...)
Right now the CFC has one method, getText. You pass in the path to a PDF and you get an array of pages. Each item in the array is the text on that particular page. I've included on this blog post two sample PDFs. One is a normal PDF with simple text. As you can imagine, the function works great with it. The other one is a highly graphical, wacky looking PDF. Ok it isn't wacky looking per se, but it isn't a simple letter. When the method is run on this PDF, the text does come back, but it is a bit crazy looking. I think this is to be expected though. And what's cool is that if your intent is to get the text out for searching/indexing purposes, you can still find it useful.
Anyway, here is a sample:
2
3 <cfset mypdf = expandPath("./paristoberead.pdf")>
4
5 <cfset results = pdf.getText(mypdf)>
6 <cfdump var="#results#">
Which gives this result:

The zip includes 2 PDFs, the component, and my test script.


Comment 1 written by Ben Nadel on 25 July 2007, at 3:58 PM
Comment 2 written by Raymond Camden on 25 July 2007, at 4:01 PM
I've actually got an entire online PDF Editor I worked on for the CF8 tour. I'm going to load it up when I wrap the CFPDF series.
Comment 3 written by Ben Nadel on 25 July 2007, at 4:07 PM
Comment 4 written by Lola LB on 26 July 2007, at 6:55 AM
Comment 5 written by O?uz Demirkap? on 10 September 2007, at 12:56 PM
Comment 6 written by Raymond Camden on 10 September 2007, at 1:00 PM
Comment 7 written by Mark on 4 April 2008, at 10:52 PM
test.cfm seems OK
test2.cfm dumps the PDFDocument structure? Is the cfpdf write failing silently?
genpdf.cfm works (just try paristoread_new.pdf or whatever)
xmptest.cfm returns [empty string] no matter what I try. . ...
Comment 8 written by Raymond Camden on 5 April 2008, at 12:36 PM
As for xmptest.cfm, it will be empty if your PDF doesn't use XMP. Not all pdfs do. If you think your does and it doesn't work, email me the PDF.
Comment 9 written by Johnny on 22 July 2008, at 10:07 AM
Can I do the same in MX7?
Thanks
Comment 10 written by Raymond Camden on 24 July 2008, at 9:39 AM
Comment 11 written by Tim on 29 September 2008, at 1:37 PM
Comment 12 written by Raymond Camden on 29 September 2008, at 1:43 PM
Comment 13 written by Raymond Camden on 29 September 2008, at 2:09 PM
Comment 14 written by Armando on 12 February 2009, at 4:53 PM
Processing seems to stop as soon as the the cfpdf tag calls processddx.
This is on a shared coldfusion 8 server at crystaltech.
Comment 15 written by Raymond Camden on 14 February 2009, at 10:08 AM
Comment 16 written by Virginia Neal on 28 May 2010, at 6:24 PM
Thanks
Comment 17 written by Raymond Camden on 30 May 2010, at 8:54 AM
Comment 18 written by Virginia Neal on 1 June 2010, at 9:18 AM
I have tried both .doc and .docx. Code is simple
<cfpdf useStructure="true" addquads="false" honourspaces="true" type="string" action="extracttext" source="test.docx.pdf" name="pdfToText" />
The document is simple and contains about 7 lines, just enough to test line breaks, indents, and centering.
Thanks for any help you can provide.
Comment 19 written by Raymond Camden on 1 June 2010, at 9:21 AM
Comment 20 written by Virginia Neal on 1 June 2010, at 9:34 AM
Comment 21 written by Raymond Camden on 1 June 2010, at 9:35 AM
Comment 22 written by Virginia Neal on 1 June 2010, at 9:38 AM
Comment 23 written by Raymond Camden on 1 June 2010, at 9:40 AM
Comment 24 written by Raymond Camden on 1 June 2010, at 10:21 AM
This may be a bit much - but you could use Google Docs. I have a wrapper CFC for it. You could upload to Google Docs than download as HTML or text. It would be slow(well slowish), but if you needed it for one time conversions, it would be acceptable I think.
Comment 25 written by Virginia Neal on 1 June 2010, at 10:30 AM
I had also tried the DDX approach and the quads option. I had briefly considered using the coordinates, but the docs I need to convert are gigantic and I get as many as 40 a day. Given that, I don't think the Google approach will work either. I have 3rd party software that currently converts Word to PDF and to Text, but I had hoped to get rid of the software and let CF do both. With CF 9, I am now able to convert the docs to PDF, but thus far no easy way to go to text. Maybe the POI will work.
Comment 26 written by Raymond Camden on 1 June 2010, at 10:32 AM
As to your 3rd party tool - shoot, if it works, use it! I'd brush my teeth with ColdFusion if I could, but at the end of the day, you want to use what works best.
Comment 27 written by Virginia Neal on 1 June 2010, at 10:42 AM
Anyway, thanks again for your help!! It is good to know it is not just me and banging my head for another week won't change the fact that the tag won't work!
[Add Comment] [Subscribe to Comments]