Have you ever needed to get information about Office documents with ColdFusion? In this blog entry I'll talk about how we can use some open source tools along with ColdFusion to read from various types of Microsoft Office documents. I'll be focusing on retrieving the text of the documents only. You can use this code to allow users to upload Word documents and provide basic search features against the actual contents of the documents.
Now you may ask - doesn't the built in Verity engine do this? It does, but if you have any other need for the text, you are out of luck. Verity sucks in the text and stores it in it's own format. You can't ask Verity to make a copy of the complete text. Let's get started!
Our code will make use of two open source products. The first is the Apache POI product. This is a set of Java libraries that allow deep integration into Office documents, both reading and writing. Our use of the library will be fairly simple - just reading. I downloaded the latest 3.2 version which provides support for Office formats from the old 97 versions up to the Office 2007 release. For OOXML you can use POI 3.5, which is in beta. I had trouble playing with this so decided to focus on the non-OOXML version. After you download the zip, open up the archive and copy the 3 JAR files you find in the root. I copied these to a folder called 'jars' under my web root. The next thing we need is JavaLoader. JavaLoader is a ColdFusion project created by Mark Mandel (the creator of Transfer). This code lets you load any random JAR file on the fly. Normally you have to copy new JARs to a specific location under your ColdFusion install and you have to follow this up with a server restart. JavaLoader is a much simpler way to handle this. It also helps with another problem - class conflicts. ColdFusion itself makes use of POI (you can see a few POI jar files in the lib folder) but the version bundled is older than what we need to use to extra our text. By using JavaLoader, we can make sure everything comes directly from the JARs in play. I downloaded JavaLoader and extracted the code to a folder named javaloader under my web root. Alright, so let's look at some code. The first thing I want to do is get JavaLoader initialized with the JAR files from POI. When you initialize JavaLoader you pass in an array of JAR files. Since I put everything in a folder, I wrote some code to simply iterate over the directory and create a list of all the JARs. I then pass this to JavaLoader:2 <cfset jarpath = expandPath("./jars")>
3 <cfset paths = []>
4 <cfdirectory action="list" name="files" directory="#jarpath#" filter="*.jar" recurse="true">
5
6 <cfloop query="files">
7 <cfset arrayAppend(paths, directory & "/" & name)>
8 </cfloop>
9
10 <!--- load javaloader --->
11 <cfset variables.loader = createObject("component", "javaloader.JavaLoader").init(paths)>
2 <cfset myfile = createObject("java","java.io.FileInputStream")>
3
4 <!--- get our required things loaded --->
5
6 <!--- Word --->
7 <cfset doc = loader.create("org.apache.poi.hwpf.HWPFDocument")>
8 <cfset wordext = loader.create("org.apache.poi.hwpf.extractor.WordExtractor")>
9
10 <!--- Excel --->
11 <cfset excel = loader.create("org.apache.poi.hssf.usermodel.HSSFWorkbook")>
12 <cfset xlsext = loader.create("org.apache.poi.hssf.extractor.ExcelExtractor")>
13
14 <!--- Powerpoint --->
15 <cfset ppt = loader.create("org.apache.poi.hslf.HSLFSlideShow")>
16 <cfset pptext = loader.create("org.apache.poi.hslf.extractor.PowerPointExtractor")>
2 <cfset filePath = expandPath("./testdocs")>
3 <cfdirectory action="list" name="files" directory="#filePath#">
4
5
6 <cfoutput query="files">
7 <cfset theFile = filePath & "/" & name>
8 <cfset myfile.init(theFile)>
9
10 Reading: #theFile#<br/>
11
12 <cfswitch expression="#listLast(name,".")#">
13
14 <cfcase value="doc">
15 <cfset doc = doc.init(myfile)>
16 <cfset wordext.init(doc)>
17 <cfoutput>
18 <pre>
19 #wordext.getText()#
20 </pre>
21 </cfoutput>
22 </cfcase>
23
24 <cfcase value="xls">
25 <cfset excel = excel.init(myfile)>
26 <cfset xlsext = xlsext.init(excel)>
27 <cfoutput>
28 <pre>
29 #xlsext.getText()#
30 </pre>
31 </cfoutput>
32 </cfcase>
33
34 <cfcase value="ppt">
35 <cfset ppt = ppt.init(myfile)>
36 <cfset pptext = pptext.init(ppt)>
37 <cfoutput>
38 <pre>
39 #pptext.getText(true,true)#
40 </pre>
41 </cfoutput>
42 </cfcase>
43 </cfswitch>
44
45 <p><hr/></p>
46
47 </cfoutput>
Once you have the text you can store it in a database, email it, or do anything else you would like with it.
Pretty simple and powerful! Also note that you can get other data then just the text. Each extractor has it's own options and each Office object itself can do it's own unique thing. For example, with the Powerpoint support you can choose to get the text of slides or the notes, or both. Check the docs at Apache for more information.
How are people using POI in the wild? Ben Nadel has his POI Utility which enables for reading/writing Excel docs. Todd Sharp used to use it for SlideSix but has since moved on to using OpenOffice.


Comment 1 written by todd sharp on 4 February 2009, at 1:10 PM
Good post Ray!
Comment 2 written by Edward Beckett on 4 February 2009, at 1:25 PM
I see envision lot's of opportunities with this ...
Great post Ray ... :)
Comment 3 written by Dave DuPlantis on 4 February 2009, at 1:27 PM
So it may simply be that there isn't that much for POI to hook into.
Comment 4 written by todd sharp on 4 February 2009, at 2:19 PM
Comment 5 written by Cyril Hanquez on 4 February 2009, at 2:26 PM
Now I'm trying to switch to Tika (http://lucene.apache.org/tika/) to support more types of document.
Comment 6 written by Martijn van der Woud on 4 February 2009, at 2:31 PM
Comment 7 written by Leigh on 4 February 2009, at 2:49 PM
> I had trouble playing with this
Just curious, but what problems did you have? I just started using the beta this week and have had pretty good results so far. Though, it does require more jars than previous versions.
But I am very glad you posted this - because it made discover there is a handy factory class for extracting text. You can pass in a file and the factory returns the correct type of extractor. I have not tried it with PPT, but it seems to be working well with excel and word.
ExtractorFactory = javaLoader.create("org.apache.poi.extractor.ExtractorFactory");
inputFile = createObject("java", "java.io.File").init( "c:\myFiles\testExcel2007.xlsx");
excelExt = ExtractorFactory.createExtractor(inputFile);
WriteOutput("excelExt text=<hr>"& excelExt.getText() &"<hr>");
Leigh
Comment 8 written by Leigh on 4 February 2009, at 2:50 PM
Darn fat fingers.
Comment 9 written by Raymond Camden on 4 February 2009, at 2:52 PM
Comment 10 written by Raymond Camden on 4 February 2009, at 2:58 PM
Comment 11 written by Raymond Camden on 4 February 2009, at 3:11 PM
So my next step is to delete your comment, ban your IP, and take credit! Muahahahahahahahahhahahahaha!
(Sorry! ;)
Comment 12 written by Mikkel Johansen on 4 February 2009, at 3:12 PM
Reading Office 2007 files is very usefull. Especially when Verity does not handle docx and xlsx files.
Now I can read the text from the Office 2007 files and index the text with CFINDEX.
- Mikkel
Comment 13 written by Raymond Camden on 4 February 2009, at 3:15 PM
Comment 14 written by Raymond Camden on 4 February 2009, at 3:19 PM
So question: Does it make sense for me to build a generic reader CFC for Office files. Something that will get text and get MD properties? I don't want to duplicate Ben's work, but his is focused on Excel only. Would a CFC to just read various office formats be worth the time/effort? Should I ping Ben about possibly expanding his POIUtils to cover more Office formats?
Comment 15 written by Mark on 4 February 2009, at 3:25 PM
Comment 16 written by Mikkel Johansen on 4 February 2009, at 3:29 PM
And to me it is out-dated. Latest Word format is Word 2002, PDF version is 1.4 (acrobat 5).
If Adobe wants to use Verity as a sales parameter. Then they have to update the supported file types.
Comment 17 written by Raymond Camden on 4 February 2009, at 3:29 PM
You could wrap it in <pre> and put that in cfdocument to get started.
Comment 18 written by Raymond Camden on 4 February 2009, at 3:31 PM
Comment 19 written by todd sharp on 4 February 2009, at 3:37 PM
You could also use OpenOffice to do it, but that requires OpenOffice running as a service on the server.
I've been meaning to blog on that topic for about 6 months now. Maybe some day...
Comment 20 written by todd sharp on 4 February 2009, at 3:38 PM
Comment 21 written by Mark on 4 February 2009, at 3:43 PM
I would be very interested in the OpenOffice method.
Comment 22 written by Eric Hoffman on 4 February 2009, at 5:28 PM
Very much so.
Comment 23 written by Matt Williams on 5 February 2009, at 8:05 AM
Comment 24 written by Raymond Camden on 5 February 2009, at 8:16 AM
Comment 25 written by Steve Moore on 5 February 2009, at 8:36 AM
Comment 26 written by Leigh on 5 February 2009, at 3:25 PM
> I also have another update showing getting some basic metadata.
Cool. I look forward it. I have not gotten around to exploring metadata with the newer format yet.
PS: My response was delayed because I think my ip was banned (okay, I forgot to subscribe .. but being banned makes me sounds dangerous, like an outlaw ;-)
Comment 27 written by Raymond Camden on 5 February 2009, at 3:28 PM
Comment 28 written by Raymond Camden on 6 February 2009, at 11:44 AM
Comment 29 written by Leigh on 7 February 2009, at 7:23 PM
Comment 30 written by john on 9 March 2009, at 5:01 PM
I am trying to use a Word document as a template, read it in using POIFS, modify some content, and create a new Word document WITH the styles from the original document.
Can you clue me in as to how to do this or where there is good example code for ColdFusion/POIFS Word document creation?
Thanks!
Comment 31 written by Raymond Camden on 9 March 2009, at 9:12 PM
Comment 32 written by Todd Rafferty on 13 July 2009, at 3:52 PM
Comment 33 written by todd sharp on 13 July 2009, at 3:55 PM
Comment 34 written by Todd Rafferty on 13 July 2009, at 3:58 PM
Comment 35 written by Josh on 16 December 2009, at 1:15 PM
I am trying to convert PPTX files to PPT files because CF doesn't like to convert PPTX files to a PDF. I am using the Open Office ODFConverter to convert my PPTX files to an ODP file before I convert it to a PDF. This works great.
When I try to use CFDocument to convert that ODP file to a PDF I get a bunch of gibberish on the page.
Do you know how to convert these ODP files to a PDF?
<cfdocument
format="pdf"
srcfile="#directory#\sourceFile.odp"
filename="#directory#\destinationFile.pdf"
overwrite="yes">
</cfdocument>
Thanks for the help
Comment 36 written by Raymond Camden on 16 December 2009, at 1:22 PM
Comment 37 written by todd sharp on 16 December 2009, at 1:42 PM
Comment 38 written by Josh on 16 December 2009, at 3:27 PM
@Todd... I'll check out the JODConverter and see how it goes. Thank you for the suggestion.
Comment 39 written by Clarence on 23 August 2010, at 1:36 PM
<cfset docxExt = extractorFactory.createExtractor(fileIo)>
Any output after this tag doesn't show, I can confirm everything works until then, and doc files work fine.
fileIo is a fileInputStream, I've tried a java.io.file too, no difference.
After running my test script, the docx file is still locked since any attempts to change the filename returns an error saying Jrun still has it.
So basically it goes into this method and never returns, any ideas?
[Add Comment] [Subscribe to Comments]