Yesterday I wrote a blog entry on reading Microsoft Office documents with ColdFusion, Apache POI, and JavaLoader. One of the commenters, Leah, shared some code that made use of the latest beta of POI. This makes the reading quite a bit simpler. I had tried this myself but ran into trouble. Thanks to Leah, I'm now able to demonstrate a new version that is quite a bit simpler.
First, make sure you have read the previous entry, as some of this won't make sense without the background information. The next thing you want to do is grab POI 3.5 (List of Mirror) and unzip it. Copy all the JARs, all the lib contents, and the ooxml-lib files, into a new subfolder called jars2. "jars2" as a name isn't required of course. My previous version of this code used the jars folder for the 3.2 files so I figured I'd use jars2 for the 3.5 code. Our initialization code is virtually the same as before:1 <!--- where the poi files are --->
2 <cfset jarpath = expandPath("./jars2")>
3 <cfset paths = []>
4 <cfdirectory action="list" name="files" directory="#jarpath#" filter="*.jar" recurse="true">
5
6 <cfloop query="files">
7 <cfset arrayAppend(paths, directory & "/" & name)>
8 </cfloop>
9
10 <!--- load javaloader --->
11 <cfset loader = createObject("component", "javaloader.JavaLoader").init(paths)>
Now for the cool part. Remember how we had around 8 or so specific Java classes to do our parsing? This was because each Office type we worked with (Word, Excel, Powerpoint) had their own code and APIs to get at the text. POI 3.5 makes this a bit simpler with a factory called the ExtractorFactory. Here is the rest of the file:
2 <cfset jarpath = expandPath("./jars2")>
3 <cfset paths = []>
4 <cfdirectory action="list" name="files" directory="#jarpath#" filter="*.jar" recurse="true">
5
6 <cfloop query="files">
7 <cfset arrayAppend(paths, directory & "/" & name)>
8 </cfloop>
9
10 <!--- load javaloader --->
11 <cfset loader = createObject("component", "javaloader.JavaLoader").init(paths)>
1 <!--- generic file reader doohicky --->
2 <cfset myfile = createObject("java","java.io.File")>
3
4 <!--- get our required things loaded --->
5 <cfset extractorFactory = loader.create("org.apache.poi.extractor.ExtractorFactory")>
6
7 <!--- get files --->
8 <cfset filePath = expandPath("./testdocs")>
9 <cfdirectory action="list" name="files" directory="#filePath#" filter="*.doc*|*.ppt*|*.xls*">
10
11 <cfloop query="files">
12 <cfset theFile = filePath & "/" & name>
13 <cfset myfile.init(theFile)>
14
15
16 <cfoutput>Reading: #theFile#<br/></cfoutput>
17
18 <cfset extractor = extractorFactory.createExtractor(myFile)>
19 <cfoutput><pre>#extractor.getText()#</pre></cfoutput>
20
21 <p><hr/></p>
22
23 </cfloop>
I made one File object and one instance of the ExtractorFactory. Once I've done that, look how darn simple the code is!
2 <cfset myfile = createObject("java","java.io.File")>
3
4 <!--- get our required things loaded --->
5 <cfset extractorFactory = loader.create("org.apache.poi.extractor.ExtractorFactory")>
6
7 <!--- get files --->
8 <cfset filePath = expandPath("./testdocs")>
9 <cfdirectory action="list" name="files" directory="#filePath#" filter="*.doc*|*.ppt*|*.xls*">
10
11 <cfloop query="files">
12 <cfset theFile = filePath & "/" & name>
13 <cfset myfile.init(theFile)>
14
15
16 <cfoutput>Reading: #theFile#<br/></cfoutput>
17
18 <cfset extractor = extractorFactory.createExtractor(myFile)>
19 <cfoutput><pre>#extractor.getText()#</pre></cfoutput>
20
21 <p><hr/></p>
22
23 </cfloop>
1 <cfset extractor = extractorFactory.createExtractor(myFile)>
The factory takes care of all the sniffing and ensuring the right extractor is returned. I then just run getText() and we're done. Simpler than a debate with Lindsey Lohan!
I've attached the code to the blog entry. Later today I'll talk about how to get at some of the metadata for Office documents. (Note, the attached zip does not have the jars from POI 3.5, they were a bit too big.)


Comment 1 written by Andrius on 8 July 2009, at 5:56 AM
However, you could save the javaloader into the server scope and use it from there to avoid memory leaks (please refer to readme.txt within the javaloader folder).
There is a known memory leak issue when using a java.net.URLClassLoader to load external jar files. Please find more info here: http://www.compoundtheory.com/?action=displayPost&...
Comment 2 written by Andrew Duvall on 1 October 2009, at 3:04 PM
I am using the now release final poi-3.5-final.zip file and now have a total of 7 jar files (16.7 MB)
\jars2\lib\commons-logging-1.1.jar
\jars2\lib\junit-3.8.1.jar
\jars2\lib\log4j-1.2.13.jar
\jars2\ooxml-lib\dom4j-1.6.1.jar
\jars2\ooxml-lib\geronimo-stax-api_1.0_spec-1.0.jar
\jars2\ooxml-lib\ooxml-schemas-1.0.jar
\jars2\ooxml-lib\xmlbeans-2.3.0.jar
Question: is that all i need from the zip file?
Issue: I get Error Occurred While Processing Request
org.apache.poi.extractor.ExtractorFactory
The error occurred in C:\Dev\libs\contactsImporter\javaloader\JavaLoader.cfc: line 94
92 : <cfargument name="className" hint="The name of the class to create" type="string" required="Yes">
93 : <cfscript>
94 : var class = getURLClassLoader().loadClass(arguments.className);
95 :
96 : return createJavaProxy(class);
Comment 3 written by Andrew Duvall on 1 October 2009, at 4:54 PM
I didn't know where to find the FINAL versions jar files, but used the missing jars from beta6 found here:
http://www.jarvana.com/jarvana/browse?path=/org/ap...
folders:
4. poi
5. poi_contrib
6. poi_ooxml
7. poi-scratchpad
It seems like the POI project is using POM files; not sure if that is why certains JAR files are gone? not sure what POM files are used for either :)
Anyway, I don't doubt I'll be schooled here; but this is what I did to get my app working for now.
Comment 4 written by Raymond Camden on 1 October 2009, at 4:56 PM
Comment 5 written by Don on 18 February 2010, at 2:24 PM
See, my problem is that I have THOUSANDS of RTF files on a server. The server is dieing so we moved the application to a new server. Now we have to read/edit the rtf files somehow. Previously it was done by simply using cfcontent. This opened Word and they did their thing. That doesn't work now since the files are on 1 server and the application on another. The app server does not have office at all on it.
Fun fun fun.
And now "for security" they want to put these files in an Oracle db. I can do that, but getting them out in an editable format is kicking my butt.