I just pushed up an update to Seeker, my ColdFusion Lucene project. I added support for MS Word documents and MS Excel files. This was incredibly easy using JavaLoader from Mark Mandel and the POI project.
Todd Sharp gets credit for pushing both these ideas to me. He also made a good suggestion for how to use JavaLoader within Seeker. Seeker makes use of various "reader" CFCs. Each CFC is responsible for one or more file types. A CFC 'registers' itself using metadata. So here is what plaintext.cfc looks like:1 <cfcomponent output="false" hint="Plain text reader." extensions="xml,txt,html,htm,cfm,cfc" extends="reader">
2
3 <cffunction name="read" access="public" returnType="string" output="false">
4 <cfargument name="file" type="string" required="true">
5 <cfset var result = "">
6
7 <cffile action="read" file="#arguments.file#" variable="result">
8 <cfreturn result>
9
10 </cffunction>
11
12 </cfcomponent>
Note the extensions attribute. This then says that this reader will be used for all the plain text file types. So what Todd suggested was just using a similar method for the Java classes. I'm not terribly happy with the names, but this is what I did.
When you add requires= to your reader CFC, you specify a list of Java classes. Like so:
2
3 <cffunction name="read" access="public" returnType="string" output="false">
4 <cfargument name="file" type="string" required="true">
5 <cfset var result = "">
6
7 <cffile action="read" file="#arguments.file#" variable="result">
8 <cfreturn result>
9
10 </cffunction>
11
12 </cfcomponent>
1 <cfcomponent output="false" hint="MS Office format reader." extensions="doc,xls" requires="org.apache.poi.hwpf.HWPFDocument, org.apache.poi.hwpf.extractor.WordExtractor, org.apache.poi.hssf.extractor.ExcelExtractor, org.apache.poi.hssf.usermodel.HSSFWorkbook" extends="reader">
(Spaces were added to me.) When Seeker runs, it will notice these requirements and use JavaLoader to load them. There is a JARs file that is autoloaded, and it is expected that if your CFC needs a jar, you will put it in the folder. Since I'm using JavaLoader, all of these JARs are plug and play. No need to restart ColdFusion. Working with the classes is simple as well:
1 <cfset var doc = getRequirement("org.apache.poi.hwpf.HWPFDocument")>
This calls a method in the inherited CFC that gets the class that was loaded by JavaLoader and injected by the core Seeker code. I'm not happy with that method name there, but it works.
Comment 1 written by Scott P on 20 June 2008, at 2:32 PM
Comment 2 written by Peter Hoopes on 20 June 2008, at 3:07 PM
Comment 3 written by nick tong on 21 June 2008, at 4:05 AM
Comment 4 written by Raymond Camden on 21 June 2008, at 6:45 AM
@Nick - Rough thoughs:
So obviously if you aren't on a Mac, then Verity is built in. No need to 'install' Seeker. I also like the category and suggestions support for Verity. You could probably duplicate category support, but it would be more difficult to do suggestions. (As far as I know, I'm still learning Lucene.)
The big plus for Lucene is index size. You have no license limits like Verity (250k). As I blogged about earlier, I tested w/ an index of 25 million records.
Comment 5 written by nick tong on 21 June 2008, at 11:04 AM
Comment 6 written by Raymond Camden on 23 June 2008, at 2:55 PM
Comment 7 written by Medman on 23 June 2008, at 4:20 PM
does seeker work with Coldfusion 7?
Comment 8 written by Raymond Camden on 23 June 2008, at 4:23 PM
Comment 9 written by Sami Hoda on 30 June 2008, at 1:51 PM
Comment 10 written by Raymond Camden on 30 June 2008, at 2:02 PM
Comment 11 written by Sami Hoda on 30 June 2008, at 2:04 PM
<a href="seeker/index.cfm" target="content">Seeker</a><br> to it. Is the XML the new approach?
Comment 12 written by Raymond Camden on 30 June 2008, at 2:05 PM
Comment 13 written by Sami Hoda on 30 June 2008, at 2:06 PM
Comment 14 written by Sami Hoda on 30 June 2008, at 2:07 PM
Comment 15 written by Raymond Camden on 30 June 2008, at 2:11 PM
Comment 16 written by Sami Hoda on 30 June 2008, at 3:53 PM
Comment 17 written by Will Wilson on 27 April 2009, at 3:09 PM
Any chance you could add snippets on the next version (similar to verity where it highlights text etc).
Would also be cool if you could link to pages within a framework...although I'm baffled how one would accomplish this.
Keep up the good work! Being on a mac, I'm finding this tool invaluable!
Comment 18 written by Raymond Camden on 27 April 2009, at 3:59 PM
Also, you need to clearly differentiate between snippets and context. A snippet could be from anywhere in the document, but helps identify the document, whereas context shows you the match. So I'm sure you mean context.
Comment 19 written by Will Wilson on 28 April 2009, at 2:29 PM
Comment 20 written by janusz on 17 June 2009, at 4:20 AM
im wondering if seeker allows for more than one index to be created?
i currently need to index lots of different tables with different columns. Instead of trying to collate them into one index, thought it may be easier to create more than one index?
thanks
Comment 21 written by Raymond Camden on 17 June 2009, at 6:16 AM
Comment 22 written by janusz on 17 June 2009, at 6:38 AM
if i have this...
<cf_indexquery directory="#index_folder#" indexdirectory="#index_folder#"
query="#arguments.index_qry#"
storecolumns="id,title,content,type,link" indexcolumns="id,title,content">
where would the name of the index file go?
seeker has been a lifesaver as im on a mac.
thanks
Comment 23 written by Raymond Camden on 17 June 2009, at 6:39 AM
Comment 24 written by janusz on 17 June 2009, at 6:45 AM
Comment 25 written by rchinoy on 28 January 2010, at 3:48 PM
It doesn't seem like stemming is working when I use Seeker. Is there something I need to do to get it working?
Thanks
Comment 26 written by Raymond Camden on 1 February 2010, at 12:49 PM
Comment 27 written by farshid on 8 February 2010, at 8:38 AM
what are you doing with MS word math equations? could you insert formula to database via rich text box or reading from doc files?
Please help me.
regards
farshid
[Add Comment] [Subscribe to Comments]