Infofuze Technical Overview

Infofuze is a Java library and server application that can be used to transform and combine XML stream representations of various sources into a specific XML output stream that can be stored or indexed. These transformations can be fully configured and scheduled. Infofuze is based on the XML transformation interface of the Java API for XML Processing (JAXP) which is bundled with standard Java (J2SE). Infofuze is written in 100% pure Java and will run on all systems for which a Java 1.6 Virtual Machine is available.

1.1 Java API for XML Processing (JAXP)

Diagram 1 shows this XML transformation interface in action. A TransformerFactory object is instantiated, and used to create a Transformer. The source object is the input to the transformation process and must implement the interface javax.xml.transform.Source. The source object can be (among others) of the standard classes SAXSource, DOMSource, or StreamSource. Similarly, the result object is the result of the transformation process and must implement the interface javax.xml.transform.Result. That object can be (among others) of the standard classes SAXResult, DOMResult, or StreamResult.

When the transformer is created, it may be created from a set of transformation instructions, in which case the specified transformations are carried out. These transformation instructions can be in the XSLT or STX language, depending on the implementation of the Transformer object. (STX is a transformation language for streaming XML transformations). If it is created without any specific instructions, then the transformer object simply copies the source to the result.

1.2 Source classes

In Infofuze, the Source object can be of any class that implements the interface javax.xml.transform.Source or any of the Source classes that are provided by Infofuze. These Source classes include:

(*) The filesystem based sources have specific support for structured transformation of XML, JSON, CSV, binary and compressed files (see section 2.1.1)

The task of these Source classes is to provide as much structured data as possible from specific sources in the form of an XML stream (of any form). Therefore these classes all extend from StreamSource. The Infofuze Source classes are further described in chapter 2.

1.3 Result classes

The Result object can be of any class that implements the interface javax.xml.transform.Result (like StreamResult which can be used to write the output to a file) or any of the Result classes that are provided by Infofuze. These Result classes include:

The task of these Result classes is to stream XML data (of a specific form, like the Solr update format) to a specific data storage or index (like a Solr instance, a relational database or webservice). Therefore these classes all extend from StreamResult or SAXResult. The Infofuze Result classes are further described chapter 3.

1.4 Transformer classes

The Transformer object can be an XSLT transformer, like the one provided by the XSLT processors Saxon or Xalan, or a STX transformer, like the one provided by Joost.

1.5 Streaming transformations

Although the transformation used by Infofuze can be a XSLT transformation, in most cases this is not the right choice for XML sources that are bigger than, let's say, a couple of tens of megabytes. An XSLT transformation requires that the whole stream is read completely in memory before the transformation can start.
Therefore in most cases it is preferred to do an STX transformation. STX (Streaming Transformations for XML) resembles XSLT on the syntactic level, but where the XSLT template rules match on DOM nodes, STX template rules match on SAX events. You could say that STX combines SAX processing with the XSLT syntax. STX has one big drawback though; where in XSLT you have full random access to all XML nodes in the complete document, in STX you can only access the current node and its ancestors. Luckily, for non-trivial transformations of large XML streams STX and XSLT can be combined so you can use XSLT to do more complex “local” transformations in the context of a STX transformation.

1.5.1 Merging source streams

During the transformation of the main Source, other Sources can be read and merged into the main stream using XPath's document() function (XSLT) or the stx:process-document instruction (STX). The parameter uri or the attribute href must have the following format:

where sourcename is the name of a source configured in infofuze-config.xml (see section 1.8.1).

<xsl:apply-templates
select="document(concat('source://my-jdbcsource?id=',$id))/resultset/row"/>

<stx:process-document
href="concat('source://my-jdbcsource?id=',$id)/resultset/row"/>

For instance, during the transformation of the primary source, an Oracle database view, one can query the same database for n x n relations, or query a Microsoft SQL server database, Directory Service or CSV file for additional data. These secondary sources can be read parameterized. These parameters are for instance used in a JDBC Source to fill in parameters of a parameterized query.

1.5.2 Splitting result streams

Using the xsl:result-document (XSLT) or stx:result-document (STX) instructions, it is possible to conditionally redirect the output stream to a specific Result class. For instance, during the transformation of a JDBC Source of a table that contains data that is not fully normalized, one could fill one table in the JDBC Result with the primary data of the source, and another table with normalized “lookup data”. Also data from one database source can be outputted as multiple smaller XML and/or CSV files.

1.6 Executing transformations

1.6.1 Use the Infofuze Command Line Interface (CLI)

The Infofuze command-line interface is implemented in the Java class com.armatiek.infofuze.cli.Transform. The interface has the following syntax:

Infofuze Transform
==================
usage: transform -s <name> -r <name> [-t <file>] [-m <mode>] [-i <id>]
      [-u <classname>] [-o <classname>] [-x <classname>]

Options:
   -h,--help                            prints this message
   -s,--sourcename <name>               name of the configured Source
   -r,--resultname <name>               name of the configured Result
   -t,--transformationfile <file>       path to the transformation file (xsl
                                        or stx) to transform Source to
                                        Result (optional)
   -m,--mode <mode>                     mode of the transformation; "full",
                                        "full_no_delete", "delta" or
                                        "no_index" (optional)
   -i,--transformationid <id>           transformation identifier (optional)
   -u,--uriresolver <classname>         name of URIResolver class (optional)
   -o,--outputuriresolver <classname>   name of OutputURIResolver class
                                        (optional)
   -x,--xsltfactory <classname>         name of TransformerFactory class for
                                        XSLT transformations in context of
                                        STX transformation (optional)

1.6.2 Write (and schedule) an Ant Task

Apache Ant is a tool actually designed to automate software build processes. These processes can be defined by writing a build file which contain a set of tasks (instructions) expressed in XML. Besides from invoking the Java compiler, Ant provides a set of other standard tasks that include all file and directory tasks (copying, moving, deleting), archiving tasks (compress and decompress files), logging tasks, mail tasks, remote tasks (sending and receiving files via ftp, scp, execute tasks via ssh), version control tasks (get of commit files from and to CVS or Subversion) and so on. Infofuze provides a custom task transform that can be combined with all the standard Ant tasks to define versatile transformation jobs where the actual transformation task can be preceded and/or succeeded by a set of other tasks without any programming. The Ant tasks can be defined in the configuration file tasks.xml.

Using the Terracotta Quartz scheduler, the server process of Infofuze can schedule the execution of Ant tasks and therefore transformation jobs. These executions have their one thread and can be executed in parallel (concurrent). The scheduling of the jobs can be defined in the configuration file jobs.xml. In this configuration file, one can define jobs and triggers. A jobs is the definition of what must be done, a trigger is the definition of when it must be done. In a typical scenario, the job definition will refer to the job class org.quartz.jobs.AntJob, which will invoke the execution of an Ant target within tasks.xml. The trigger can be a SimpleTrigger, for a “one-shot” execution of a job on a specific time, or a CronTrigger for job execution based on calendar-like schedules - such as "every Friday, at noon" or "at 10:15 on the 10th day of every month."

1.6.3 Write your own Java code

You can write your own Java code using the Infofuze Java API and JAXP that creates a Transformer, pulls a Source and Result object from the SourcePool en ResultPool and provide the Transformer with an XSLT or STX file. You can also use the class com.armatiek.infofuze.transformer.Transformer that does all that work for you (and some more stuff regarding transactioning of the Result classes). You can use that class directly, or use the source as an example. Actually, the command line interface of Infofuze is not more than an thin interface to that class.

1.7 Full and delta transformations

In a lot of scenario's, especially when transforming large amounts of data, it makes sense to not transform data of which is known that is it not changed compared to the data that is already in the data store or index the transformation is writing to. This can avoid a lot of unnecessary network traffic, disk I/O and processing time. Infofuze executes a transformation in one of four modes:

In some scenario's the mode FULL does perform much better than the mode FULL_WITH_DELETE, and in such situations the scheduling of this mode can be combined with the scheduling of a FULL_WITH_DELETE job at a lower frequency. Furthermore, sometimes the FULL_WITH_DELETE mode is just not feasible because of the sheer amount of data.

The mode can be specified as a parameter on the command line interface or in the attribute "mode" of a transformation task in tasks.xml.

1.8 Configuring Infofuze

Infofuze can be configured by a set of configuration files that are stored under the Infofuze home directory. The most important configuration files are:

The location of the home directory can be specified by JNDI or a Java system property.

1.8.1 infofuze-config.xml

The configuration file infofuze-config.xml contains the definitions of the Source and Result objects. The format of infofuze-config.xml is defined by the XML Schema infofuze-config.xsd. The configuration file is both used by the server process and when running Infofuze from the command line. A simple configuration file looks like this:

<config>
<sources>
   <localFileSystemSource name="filesystem-network">
     <location>/var/files/myfiles</location>
     <location>/var/files/mydocuments</location>

     <directoryFilter>
       <andFileFilter>
         <hiddenFileFilter>false</hiddenFileFilter>
         <notFileFilter>
           <nameFileFilter>
             <name>.svn</name>
           </nameFileFilter>
         </notFileFilter>
       </andFileFilter>
     </directoryFilter>

     <compressedFileFilters>
       <tarFileFilter>
         <wildcardFileFilter caseSensitive="false">
           <wildcard>*.tar</wildcard>
         </wildcardFileFilter>
       </tarFileFilter>
       <zipFileFilter>
         <wildcardFileFilter caseSensitive="false">
           <wildcard>*.zip</wildcard>
         </wildcardFileFilter>
       </zipFileFilter>
     </compressedFileFilters>

     <binaryFileFilters>
       <binaryFileFilter
        extractText="true" includeBinary="false" extractMetadata="true" preParse="true">
         <wildcardFileFilter caseSensitive="false">
           <wildcard>*.doc</wildcard>
           <wildcard>*.pdf</wildcard>
         <wildcardFileFilter>
       </binaryFileFilter>
     </binaryFileFilters>

     <xmlFileFilters>
       <xmlFileFilter includeBinary="false" preParse="true">
         <wildcardFileFilter caseSensitive="false">
           <wildcard>*.xml</wildcard>
           
           <wildcard>*.htm?</wildcard>
           <wildcard>*.xhtm?</wildcard>
         </wildcardFileFilter>
       </xmlFileFilter>
     </xmlFileFilters>

     <csvFileFilters>
       <csvFileFilter
         includeBinary="false" delimiter="," encapsulator="""
         ignoreLeadingWhitespace="true" ignoreTrailingWhitespace="true"
         interpretUnicodeEscapes="false" ignoreEmptyLines="true"
         hasHeading="false" defaultCharsetName="UTF-8">
         <wildcardFileFilter caseSensitive="false">
           <wildcard>*.csv</wildcard>
         </wildcardFileFilter>
       </csvFileFilter>
     </csvFileFilters>

     <jndiJdbcSource name="database-addresses">
       <jndiDataSource>jdbc/addresses</jndiDataSource>
       <queryFull>select * from addresses</queryFull>
       <queryDelta>
       select * from addresses where lastmodified > :last_transformed_time
       </queryDelta>
     </jndiJdbcSource>

   </localFileSystemSource>
</sources>

<results>
   <solrjResult name="solr-core-filesystem">
     <uri>http://localhost:8080/solr/filesystem</uri>
     <uniqueKeyFieldName>id</uniqueKeyFieldName>
     <sourceFieldName>source</sourceFieldName>
     <defaultFieldName>binary</defaultFieldName>
     <commit>true</commit>
     <optimize>true</optimize>
     <waitFlush>true</waitFlush>
     <waitSearcher>true</waitSearcher>
   </solrjResult>

   <fileResult name="result-debug">
     <path>/var/files/debug/result-debug.xml</path>
     <append>false</append>
   </fileResult>
</results>

</config>

For the filesystem based sources there are file filters available that filter on age, size, hidden status, mimetype, prefix, suffix, wildcard, regular expression and so on. These file filters can be combined within AND and OR filters to construct complex and powerful filters.

1.8.2 jobs.xml

The configuration file jobs.xml contains the definitions of jobs (what task to execute) and triggers (when to execute). The format of jobs.xml is that of the Terracotta Quartz scheduler and defined by the XML Schema job_scheduling_data_1_8.xsd. A simple configuration file looks like this:

<job-scheduling-data
xmlns="http://www.quartz-scheduler.org/xml/JobSchedulingData"
version="1.8">
<schedule>
   <job>
     <name>job-index-solr</name>
     <job-class>com.armatiek.infofuze.job.AntJob</job-class>
     <job-data-map>
       <entry>
         <key>target</key>
         <value>index-solr</value>
       </entry>
     </job-data-map>
   </job>
   <trigger>
     <cron>
       <name>trigger-index-solr</name>
       <job-name>job-index-solr</job-name>
       
       <cron-expression>0 15 10 ? * MON-FRI</cron-expression>
     </cron>
   </trigger>
</schedule>
</job-scheduling-data>

The configuration file is not used when running Infofuze from the command line, it's only used within the server process

1.8.3 tasks.xml

The configuration file tasks.xml contains the definitions of the transformation tasks. This file is a standard Apache Ant build file in which a custom task transform can be used. A simple configuration file looks like this:

The source "filesystem-network" and result "solr-core-filesystem" are defined in infofuze-config.xml. These transformation jobs can be executed like any other Ant project directly from the command-line (using Ant itself), but can also be scheduled and executed in the context of the Infofuze server process. The configuration file is not used when running Infofuze from the command line, it's only used within the server process.

2 Source classes

2.1 Filesystem based source classes

The task of these sources is to read (a filtered subset) of files from a filesystem and provide as much structured information as possible to the XML stream. During the traversal of the filesystem, files and directories can be filtered on age, size, hidden status and name (using wildcards, suffix, prefix and regular expressions). Filters can be combined with AND and OR operators to create complex filters. Optionally, files that were not changed since the last traversal can be skipped as well (see section 1.7).

All filesystem based sources can provide for every file (regardless of its type) at least the following XML structure :

For specific file types and source classes more data is provided, as described in the following sections.

2.1.1 File types

2.1.1.1 XML files

XML structure
All sources can provide at least the following XML structure to the stream for XML files:

The XML enclosed by the xml tag does not have to be equal to the binary data within the bin tag. The character encoding of the XML within the xml tag will always be converted to that of the stream and this XML will not contain a prolog (any xml declaration, DOCTYPE statement or other constructs before the start element). For XML that relies on named ENTITY declarations in a Document Type Definition (DTD) a DOCTYPE statement can be configured for the entire stream.

HTML files (both well- and non well formed) can also be defined as XML files. Infofuze will convert any HTML to well formed XML and feed this to the stream. This will enable the following functionality:

Configuration
Below is an example how to filter local XML files that are smaller than 5Mb and stored in the directory /var/files/myfiles. Also HTML files will be included and converted to well formed XML files. The binary file is not written to the stream and the source XML is checked for wellformedness before offering it to the stream.

<config>
<sources>
   <localFileSystemSource name="my-local-source">
     <location>/var/files/myfiles</location>
     
     <xmlFileFilters>
       <xmlFileFilter includeBinary="false" preParse="true">
         <andFileFilter>
           <wildcardFileFilter caseSensitive="false">
             <wildcard>*.xml</wildcard>
             
             <wildcard>*.htm?</wildcard>
             <wildcard>*.xhtm?</wildcard>
           </wildcardFileFilter>
           
           <sizeFileFilter acceptLarger="false">5242880</sizeFileFilter>
         </andFileFilter>
       </xmlFileFilter>
     </xmlFileFilters>
     
   </localFileSystemSource>
</sources>

</config>

2.1.1.2 Binary files

XML structure
Through the use of Apache Tika, all sources can provide at least the following XML structure to the stream for specific binary files:

The most important binary file types that are supported are (X)HTML, Microsoft Office (Word, Excel and PowerPoint; both OLE2 an OOXML, Visio), OpenDocument/OpenOffice, PDF, E-pub, RTF, Text, Email (mbox) and several audio, video and image formats (see http://tika.apache.org/0.9/formats.html).

Configuration
Below is an example how to filter local MS Word and PDF files whose last modification time is after January 13th eight o' clock PM and stored in the directory /var/files/myfiles. The binary file is not written to the stream, but the metadata is. Also the XHTML representation of the content is checked for wellformedness before offering it to the stream.

<config>
<sources>
   <localFileSystemSource name="my-local-source">
     <location>/var/files/myfiles</location>
     <location>/var/files/mydocuments</location>
     
     <binaryFileFilters>
       <binaryFileFilter
        extractText="true" extractMetadata="true" includeBinary="false" preParse="true">
         <andFileFilter>
           <wildcardFileFilter caseSensitive="false">
             <wildcard>*.doc?</wildcard>
             <wildcard>*.pdf</wildcard>
           </wildcardFileFilter>
           
           <ageFileFilter acceptOlder="false">2001-01-13T20:00:00</ageFileFilter>
         </andFileFilter>
       </binaryFileFilter>
     </binaryFileFilters>
     
   </localFileSystemSource>
</sources>

</config>

2.1.1.3 JSON files

XML structure
For Javascript Object Notation (JSON) files, Infofuze will convert all data within the JSON file to an XML stream. This XML stream has the a structure that is a literal XML representation of the JSON data. All filesystem based sources can provide at least the following XML structure to the stream for JSON files:

Configuration
Below is an example how to filter local JSON files that are stored in the directory /var/files/myfiles.

<config>
<sources>
   <httpSource name="my-http-source">
     <location>http://myserver/myjsonstream</location>
     <jsonFileFilters>
       <jsonFileFilter>
         <trueFileFilter/>
       </jsonFileFilter>
     </jsonFileFilters>
   </httpSource>
</sources>

</config>

2.1.1.4 CSV files

XML structure
For Comma Separated Values (CSV) files, Infofuze will convert all data within the CSV file to an XML stream. This XML stream has the same structure of that of the JDBC Sources. All filesystem based sources can provide at least the following XML structure to the stream for CSV files:

Because CSV is not a well defined standard, several options can be configured for the interpretation of the CSV file, like the delimiter, encapsulator and comment characters, if the first line of the CSV file must be treated as headings and so on.

Configuration
Below is an example how to filter local CSV files that are stored in the directory /var/files/myfiles.

<config>
<sources>
   <localFileSystemSource name="my-local-source">
     <location>/var/files/myfiles</location>
     
     <csvFileFilters>
       <csvFileFilter
         includeBinary="false"
         delimiter=","
         encapsulator="""
         ignoreLeadingWhitespace="true"
         ignoreTrailingWhitespace="true"
         interpretUnicodeEscapes="false"
         ignoreEmptyLines="true"
         hasHeading="false"
         defaultCharsetName="UTF-8">
         <wildcardFileFilter caseSensitive="false">
           <wildcard>*.csv</wildcard>
         </wildcardFileFilter>
       </csvFileFilter>
     </csvFileFilters>
     
   </localFileSystemSource>
</sources>

</config>

2.1.1.5 Compressed files

ZIP, TAR, GZIP and BZIP2 files will be treated by Infofuze as directories and will traverse them transparently. This also applies to compressed files that are nested within other compressed files. The filtering of files within compressed files works the same as in normal directories. The XML that is provided to the stream is the same as of uncompressed files.

Configuration
Below is an example how to filter local zip and tar files that are stored in the directory /var/files/myfiles.

<config>
<sources>
   <localFileSystemSource name="my-local-source">
     <location>/var/files/myfiles</location>
     
     <compressedFileFilters>
       <tarFileFilter>
         <wildcardFileFilter caseSensitive="false">
           <wildcard>*.tar</wildcard>
         </wildcardFileFilter>
       </tarFileFilter>
       <zipFileFilter>
         <wildcardFileFilter caseSensitive="false">
           <wildcard>*.zip</wildcard>
         </wildcardFileFilter>
       </zipFileFilter>
     </compressedFileFilters>
   
   </localFileSystemSource>
</sources>

</config>

2.1.1.6 Unparseable files

XML structure
Unparseable files are files of which the content can not or should not be parsed or interpreted. Only the minimal structure that is described in section 2.1 will included in the stream.

Configuration
Below is an example of how to filter unparseable files with the extensions .dat and .bin that are stored in the directory /var/files/myfiles. The binary file content will be included in the XML Stream.

<config>
<sources>
   <localFileSystemSource name="my-local-source">
     <location>/var/files/myfiles</location>
     
     <unparseableFileFilter includeBinary="true">
       <wildcardFileFilter caseSensitive="false">
         <wildcard>*.dat</wildcard>
         <wildcard>*.bin</wildcard>
       </wildcardFileFilter>
     </unparseableFileFilter>
     
   </localFileSystemSource>
</sources>

</config>

2.1.2 Filesystem based sources

2.1.2.1 LocalFileSystemSource

The LocalFileSystemSource can be used to stream local files that can be accessed by Java's java.io.File class.

The LocalFileSystemSource provides the minimal XML structure that is described in section 2.1.

Configuration
Examples of the configuration of the LocalFileSystemSource are included in the section 2.1.1.

2.1.2.2 CIFSSource

XML structure
The CIFSSource can be used to stream files that can be accessed by the CIFS (Common Internet File System) or SMB (Server Message Block) protocol. This includes files on a Microsoft Windows Network and Unix based systems running Samba. In addition to the minimal XML structure this source provides extra information about the file's creation time, UNC path, security descriptor and share permissions. From the security descriptor the read-permissions ( The CIFSSource can provide for every file (regardless of its type) at least the following XML structure to the stream:

Configuration
Below is an example of the configuration of a CIFSSource. The configuration of CIFSSource supports the following additional properties:

<config>
<sources>
   <cifsSource name="my-cifs-source" extractSecurity="true" extractShareSecurity="true" ldapSourceRef="..." ldapResultXslPath="...">
     <location>smb://user:password@myserver.nl/myshare/files/</location>
     
     <binaryFileFilters>
       <binaryFileFilter extractText="true" extractMetadata="true">
         <wildcardFileFilter caseSensitive="false">
           <wildcard>*.doc?</wildcard>
           <wildcard>*.pdf</wildcard>
         </wildcardFileFilter>
       </binaryFileFilter>
     </binaryFileFilters>
     
   </cifsSource>
</sources>

</config>

2.1.2.3 WebDAVSource

XML structure
The WebDAVSource can be used to stream files that are accessible by the WebDAV protocol. This protocol is for instance implemented by the Apache HTTP server module mod_dav, Microsoft Internet Information Server 5 (and higher), Microsoft Exchange server and others. The source provides the minimal XML structure that is described in section 2.1, and supports HTTP, HTTPS, proxy access, authentication, non standard ports and fully supports XML, CVS, binary and compressed files. The WebDAVSource uses the WebDAV client library of the Apache Jackrabbit project.

<config>
<sources>
   <webDAVSource
     name="my-webdav-source"
     username="john"
     password="secret"
     timeout="5000"
     <location>http://localhost/webdav/</location>
     <directoryFilter>
       <hiddenFileFilter>false</hiddenFileFilter>
     </directoryFilter>
     <xmlFileFilters>
       <xmlFileFilter includeBinary="false">
         <andFileFilter>
           <wildcardFileFilter caseSensitive="false">
             <wildcard>*.xml</wildcard>
           </wildcardFileFilter>
           
           <sizeFileFilter acceptLarger="false">5242880</sizeFileFilter>
         </andFileFilter>
       </xmlFileFilter>
     </xmlFileFilters>
   </webDAVSource>
</sources>

</config>

2.1.2.4 WebCrawlSource

XML structure
The WebCrawlSource can be used to crawl or spider through all or a subset of the pages and files given one or more so called seed URLs of websites. The source provides the minimal XML structure that is described in section 2.1 but includes an extra attribute uri to the element file. It supports HTTP, HTTPS, proxy access, authentication, non standard ports and fully supports XML, CVS, binary and compressed files. The crawler has a number of other options that can be configured:

The crawler supports most of the redirect status codes. It performs extensive URL normalization to avoid multiple transformations of the same page (but with different URLs). It does not yet support the Robots Exclusion Standard (robots.txt) but does support the robots meta tag (specifically the values NOINDEX and NOFOLLOW). The crawler does not follow external links (links to other hosts than the seed URL) and does not follow links that are higher in the URL hierarchy than the seed URL.

<config>
<sources>
   <webCrawlSource
     name="my-webcrawl-source"
     timeout="5000"
     maxDepth="5"
     wait="200"
     userAgent="Mozilla/5.0 (compatible; Infofuze/1.0; Windows NT 5.1)"
     followImages="true"
     followScripts="false"
     followLinks="false">
     <location>http://www.armatiek.nl</location>
     <binaryFileFilters>
       <binaryFileFilter extractText="true" extractMetadata="true" preParse="true">
         <mimeTypeFileFilter>
           <mimeType>application/msword</mimeType>
           <mimeType>application/pdf</mimeType>
         </mimeTypeFileFilter>
       </binaryFileFilter>
     </binaryFileFilters>
   </webCrawlSource>
</sources>

</config>

2.1.2.5 FTPSource

XML structure
The FTPSource can be used to stream files that are accessible by the FTP protocol. The FTPSource provides the minimal XML structure that is described in section 2.1, and supports FTP, FTPS, proxy access, authentication, non standard ports and fully supports XML, CVS, binary and compressed files.

2.2 JDBC based sources

XML structure
The JDBC based sources can be used to stream data from a relational database using a JDBC or ODBC driver (via Java's JDBC-ODBC bridge). Infofuze provides plain JDBC connections, but also supports connection pooling for drivers that implement ConnectionPoolDataSource and PooledConnection, or the connection pooling of the Java Application Server via JNDI (like Tomcat or JBoss). JDBCSources can be read parameterized, in which case the source is read from within a STX or XSLT transformation and a parameterize query is executed to obtain additional data.

2.2.1 DirectJDBCSource

The DirectJDBCSource can be used to connect to a database using a direct unpooled JDBC connection providing the following configuration properties:

<config>
<sources>
   <directJdbcSource name="my-directjdbc">
     <driver>com.mysql.jdbc.Driver</driver>
     <url>jdbc:mysql://localhost:3306/database</url>
     <username>user</username>
     <password>password</password>
     <queryFull>select * from addresses</queryFull>
   </directJdbcSource>
</sources>

</config>

This source class is particularly useful in scenario's where the transformation is executed outside the context of a Java application server and the source is the primary source of the transformation and only one connection is used during the entire transformation.

2.2.2 PooledJDBCSource

The PooledJDBCSource can be used to connect to a database using a connectionpool. This source can only be used for JDBC drivers that provide implementations of Java's ConnectionPoolDataSource and PooledConnection (like the ones for Oracle, Microsoft SQL Server and MySql). The following configuration properties must be provided:

<config>
<sources>
   <pooledJdbcSource name="my-pooledjdbc">
     <connectionpoolDataSource>
       var dataSource = new org.h2.jdbcx.JdbcDataSource();
       dataSource.setURL("jdbc:h2:/database");
       dataSource.setUser("sa");
       dataSource.setPassword("");
     </connectionpoolDataSource>
     <queryFull>select * from addresses</queryFull>
   </pooledJdbcSource>
</sources>

</config>

This source class is particularly useful in scenario's where the transformation is executed outside the context of a Java application server and the source is used as a secondary source from within the transformation and multiple connections are necessary during the transformation.

2.2.3 JNDIJDBCSource

The JNDIJDBCSource can be used to connect to a database using a datasource that is configured in a Java application server with JNDI. The following configuration properties must be provided:

<config>
<sources>
   <jndiJdbcSource name="my-jndijdbc">
     <jndiDataSource>jdbc/mydatabase</jndiDataSource>
     <queryFull>select * from addresses</queryFull>
     <queryDelta>
     select * from addresses where lastmodified > :last_transformed_time
     </queryDelta>
   </jndiJdbcSource>
</sources>

</config>

This source class is particularly useful in scenario's where the transformation is executed within the context of a Java application server and the source is used as a secondary source from within the transformation and multiple connections are necessary during the transformation.

2.3 SolrSource

The SolrSource can be used to connect to a Apache Solr core and get the XML stream that is the result of executing a Solr query. The following configuration properties can be provided:

<config>
<sources>
   <solrSource name="my-solr-source">
     <uri>http://localhost:8080/solr/mycore</uri>
     <uniqueKeyFieldName>id</uniqueKeyFieldName>
     <sourceFieldName>xml</sourceFieldName>
     <defaultFieldName>text</defaultFieldName>
   </solrSource>
</sources>

</config>

2.4 MongoDBSource

The MongoDBSource can be used to connect to a MongDB database and get a XML stream representation of the JSON that is the result a query on the MongoDB database. The following configuration properties can be provided:

<config>
<sources>
   <mongoDBSource name="my-mongodb-source">
     <uri>mongodb://localhostt</uri>
     <uniqueKeyFieldName>myid</uniqueKeyFieldName>
     <database>mydatabase</database>
     <collection>mycollection</collection>
     <queryFull>{ }</queryFull>
     <limit>1</limit>
     <sort>{ creationdatetime : 1 }</sort>
   </mongoDBSource>
</sources>

</config>

2.5 HTTPSource

The HTTPSource can be used to stream the result of a HTTP POST or GET request and is particularly suited to stream the result of a SOAP request to a webservice. In this scenario the source is not used as the primary source of a transformation, but read during a transformation using XPath's document() function (XSLT) or the stx:process-document instruction (STX).

2.6 LDAPSource

The result of a search operation is written to the stream in the Directory Services Markup Language (DSML) format. Below is an example of such a result:

<batchResponse xmlns="urn:oasis:names:tc:DSML:2:0:core">
<searchResponse>
   <searchResultEntry dn="CN=jdoe,CN=Users,DC=corp,DC=armatiek,DC=com">
     <attr name="memberOf">
       <value>Administrators</value>
       <value>Developers</value>
     </attr>
     <attr name="sAMAccountName">
       <value>jdoe</value>
     </attr>
   </searchResultEntry>
   <searchResultDone>
     <resultCode code="0"/>
   </searchResultDone>
</searchResponse>
</batchResponse>

<ldapSource name="source-ldap-group-membership-jdoe">
<host>localhost</host>
<bindDN>CN=Administrator,CN=Users,DC=corp,DC=armatiek,DC=com</bindDN>
<bindPassword>mypassword</bindPassword>
<baseDN>DC=corp,DC=armatiek,DC=com</baseDN>
<scope>sub</scope>
<filter>(&(objectCategory=user)(sAMAccountName=jdoe))</filter>
<attributes>
<attribute>memberOf</attribute>
<attribute>sAMAccountName</attribute>
</attributes>
</ldapSource>

When an LDAPSource is read using the using XPath's document() function (XSLT) or the stx:process-document instruction (STX), the properties baseDN, scope, filter and attributes can be specified as parameters in the query string and have precedence over any of these properties that are defined in the configuration.

2.7 NullSource

The NullSource provides an empty XML stream. This source is particularly useful in a scenario where all data is read using XPath's document() function (XSLT) or the stx:process-document instruction (STX) and the primary source of the transformation has no purpose.

3 Result classes

Result classes are classes that implement the interface javax.xml.transform.Result. Their task is to write the resulting stream of the transformation to a particular datastore or index. The following result classes are provided by Infofuze:

3.1 FileResult

The FileResult can be used to write the result of a transformation to a local file. Below is an example of the configuration of a FileResult in infofuze-config.xml:

<config>

<results>
   <fileResult name="my-file-result">
     <path>/tmp/debug-result.xml</path>
     <append>false</append>
   </fileResult>
</results>
</config>

3.2 SolrjResult

The SolrJResult can be used to write XML data that complies with the Apache Solr update format to a Solr instance using the Solrj client library. The Solr update format has the following form:

<add>
<doc>
   <field name="employeeId">05991</field>
   <field name="office">Bridgewater</field>
   <field name="skills">Perl</field>
   <field name="skills">Java</field>
</doc>
[<doc> ... </doc>[<doc> ... </doc>]]
</add>

<config>

<results>
   <solrjResult name="my-solrj-result">
     <uri>http://localhost:8080/solr/demo</uri>
     <uniqueKeyFieldName>id</uniqueKeyFieldName>
     <sourceFieldName>source</sourceFieldName>
     <defaultFieldName>binary</defaultFieldName>
     <commit>true</commit>
     <optimize>true</optimize>
     <waitFlush>true</waitFlush>
     <waitSearcher>true</waitSearcher>
   </solrjResult>
</results>
</config>

3.3 MongoDBResult

The MongDBResult can be used to write XML data that complies to a specific format to a MongDB database using the MongoDB Java driver. This XML data format must have the following form:

Below is an example of the configuration a MongoDBResult in infofuze-config.xml:

<config>

<results>
   <mongoDBResult name="my-mongodb-result">
     <uri>http://localhost</uri>
     <uniqueKeyFieldName>employeeid</uniqueKeyFieldName>
     <database>mydatabase</database>
     <collection>mycollection</collection>
   </mongoDBResult>
</results>
</config>

3.4 JDBC based results

XML structure
The JDBC based results can be used to insert or update records in a relational database using a JDBC or ODBC driver (via Java's JDBC-ODBC bridge). Infofuze provides plain JDBC connections, but also supports connection pooling for drivers that implement ConnectionPoolDataSource and PooledConnection, or the connection pooling of the Java Application Server via JNDI (like Tomcat or JBoss).

3.4.1 DirectJDBCResult

The DirectJDBCResult can be used to connect to a database using a direct unpooled JDBC connection providing the following configuration properties:

<config>
<sources>
   <directJdbcResult name="my-directjdbc">
     <driver>com.mysql.jdbc.Driver</driver>
     <url>jdbc:mysql://localhost:3306/database</url>
     <username>user</username>
     <password>password</password>
     <updateQuery>
       UPDATE phone_book SET number = :number WHERE name = :name
     </updateQuery>
     <insertQuery>
       INSERT INTO phone_book (name, number) VALUES (:name, :number)
     </insertQuery>
     <parameters>
       <parameter name="name" type="string"/>
       <parameter name="number" type="string"/>
     </parameters>
   </directJdbcResult>
</sources>

</config>

This source class is particularly useful in scenario's where the transformation is executed outside the context of a Java application server.

3.4.2 PooledJDBCResult

The PooledJDBCResult can be used to connect to a database using a connectionpool. This source can only be used for JDBC drivers that provide implementations of Java's ConnectionPoolDataSource and PooledConnection (like the ones for Oracle, Microsoft SQL Server and MySql). The following configuration properties must be provided:

<config>
<sources>
   <pooledJdbcResult name="my-pooledjdbc">
     <connectionpoolDataSource>
       var dataSource = new org.h2.jdbcx.JdbcDataSource();
       dataSource.setURL("jdbc:h2:/database");
       dataSource.setUser("sa");
       dataSource.setPassword("");
     </connectionpoolDataSource>
     <updateQuery>
       UPDATE phone_book SET number = :number WHERE name = :name
     </updateQuery>
     <insertQuery>
       INSERT INTO phone_book (name, number) VALUES (:name, :number)
     </insertQuery>
     <parameters>
       <parameter name="name" type="string"/>
       <parameter name="number" type="string"/>
     </parameters>
   </pooledJdbcResult>
</sources>

</config>

This source class is particularly useful in scenario's where the transformation is executed outside the context of a Java application server.

3.4.3 JNDIJDBCResult

The JNDIJDBCResult can be used to connect to a database using a datasource that is configured in a Java application server with JNDI. The following configuration properties must be provided:

<config>
<sources>
   <jndiJdbcResult name="my-jndijdbc">
     <jndiDataSource>jdbc/mydatabase</jndiDataSource>
     <updateQuery>
       UPDATE phone_book SET number = :number WHERE name = :name
     </updateQuery>
     <insertQuery>
       INSERT INTO phone_book (name, number) VALUES (:name, :number)
     </insertQuery>
     <parameters>
       <parameter name="name" type="string"/>
       <parameter name="number" type="string"/>
     </parameters>
   </jndiJdbcResult>
</sources>

</config>

This source class is particularly useful in scenario's where the transformation is executed within the context of a Java application server.

3.5 JDBCResult

The JDBCResult classes () can be used to write XML data to a table in a relational database using a JDBC or ODBC driver (using Java's JDBC-ODBC bridge). The same connection pooling mechanisms can be used as with the JDBC source classes. The transformed XML data must have the following form:

The configuratioBelow is an example of the configuration a JDBCResult in infofuze-config.xml:

3.6 HTTPResult

3.7 NullResult

Output that is written to a NullResult has no destination and is the equivalent of the famous /dev/null. All bytes or characters are ignored and lost in cyberspace. This result is particularly useful in a scenario where all data is written using the xsl:result-document or stx:result-document instructions and the primary result has no purpose.

4 Third party components

5 Building Infofuze

Infofuze can be build using Maven. Before Infofuze can be build three libraries has to be installed in your local Maven repository because the versions in the central Maven repository are way too old or do not exist. These two libraries are Saxon (HE), JCIFS and Waffle.

mvn install:install-file -DgroupId=net.sf.saxon -DartifactId=saxonhe -Dversion=9.3.0_2j -Dpackaging=pom -Dfile=saxonhe-9.3.0_2j.pom

mvn install:install-file -DgroupId=net.sf.saxon -DartifactId=saxonhe -Dversion=9.3.0_2j -Dpackaging=jar -Dfile=saxonhe-9.3.0_2j.jar

mvn install:install-file -DgroupId=org.samba.jcifs -DartifactId=jcifs -Dversion=1.3.15 -Dpackaging=pom -Dfile=jcifs-1.3.15.pom

mvn install:install-file -DgroupId=org.samba.jcifs -DartifactId=jcifs -Dversion=1.3.15 -Dpackaging=jar -Dfile=jcifs-1.3.15.jar

mvn install:install-file -DgroupId=net.java.dev.jna -DartifactId=jna -Dversion=3.2.7 -Dpackaging=pom -Dfile=jna-3.2.7.pom

mvn install:install-file -DgroupId=net.java.dev.jna -DartifactId=jna -Dversion=3.2.7 -Dpackaging=jar -Dfile=jna.jar

mvn install:install-file -DgroupId=net.java.dev.jna -DartifactId=platform -Dversion=3.2.7 -Dpackaging=pom -Dfile=platform-3.2.7.pom

mvn install:install-file -DgroupId=net.java.dev.jna -DartifactId=platform -Dversion=3.2.7 -Dpackaging=jar -Dfile=platform.jar

mvn install:install-file -DgroupId=waffle-jna -DartifactId=waffle-jna -Dversion=1.3 -Dpackaging=pom -Dfile=waffle-jna-1.3.pom

mvn install:install-file -DgroupId=waffle-jna -DartifactId=waffle-jna -Dversion=1.3 -Dpackaging=jar -Dfile=waffle-jna.jar

The pom files can be found under the directory <infofuze>/trunk/local-repository and the jar files can be downloaded from http://saxon.sourceforge.net/ and http://jcifs.samba.org/ and http://waffle.codeplex.com/ . For newer versions the commands and the pom files must be changed along with the version numbers of the artifacts jcifs, saxonhe and waffle-jna in the master pom.

After installing the two libraries in your local Maven repository the sources can be build running the command-line:

6 Deploying Infofuze

6.1 infofuze-core jar

This jar contains the core classes of Infofuze and is used by all the other artifacts. This library is a “J2SE only” jar; there are no dependencies between this jar and any J2EE application server specific classes.

6.2 infofuze-cli jar

This is an runnable “uber jar” contaning all 3rd party dependencies and the infofuze-core jar. The jar has the command line interface that is described in section 1.6.1. When running this jar two system properties must be specified: infofuze.infofuze.home pointing to the infofuze home directory containing the configuration files and a JCIFS specific system property java.protocol.handler.pkgs with the value jcifs.

The following is a valid transformation command when run from the directory <infofuze>/trunk/infofuze-cli and java version 1.6:

java -jar -Dinfofuze.infofuze.home=..\infofuze-home -Djava.protocol.handler.pkgs=jcifs -Xmx512m ./target/infofuze-cli-0.1-SNAPSHOT-executable.jar -s my-source -r my-result

providing the configuration file contains a definition for a source my-source and a result my-result.

6.3 infofuze-web-core jar

This jar contains all the core classes that have dependencies with J2EE application server specific classes like the abstract servlet and web application listener classes.

6.4 infofuze-web-backend war

This war is the web application in which the scheduler is started and the transformation jobs are executed. The war should be to be deployed to any Java application server of servlet container but so far it only has been tested on Apache Tomcat version 6.0.32. The following steps are necessary to deploy the war to Tomcat 6.0.32: