The data structure and algorithms have been implemented in Java. The implementation has been focused on clearity and generality rather than computational speed. Although the main application is likely to be on DNA sequences, this is not assumed anywhere in the implementation.
Note that the Java executable should be executed using the Java option
java -Xmx4g -jar WordTable.jar, to set the maximal amount of memory Java is allowed to allocate: the default is usuall very low. Executing it with
java rather than
javaw, which may be the default file association of JAR files, also ensures that critical errors get reported on the console.
The Java implementation can run either on the command line or in a GUI. It can process sequence data in Fasta or Fastq formats into a kFM-index, which may then be pre-assembled or various statistics may be generated. The main computational parts of the kFM-index generation has been parallelised for increased speed.
The latest available version of the kFM-index implementation is WordIndex version 0.8 (build 4).
Compared to the version discussed in the article, the implementation has been considerably improved. The following features have been added:
These improvement all influence the construction of the kFM-index from the sequence data.
Removing the memory limitation, enables processing of arbitrarily large blocks of k-mers in the first step of kFM-index construction: when the sequence data is split into blocks, kFM-indexes produced from each block, and these kFM-indexes then merged pairwise until a final kFM-index results. By increasing the block size, limited only by the available memory, the number of merge operations is reduced.
During the initial kFM-index construction from lists of in-edges (k-mers), the in-edge lists are generated from sequence data and then sorted. The sort procedure has been replaced by a faster one, and parallelised. The operation for merging two kFM-indexes has also been parallelised.
Updated benchmarks are available from the benchmarks page.