The kFM-index is a data structure, similar to the FM-index based on the Burrows-Wheeler transform, for storing the k-subwords from a set of strings in a compact manner. The intended use is the representation of the k-mer composition of sequence reads from high throughput sequencing.
The key ingredient of the data structure is an ordered table of nodes listing initial letters of the in-coming edges to each node. For DNA, the letters are A,
C, G and T, so for each node 4 bits are required to indicate which of these form in-coming edges. The k-words of the nodes are not stored, but are
reconstructed from the data. In addition to the 4 bits per node (or
σ bits per node if the alphabet consists of
σ letters), the nodes have to be grouped to indicate which nodes differ only by the last letter: this takes one more bit per node.The index of the
in-coming nodes can be computed from the permanent data, but in order to be able to retreive the index pointing to the previous node more efficiently a
subset of them are stored.
The details of the data structure and algorithms for using it are included in the manuscript.