Compact representation of de Bruijn subgraphs of k-mers


The kFM-index requires that there is only one final node, i.e. without out-going edges, and so the graph is first completed by adding additional nodes with $ signs appended. The data table contains one row per node, with nodes in lexicographic order. For each node, bit flags indicate which in-coming edges exist. Another bit per node is used to indicate groups of nodes which differ by only the last letter. The previous node values, which correspond to the FM-index, point to the index of the in-coming nodes, and can be computed from the previous data: the in-coming edge and group end flags, but for computational speed some of these are stored in memory as starting points for the computations.

The kFM-index is a data structure, similar to the FM-index based on the Burrows-Wheeler transform, for storing the k-subwords from a set of strings in a compact manner. The intended use is the representation of the k-mer composition of sequence reads from high throughput sequencing.

The key ingredient of the data structure is an ordered table of nodes listing initial letters of the in-coming edges to each node. For DNA, the letters are A, C, G and T, so for each node 4 bits are required to indicate which of these form in-coming edges. The k-words of the nodes are not stored, but are reconstructed from the data. In addition to the 4 bits per node (or σ bits per node if the alphabet consists of σ letters), the nodes have to be grouped to indicate which nodes differ only by the last letter: this takes one more bit per node.The index of the in-coming nodes can be computed from the permanent data, but in order to be able to retreive the index pointing to the previous node more efficiently a subset of them are stored.

The details of the data structure and algorithms for using it are included in the manuscript.

Last modified December 04, 2013.