Compressed Indexes for Regular Languages - REGINDEX

Compressed Indexes for Regular Languages with Applications to Computational Pan-genomics.

People
Nicola Prezza, Alessio Campanelli, Daniel Puttini, Carlo Tosoni, Riccardo Maso, Davide Cenzato, Davide Tonetto, Ruben Becker

Grants
European Research Council (ERC) grant agreement No. 101039208.

Description

Sorting is, arguably, the most powerful algorithmic building block when it comes to indexing data. At the same time, the regularities exposed by sorting are precisely those enabling data compression. In the last two decades, this fascinating duality has led researchers to the design of compressed full-text indexes: data structures supporting fast pattern matching queries over compressed text. In this project, we revisit the natural generalization of the problem to labeled graphs from a new perspective: we interpret graphs as finite-state automata and investigate the connections existing between their propensity to be sorted and the regular languages they recognize. This novel language-theoretic approach makes it possible to transfer fundamental results between the mature fields of regular language theory and compressed text indexing. The project aims at building this bridge by developing a new theory of compressed regular language indexing.

The project finds important applications to the rapidly-expanding field of computational pangenomics, where the goal is to study the variations contained in the genomes of an entire population. Recent research has shown that representing pan-genomes as labeled graphs is an important step to reduce reference allele bias. Existing approaches, however, can index only restricted classes of graphs, thereby limiting the practical applicability of such powerful pan-genome representations.

The project’s approach, based on sorting regular languages by partial co-lexicographic orders (see figure), changes the perspective from which the compressed indexing problem has been tackled in the literature. This project aims at developing a theory of graph indexing and compression based on the natural interplay between sorting and regular language theory.

Software

1 - Indexing

Minimum-WDFA-Constructor - Software to construct the minimum Wheeler deterministic finite automaton (WDFA) equivalent to a given minimum DFA of a Wheeler language.
NFA-index - Software to construct an index for a sorted NFA, supporting efficient count, locate, and membership queries.

2 - Sorting

finite-automata-partition-refinement - Software that provides automata-sorting functionalities based on the partition refinement algorithm.
DFA-suffix-doubling - Software that computes the co-lexicographical ordering of DFA states using a suffix-doubling algorithm generalized to automata.
Labeled-Graph-LCP - Software implementing a software for computing the Longest Common Prefix (LCP) of arbitrary labeled graphs.

3 - Generation

Wheeler-DFA-generation - Software to sample random uniform Wheeler DFAs (WDFAs) with a given number of nodes, number of edges, and alphabet size.
RegexpToAutomaton - Software to compute an automaton out of a regular expression.

4 - Other

DeterministicWidth - Software implementing an algorithm to compute the deterministic co-lexicographical width of a regular language.

RAVEN — Research on Algorithms VENice

Compressed Indexes for Regular Languages - REGINDEX

Description

Software