Posts by Collection

people

Alessio Campanelli

Alessio graduated in October 2024 in Computer Science at Ca’ Foscari University of Venice (Italy), with a thesis on Bioinformatics. He then started his PhD programme at the same university, with the REGINDEX research group. He is a competitive programming enthusiast and his main area of interest is compact data structures.

Carlo Tosoni

After graduating in 2021 with a degree in computer engineering from the University of Perugia, Carlo enrolled in the master’s degree programme in computer science of the University of Pisa, where he graduated in July 2023. His chosen curriculum, ‘Big Data Technologies’, aimed to develop the skills needed to manage large collections of data in the most efficient possible manner. Two months later, he started a PhD programme at Ca’ Foscari University in Venice, under the supervision of Professors Nicola Prezza and Ruben Becker. More information about his research can be found in the following website.

Daniel Puttini

Daniel earned his master’s degree in Computer Science from the University of Parma in March 2023, having completed a thesis in bioinformatics. Following graduation, he collaborated with the Department of Food and Drugs at the University of Parma for six months. In November 2023, he joined the REGINDEX research group at Ca’ Foscari University of Venice (Venice, Italy), where he is currently engaged as a (“pre-doc”) research fellow, eagerly anticipating the commencement of his doctoral studies.

Davide Cenzato

After graduating in Medical bioinformatics in 2019, Davide received his Ph.D. in Computer Science at the University of Verona (Verona, Italy). He then joined Ca’ Foscari University of Venice (Venice, Italy) in February 2023 as a postdoc researcher in the REGINDEX group. His main research interests are algorithms and data structures on strings and graphs; in particular, he focuses on algorithms for processing and indexing genomic string collections and pangenomes, with a special emphasis on providing efficient implementations.

Davide Cologni

Davide earned his master’s degree in Computer Science from the University of Milan in April 2025, with a thesis on graph compression. He then joined the RAVEN research group at Ca’ Foscari University of Venice as a Ph.D. student, under the supervision of Professor Giulio Ermanno Pibiri. His primary research interests are in the field of compressed data structures.

Davide Tonetto

Davide earned his master’s degree in Computer Science and Information Technology from Ca’ Foscari University of Venice in October 2025, where he completed a thesis on the compression of finite languages. He has since joined the REGINDEX research group at Ca’ Foscari University as a doctoral student, continuing his academic journey in computer science and pursuing his passion for study and research.

Giulio Ermanno Pibiri

Giulio holds a PhD degree in Computer Science from the University of Pisa (2019). He has been a Research Fellow in Computer Science at ISTI-CNR in Pisa from late 2018 to May 2022 (three times “Young Researcher Award” for the years 2019, 2020, 2021). He then joined Ca’ Foscari University in Venice in June 2022 as Assistant Professor and, as of June 2025, he is Associate Professor. His research activity focuses on the design and implementation of compressed data structures to index large datasets; currently, with applications to Bioinformatics. You can find more details on his personal website.

Nicola Prezza

Nicola received his PhD in Computer Science from the University of Udine (Italy) in 2017. He spent research periods (as a postdoc) at the universities of Pisa (Italy) and DTU (Denmark), and later has been Assistant professor at LUISS (Rome). He is now Associate professor at Ca’ Foscari university of Venice. Nicola Prezza’s research is focused on the interplay between data structures, data compression, and regular language theory. In 2018 he received the “best italian young researcher in Theoretical Computer Science” award from the italian chapter of the European Association for Theoretical Computer Science (IC-EATCS). He is PI of an ERC Starting grant (REGINDEX, grant nr. 101039208, 01/09/2022 - 31/08/2027). Personal webpage: https://nicolaprezza.github.io/

Riccardo Maso

Riccardo graduated in Computer Science from Ca’ Foscari University of Venice (Italy) in March 2025. He later joined the REGINDEX research group at the same university as a pre-doctoral research fellow, and in September 2025 he began his PhD.

Ruben Becker

Ruben holds a PhD degree from Saarland University (Germany). While being a doctoral student (2014-2018) he was a member of the Algorithms and Complexity department at Max Planck Institute for Informatics in Saarbrücken, Germany. He was then a Postdoctoral researcher in the Computer Science department of the Gran Sasso Science Institute in L’Aquila, Italy (2018-2022). He joined Ca’ Foscari University of Venice in November 2022 as an assistant professor and since December 2025 he is tenure-track assistant professor (RTT). Ruben’s research revolves around the topics algorithms, graphs, and randomness. His PhD thesis received the Dr. Eduard Martin Preis for the best dissertation within the Faculty of Computer Science and Mathematics at Saarland University in 2018. You can find more details on his personal website.

portfolio

Portfolio item number 1

Short description of portfolio item number 1

Portfolio item number 2

Short description of portfolio item number 2

projects

Compressed Indexes for k-mer Sets

We research compact and fast data structures to represent large sets of genomic k-mers (strings of length k over the DNA alphabet).

Compressed Indexes for Strings

We study compressed space algorithms and data structures to process and index large genomic textual datasets

Algorithmic Fairness

We study computational problems with a focus on fairness guarantees, predominantly problems related to social networks.

Minimal Perfect Hashing

This research line investigates the design of minimal perfect hash functions that enable constant-time, collision-free access to large static datasets while compressing storage requirements near the information-theoretic lower bound.

Compressed Indexes for Regular Languages - REGINDEX

Compressed Indexes for Regular Languages with Applications to Computational Pan-genomics.

Streaming and Sketching

We study streaming and sketching algorithms for processing large datasets

publications

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Published in Journal of the ACM 67(1): 2:1-2:54, 2020

Indexing highly repetitive texts—such as genomic databases, software repositories and versioned text collections—has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in a text of length n (in O(m log log n) time, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this article, we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O(occ log log n) time) within O(r) space. By raising the space to O(r log log n), our index counts the occurrences in optimal time, O(m), and locates them in optimal time as well, O(m + occ). By further raising the space by an O(w/ log σ) factor, where σ is the alphabet size and w = Ω (log n) is the RAM machine size in bits, we support count and locate in O(⌈ m log (σ)/w ⌉) and O(⌈ m log (σ)/w ⌉ + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log (n/r)) space that replaces the text and extracts any text substring of length ℓ in the almost-optimal time O(log (n/r)+ℓ log (σ)/w). Within that space, we similarly provide access to arbitrary suffix array, inverse suffix array, and longest common prefix array cells in time O(log (n/r)), and extend these capabilities to full suffix tree functionality, typically in O(log (n/r)) time per operation. Our experiments show that our O(r)-space index outperforms the space-competitive alternatives by 1–2 orders of magnitude in time. Competitive implementations of the original FM-index are outperformed by 1–2 orders of magnitude in space and/or 2–3 in time.

Download Paper

Near-Optimal Approximate Shortest Paths and Transshipment in Distributed and Streaming Models

Published in SIAM J. Comput. 50(3): 815-856, 2021

We present a method for solving the transshipment problem — also known as uncapacitated minimum cost flow — up to a multiplicative error of 1 +𝜀 in undirected graphs with nonnegative edge weights using a tailored gradient descent algorithm. Using ˜𝑂⁡( ⋅) to hide polylogarithmic factors in 𝑛 (the number of nodes in the graph), our gradient descent algorithm takes ˜𝑂⁡(𝜀^{−2}) iterations, and in each iteration it solves an instance of the transshipment problem up to a multiplicative error of polylog 𝑛. In particular, this allows us to perform a single iteration by computing a solution on a sparse spanner of logarithmic stretch. Using a randomized rounding scheme, we can further extend the method to finding approximate solutions for the single-source shortest paths (SSSP) problem. As a consequence, we improve upon prior works by obtaining the following results: (1) Broadcast CONGEST model: (1 +𝜀)-approximate SSSP using ˜𝑂⁡((√𝑛 +𝐷)⁢𝜀^{−3}) rounds, where 𝐷 is the (hop) diameter of the network. (2) Broadcast Congested Clique model: (1 +𝜀)-approximate transshipment and SSSP using ˜𝑂⁡(𝜀^{−2}) rounds. (3) Multipass Streaming model: (1 +𝜀)-approximate transshipment and SSSP using ˜𝑂⁡(𝑛) space and ˜𝑂⁡(𝜀^{−2}) passes. The previously fastest SSSP algorithms for these models leverage sparse hop sets. We bypass the hop set construction; computing a spanner is sufficient with our method. The above bounds assume nonnegative edge weights that are polynomially bounded in 𝑛; for general nonnegative weights, there is an additional multiplicative overhead equal to the logarithm of the maximum ratio between nonzero weights. Our algorithms can also handle asymmetric costs for traversing edges in opposite directions. In this case, we obtain an additional multiplicative dependence of the maximum ratio between the two costs on some edge.

Download Paper

Sparse and skew hashing of kmers

Published in Bioinformatics, Volume 38, 2022

A dictionary of k-mers is a data structure that stores a set of n distinct k-mers and supports membership queries. This data structure is at the hearth of many important tasks in computational biology. High-throughput sequencing of DNA can produce very large k-mer sets, in the size of billions of strings—in such cases, the memory consumption and query efficiency of the data structure is a concrete challenge.

Download Paper

Parallel and External-Memory Construction of Minimal Perfect Hash Functions with PTHash

Published in IEEE Transactions on Knowledge and Data Engineering, Volume 36, Issue 3, 2023

A function f : U → {0,…,n−1} is a minimal perfect hash function for a set S ⊆ U of size n, if f bijectively maps S into the first n natural numbers. These functions are important for many practical applications in computing, such as search engines, computer networks, and databases. Several algorithms have been proposed to build minimal perfect hash functions that: scale well to large sets, retain fast evaluation time, and take very little space, e.g., 2 – 3 bits/key. PTHash is one such algorithm, achieving very fast evaluation in compressed space, typically many times faster than other techniques.

Download Paper

Co-lexicographically Ordering Automata and Regular Languages - Part I

Published in Journal of the ACM 70.4: 1-73, 2023

The states of a finite-state automaton 𝒩 can be identified with collections of words in the prefix closure of the regular language accepted by 𝒩. But words can be ordered, and among the many possible orders a very natural one is the co-lexicographic order. Such naturalness stems from the fact that it suggests a transfer of the order from words to the automaton’s states. This suggestion is, in fact, concrete and in a number of articles automata admitting a total co-lexicographic (co-lex for brevity) ordering of states have been proposed and studied. Such class of ordered automata — Wheeler automata — turned out to require just a constant number of bits per transition to be represented and enable regular expression matching queries in constant time per matched character.Unfortunately, not all automata can be totally ordered as previously outlined. In the present work, we lay out a new theory showing that all automata can always be partially ordered, and an intrinsic measure of their complexity can be defined and effectively determined, namely, the minimum width p of one of their admissible co-lex partial orders–dubbed here the automaton’s co-lex width. We first show that this new measure captures at once the complexity of several seemingly-unrelated hard problems on automata. Any NFA of co-lex width p: (i) has an equivalent powerset DFA whose size is exponential in p rather than (as a classic analysis shows) in the NFA’s size; (ii) can be encoded using just Θ(log p) bits per transition; (iii) admits a linear-space data structure solving regular expression matching queries in time proportional to p2 per matched character. Some consequences of this new parameterization of automata are that PSPACE-hard problems such as NFA equivalence are FPT in p, and quadratic lower bounds for the regular expression matching problem do not hold for sufficiently small p.Having established that the co-lex width of an automaton is a fundamental complexity measure, we proceed by (i) determining its computational complexity and (ii) extending this notion from automata to regular languages by studying their smallest-width accepting NFAs and DFAs. In this work we focus on the deterministic case and prove that a canonical minimum-width DFA accepting a language ℒ–dubbed the Hasse automaton ℋ of ℒ–can be exhibited. ℋ provides, in a precise sense, the best possible way to (partially) order the states of any DFA accepting ℒ, as long as we want to maintain an operational link with the (co-lexicographic) order of ℒ’s prefixes. Finally, we explore the relationship between two conflicting objectives: minimizing the width and minimizing the number of states of a DFA. In this context, we provide an analogue of the Myhill-Nerode Theorem for co-lexicographically ordered regular languages.

Download Paper

Where the patterns are: repetition-aware compression for colored de Bruijn graphs

Published in Journal of Computational Biology, Vol. 31, No. 10, 2024

We describe lossless compressed data structures for the colored de Bruijn graph (or c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from k-mers to their color sets. The color set of a k-mer is the set of all identifiers, or colors, of the references that contain the k-mer. While these maps find countless applications in computational biology (e.g., basic query, reading mapping, abundance estimation, etc.), their memory usage represents a serious challenge for large-scale sequence indexing.

Download Paper

A survey of BWT variants for string collections

Published in Bioinformatics, Volume 40, Issue 7, 2024

In recent years, the focus of bioinformatics research has moved from individual sequences to collections of sequences. Given the fundamental role of the Burrows–Wheeler transform (BWT) in string processing, a number of dedicated tools have been developed for computing the BWT of string collections. While the focus has been on improving efficiency, both in space and time, the exact definition of the BWT used has not been at the center of attention. As we show in this paper, the different tools in use often compute non-equivalent BWT variants: the resulting transforms can differ from each other significantly, including the number r of runs, a central parameter of the BWT. Moreover, with many tools, the transform depends on the input order of the collection. In other words, on the same dataset, the same tool may output different transforms if the dataset is given in a different order. We studied 18 dedicated tools for computing the BWT of string collections and were able to identify 6 different BWT variants computed by these tools. We review the differences between these BWT variants, both from a theoretical and from a practical point of view, comparing them on eight real-life biological datasets with different characteristics. We find that the differences can be extensive, depending on the datasets, and are largest on collections of many similar short sequences. The parameter r, the number of runs of the BWT, also shows notable variation between the different BWT variants; on our datasets, it varied by a multiplicative factor of up to 4.2.

Download Paper

Decentralized Low-Stretch Trees via Low Diameter Graph Decompositions

Published in SIAM J. Comput. 53(2): 247-286, 2024

We study the problem of approximating the distances in an undirected weighted graph 𝐺 by the distances in trees based on the notion of stretch. Focusing on decentralized models of computation such as the 𝖢𝖮𝖭𝖦𝖤𝖲𝖳, 𝖯𝖱𝖠𝖬, and semi-streaming models, our main results are as follows: (1) We develop a simple randomized algorithm that constructs a spanning tree such that the expected stretch of every edge is 𝑂⁡(log^3⁡𝑛), where 𝑛 is the number of nodes in 𝐺. If 𝐺 is unweighted, then this algorithm can be implemented to run in 𝑂⁡(hop⁡(𝐺)) rounds in the 𝖢𝖮𝖭𝖦𝖤𝖲𝖳 model, where hop⁡(𝐺) is the hop-diameter of 𝐺; thus our algorithm is asymptotically optimal in this case. In the weighted case, the run-time of the algorithm matches the currently best known bound for exact single source shortest path (SSSP) computations, which despite recent progress is still separated from the lower bound of Ω⁢(√𝑛 +hop⁡(𝐺)) by polynomial factors. A naive attempt to replace exact SSSP computations with approximate ones in order to improve the complexity in the weighted case encounters a fundamental challenge, as the underlying decomposition technique fails to work under distance approximation. (2) We overcome this obstacle by developing a technique termed blurry ball growing. This technique, in combination with a clever algorithmic idea of Miller, Peng, and Xu (SPAA 2013), allows us to obtain low diameter graph decompositions with small edge cutting probabilities based solely on approximate SSSP computations. (3) Using these decompositions, we in turn obtain metric tree embedding algorithms in the vein of the celebrated work of Bartal (FOCS 1996), whose computational complexity is optimal up to polylogarithmic factors not only in the 𝖢𝖮𝖭𝖦𝖤𝖲𝖳 model but also in the 𝖯𝖱𝖠𝖬 and semi-streaming models. Our embeddings have the additional useful property that the tree can be mapped back to the original graph such that each edge is “used” only logarithmically many times. This property is of interest for capacitated problems and for simulating 𝖢𝖮𝖭𝖦𝖤𝖲𝖳 algorithms on the tree into which the graph is embedded.

Download Paper

Analysing New Entropy Measures for Tries

Published in International Symposium on String Processing and Information Retrieval, 2025

We introduce new entropy measures for tries taking into account the distribution of the edge labels. To do that, we study the combinatorial problem of counting the number of tries with a given symbol distribution. We provide an alternative proof for the closed formula counting the tries belonging to such a class. This formula allows us to directly deﬁne the worst-case entropy for the aforementioned set of tries. Moreover, we propose a new notion of k-th order empirical entropy for tries and we show that the relationships between these two entropy measures are similar to those between the corresponding well-known measures for strings. Contrary to the label entropy [FOCS ’05], which was designed to compress the labels of a node-labeled ordered tree, our empirical entropy considers not only the labels, but the entire structure of the trie. Finally, we relate our empirical entropy to the repetitiveness measure r proposed by Prezza [SODA ’21], which counts the number of runs in the XBWT of the trie. We show that these measures exhibit relations analogous to those of their string counterparts.

Download Paper

talks

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.

RAVEN — Research on Algorithms VENice

Posts by Collection

people

portfolio

projects

publications

talks

teaching