Strings are fundamental objects in many important computer science
applications, such as text/speech processing, bioinformatics, data
storage and access, and communication systems. However, processing
strings presents various challenges in computation and communication.
For example, many of today's applications involving strings are
implemented on large data sets, and the known algorithms are often too
costly in both time and space. Similarly, when strings are transmitted
between systems, errors like insertions and deletions occur frequently,
which can lead to problems from a simple misunderstanding to a severe
system failure. These challenges are usually addressed by high cost
solutions, where one either requires a large amount of resources for the
algorithms, or expensive hardware that keeps systems strictly
synchronized. The overarching goal of this project is to understand the
fundamental question of how to more efficiently compute different
string measures, and transmit strings reliably in the presence of
insertions and deletions. This will lead to a deeper theoretical
understanding of the nature of various string measures and errors, as
well as potentially much more efficient solutions in practice. Examples
of benefits include algorithms that use much less resources for handling
large data sets, and more reliable communication systems in adversarial
environments. The project also has plans for mentoring PhD students,
integration of the research topics into courses taught to students from
a variety of different backgrounds, and support of under-represented
groups of students in computer science.
The project contains two sets of specific goals. The first set of goals
seeks to understand how to design efficient error-correcting codes for
insertions and deletions. These include codes and document-exchange
protocols over the binary alphabet with asymptotically optimal
parameters; codes that form a linear subspace or have a low density
parity check matrix, which allow fast encoding and decoding; list
decodable codes that can correct a larger fraction of errors with
outputting a small list of possibly correct messages; and locally
decodable codes that can correctly recover any target message bit with
only a few queries of the codeword. The second set of goals investigates
the space complexity of various string measures, such as edit distance,
longest common subsequence, and longest increasing subsequence.
Specifically, the goal is to both design algorithms that require much
smaller space to compute or approximate these string measures, and to
prove better space lower bounds in various models. These two sets of
goals are also naturally connected, since the codes in the first part
can often be used to give space lower bounds in the second part. The
study of these topics will be based on techniques from several different
areas such as probability theory, information theory, combinatorics,
algorithm design, communication complexity, and pseudorandomness, and
will further foster the interactions among these areas towards
breakthroughs.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.