1999 — 2002 |
Orlitsky, Alon |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Vector Quantization: Theoretical Limits and Practical Constructions @ University of California-San Diego
The investigators perform a systematic study of issues related to the tradeoffs between dimension and rate in vector quantization. They investigate the compression gains achievable by simultaneous quantization of a block of data over scalar quantization. Theoretical limits on the savings, rates of convergence, and algorithms achieving them are sought for worst-case as well as average-case performance criteria. Special consideration is given to "combinatorial" distortion measures that attain only two values: zero or infinity. These measures allow only certain types of errors and are important in applications where some mistakes cannot be tolerated. These measures are simpler to analyze, yet display many of the complexities of general measures.
|
1 |
2003 — 2007 |
Cruz, Rene [⬀] Fainman, Yeshaiahu (co-PI) [⬀] Papen, George Orlitsky, Alon Ford, Joseph (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Nrt: Micro-Buffered Networks @ University of California-San Diego
The proposed project will investigate a broad class of packet-based network architectures, protocols, and services where the core switch/route fabric has limited or no buffering capability. These networks - called mbuffered networks - have two distinguishing features. First, packet loss due to contention is treated as an erasure that can be corrected via coding techniques. Information is encoded into code words and each codeword is divided into fragments. The redundancy built into the codeword acts like a "virtual buffer" that mitigates contention and packet loss so that if several of the codeword fragments are erased as they pass through the network, there is still a high probability that the information within the codeword can be decoded correctly at the destination. Adaptive flow control can be implemented by adjusting the coding overhead (code rate) as well as the fragment generation rate. The second distinguishing feature is the robustness with respect to hardware and routing failures. In particular, different codeword fragments belonging to the same codeword can be sent using different routes within the network to increase resilience. Route diversity also provides unique security and authentication features.
The intellectual merit of the proposed project is the exploration of new architectural approaches that use little or no buffering in high-speed networks where buffers are becoming increasingly difficult to implement. Results of this project will impact research directions in optical systems technology, and increase the base of knowledge in communication systems theory. The project will provide unique training of both undergraduate and graduate students in a systems-oriented multi-disciplinary effort.
|
1 |
2003 — 2007 |
Orlitsky, Alon |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Universal Compression of Infinite Alphabets With Applications to Language Modeling @ University of California-San Diego
In many data-compression applications the underlying distribution is not known. In these applications one typically assumes that the distribution belongs to a large class of natural distributions and tries to devise compression algorithms that perform well for all distributions in the class. For distributions over finite alphabets a lot is known. For example, it was shown that any stationary ergodic sequence can be compressed as well as when the distribution is known in advance. However in many real applications, such as text and image compression, the alphabet is large compared to the string length, often even infinite. Unfortunately, it has been shown that for large alphabets, universal compression cannot be achieved, and as the size of the alphabet grows, the redundancy, namely, the penalty for not knowing the distribution, increases to infinity.
Recently, we took a different approach to the compression of strings over large alphabets. The description of any string, over any alphabet, can be decomposed into two parts: description of the dictionary, namely the symbols appearing in the string, and of their pattern, namely the order in which they appear. The descriptions of the pattern and the dictionary can be viewed as two separate problems. The pattern is related to the high-level structure of the string whereas the dictionary relates to the composition of the symbols. In many applications such as language modeling for speech recognition, the pattern is more significant.
We have shown that patterns of strings drawn according to independent and identically distributed random variables can be compressed as if the distribution were known in advance. We now study extensions of this result that, if proven, will render it more powerful and practical. We are studying sequential compression algorithms that compress the sequence one symbol at a time, practical algorithms that can be performed using few operations per symbol, and extensions of these results to distributions with memory; such distributions model several practical applications. We are also trying to improve the upper and lower bounds on the best compression rate that can be achieved.
|
1 |
2005 — 2009 |
Orlitsky, Alon |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Predicting the Unlikely: Theory, Algorithms, and Applications @ University of California-San Diego
Abstract --------
Many scientific and engineering endeavors call for estimating probabilities and distributions based on an observed data sample. When the sample size is large relative to the number of possible outcomes, for example when a biased coin is tossed many times, estimation is simple. However, in many applications the number of possible outcomes is large compared to the sample size. For example, in language modeling - used in compression, speech recognition, and data mining - the number of words and contexts is large compared to the amount of text at hand. Estimation in this large-alphabet regime is much more complex, and has been studied for over two centuries. While some good estimators have been derived, for example those named after Laplace, Krichevsky-Trofimov, and Good-Turing, very few optimality properties have been established for them, and each is known to perform poorly under some conditions.
Adopting an information-theoretic viewpoint, the investigators undertake a systematic study of these issues. They concentrate on two broad problems, concerning the estimation of: (1) the probability of each observed outcome and of the collection of outcomes not yet observed; (2) the underlying distribution, which does not associate probabilities with specific outcomes. For each problem they seek estimation algorithms that perform well in practice and have provable optimality properties such as small Kullback-Leibler divergence and other metrics between the underlying and estimated distributions. The problems they address are both theoretical, for example the data size required to estimate the underlying distribution to within a given confidence level, and computational, regarding the complexity and sequentiality of the derived algorithms.
|
1 |
2007 — 2011 |
Orlitsky, Alon Rosing, Tajana (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Design and Analysis of Compressed Sensing Dna Microarrays @ University of California-San Diego
The diverse functions performed by a living cell during her life cycle are controlled and regulated through complicated gene- and protein- interaction networks. Any pattern of irregular behavior of genes in the network can lead to cell malfunctioning, cell death, or the emergence of diseases like cancer. It is therefore of crucial importance to recognize erroneous gene interaction patterns and compare them to those in healthy cells. For this type of study, one of the most frequently used bioengineering systems is the well known DNA microarray device. DNA microarrays consist of grids of spots containing unique genetic identifiers for each of the tested genes, capable of generating snapshots of gene activity in terms of selective DNA sequence annealing. Microarrays have also found many other applications in the field of molecular biology, most notably for the purpose of detecting hostile microbial agents in food, water, and in the air. One of the main drawbacks of current microarray designs is that they are, for the purpose of whole genome studies, severely underutilized; similarly, for biosensing applications, existing microarray systems cannot be used for simultaneous identification of a large number of microorganisms and their strains due to technological limitations.
The investigators study novel array architectures, termed compressed sensing DNA microarrays. The research involves finding DNA probes that serve as group identifiers for classes of microorganisms; designing sparse sensing matrices for DNA group identifiers; developing compressed sensing reconstruction algorithms capable of handling saturation effects arising due to high agent concentration levels; characterizing the fundamental trade-offs between distortion and sensor dimension for non-linear arrays; and, analyzing the complexity of integrating compressed sensing microarrays into existing biosensor networks.
|
1 |
2011 — 2016 |
Orlitsky, Alon |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cif: Medium: Collaborative Research: Information Theory and Statistical Inference From Large-Alphabet Data @ University of California-San Diego
Statistical analysis is key to many challenging applications such as text classification, speech recognition, and DNA analysis. However, often the amount of data available is comparable or even smaller than the set of symbols (alphabet) constituting the data. Unfortunately, not much is known about optimal inference in this so-called large-alphabet domain. Recently, several promising approaches have been developed by different scientific communities, including Bayesian nonparametrics in statistics and machine learning, universal compression in information theory, and the theory of graph limits in mathematics and computer science.
The investigators study the problem drawing from these multiple perspectives, but with a particular focus on developing the information theoretic approach. The research studies analytical properties of the "pattern maximum likelihood'' estimator, which performs well in practice but is not understood theoretically, and also explores computational speedups. Moreover, it attempts to delineate which problem classes are better handled by Bayesian nonparametric techniques and which by the pattern approach, and explores links between these approaches. The investigators use the resulting theory for automatic document classification, allowing for more automation in storing, retrieving, and analyzing data. Furthermore, the investigators use the theory to study genetic variations, whose link with disease diagnosis is a crucial step in the systematic quantification of biology that is playing an increasingly important role in medical advancement. The research also brings new courses to the classroom, with a special outreach effort to involve women and under-represented minorities, including through the Native Hawaiian Science and Engineering Mentorship Program.
|
1 |
2011 — 2014 |
Orlitsky, Alon |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cif: Small: Collaborative Research: Algorithms and Information-Theoretic Limits For Data-Limited Inference @ University of California-San Diego
It is an irony of our time that despite living in the 'information age' we are often data-limited. After decades of research, scientists still debate the causes and effects of climate change, and recent work has shown that a significant fraction of the most influential medical studies over the past 13 years have been subsequently found to be inaccurate, largely due to insufficient data. One reason for this apparent paradox is that modeling complex, real-world information sources requires rich probabilistic models that cannot be accurately learned even from very large data sets. On a deeper level, research inherently resides at the edge of the possible, and seeks to address questions that available data can only partially answer. It is therefore reasonable to expect that we will always be data-limited.
This research involves developing new algorithms and performance bounds for data-limited inference. Prior work of the PIs has shown that, by taking an information-theoretic approach, one can develop new algorithms that are tailored specifically to the data-limited regime and perform better than was previously known, and in some cases are provably optimal. This project advances the goal of developing a general theory for data-limited inference by considering a suite of problems spanning multiple application areas, such as classification; determining whether two data sets were generated by the same distribution or by different distributions; distribution estimation from event timings; entropy estimation; and communication over complex and unknown channels. Whereas these problems have all been studied before in isolation, prior work of the PIs has shown it is fruitful to view them as instances of the same underlying problem: data-limited inference.
|
1 |
2015 — 2017 |
Orlitsky, Alon Fragouli, Christina (co-PI) [⬀] Verdu, Sergio |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Enhancing Education and Awareness of Shannon Theory @ Institute of Electrical & Electronics Engineers, Inc.
Consistent with the National Science Foundation's goal "to initiate and support ... programs to strengthen scientific and engineering research potential [and] science and engineering education programs at all levels", this project develops materials that will support education and broad awareness of the importance of key engineering advances . In particular, it will support creation of educational films and corresponding lesson materials for K-12 mathematics and science teachers that will allow students to be exposed not just to the advances in information theory, but also to how an ordinary person played a pivotal role in fostering them. By adapting material previously restricted to graduate-level courses and technical conferences to a larger audience, a broader dissemination of information theory should profitably inform teachers and researchers in other fields, thereby fueling the nation's STEM workforce and improving commercial technology.
This project focuses on the efforts and advances of Claude Shannon, who established the field of information theory by stating some of its most fundamental problems and solving them. The technologies that his work made possible form a major driving force of our economy. His information theory concepts provide the foundation for nearly every aspect of modern information technology and have been applied to many fields, including communication, language, genetics, computing, cryptography, psychology, perception, memory, artificial intelligence, quantum physics, and others.
|
0.919 |
2016 — 2020 |
Orlitsky, Alon |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cif: Medium: Collaborative Research: Learning in High Dimensions: From Theory to Data and Back @ University of California-San Diego
Statistical-modeling is the cornerstone of analyzing modern data sets, and using observed data to learn the underlying statistical model is a crucial part of most data analysis tasks. However, with the success of data utilization came a vast increase in its complexity as expressed in complex models, numerous parameters, and high dimensional features. This research project studies problems in learning such high-dimensional models, both in theory and in practice with actual datasets in cutting-edge applications.
Learning high-dimensional models efficiently, both in terms of computation and in terms of the use of the data, is an important challenge. The research characterizes the fundamental limits on the sample and computational complexity of several key distribution learning problems, as well as the associated optimal learning algorithms that achieve the limits. The learning problems underpin important tasks such as clustering, multiple testing of hypothesis and information measure estimation. The new algorithms and new methodologies developed are evaluated and applied on real data from three specific applications: 1) denoising of high throughput transcriptomic data; 2) analysis of omics data for personalized medicine; 3) ecological population studies. While these applications are useful on their own right, there will also be many other potential applications in fields such as speech recognition, topic modeling, character recognition, neuroscience, etc.
|
1 |
2016 — 2019 |
Orlitsky, Alon |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cif: Small: Information Theoretic Foundations of Data Science @ University of California-San Diego
The advent of modern data generation, collection, communication, retention, and application has brought about an important new discipline termed data science. It draws upon diverse techniques from statistics, machine learning, computer science, and information theory to analyze, understand, and most importantly, utilize, data. It therefore significantly impacts diverse fields ranging from medicine to marketing, manufacturing, finance, and security. Much of the initial work in this area has centered on simple models to to facilitate analysis. This project is focused on bringing these results closer to real-world applications such as natural language processing. The results of this work are expected to lead to a better ability to predict future events, detect anomalies, and classify data.
The research in this project is focused on three technical thrusts: (i) extending data science concepts to practical regimes, (ii) utilizing structure, and (iii) understanding the effect of memory in data sources. With regards to (i), the objective is to develop optimal distribution estimators not just for the traditional Kullback-Leibler divergence, but also for other metrics, and to do so for arbitrary parameter values, not just restricted asymptotic regimes. With regards to (ii), the objective is to utilize structure to improve distribution estimates and to resolve a conundrum where theorists and practitioners employ opposing distribution estimation techniques. With regards to (iii), the general objective is to study the problem of estimating distributions with memory. Preliminary results have uncovered an interesting dichotomy between compression and learning when memory is present. Through these technical thrusts, this project will study complex and realistic data models and derive results with important theoretical implications as well as applications in a variety of fields.
|
1 |
2017 — 2018 |
Orlitsky, Alon |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cif: Student Travel Support For the 2017 Ieee International Symposium On Information Theory @ University of California-San Diego
The IEEE International Symposium on Information Theory (ISIT) is the premier international conference on information theory, the scientific discipline that studies the mathematical foundations of efficient, reliable, and secure communications, including telephony, television, wireless, optical, and data storage systems. By their nature, the fundamental limits of communication and storage are not technology-dependent, and do not become obsolete with improvements in hardware or software. In fact, the performance benchmark provided by information theory has successfully guided engineers in the design of ever more efficient coding systems that leverage the technological advances to closely approach the fundamental limits.
The symposium typically attracts around 1000 paper submissions with about 600 papers accepted and 850 participants, including about 350 students who benefit from the excellent learning and networking opportunity that ISIT provides. This award provides travel funds for students to attend the symposium.
Students form an integral part of the Information Theory community. Over the past few years the IEEE Information Theory Society has undertaken several activities to promote student participation. These include the establishment of Schools of Information Theory on five continents, the ISIT Jack Keil Wolf Best Student Paper Award, the Information Theory Society Student Committee, and the ISIT tutorial series. The ISIT Jack Keil Wolf Student Paper Award is given to up to 3 outstanding papers for which a student is the principal author and presenter. This author must be a registered student of an educational institution at the time of paper submission to be eligible. The criteria for the award includes both content and presentation.
In addition, an outstanding benefit of the ISIT is the close personal interaction between senior and junior researchers in information theory, which encourages the liberal exchange of ideas and critical comments. The opportunities for cross-fertilization are immense and it is vital that those who are new to the field and bring fresh ideas be present to discuss their work with those who are more experienced.
|
1 |