2001 — 2007 |
Suciu, Dan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Career: Efficient Management of Web Data @ University of Washington
The goal of this research project is to enable the manag. The project addresses four XML data management tasks: XML publishing from relational databases, transport of XML data over the Internet, storage of large amounts of XML data, and stream processing XML data.ement of XML data on the Web to be done efficiently In XML publishing, a declarative mapping is specified from a relational database to XML. The project develops techniques for efficiently translating XML queries on the view back into SQL queries on the relational database. For XML transport, the project creates new compression techniques that exploit patterns and datatypes in a given XML data instance. Today, it is difficult to store XML data in a relational database system, because XML differs radically from the relational model. This project creates new methods for splitting XML data optimally into relations to be managed by a relational database system. Finally, the project develops alight-weight XML query processing method, by providing a set of tools for simple XML transformations that can be combined into pipelines performing complex processing. Several software tools will result from this project, and will be made available in the public domain, with an expected impact both on the research community and on the industry. Parts of this research project will be integrated in the database course being offered at the University of Washington.
|
1 |
2002 — 2005 |
Suciu, Dan Levy, Henry (co-PI) [⬀] Halevy, Alon [⬀] Gribble, Steven |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Itr: (Software & Hardware Systems) Piazza: a Platform For Wide-Scale Distributed Data Sharing and Integration @ University of Washington
This research seeks to create a new global-scale information facility on the Internet in which structured data, rather than raw text, is the key element supported. Currently, the information provided on the Internet consists of simple text-based documents (HTML), or other unstructured types such as images and streams. The new facility will allow data sources to be described by schemas and their content to be indexed and located. Sophisticated query-oriented database-style operations will be possible over simple or complex data sources, and new views of multiple data sources can be created and shared with other clients. A new class of data processing services will be possible to collect, manipulate, analyze, and store data.
The first goal of project is to define the basic architecture of a client, allowing it to share its data with other peers, participate in distributed query processing, and cooperate with other clients. A specific data integration formalism will be designed that will allow clients to integrate their data with others' using both global-as-view and local-as-view data integration paradigms. Building on this arthictecture, a new query processing technique will be developed that can adapt to both data integration paradigms. The project will then develop intelligent data placement techniques for storing query results on different clients, thus enabling dramatic performance improvments for subsequent queries. The data such placed on servers may become stale, when the source data gets updated: the project will develop specific update propagation techniques for publishing and disseminating updates. Finally, the project is developing novel data indexing techniques that allow clients to search data items globally.
|
1 |
2002 — 2005 |
Suciu, Dan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Containment, Equivalence, and Related Problems For Xpath Expressions @ University of Washington
. A large class of applications access XML data through XPath expressions and need to make routine decisions based on a simple test: whether one XPath expressions is contained in another, meaning that the answer to the first is always is subset of the answer to the second. Examples of such applications include query optimization, query rewriting, semantic caching, XML-based content routing. Despite its apparent simplicity, the containment problem turns out to be surprisingly difficult to analyze when XPath expressions include wild-cards, descendant axes, and predicates. Previous work has focused on only toy fragments of XPath for which the containment problem is in PTIME, but these simple results fail for more realistic fragments. This project studies the containment problem for a large fragment of XPath. During initial investigations for the project it was established that the containment problem for XPath expressions that contain wild-cards, descendant axes, and filters is co-NP hard, suggesting that a complete and efficient containment algorithm is impossible to find. In light of that, several algorithms will be designed in order to explore the tradeoff between efficiency and completeness. One goal of the project is to design a complete algorithm that always returns the correct answer, runs in exponential time in general, but runs efficiently on special instances of XPath expressions. Another goal is to design a heuristic algorithm that always runs efficiently, but that may return false negatives in certain cases. Both algorithms will be analyzed formally, in order to provide a full insight into what performance or precision guarantees they offer. The most promising algorithm will be implemented and made available in the public domain.
|
1 |
2004 — 2008 |
Elmagarmid, Ahmed (co-PI) [⬀] Clifton, Christopher [⬀] Schadow, Gunther (co-PI) [⬀] Suciu, Dan Doan, Anhai (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Itr - (Ase+Nhs) - (Dmc+Int): Privacy-Preserving Data Integration and Sharing
Integrating and sharing data from multiple sources has been a long-standing challenge in the database community. This problem is crucial in numerous contexts, including data integration for enterprises and organizations, data sharing on the Internet, collaboration among government agencies, and the exchange of scientific data. Many applications of national importance, such as emergency preparedness and response; as well as research in many scientific domains, require integrating and sharing data among participants.
Data integration is seriously hampered by an inability to ensure privacy. Without a privacy framework, sources are reluctant to share their data. Problems include fear of disclosing confidential information as well as regulations protecting individual privacy. While there has been progress in computing aggregations of distributed data without disclosing that data; e.g., privacy-preserving distributed data mining, it assumes data integration problems (schema matching, record linkage) are solved. As a consequence, the lack of a privacy-preserving data integration framework has become a key bottleneck to deploying data integration.
This project will develop the technology needed to create and manage federated databases while controlling the disclosure of private data. While the emphasis will be on general techniques for data integration that preserve privacy, the project will work in the context of diverse but particularly relevant problem domains, including scientific research and emergency preparedness. Involvement of domain experts from these fields in developing and testing the techniques will ensure impact on areas of national importance.
|
0.961 |
2005 — 2009 |
Suciu, Dan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Using Cryptography to Control Access in Published Data @ University of Washington
Information exchange via XML documents is a rapidly growing technology. However, to date, complex constraints of trust and confidentiality often prohibit the dissemination of data. Data that could safely be disseminated to others remains hidden behind firewalls. This project aims at producing lightweight tools that allows publication and dissemination of data while at the same time controlling how data is accessed. New data management techniques are developed that use cryptographic primitives in order to enforce access control policies in published XML documents. In cryptographically enforced access control, the data owner publishes a single data instance, which is partially encrypted, and which enforces all access control policies. The project develops a declarative language for access policies, based on XQuery, and a method for applying these policies to an XML data instance to produce a single, multiply-encrypted XML view. This view can then be published by the data owner on the Internet, and everyone can freely download and disseminate it. The crucial aspect is that only users having the right keys can access encrypted parts of the XML document. Different users holding different set of keys will have access to different parts of the document. The project also develops an XQuery interpreter to enable authorized users, holding the right keys, to execute queries on the encrypted view. The interpreter decrypts data on the fly, and only that data required to answer the query. A novel kind of data model, called protection tree, is developed which captures how various keys protect different parts of the XML data. The protection tree is central to the proposed approach: once the security policies are applied to the XML data instance, they produce a protection tree; and the data model that forms the input to the user queries is also modeled as a protection tree. The results of this research will be applicable to providing secure access to XML documents. The project Web site (http://www.cs.washington.edu/homes/pjallen/cryptography.html) will be used for free dissemination of results; specifically, a policy query evaluator (which produces the encrypted view), a decrypting XQuery interpreter (which is used to query the encrypted view), and a consistency checker. This project will also provide educational and research experience in the Cyber Trust area.
|
1 |
2005 — 2009 |
Halevy, Alon (co-PI) [⬀] Suciu, Dan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Reconciling Semantic Heterogeneity by Leveraging Past Experience @ University of Washington
The goal of this research project is to develop methods for bridging semantic heterogeneity. Semantic heterogeneity arises in contexts where data needs to be shared among multiple data sources and applications, and these sources use different terminologies. For example, companies own a large number of databases, and need to coordinate between them in order to leverage their value. Similarly, large-scale scientific projects and coordination among government agencies also requires sharing data across multiple repositories. The approach consists of collecting a large number of schemas in a particular domain and trying to learn the patterns and variations on patterns that database designers use in the domain. By leveraging such patterns, it is possible to match between previously unseen database schemata in the domain. The techniques are validated by developing systems for matching between disparate schemata, and by applying the techniques to searching the growing number of web-services available today on the World Wide Web. One of the systems being built by this research is a search engine for web services that attempts to get at the underlying meaning of the web-service operations and will be available from the University of Washington (http://data.cs.washington.edu/schemaMatching/index.htm). The results of the project will provide a set of online services as well as public data sets that can be used by the research community. Possible direct applications of this research include biomedical informatics and deep-web search. The results
|
1 |
2005 — 2010 |
Suciu, Dan Halevy, Alon (co-PI) [⬀] Tarczy-Hornoch, Peter |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ii: Information Integration in the Presence of Uncertainty @ University of Washington
Experience building and fielding data integration systems has shown that they are brittle in a very fundamental way: they cannot handle uncertainty about data or about how data is combined to provide answers. This limitation is especially pronounced in scientific applications, where data is inherently uncertain and the models of the domain are constantly evolving. From the users' perspective, the inability to model uncertainty can result in loss of relevant answers, an explosion of irrelevant answers and in no justification of answers. The limitation is deeply rooted in the deterministic paradigm underpinning data management systems today, which is designed to support scalability to large data instances, but is incapable of representing and reasoning about uncertainty.
A new approach to data integration, where uncertainties are handled explic- Itly, is proposed. Over the past few years, the BioMediator system, which integrates about a dozen public data sources on genes and proteins, has been available. The group has observed and documented the types of uncertainty that limit the power of any mediator-based integration system like BioMediator. These uncertainties occur at three levels: at the data instance level, at the schema level, and at the user query level. In the new approach, all uncertainties will be made explicit in the system, and represented in a uniform way, using a probabilistic data model. The mediator system supports a query language with SQL but with a modified semantics: the answers to each query are annotated with a probability score, and a lineage information.
The new work will involve the design of a probabilistic data model, the development of probabilistic query processing and optimization techniques, and the design of user feedback methods. They will build a system, U2 (short for UII { Uncertain Information Integration ) that will model uncertainty at all levels of the system, including the query language, mediated schema, source mappings and source data. U2 will explain its results to the user and will actively seek to resolve uncertainty when it arises, incorporating feedback from the user where possible. They will extend the BioMediator System and collaborate with the current users of the system.
There are three areas of broader impact. Issues of information integration will be integrated more tightly into the undergraduate and graduate database curriculum Second, the research will fuel collaboration with biomedical computing research, and will extend the BioMediator system that is currently in use by practitioners in the field. Finally, tools and services will be made available for public use.
|
1 |
2005 — 2009 |
Suciu, Dan Halevy, Alon (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cri: Global-Scale Data Sharing Using Statistics and Probabilities @ University of Washington
Abstract Proposal: CNS 0454425 PI: Dan Suciu Institution: University of Washington Program: NSF 04-588 CISE Computing Research Infrastructure Title: CRI: Global-scale Data Sharing using Statistics and Probabilities
This project will address the problem of semantic heterogeneity that occurs in large-scale data integration by exploring the scalability of novel techniques to very large amounts of data. Two such techniques will be considered. One is corpus-based schema matching, where a large collection (corpus) of schemas is stored, analyzed, and preprocessed in order to enhance automatic schema matching. The second technique consists of probabilistic-based query answering, which efficiently computes complex SQL queries on probabilistic databases. To study the scalability of these techniques to large-scale data integration tasks, a significant fragment of the Web will be downloaded, and stored locally, on a cluster of servers. Data instances and their schemas will be extracted automatically from these Web pages. The resulting corpus of schemas will be matched using a variety of techniques, and the matches interpreted probabilistically. The resulting data organization is called the semantic cache. Users will be able to formulate rich queries over the semantic cache, for example in a language like SQL. Each query will be evaluated on the global data, and given a probabilistic interpretation. The answers will be returned to a user ranked according to their probabilities. This project has potential, if successful, to impact a variety of applications where large scale data integration is currently impossible to achieve, such as from scientific data sharing, electronic commerce, and emergency management systems.
|
1 |
2006 — 2011 |
Suciu, Dan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ct-T: Collaborative Research: Preserving Utility While Ensuring Privacy For Linked Data @ University of Washington
Johannes Gehrke Cornell University 0627680 Panel: P060970
CT: T Collaborative Research: Preserving Utility While Ensuring Privacy for Linked Data
Abstract
This research investigates how to publish data while limiting disclosure about entities in the data. An example is census data, an invaluable source of socioeconomic data. Simple approaches for limiting disclosure, such as removing identifying attributes like social security number and name, are not sufficient because combinations of other information in the data can help identify individuals in the data, especially when the data can be linked to external databases. It is this linkage, and in general, the property of data that it is often explicitly linked to other data, that is the focus of this project.
In linked data, data records are linked through relationships between records. Examples include data about students and the classes they took where the links are the association between a student and the classes she took; data about network packets and the routers that forwarded these packets, where the links are the association of packets to routers; or data about people and their social network, where the links are the social relationships between people. It is the explicit representation of these links in the data that violates some of the key assumptions of prior work. This research spans the whole spectrum from motivating applications of linked data, to novel privacy models and practical anonymization algorithms, to new techniques for attacking and analyzing anonymized data.
|
1 |
2006 — 2010 |
Shapiro, Linda [⬀] Brinkley, James (co-PI) [⬀] Suciu, Dan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Multimedia Information Retrieval For Biological Research @ University of Washington
Multimedia Information Retrieval for Biological Research PI: Linda G. Shapiro, Co-PI: James F. Brinkley, III and Dan Suciu University of Washington
Washington University is award a grant to develop a unified methodology for organization and retrieval of biological data, in particular, image data and related measurements from scientific experiments. The work will build on existing work in experiment management, approximate queries, and content-based image retrieval. A probabilistic query framework for multimedia data will be developed that provides users with a unified way to access multiple types of data. Queries will be able to handle both single data types and multiple related data types, such as registered CT and MRI scans or neuronal firing patterns and related fMRI data. The data will be organized in a way that is both easy for users to understand and efficient for query access. A prototype system will be built and evaluated on three different biological applications. The results of this work will be a new paradigm for multimedia information retrieval that satisfies the needs of biological research. The broader impacts of the research will be the use of these storage and retrieval techniques in groundbreaking biological research in a wide variety of areas where image and other multimedia data are critical to the research. Biologists will be able to organize and access their data to perform major analyses and to provide new insights into their work. Students will be involved in the research. The PI is mentoring young women and encourages them to embark in a computer science career.
|
1 |
2007 — 2011 |
Suciu, Dan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii Cor: Query Evaluation and View Materialization in Probabilistic Data @ University of Washington
Query Evaluation and View Materialization in Probabilistic Data PI: Dan Suciu (U. Washington) Award: IIS-0713576
Databases and data management tools have a deterministic semantics: a data item either is in the data set or is not. But when data comes from multiple sources, or is extracted automatically, it often contains a variety of imprecisions that are difficult to model using the standard deterministic semantics: the same data item may have different representations in different sources and matching algorithms are imprecise; schemas differ across sources and schema matching tools are imprecise or incomplete or both; data at different sources may hold contradictory information; finally, some data, such as sensor data, is inherently probabilistic and hence imprecise. This project represents all sources of imprecision in a single uniform way, as data with a probabilistic semantics, and extends today''s data management tools to manage efficiently data with probabilistic semantics.
Re-designing databases to handle probabilistic data is a daunting task. This project studies two problems that lie at the core of probabilistic data management: the complexity of the query evaluation problem on probabilistic databases, and the view materialization problem (deciding whether a view can be materialized and whether it can be used in other query plans). The results of this research will consists of a range of fundamental techniques to be used in a general purpose probabilistic query processor.
Intellectual Merits. The project makes new contributions that lie at the intersection of several disparate fields: logic, probability theory, knowledge representation, and traditional query processing and optimization. The project enhances the understanding of the query evaluation problem on probabilistic data, develops new algorithms for efficiently evaluating such queries and for materializing views, while leveraging existing database technology.
Broader Impact. Searching large information spaces (the Web; large collections of scientific databases; Homeland Security data) is one the new and most challenging frontiers in Computer Science. The innovation that is needed to support complex searches that scale to large and heterogeneous information spaces, has to come from the data management research community. This project makes contributions to broaden our ability to search large information spaces. If successful, the project will be one of the pieces that will help data management technology undergo a new paradigm shift, from supporting complex queries with deterministic semantics, to supporting complex explorations with probabilistic semantics.
Project URL: http://www.cs.washington.edu/homes/suciu/project-probD
|
1 |
2009 — 2016 |
Koch, Christoph Halpern, Joseph (co-PI) [⬀] Lifka, David Gehrke, Johannes (co-PI) [⬀] Suciu, Dan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii: Large: Causal Databases
The commercial success of data mining, and the great research interest that this area attracts, prove that there is a need for analyzing and understanding data that goes well beyond classical database queries. Users are often particularly interested in understanding the causal relationship between data items and the reasons for observations.
Current database systems cannot explicitly model the causal structure within data (although it is often implicit in the data), and thus offer no specific support for causal queries. In the absence of information about causal relationships, users have to rely on techniques for mining for statistically significant patterns in data. Causal relationships are often simply concluded from statistical dependencies. This can lead to inaccurate conclusions; correlation does not necessarily imply causation.
This project creates the foundations for a new breed of databases called causal databases. Causal databases can model causal information, and allow for queries regarding causality and explanations, which are beyond the scope of current databases. They can also take advantage of causal information that is implicit, but unexploited, in some current databases, such as those for large engineering projects. In the project, new database models and query languages for representing and transforming causal information are developed, with particular focus on large engineering databases and scientific databases. In addition, efficient and scalable techniques for processing causality and computing explanations in large causal databases are developed. This involves both work on integrating causality processing into traditional database query processing architectures and the development of special datastream techniques for scaling up to the most data-intensive applications.
Further information on the project can be found at the project web page: http://www.cs.cornell.edu/databases/causality/
|
0.957 |
2009 — 2013 |
Gatterbauer, Wolfgang Suciu, Dan Balazinska, Magdalena (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii: Small: Beliefdb - Adding Belief Annotations to Databases @ University of Washington
In many scientific disciplines today, a community of users is working together to assemble, revise, and curate a shared data repository. As the community accumulates knowledge and the database content evolves over time, it may contain conflicting information and members can disagree on the information it should store. Relational database management systems (RDBMS) today can help these communities manage their shared data, but provide limited support for managing conflicting facts and conflicting opinions about the correctness of the stored data.
This project develops a Belief Database System (BeliefDB) that allows users to express belief annotations. These annotations can be positive (agreement) or negative (disagreement), and can be of higher order (belief annotations about other belief annotations). The approach allows users to have a structured discussion about the database content and annotations. A BeliefDB gives annotations a clearly defined semantics that lets a relational database understand and manage them efficiently.
Intellectual merits: (i) Definition of a Belief Database Model: The project develops a formalism that extends a relational database with belief annotations on data and on previously inserted annotations.
(ii) Design of a Belief Query Language: The project complements the data model with a new query language that extends SQL. (iii) Development of a canonical Belief Database Representation: The projects develops approaches to store and manipulate belief databases on top of a conventional RDBMS.
Broader impact: Curated databases and shared data repositories are becoming widespread in the scientific communities. A BeliefDB provides a new data management system that addresses the need of these communities to manage conflicting data. If successful, the project will be one of the pieces that will help data management technology undergo a new paradigm shift, from managing data as content, to supporting a community of users in collaboratively creating partly conflicting database contents.
For further information on the project see the project web page:: http://db.cs.washington.edu/beliefDB/
|
1 |
2011 — 2015 |
Suciu, Dan Howe, Bill |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii: Medium: Collaborative Research: Database-as-a-Service For Long Tail Science @ University of Washington
With tremendous amounts of data existing in scientific applications, database management becomes a critical issue, but database technology is not keeping pace. This problem is especially acute in the long tail of science: the large number of relatively small labs and individual researchers who collectively produce the majority of scientific results. These researchers lack the IT staff and specialized skills to deploy technology at scale, but have begun to routinely access hundreds of files and potentially terabytes of data to answer a scientific question. This project develops the architecture for a database-as-a-service platform for science. It explores techniques to automate the remaining barriers to use: ingesting data from native sources and automatically bootstrapping an initial set of queries and visualizations, in part by aggressively mining a shared corpus of data, queries, and user activity. It investigates methods to extract global knowledge and patterns while offering scientists access control over their data, and some formal privacy guarantees. The Intellectual Merit of this proposal consists of automating non-trivial cognitive tasks associated with data work: information extraction from unstructured data sources, data cleaning, logical schema design, privacy control, visualization, and application-building. As Broader Impacts, the project helps scientists reduce the proportion of time spent "handling data" rather than "doing science." All software resulting from this project are open source, and all findings are disseminated broadly through publications and workshops. Sustainable support for science users of the software is coordinated through the University of Washington eScience Institute. The research is incorporated in both undergraduate and graduate computer science courses, and the software is also incorporated into domain science courses as well. The project's outreach activities include advising students through special programs geared toward under-represented groups such as the CRA-W DREU. More information about this project is found at http://escience.washington.edu/dbaas.
|
1 |
2011 — 2015 |
Suciu, Dan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii: Small: Query Compilation On Probabilistic Databases @ University of Washington
The goal of probabilistic databases is to manage large databases where the data is uncertain. Applications include Web-scale information extraction, RFID systems, scientific data management, biomedical data integration, business intelligence, data cleaning, approximate schema mappings, and data deduplication. Despite the huge demand and the intense recent research on probabilistic databases, no robust probabilistic database systems exist to date. The reason is that the probabilistic inference problem is, in general, intractable. Fortunately, in databases there are two distinct inputs to the probabilistic inference problem: the query and the database instance. This has led recently to the discovery of safe queries, which are queries that can be evaluated efficiently on any input database, and to new probabilistic inference algorithms that exploit the structure of the query. However, unsafe queries remain a major challenge in probabilistic databases.
This project studies novel algorithms for evaluating unsafe queries on probabilistic database, with guaranteed performance. It uses a novel approach, query compilation, which translates the query into one of four targets: OBDDs, FBDDs, d-DNNFs, and circuits using inclusion/exclusion nodes. The project pursues two thrusts: (1) It develops instance-dependent compilation techniques that significantly extend the reach of instance-independent techniques used in safe queries. (2) It develops approximate query compilation techniques , which always run efficiently, even on intractable query, instance pairs, by sacrificing accuracy. These algorithms are conservative, in the sense that they return correct probabilities in all cases when the input query, instance pair is tractable.
The Intellectual Merit of this project consists of new techniques for compiling queries into one of four compilation targets, OBDD, FBDD, d-DNNF, and inclusion/exclusion-based inference, using both exact inference (without performance guarantees), and approximate inference (with performance guarantees). It expands our understanding of probabilistic inference, and leads to practical approaches for probabilistic database engines. As Broader Impact, the project benefits a large class of applications that require general purpose management of uncertain data, ranging from large-scale information extraction systems, to scientific data management, to business intelligence. The project gradually incorporates topics from probabilistic data into into a curriculum for graduate level education; query compilation is already discussed in the PI's book on Probabilistic Databases ( http://dx.doi.org/10.2200/S00362ED1V01Y201105DTM016), a graduate-level textbook.
For further information see the project web site at the URL: http://www.cs.washington.edu/homes/suciu/project-querycompilation.html
|
1 |
2011 — 2014 |
Suciu, Dan Howe, Bill Balazinska, Magdalena [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cic Rddc: Relational Data Markets in the Cloud @ University of Washington
Scientists are generating data at an unprecedented scale and rate. There is tremendous value in not only analyzing this data but also sharing it among scientists. Cloud-computing platforms are well-suited to support such sharing: They offer a single logical location for data, access to data management tools for analyzing it, and a pay-as-you-go charging mechanism. However, while cloud-computing systems offer simple pricing schemes for storage and compute resources, the economics of data sharing are poorly understood and only coarsely supported.
This research is developing models and infrastructures to establish relational data markets in the cloud. These markets enable scientists to upload their data and make it publicly available in the cloud, then recoup costs by charging others for using the data. The data markets also enable scientists to share their data with direct collaborators and see the cloud costs fairly distributed among team members.
The project is building a prototype system on the Windows Azure cloud to implement these data markets.
This project is having impact on both cloud computing and science by introducing new pricing techniques for data sharing in the cloud. The software and technical papers resulting from this project are being disseminated through the project website (http://cloud-data-pricing.cs.washington.edu/).
|
1 |
2013 — 2017 |
Suciu, Dan Howe, Bill Balazinska, Magdalena (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Bigdata: Mid-Scale: Dcm: a Formal Foundation For Big Data Management @ University of Washington
The ability to analyze massive-scale datasets has become an important tool both in industry and in the sciences and many systems have recently emerged to support it. However, effective methods for deep data analytics are currently high-touch processes: they require a highly specialized expert who thoroughly understands the application domain and pertinent disparate data sources and who needs to perform repeatedly a series of data exploration, manipulation and transformation steps to prepare the data for querying, machine learning or data mining algorithms. This project explores the foundations of big data management with the ultimate goal of significantly improving the productivity in big data analytics by accelerating the bottleneck step of data exploration. The project integrates two thrusts: a theoretical study, which leads to new fundamental results regarding the complexity of various new (ad hoc) data transformations in modern massive-scale systems, and a systems study, which leads to a multi-platform software middleware for expressing and optimizing ad hoc data analytics techniques. The middleware is designed to augment and integrate existing analytics solutions in order to facilitate and improve methods of interest to the community and compatible with many existing platforms.
The results of this project will make it easier for domain experts to conduct complex data analysis on big data and on large computer clusters. All research results will be released in a middleware package layered on top of existing big-data systems. The middleware includes all the new algorithms, optimization techniques, fault-tolerance and skew mitigation mechanisms, and generalized aggregates developed during the project. In addition, the project develops and deploys a Web-based query-as-a-service interface to the new middleware. The project Web site (http://myriadb.cs.washington.edu) provides access to the software, additional results and information. Project results will be included in educational and outreach activities in big data analytics, including new curricula at the undergraduate, graduate, and professional levels.
|
1 |
2015 — 2019 |
Suciu, Dan Balazinska, Magdalena (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Aitf: Full: Query Processing With Optimal Communication Cost @ University of Washington
Big Data analytics is changing traditional query processing in two ways. The first is a shift from single server or small-scale parallel relational databases to massively distributed architectures, where hundreds or thousands of servers are used during the computation of a single query. The second is an increased complexity in the queries being issued, from single- or star-joins, to complex graph-like structured queries. This project develops new algorithms for query processing over large distributed systems, which are optimized for the cost of communication, then implements and evaluates these algorithms in an open-source big data management system and service.
The project studies a new approach to query evaluation that computes the entire query at once, replacing the traditional approach based on a query plan. The theoretical part of this project builds on a new model, called the Massively Parallel Communication model (MPC), where the communication is the only cost. The system development is performed over the Myria big data management system and service.
The Intellectual Merit of the project consists in advancing the state of the art in both the theory and systems approaches to query evaluation in modern, massive-scale shared-nothing clusters. It develops new, fundamental algorithms for processing queries over massively distributed architectures, with a provably optimal communication cost. The project implements and deploys these algorithms in a system, validating and informing the theoretical model. In particular, the project makes the following contributions: it develops provably optimal, one-round algorithms for skewed data; it studies how and when multiple rounds can be used to further reduce the communication cost; it experiments with these novel algorithms on clusters with up to 1000 worker processes; and it develops a new theoretical model for the communication cost on large shared-nothing architectures with heterogeneous hardware.
The Broader Impact of the project is to contribute to a new architecture for massively parallel query processing, where the traditional multi-step, single-join query evaluation approaches are replaced with novel, single-step, multi-join algorithms. This change has the potential to lead to more efficient big data analytics engines, allowing data analysts to explore large datasets more efficiently. As an immediate application, the project will impact the domain scientists who already use the Myria big data management system and service. All algorithmic discoveries in this project will be implemented in the Myria system, and will significantly improve query performance, allowing domain scientists to conduct more complex analytics and explorations over their data.
|
1 |
2016 — 2019 |
Suciu, Dan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii: Small: Scalable Probabilistic Inference For Large Knowledge Bases @ University of Washington
Large Knowledge Bases are constructed today automatically from large corpora of text, like Web pages, journal articles, news stories. The construction proceeds in two major stages. First, several database queries are computed on the corpora of text, to extract candidate data items; the resulting data, called a factor graph, can be thought of as a very large, noisy, uncertain, redundant, and inconsistent database. Second, a complex probabilistic inference is performed on the factor graph to produce a large, probabilistic knowledge base. Both stages are computationally expensive, but only the first stage has benefited so far from advances in database query processing techniques. This project develops new database processing techniques for the probabilistic inference task. These new techniques have theoretical guarantees, either in the form of absolute guarantees on the runtime of the probabilistic inference, or in the form of a trade-off between the run time and the precision of the probabilistic inference.
The main technique pursued by the project is called lifted probabilistic inference, and consists of algorithms that compute the probability of a SQL query inductively on the structure of the query, without having to first ground the query to compute the large factor graph. Lifted inference is very efficient, but possible only for some queries. The project has four thrusts. First, it combines sampling with lifted inference for efficient approximate probabilistic inference for any query; this algorithms can pushed in the database engine, and can therefore benefit immediately from all optimizations available today in modern, parallel query processors. Second, the project studies the complexity of query evaluation on symmetric databases, a special case of high practical importance, since it scales easily to arbitrarily large domains. In the third thrust, the project extends lifted inference techniques to queries with negations by combining probabilistic inference with resolution; this is necessary because soft constraints in knowledge bases almost always have negations. Finally, the project develops a system prototype and benchmarks.
|
1 |
2017 — 2020 |
Suciu, Dan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii: Medium: Collaborative Research: a Unified and Declarative Approach to Causal Analysis For Big Data @ University of Washington
Observational data is available today in multi-relational form, often extracted from various sources, and stored in multiple flat and interrelated tables. Standard statistical methods for conducting causal inference on observational data assume a very simple data model: a single table with independent units. This research has the potential to significantly impact application domains where differentiating causality from correlation is essential, e.g., education policy and cancer genomics. The HUME project develops techniques for efficient causal analysis using a declarative approach, over complex views, and over large datasets that are integrated from disparate data sources. HUME uses a SQL-like language and is integrated with a relational database system.
The project develops techniques for defining arbitrarily complex units, treatments, outcomes, and covariates, by combining joins, data mapping, and aggregates across multiple tables, and uses a causal network to choose a good set of covariates for causal inference. The first part of the project develops scalable techniques for sub-classification and matching for large data sets obtained by declaratively integrating multiple data sources. The second part of the project develops scalable methods for discovering causal relationships among the attributes in the views by constraint-based, search-based, and hybrid discovery processes. Finally, the third part of the project investigates interferences among units arising from the complex views by designing normal forms and automatic inference of underlying assumptions exploiting techniques from database theory.
|
1 |
2021 — 2024 |
Suciu, Dan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Nsf-Bsf: Iii: Small: Data Driven Schema @ University of Washington
Data needs to be organized systematically and rigorously. Data about consumer products goes into one table, and data about micro-organisms goes into a different table. This makes it easier for humans and computers to store the data, to retrieve and query it, and to update it. But today one often finds large amounts of noisy, inconsistent, incomplete data, which are impossible to organize rigorously. The sheer volume of this data makes it very valuable, yet it is of limited utility without proper organization. This project develops methods for organizing noisy, inconsistent, incomplete data, and develops techniques for storing, querying, and updating such data. Its findings will inform organizations on how to organize and use large amounts of noisy data.
This project develops a technique for approximate schema discovery for noisy data, for normalizing the data according to this schema, and for improving query processing. The input consists of a single, large relation, which may be noisy, inconsistent, incomplete, and the system discovers automatically a few candidate schemas that can represent the data with minimal loss and with high utility for downstream tasks. Each schema is associated with an information-theoretic score, which represents the amount of information that may be lost when we represent the data according to that schema. Then, the project researches new approaches for querying the data stored in an approximate schema, by either recording explicitly the number of "spurious tuples" generated by the schema, or by using probabilities to quantify the degree of confidence in the query's answer. The techniques explored in this project combine information theory with graph algorithms and with query optimization.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
1 |