2007 — 2011 |
Balazinska, Magdalena |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii-Cor: Exploiting History in Continuous Monitoring Systems @ University of Washington
The goal of this research project, called Moirae, is to investigate the benefits and challenges of integrating history into a near-real-time monitoring system and to build a new continuous query engine that supports this integration. To achieve this goal, the project takes the following steps: (1) develop new query models for integrated queries over live and historical data; (2) develop new algorithms that effectively match new events with similar past observations; (3) develop a new continuous query engine that effectively supports the new query model (the engine includes a partitioned stream data store, a scheduler for fair and incremental historical query execution, new operators for merging historical data with data streams, and mechanisms for user-feedback); and (4) evaluate the practicality and performance of the system on data traces from real application domains.
Project funds support the training of PhD students. However, the project includes a large system development component that serves to train both graduate and undergraduate students in research and systems building. The result of this project will provide a new type of continuous processing engine that will better support monitoring applications in domains ranging from computer system monitoring, to network monitoring, and sensor-based environment monitoring. Such large-scale monitoring applications are critical for enterprises that operate at large scales. These enterprises need to carefully monitor their infrastructures to effectively handle and diagnose failures and deliver high-quality services to their customers. The software and technical papers resulting from this project will be disseminated through the project website(http://data.cs.washington.edu/moirae/moirae.shtml).
|
1 |
2009 — 2013 |
Gatterbauer, Wolfgang Suciu, Dan [⬀] Balazinska, Magdalena |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii: Small: Beliefdb - Adding Belief Annotations to Databases @ University of Washington
In many scientific disciplines today, a community of users is working together to assemble, revise, and curate a shared data repository. As the community accumulates knowledge and the database content evolves over time, it may contain conflicting information and members can disagree on the information it should store. Relational database management systems (RDBMS) today can help these communities manage their shared data, but provide limited support for managing conflicting facts and conflicting opinions about the correctness of the stored data.
This project develops a Belief Database System (BeliefDB) that allows users to express belief annotations. These annotations can be positive (agreement) or negative (disagreement), and can be of higher order (belief annotations about other belief annotations). The approach allows users to have a structured discussion about the database content and annotations. A BeliefDB gives annotations a clearly defined semantics that lets a relational database understand and manage them efficiently.
Intellectual merits: (i) Definition of a Belief Database Model: The project develops a formalism that extends a relational database with belief annotations on data and on previously inserted annotations.
(ii) Design of a Belief Query Language: The project complements the data model with a new query language that extends SQL. (iii) Development of a canonical Belief Database Representation: The projects develops approaches to store and manipulate belief databases on top of a conventional RDBMS.
Broader impact: Curated databases and shared data repositories are becoming widespread in the scientific communities. A BeliefDB provides a new data management system that addresses the need of these communities to manage conflicting data. If successful, the project will be one of the pieces that will help data management technology undergo a new paradigm shift, from managing data as content, to supporting a community of users in collaboratively creating partly conflicting database contents.
For further information on the project see the project web page:: http://db.cs.washington.edu/beliefDB/
|
1 |
2009 — 2014 |
Balazinska, Magdalena |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Career: Interactive and Collaborative Data Management in the Cloud @ University of Washington
This award is funded under the American Recovery and Reinvestment Act of 2009 (Public Law 111-5).
The scientific data management landscape is changing. Improvements in instrumentation and simulation software are giving scientists access to data at an unprecedented scale. This data is increasingly being stored in data centers running thousands of commodity servers. This new environment creates significant data management challenges. In addition to efficient query processing, the magnitude of data and queries call for new query management techniques such as runtime query control, intra-query fault tolerance, query composition support, and seamless query sharing.
This project is developing a series of techniques to support the above query management tasks. To achieve this goal, the project includes the design and implementation of a prototype massively parallel database management system that serves as the platform for the development of various query management schemes. The new algorithms are evaluated on both synthetic and real data from the scientific domain.
The expected results of this project include a variety of runtime query management algorithms including parallel query progress indicators, distributed intra-query fault-tolerance, and the ability to suspend and resume queries as needed. The expected results also include tools for searching previously executed queries, annotating them, and sharing them with others. Together, these tools hold the promise to significantly improve data analysis at massive scale, making it an interactive and collaborative process.
Through the above contributions, this project will have significant impact on the scientific community, currently limited by their ability to analyze data. The software and technical papers resulting from this project will be disseminated through the project website (http://nuage.cs.washington.edu/).
|
1 |
2010 — 2015 |
Patel, Shwetak (co-PI) [⬀] Fogarty, James (co-PI) [⬀] Balazinska, Magdalena Demiris, George (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cdi - Type Ii: Transforming Community-Based Elder Care Through Heterogeneous Activity Sensing Analytics @ University of Washington
Health care systems face critical challenges as people live longer with multiple chronic conditions that increase their need for health care services. Addressing the needs of an aging population requires improved methods for community-based preventive care through ongoing monitoring, early detection of adverse events and patterns, and early intervention.
This project addresses this challenge by developing a transformative pipeline of flexible tools for at-home activity monitoring and analysis. The key goal is to enable at-home deployments of flexible sensor platforms and tools for agile analysis of this sensor data by users with varying skill sets and information needs. To achieve this goal, the project focuses on three specific challenges. First, this project is developing new platforms for deploying different combinations of at-home sensors according to varying elder needs. Second, this project is creating new tools for converting the raw sensor data into knowledge of elder activities and their impact on elder care. This requires rapid organization, analysis, and interpretation of heterogeneous and noisy data across multiple scales of analysis. Finally, the project is developing new techniques to support the varying and unpredictable information needs of all actors involved in elder care: the elders themselves, informal caregivers (e.g., family members), formal caregivers (e.g., community nurses), and health researchers (e.g., evaluating the effectiveness of different prescribed activity regimens).
The expected results of this project include (a) new platforms that integrate a variety of heterogeneous home sensors, (b) new streaming data models and query processing techniques addressing the noisy, low-level, and bursty nature of home activity sensing data, (c) new techniques for effective end-user interaction with this data, (d) a system that integrates all three components into a single data acquisition and analysis pipeline, and (e) deep knowledge of how this transformative pipeline can benefit community elder care.
Through the above contributions, this project will have broad impact in both computer science and elder care in the United States. The goal is to provide higher quality care at lower cost that keeps elders in their homes longer.
|
1 |
2011 — 2014 |
Suciu, Dan (co-PI) [⬀] Howe, Bill Balazinska, Magdalena |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cic Rddc: Relational Data Markets in the Cloud @ University of Washington
Scientists are generating data at an unprecedented scale and rate. There is tremendous value in not only analyzing this data but also sharing it among scientists. Cloud-computing platforms are well-suited to support such sharing: They offer a single logical location for data, access to data management tools for analyzing it, and a pay-as-you-go charging mechanism. However, while cloud-computing systems offer simple pricing schemes for storage and compute resources, the economics of data sharing are poorly understood and only coarsely supported.
This research is developing models and infrastructures to establish relational data markets in the cloud. These markets enable scientists to upload their data and make it publicly available in the cloud, then recoup costs by charging others for using the data. The data markets also enable scientists to share their data with direct collaborators and see the cloud costs fairly distributed among team members.
The project is building a prototype system on the Windows Azure cloud to implement these data markets.
This project is having impact on both cloud computing and science by introducing new pricing techniques for data sharing in the cloud. The software and technical papers resulting from this project are being disseminated through the project website (http://cloud-data-pricing.cs.washington.edu/).
|
1 |
2011 — 2016 |
Balazinska, Magdalena |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii: Large: Collaborative Research: Scidb - An Array Oriented Data Management System For Massive Scale Scientific Data @ University of Washington
This collaborative project brings together expertise of five research teams at Brown University (IIS-1111423), University of Washington (IIS-1110370), Massachusetts Institute of Technology (IIS-1111371), Portland State University (IIS-1110917) and University of Wisconsin-Madison (IIS-1111423). Scientific data management has traditionally been performed using the file system, at best using files structured according to a low-level data format. Higher-level data management infrastructure has been task-specific and not reusable in different domains, resulting in millions of dollars of duplicated implementation effort by scientists to manage their data. The goal of this project is the development of a scientific database (SciDB), a system designed and optimized for scientific applications. The aim of SciDB is to do for science what relational databases did for the business world, namely to provide a high performance, commercial-quality and scalable data management system appropriate for many science domains.
In contrast to existing database systems, SciDB is based on a multidimensional array data model and includes multiple features specific to science and critical for science: provenance, uncertainty, versions, time travel, science-specific operations, and in situ data processing. No existing system offers all these features in a single, highly scalable engine. SciDB thus significantly advances the state-of-the-art in data management in addition to supporting domain scientists in data-driven knowledge discovery. The intellectual merit of SciDB is in exploring novel, high performance solutions to nested array storage, parallel array query optimization and execution, array language design, and time travel.
The primary broader impact of SciDB is on the community of scientists who benefit from the tool. By keeping scientists "in the loop" in the design of the system from the outset, the project delivers software that is broadly usable to the community. The proposal also funds participation in a series of workshops that seek to engage even more of the science community. SciDB is an open-source effort, with an initial prototype (http://www.scidb.org/) already downloaded by hundreds of users. Finally, the PIs have a strong track record of delivering robust data management software that is widely used and involving students in the process, including students from under-represented groups. Further information can be found on the project web page (http://database.cs.brown.edu/projects/scidb).
|
1 |
2013 — 2017 |
Suciu, Dan [⬀] Howe, Bill Balazinska, Magdalena |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Bigdata: Mid-Scale: Dcm: a Formal Foundation For Big Data Management @ University of Washington
The ability to analyze massive-scale datasets has become an important tool both in industry and in the sciences and many systems have recently emerged to support it. However, effective methods for deep data analytics are currently high-touch processes: they require a highly specialized expert who thoroughly understands the application domain and pertinent disparate data sources and who needs to perform repeatedly a series of data exploration, manipulation and transformation steps to prepare the data for querying, machine learning or data mining algorithms. This project explores the foundations of big data management with the ultimate goal of significantly improving the productivity in big data analytics by accelerating the bottleneck step of data exploration. The project integrates two thrusts: a theoretical study, which leads to new fundamental results regarding the complexity of various new (ad hoc) data transformations in modern massive-scale systems, and a systems study, which leads to a multi-platform software middleware for expressing and optimizing ad hoc data analytics techniques. The middleware is designed to augment and integrate existing analytics solutions in order to facilitate and improve methods of interest to the community and compatible with many existing platforms.
The results of this project will make it easier for domain experts to conduct complex data analysis on big data and on large computer clusters. All research results will be released in a middleware package layered on top of existing big-data systems. The middleware includes all the new algorithms, optimization techniques, fault-tolerance and skew mitigation mechanisms, and generalized aggregates developed during the project. In addition, the project develops and deploys a Web-based query-as-a-service interface to the new middleware. The project Web site (http://myriadb.cs.washington.edu) provides access to the software, additional results and information. Project results will be included in educational and outreach activities in big data analytics, including new curricula at the undergraduate, graduate, and professional levels.
|
1 |
2013 — 2018 |
Beck, David (co-PI) [⬀] Connolly, Andrew Armbrust, E. Virginia Guestrin, Carlos Balazinska, Magdalena |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Igert-Cif21: Big Data U: a Program For Integrated Multidisciplinary Education and Research For Big Data Science @ University of Washington
This Integrative Graduate Education and Research Traineeship (IGERT) award provides Ph.D. students at the University of Washington with multidisciplinary training in computer science, statistics, and domain sciences (oceanography, astronomy, chemistry, and genome sciences). Through this blended approach, trainees will learn how to manage, analyze, and visualize increasingly large amounts of data (known as ?Big Data?), thereby being prepared to address the challenges of cyberinfrastructure in the 21st century.
Intellectual Merit: By developing a new Ph.D. program that involves partnerships with 11 leading companies and national labs in the field of Big Data, this program provides trainees with a collaborative approach to processing, scaling, and modeling massive and complex data sets for the scientific community. Trainees learn to create new statistical and machine learning techniques needed to manage large data sets. Additionally, this program builds an open-source system for scientists worldwide to access and analyze Big Data through a Cloud service.
Broader Impacts: This IGERT traineeship program aims to create a new Big Data curriculum that will be delivered online and through University of Washington outreach initiatives. The program also prepares a new generation of scientists with the interdisciplinary tools to approach problems that will arise in the field of cyberinfrastructure. Moreover, this program promotes the development of a diverse STEM workforce by recruiting and training underrepresented groups, women, and students with disabilities, particularly through a partnership with the AccessComputing Alliance and the University of Washington DO-IT program.
IGERT is an NSF-wide program intended to meet the challenges of educating U.S. Ph.D. scientists and engineers with the interdisciplinary background, deep knowledge in a chosen discipline, and the technical, professional, and personal skills needed for the career demands of the future. The program is intended to establish new models for graduate education and training in a fertile environment for collaborative research that transcends traditional disciplinary boundaries, and to engage students in understanding the processes by which research is translated to innovations for societal benefit.
|
1 |
2015 — 2019 |
Suciu, Dan [⬀] Balazinska, Magdalena |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Aitf: Full: Query Processing With Optimal Communication Cost @ University of Washington
Big Data analytics is changing traditional query processing in two ways. The first is a shift from single server or small-scale parallel relational databases to massively distributed architectures, where hundreds or thousands of servers are used during the computation of a single query. The second is an increased complexity in the queries being issued, from single- or star-joins, to complex graph-like structured queries. This project develops new algorithms for query processing over large distributed systems, which are optimized for the cost of communication, then implements and evaluates these algorithms in an open-source big data management system and service.
The project studies a new approach to query evaluation that computes the entire query at once, replacing the traditional approach based on a query plan. The theoretical part of this project builds on a new model, called the Massively Parallel Communication model (MPC), where the communication is the only cost. The system development is performed over the Myria big data management system and service.
The Intellectual Merit of the project consists in advancing the state of the art in both the theory and systems approaches to query evaluation in modern, massive-scale shared-nothing clusters. It develops new, fundamental algorithms for processing queries over massively distributed architectures, with a provably optimal communication cost. The project implements and deploys these algorithms in a system, validating and informing the theoretical model. In particular, the project makes the following contributions: it develops provably optimal, one-round algorithms for skewed data; it studies how and when multiple rounds can be used to further reduce the communication cost; it experiments with these novel algorithms on clusters with up to 1000 worker processes; and it develops a new theoretical model for the communication cost on large shared-nothing architectures with heterogeneous hardware.
The Broader Impact of the project is to contribute to a new architecture for massively parallel query processing, where the traditional multi-step, single-join query evaluation approaches are replaced with novel, single-step, multi-join algorithms. This change has the potential to lead to more efficient big data analytics engines, allowing data analysts to explore large datasets more efficiently. As an immediate application, the project will impact the domain scientists who already use the Myria big data management system and service. All algorithmic discoveries in this project will be implemented in the Myria system, and will significantly improve query performance, allowing domain scientists to conduct more complex analytics and explorations over their data.
|
1 |
2015 — 2018 |
Balazinska, Magdalena |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii: Small: Data Analysis in the Cloud With Guaranteed and Explainable Performance @ University of Washington
Increasingly many users have access to large datasets that they need to analyze. Astronomers, oceanographers, and other domain scientists rely on data analysis for their science. Journalists may want to analyze data to use in their articles. Over the past several years, cloud service providers have been offering an increasingly large selection of data management services for data analytics (e.g., Amazon Elastic MapReduce or Google BigQuery). Cloud services provide a seamless access to powerful data analysis tools, often directly through the browser. Too many services, however, remain too close to the traditional mode of operating a database management system. They reveal too much information about their internal architecture and deployment: Users are required to reason at the level of service instances, instance types, and gigabytes processed. As a result, users today must be data management experts to choose between these services and leverage them in a cost-effective manner. This project will develop new data management techniques that will enable cloud service providers to isolate users from the details of their service internals while offering the ability to trade off price and performance. The project will further develop tools to explain performance and help users re-write their queries to improve it.
More specifically, the project will develop new approaches to (1) predict not only the query runtime but whether a query is likely to execute slower than estimated due to failures, skew, cardinality estimation errors, or contention; (2) guarantee query runtimes by dynamically changing both the resources allocated to a query and its failure-handling and skew-handling mechanisms as needed; (3) post specific slowdown factors in case of heavy load and guarantee them through novel scheduling algorithms; and (4) explain query performance and suggest rewrites in a way that does not require users to understand query plans. The project will implement all of the algorithms in the open source Myria cloud data management system (and service) recently developed and in continuous operation at the University of Washington.
For further information see the project web site at: http://cloudperf.cs.washington.edu
|
1 |
2015 — 2017 |
Beck, David [⬀] Connolly, Andrew Armbrust, E. Virginia Balazinska, Magdalena |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Workforce Development: Graduate Data Science Workshop & Community Building @ University of Washington
The University of Washington will host a national workshop where graduate students in data science disciplines will interact to explore data science grand challenges in a collaborative environment. The project will implement a novel idea and advance the understanding of how to develop data science communities by engaging graduate students, academia, and industry. It also addresses an important national need for researchers with cross-disciplinary training in data science and serves as a catalyst for other institutions to integrate data science skills more broadly throughout the curriculum.
The goal of the project is to help create a highly connected workforce that is adept at cross-disciplinary communication and idea synthesis, primed to solve a new generation of data science and big data challenges. To accomplish this goal, the investigators will conduct a 2.5-day workshop for approximately 100 attendees that is designed to (1) introduce graduate students to new data science concepts and applications, (2) enable students to interact with experts from industry, domain, and methodology fields, and (3) initiate the establishment of a professional community using team building activities. The investigators will conduct a pre-workshop promotional campaign and develop a web site where users can connect and share their ideas. Following the workshop, a set of white papers, team reports, and a workshop report will be available to the public.
|
1 |
2016 — 2017 |
Quinn, Thomas [⬀] Quinn, Thomas [⬀] Balazinska, Magdalena |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Si2-Ssi: Collaborative Research: Paratreet: Parallel Software For Spatial Trees in Simulation and Analysis @ University of Washington
Many scientific and visualization methods involve organizing the data they are processing into a hierarchy (also known as a "tree"). These applications and methods include: astronomical simulations of particles moving under the influence of gravity, analysis of spatial data (that is, data that describes objects with respect to their relative position in space), photorealistic rendering of virtual environments,reconstruction of surfaces from laser scans, collision detection when simulating the movement of physical objects, and many others. Tree data structures, and the algorithms used to work on these structures, are heavily used in these applications because they help to make these applications run much faster on supercomputers. However, implementing tree-based algorithms can require a significant effort, particularly on modern highly parallel computers. This project will create ParaTreet, a software toolkit for parallel trees, that will enable rapid development of such applications. Details of the parallel aspects will be hidden from the programmer, who will be able to quickly evaluate the relative merits of different trees and algorithms even when applied to large datasets and very computation-intensive applications. The combination of such an abstract and extensible framework with a portable adaptive runtime system will allow scientists to effectively use parallel hardware ranging from small clusters to petascale-class machines, for a wide variety of tree-based applications. This project will demonstrate the feasibility of such an approach as well as generate evidence of community adoption of this technology. If successful, this project will enable NSF-supported researchers to solve science problems faster as well as to tackle more complex problems, thus serving NSF's science mission.
This project builds upon an existing collaboration on Computational Astronomy and the resultant software base in the ChaNGa (Charm N-body GrAvity solver) code. ChaNGa is a software package that performs collisionless N-body simulations, and can perform cosmological simulations with periodic boundary conditions in co-moving coordinates or simulations of isolated stellar systems. This project will extend ChaNGa with a parallel tree toolkit called ParaTreet and associated applications, that will allow scientists to effectively utilize small clusters as well as very large supercomputers for parallel tree-based calculations. The key data structure in ParaTreet is an asynchronous software-based tree data cache, which maintains a writeback local copy of remote tree data. We plan to support a variety of spatial decomposition methods and the associated trees, including Oct-trees, KD-trees, inside-outside trees, ball trees, R-trees, and their combinations. Different trees are useful in different application circumstances, and the software will allow their relative merits to be evaluated with relative ease. The framework will support a variety of parallel work decomposition methods, including those based on space filling curves, and support dynamic rearrangement of parallel work at runtime. The algorithms supported will range from Barnes-Hut with various multipole expansions, data clustering, collision detection, surface reconstruction, ray intersection, etc. The software includes a collection of dynamic load balancing strategies in the Charm++ framework that can be tuned for specific problem structures. It also includes support for clusters of accelerators, such as GPGPUs. This project will demonstrate the feasibility of such an approach as well as generate evidence of community adoption of this technology.
|
1 |
2017 — 2021 |
Ceze, Luis Balazinska, Magdalena |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Shf: Medium: a Visual Cloud For Virtual Reality Applications @ University of Washington
The ability to collect videos can revolutionize interactions with the world by enabling powerful virtual reality video applications in education, tourism, tele-presence and others. These new applications involve processing and serving 360-degree stereoscopic videos, which require a dramatic improvement in technology to manage and process the massive-scale visual data necessary for truly immersive experiences. Systems that support VR video also represent an excellent educational tool as students can experience scenes in a truly immersive way and hence better convey content. This project builds a visual cloud that provides seamless access to a new database management system with hardware acceleration and edge computation that enables the efficient and real-time management of massive-scale image data and virtual reality (VR) applications built on top of it. The project develops a new hardware and software stack for VR data processing with execution in public clouds. The stack includes a new storage manager that significantly increases data ingest and retrieval throughputs for multidimensional array data compared with existing systems, as motivated by the extreme needs of VR applications. The storage manager utilizes novel hardware technologies (non-volatile memory) and provides novel approximate and multi-resolution data storage capabilities. The project also develops a new runtime system for high-throughput and large-scale array processing by developing a new API for expressing VR pipelines as a graph of user-defined functions, a library of specialized implementations of known VR algorithms for different types of hardware (CPU, GPU, FPGA, and 3D XPoint), and associated optimizers and schedulers. Finally, the project develops new techniques to enable real-time VR applications. They include an FPGA-based acceleration platform for real-time VR video processing and algorithms and software components for prefetching and caching VR data close to the viewers and processing that data in the viewing device. http://visualcloud.cs.washington.edu
|
1 |
2017 — 2020 |
Connolly, Andrew Balazinska, Magdalena Juric, Mario (co-PI) [⬀] Cheung, Alvin Rokem, Ariel (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Si2-Sse: An Ecosystem of Reusable Image Analytics Pipelines @ University of Washington
Astronomy has entered an era of massive data streams generated by telescopes and surveys that can scan tens of thousands of square degrees of the sky across many decades of the electromagnetic spectrum. The promise of these new experiments - characterizing the nature of dark energy and the composition of dark matter, discovering the most energetic events in the universe, tracking asteroids whose orbits may intersect with that of the Earth - will only be realized if we can address the challenge of how to process and analyze the tens of petabytes of images that these astronomical surveys will generate per year. With the increasing capacity for scientists to collect ever larger sets of data, often in the form of images, our potential for scientific discovery will soon be limited not by how we collect or store data, but rather how we extract the knowledge that these data contain (e.g. how we account for noise inherent within the data, and understand when we have detected fundamentally new classes and interesting events or physical phenomena). This project is to develop an open source scalable framework for the analysis of large imaging data sets. It is designed to operate as a cloud service, incorporate seamlessly new or legacy image processing algorithms, support and optimize complex analysis workflows, and scale analyses to thousands of processors without the need for an individual user to develop custom solutions for a specific computer platforms or architecture. This framework will be integrated with state-of-the-art image analysis algorithms developed for astronomical surveys to provide an image analytics platform that can be used by future telescopes and cameras and the astronomical community as a whole. Beyond astronomy, the framework will be extended to enable scientists from the physical and life sciences that make use of imaging data (e.g. neuroscience, oceanography, biology, seismology) to focus their work on developing scientific algorithms and analyses rather than the infrastructure required to process massive data sets
Over the last decade, there have been many advancements in astronomical image analysis algorithms and techniques; driven by new surveys and experiments. The complexity of these techniques and the systems that run them has, however, meant that the number of users who make use of these advancements is small; typically restricted to the experiments themselves or to a small group of expert users. Because of this, the community as a whole does not benefit from the significant investment in image analytics for astronomy. In this project, the PIs address these issues by developing and deploying a scalable framework for the analysis of small and large imaging datasets. This cloud-based system will be able to incorporate new and legacy image processing algorithms, support and optimize complex analysis workflows, scale applications to thousands of processors without users needing to develop custom code for specific platforms, and support efficient sharing of algorithms and analysis results among users. It will enable state-of-the-art image analysis algorithms (e.g. those developed for surveys such as the Large Synoptic Survey Telescope; LSST) to be used by the broad astronomical community and in so doing will leverage then tens of thousands of hours that has been invested in the development of these techniques. To accomplish this the team will extract key data analysis functions from the LSST data analysis pipeline into a standalone library, independent of the LSST software stack and data access mechanisms. They will integrate this library with the Myria big data management system. Myria is an elastically scalable big data management system that operates as a service in the Amazon cloud that wedeveloped at the University of Washington. Compared with other big data systems, Myria is especially attractive because it integrates PostgreSQL database instances within its storage layer and thus provides access to PostgreSQL's rich libraries of spatial functions, which are frequently used in astronomical data analysis pipelines. At the same time, it has rich support for new and legacy Python code and for complex analytics. By integrating the library of LSST image analytics functions with Myria, new image analytics pipelines will become significantly easier to write. The skeleton of the analysis pipeline will be expressed in the MyriaL declarative query language (i.e. SQL extended with constructs such as iterations and others). The core data processing functions will directly map to Python functions, enabling the reuse of legacy code and the easy addition of new functions. The resulting code will be amenable to optimization and efficient execution using the Myria service. By doing so, they intend to reduce barriers to adoption. Users will be able to express their analysis in Python without worrying about how data and computation will be distributed in a cluster. The image analysis framework developed as part of this proposal will be made publicly available as open-source software. The PIs will utilize the use case of neuroscience to demonstrate how their system, developed for astronomy, can be deployed across multiple domains.
This project is supported by the Office of Advanced Cyberinfrastructure in the Directorate for Computer & Information Science and Engineering, the Astronomical Sciences Division and Office of Multidisciplinary Activities in the Directorate of Mathematical and Physical Sciences.
|
1 |
2019 — 2021 |
Balazinska, Magdalena Pfaendtner, W. James Beck, David (co-PI) [⬀] Rokem, Ariel (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Hdr: I-Dirse-Fw: Accelerating the Engineering Design and Manufacturing Life-Cycle With Data Science @ University of Washington
The manufacturing life cycle begins with the discovery of new molecules and materials. This first step is often initiated through computer simulations that explore the space of possible molecules and materials, and identify promising candidates that can later be tested in laboratories. As simulations have grown in scale and complexity, this step has become a critical bottleneck. New data-driven approaches present the opportunity to increase the speed and accuracy of such predictions, with broad potential impact on the US Manufacturing sector. This Harnessing the Data Revolution Institutes for Data-Intensive Research in Science and Engineering (HDR-I-DIRSE) Frameworks award brings together Engineers and Data Scientists to conceptualize a new Engineering Data Science Institute where these tools can be applied for new discovery. The effort will develop new data science approaches to accelerate the engineering life cycle: design, characterization, manufacturing, and operation. This life cycle starts with the discovery of new molecules and materials, followed by advanced characterization with high throughput methods augmented by machine learning. Then, efficient manufacturing and operation of systems that use these materials can be designed and developed. By focusing on this holistic lifecycle, the researchers will build a broadly applicable foundation in Engineering Data Science methods. The new Institute will seek to create an Engineering Data Science environment that supports engineers and scientists (students, postdoctoral researchers, and faculty) through a synergistic set of collaboration and education activities.
This collaborative effort follows three thrusts. The first focuses on the reduction of the experimental design space with data science tools targeting the discovery of new molecules and polymers. The research develops a new, formal framework for pairing accurate predictive simulations with data-driven models to create a scalable and transferable workflow that can be deployed across multiple examples of molecular engineering applications. The second thrust addresses a manifold of cross-cutting needs at the intersection of image data analytics and characterization of materials and systems. It also builds community cyberinfrastructure through open-source software resources with support for execution in public clouds. The final thrust focuses on improving manufacturing, optimization, and control. It further enhances cyberinfrastructure resources through a suite of open-source software solutions to systematically develop digital twin models for complex engineering and manufacturing systems, and apply them for optimization and control. This project is part of the National Science Foundation's Harnessing the Data Revolution (HDR) Big Idea activity and is co-funded by the Office of Advanced Cyberinfrastructure.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
1 |
2019 — 2024 |
Hossain, Faisal (co-PI) [⬀] Balazinska, Magdalena Butman, David Holtgrieve, Gordon [⬀] Wood, Chelsea |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Nrt: Future Rivers: Training a Scientifically Innovative, Communication Savvy Stem Workforce For Sustaining Food-Energy-Water Services in Large and Transboundary River Ecosystems @ University of Washington
Large freshwater ecosystems are lifelines for a majority of the world's population, providing ecosystem goods and services critical to economies and livelihoods. Despite the important societal and economic benefits of these freshwater systems, the ability to predict impacts from ecosystem change and to evaluate tradeoffs is limited. A better understanding of conceptual and quantitative frameworks for evaluating the physical, biological, and social dynamics that sustain freshwater ecosystem services would allow for better management of these critical resources. This National Science Foundation Research Traineeship (NRT) award to the University of Washington will develop an innovative, culturally-aware STEM workforce fluent in state-of-the-art approaches for sustaining food-energy-water services in large river ecosystems and who are prepared to effectively safeguard ecosystem services for a growing world population. The project is driven by an urgent need for interdisciplinary scientists who can address current and future environmental problems by employing the quantitative tools required to integrate, model, and visualize complex datasets and often conflicting outcomes. The Future Rivers NRT training will include coursework and group activities that emphasize quantitative and interdisciplinary literacy. Students will engage in research that integrates transboundary rivers across the world, spanning a range of human disturbance and regional economic development regimes. The project anticipates training sixty (60) MS and PhD students, including eighteen (18) funded trainees, from disciplines across natural resource science, engineering, social science, health science, and policy.
The overall goals of the Future Rivers NRT project are to develop a trained workforce in 21st century quantitative and data science approaches to sustain and safeguard food-energy-water services in large river ecosystems, while researching new ways to better predict impacts and safeguard these resources. The project will use numerical modeling and data science to catalyze new approaches for addressing the grand challenge of achieving sustainability in large Food, Energy, and Water Systems (FEWS). To integrate the research and learning, the Future Rivers project follows an active learning model that starts with the presentation of new information, followed by directed practice and exercises, and ends with the application of knowledge gained in a new context. Project training activities are centered around five primary educational objectives: 1) develop new technical and data science skills; 2) foster innovative interdisciplinary and international science integration; 3) improve trainee communication skills; 4) increase cultural awareness and inclusivity among faculty, trainees, and participants; and 5) create networks and opportunities for student career development. Specific project components focus on data science training and careers (courses, hackathon events, research summits, career fairs); communication and outreach skillsets (workshops, science communication film contests); equity and inclusivity training; and interdisciplinary river FEWS issues (courses, seminar series, and summer institutes that include some international locations).
The NSF Research Traineeship (NRT) Program is designed to encourage the development and implementation of bold, new potentially transformative models for STEM graduate education training. The program is dedicated to effective training of STEM graduate students in high priority interdisciplinary or convergent research areas through comprehensive traineeship models that are innovative, evidence-based, and aligned with changing workforce and research needs.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
1 |
2022 — 2026 |
Balazinska, Magdalena Krishna, Ranjay |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii: Medium: Vocal: Video Organization and Interactive Compositional Analytics @ University of Washington
Camera deployments are commonly used in many applications such as traffic monitoring, animal behavior tracking, autonomous driving, civil engineering, and more. Extracting value from these video streams is a key research and commercial challenge; a system that can organize and provide an interface for users to easily interact with and query large-scale video is poised to be transformative in many commercial and academic domains. Yet, the video data management systems required to develop modern video applications are still in their infancy. Existing systems have important limitations that restrict their practical use: they do not adapt easily to new domains; they have limited to no support for asking complex queries; and most systems process video streams from multiple cameras independently of one another, even if the cameras are part of a coordinated deployment. This project addresses these limitations by developing VOCAL: an open-source system for Video Organization and Interactive Compositional AnaLytics. VOCAL consists of a suite of domain-agnostic tools for end-to-end video analytics. It supports users with (1) interactively organizing video data, (2) expressing and executing complex queries, and (3) querying multi-view camera deployments. This project also provides research experiences for undergraduate and graduate students and produces materials to teach K-12 students about video management and analytics.<br/><br/>To meet the above goals, this project contributes new approaches in databases, computer vision, and AI. It also brings together some of the independent efforts across these disciplines. In particular, VOCAL highlights the possibilities of using recent self-supervised computer vision methods to build algorithms that can make data exploration feasible for large video datasets, and thereby, allowing the rapid development of domain-specific video event recognition models. VOCAL also utilizes scene graph representations to allow users to express complex queries as compositions of simpler ones. It then develops new approaches for the interactive specification and efficient execution of such queries. Finally, VOCAL contributes new approaches to seamlessly querying multi-view camera deployments.<br/><br/>This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
1 |