2005 — 2011 |
Ives, Zachary |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Career: Orchestra - Managing the Collaborative Sharing of Evolving Data @ University of Pennsylvania
This CAREER project develops, implements, and validates Orchestra, a general middleware layer for supporting collaborative data sharing: the effective exchange of data and updates between loose confederations of scientific collaborators who may disagree about their preferred schema, or even about which data items or updates are correct. The proposed methodology is "peer-to-peer" and "bottom-up" in style: it provides mechanisms by which one can rapidly join a data sharing confederation, by mapping a new data source to existing schemas within the confederation and then specifying what updates are to be trusted and accepted. The primary means of data exchange is through reconciling modifications made locally with those (trusted) updates made elsewhere; operations from elsewhere can always be overridden locally. Additionally, the research develops mechanisms for globally querying across all data sources and schemas. The work naturally extends techniques from data integration and peer data management.
Broader Impact: This work enables greater data sharing in the scientific community, especially bioinformatics: it focuses on the use, derivation, and sharing of information in bioinformatics databases and warehouses, and it will be evaluated in such applications. The developed system will be disseminated on the Web and promoted through seminars delivered at the Penn Center for Bioinformatics and the regional Greater Philadelphia Bioinformatics Alliance, as well as other scientific forums. The research project will be integrated into an educational program (graduate and undergraduate) to teach data management in the broader context, focusing on data integration and exchange as well as traditional databases.
http://www.cis.upenn.edu/~zives/orchestra/
|
1 |
2005 — 2009 |
Ives, Zachary Stoeckert, Christian (co-PI) [⬀] White, Peter Tannen, Val (co-PI) [⬀] Davidson, Susan [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ii: Data Cooperatives: Rapid and Incremental Data Sharing With Applications to Bioinformatics @ University of Pennsylvania
Generic tools and technologies for creating and maintaining data cooperatives- confederations whose purpose is distributed data sharing-will be developed to overcome the difficultiess encountered in the sharing of information in life sciences, specifically in bioinformatics.
The vision of large-scale data sharing has been a long-time goal of the bioinformatics field, much of it proceeding through data integration efforts. However, conventional approaches to data integration do not have the necessary flexibility and adaptability to make the existing and future plethora of data accessible and usable to typical biologists, while keeping it rapidly extensible to new concepts, domains, and types of queries, and thus fostering new research developments. The main reasons are that (1) different biologists work with different types of data and at differing levels of abstraction; (2) schemas in the bioinformatics world are typically large and complex; (3) queries and mappings may "break" without warning because of asynchronous updates; (4) it is logistically, economically and politically difficult to operate centralized data integration facilities. In response to these difficulties data cooperatives emphasize: decentralization for both scalability and flexibility, incremental development of resources such as schemas, mappings, and queries, rapid discovery mechanisms for finding the resources relevant to a topic, and tolerance for intermittent participation of members and for approximate consistency of mappings. More specifically, the technical goals of the proposal include: (1)collaboratively developed yellow pages of biological topics; (2) schema templates, capturing the part of the structure of data pertaining to a specific interest and functioning also as visual templates from which a query form created; (3) incremental specification of mappings; (4) reasoning about uncertainty in mappings by measuring with statistical tools their degree of reliability and using it in query answering; (5) multi-path answering for queries with caching and replication in a large-scale data cooperative where the participation of individual members may not always be assured.
Data cooperatives will have broader impact through applications in a variety of scientific and industrial fields, but it is in the field of bioinformatics that they are likely to have an immediate and significant impact. Therefore, a specific data cooperative as a biological testbed for evaluating the proposed technologies. This testbed is based on a small set of databases which are already collaborating and exchanging data related to Plasmodium falciparum. Broader impact will be also be achieved through the proposed educational initiatives, specifically through a "compu-tational orchestra" bioinformatics course which will expose students to data integration issues through project work, and a workshop for the Greater Philadelphia Bioinformatics Alliance (GPBA). Minority involvement will also be encouraged through a GPBA internship program.
|
1 |
2006 — 2007 |
Ives, Zachary Davidson, Susan [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Seiii: Workshop On Information Integration @ University of Pennsylvania
The purpose of the workshop is to devote national attention to the need for identifying and comprehensively examining the major challenges in the area of information integration (II) and the required long-term research, engineering and development that will be needed to advance the state of the art and state of the practice in the application of advanced IT to resolve these challenges in the near- and long-term. The type and modality of information available in digital form is vast - image, video, text, audio, sensor and other forms of streaming data, as well as structured data such as databases and XML/ HTML documents. Although the data may be geographically distributed, collected and designed for specific uses and applications ("silos" of information), it is often logically inter-related, and many important questions can only be answered by accessing it collectively. However, despite considerable research and development over the last 20 years, truly ad hoc II, where disparate information systems are accessed efficiently in real time in response to unanticipated information needs and data is combined to form reliable answers to queries, remains an elusive goal. A systematic development of II technologies is needed to provide the necessary infrastructure leading to significant advances in the access to and analysis of widely distributed, heterogeneous, disparate information resources. Of particular importance to this workshop will be issues associated with the integration of science and engineering data and Federal, medical, and other records.
|
1 |
2007 — 2012 |
Ives, Zachary Hanson, Clarence Lee, Insup (co-PI) [⬀] Guha, Sudipto (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Nets/Noss: Aspen: Abstraction-Based Sensor Programming Environment @ University of Pennsylvania
As sensing devices become increasingly sophisticated, more widely deployed in the real world, and more advanced in their capabilities, they are being deployed in richer, more diverse, and more sophisticated applications (such as providing visitor guides, monitoring patients in the home, or managing workflows). This project develops a new general-purpose programming model, group-based programming, for developing an emerging class of sensor applications that monitor heterogeneous data and abstract it into higher-level concepts like events, phenomena, and workflows. Group-based programming develops a unified, declarative framework for integrating heterogeneous data into abstract groups (sets of devices) and the data streams (views) they produce; composing groups and views; and, equally importantly, expressing communication, security, and privacy constraints. It synthesizes ideas and techniques from databases and data integration, real-time-systems, and streaming algorithms, in order to provide a higher-level framework for application development. The project develops an efficient runtime support layer for group-based programming across a broad array of sensor devices of different types, with automatic optimization capabilities that exploit the underlying properties of the underlying network and devices. Finally, it validates the suitability of this model across a variety of applications in hospital and home health care. The project will train two graduate students and a variety of summer students at the undergraduate and/or high school level, will develop a graduate/undergraduate course in sensor network applications, and will result in publicly available software.
|
1 |
2007 — 2009 |
Ives, Zachary Loo, Boon Thau [⬀] Smith, Jonathan (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Find: Wireless Knowledge Infrastructure (Wiki) @ University of Pennsylvania
Mobility has created an increasing demand for information local to the user in a challenging and information-rich environment, demanding new capabilities from information services and network protocols. The Wireless Knowledge Infrastructure (WiKI) project develops an extensible general-purpose system layer based on new ideas for applying concepts from programming languages and database systems - the use of declarative languages and composable views of router, network and host state - to allow monitoring, event detection and triggering based on extant network conditions and policies. Declarative routing algorithms take into account application, session and network state information to set up adaptive routes among mobile devices and wired infrastructure nodes. Cross-layer and cross-domain integrated views of data streams expose and abstract data from different subsystems and layers, providing a step towards a "Knowledge Plane" for networks. WiKI takes an exploratory approach, namely building a small-scale software infrastructure using 802.11 to understand the wireless challenges of heavily populated urban areas in Philadelphia, and to develop prototype services based on a WiKI model. WiKI services are incrementally refined as the research progresses.
Broader Impact: The end goal is incorporation of WiKI platforms, software and services into the "Wireless Philadelphia" municipal WiFI effort, notable for its integral Digital Inclusion program which attempts to reach economically disadvantaged households in our city.
|
1 |
2007 — 2011 |
Ives, Zachary Guha, Sudipto (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii: Distributed Stream Integration @ University of Pennsylvania
Declarative, database-style models for programming distributed applications are becoming widely adopted, in a variety of realms ranging from sensors to publish-subscribe to network state management. They free the developer to define high-level queries for the specific data of interest, without regard to details about data sources, communications protocols, or synchronization.
As this approach to programming gains momentum, there is increasing need to abstract low-level stream data source variations away under a uniform representation, i.e., a view; and to integrate, i.e., conjoin, different types of stream data from large numbers of sources. Such tasks involve much more distributed communication and coordination than in traditional distributed databases or even data stream management systems. It becomes essential to do in-network computation of the query, and to optimize the processing of each stream (or few streams) separately, in a way that considers the topology of the network.
This proposal develops the technologies to support integration of data streams, including languages for stream schema mappings, focusing on issues relating to combining distributed messages and maintaining timing information; techniques for rapidly establishing query computation paths through a network, for sets of data stream elements that need to be joined and aggregated together; offline and adaptive, network-aware query optimization techniques for distributed computation in the network. These techniques will scale across widely heterogeneous (sensor, wireless, and conventional) networks, and will be evaluated in environmental monitoring applications.
The intellectual merit is the development of new techniques for performing queries across large, highly distributed networks of stream-producing sources; this increases understanding of the adaptive query processing space when access costs to data items are non-uniform and query processing requires distributed communication, and the trade-offs with respect to offline versus adaptive optimization and relative to optimization granularity. The broader impact includes the development of distributed stream integration capabilities that can directly address a number of emerging and well-known challenges in the network and environmental monitoring domains. The educational component includes the training of two PhD students, and the teaching of stream data integration in graduate and advanced undergraduate courses.
Project URL: http://www.cis.upenn.edu/~zives/stream-integration/
|
1 |
2010 — 2015 |
Ives, Zachary Yoo, Christopher Haeberlen, Andreas (co-PI) [⬀] Loo, Boon Thau (co-PI) [⬀] Smith, Jonathan [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Fia: Collaborative Research: Nebula: a Future Internet That Supports Trustworthy Cloud Computing @ University of Pennsylvania
Cloud computing provides economic advantages from shared resources, but security is a major risk for remote operations and a major barrier to the approach, with challenges for both hosts and the network. NEBULA is a potential future Internet architecture providing trustworthy networking for the emerging cloud computing model of always-available network services. NEBULA addresses many network security issues, including data availability with a new core architecture (NCore) based on redundant connections to and between NEBULA core routers, accountability and trust with a new policy-driven data plane (NDP), and extensibility with a new control plane (NVENT) that supports network virtualization, enabling results from other future Internet architectures to be incorporated in NEBULA. NEBULA?s data plane uses cryptographic tokens as demonstrable proofs that a path was both authorized and followed. The NEBULA control plane provides one or more authorized paths to NEBULA edge nodes; multiple paths provide reliability and load-balancing. The NEBULA core uses redundant high-speed paths between data centers and core routers, as well as fault-tolerant router software, for always-on core networking. The NEBULA architecture removes network (in) security as a prohibitive factor that would otherwise prevent the realization of many cloud computing applications, such as electronic health records and data from medical sensors. NEBULA will produce a working system that is deployable on core routers and is viable from both an economic and a regulatory perspective.
|
1 |
2010 — 2012 |
Ives, Zachary |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii: Eager: Data Integration as a Dialogue With the User @ University of Pennsylvania
This work establishes a new approach to providing ad hoc ("discovery") queries requiring integration and structuring: such queries help scientists learn possible relationships between topics, and help decision-makers or consumers explore options. The work develops a new system and underlying architecture based on an iterative process, where the system and user engage in a dialogue until the user has answers meeting his or her information need.
The resulting system takes sources on the Web, discovers semantic relationships among them, and allows users to pose discovery queries. It leverages existing extraction, matching, and recommendation algorithms as sources of evidence to generate hypotheses and corresponding queries, and adjusts these hypotheses based on user feedback over the query results. Innovations include scalable models for combining features and learning to re weight hypotheses; query and source recommendation techniques; and means of generalizing tuple-based feedback to support or refute hypotheses.
The research impact is a new paradigm for data integration by end users, which scalably combines machine learning and database concepts. The broader impact includes better discovery tools for scientific users and other users who sorely need them; improved integration of existing Web data resources; and new educational material on how networks of data can be as important as networks of systems and people. The PI is incorporating the research concepts into courses in the University of Pennsylvania's new Market and Social Systems Engineering Program, focused on the interface between people, protocols, and systems on the Internet, especially through social and data networks, as well as markets. More information on the project can be found on the project website at http://www.cis.upenn.edu/~zives/dialogue/
|
1 |
2011 — 2017 |
Ives, Zachary Haeberlen, Andreas [⬀] Loo, Boon Thau (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Tc: Medium: Collaborative Research: Tracking Adversarial Behavior in Distributed Systems With Secure Networked Provenance @ University of Pennsylvania
Operators of networks and distributed systems often find themselves needing to answer a diagnostic or forensic question -- some part of the system is found to be in an unexpected state, and the operators must decide whether the state is legitimate or a symptom of a clandestine attack. In such cases, it would be useful to ask the system for an 'explanation' of the observed state. In the absence of attacks, emerging network provenance techniques can construct such explanations by constructing a chain of events that links the observed state to its root causes. However, an attacker can cause the nodes under his control to forge or suppress information and thus produce a plausible (but incorrect) explanation. As a result, the operators may fail to notice the attack.
This research develops secure network provenance techniques that can provide useful explanations even when the system is under attack by a powerful adversary. The project (i) substantially extends and generalizes the concept of network provenance by adding capabilities needed in a forensic setting; (ii) develops techniques for securely storing provenance without any trusted components; (iii) designs methods for efficiently querying secure provenance; (iv) introduces methods for protecting the confidentiality of provenance; and (v) evaluates these techniques in the context of concrete applications.
The project's theme of provenance and forensics is integrated with Penn's new undergraduate program in Market and Social Systems Engineering. It will provide forensics support for a wide variety of distributed applications, including emerging cloud applications upon which critical infrastructure may soon be based.
|
1 |
2012 — 2015 |
Ives, Zachary Tannen, Val [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii: Small: Provisioning For Autonomous Data Analysis and Scenario Exploration @ University of Pennsylvania
In business intelligence and data-driven science, users often wish to consider various "what-if scenarios": hypothetical updates and query refinements. Unfortunately current techniques do not support this type of exploration when query answers have high latency, when data sources charge fees, or in mobile, disconnected settings. This project considers how, given a particular set of query aspects and updates, one can precompute a special data representation that can be stored on a client machine and can be used to directly answer queries under a variety of updates and what-if scenarios, without direct access to the data sources. The project achieves this goal by developing "provisioned representations" that capture a form of parameterized data instances, from which complex queries (even with multiple levels of aggregation) can be answered. The work develops: (1) the provisioned representation (PR) formalism, (2) means of encoding and storing PRs, (3) a query system and query processing techniques for PRs, (4) a "wizard" for generating PRs from parameterized scenarios, and (5) interactive tools for exploring changes to data and queries over PRs. The work will improve decision support systems and help enable "information foresight" - the ability to provide, given a question, answers that include additional data relevant to a user's interests. The project supports Ph.D. students, and also develops a new course on networked information management for the University of Pennsylvania's innovative Market and Social Systems Engineering undergraduate degree program on networks and markets. Data and code will be disseminated through public collaborative portals (GitHub, Google Code) and the project Web site (https://dbappserv.cis.upenn.edu/home/?q=node/173).
|
1 |
2015 — 2016 |
Ives, Zachary Kim, Junhyong (co-PI) [⬀] |
U01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies. |
Approximating and Reasoning About Data Provenance @ University of Pennsylvania
? DESCRIPTION (provided by applicant): In many Big Data applications today, such as Next-Generation Sequencing, data processing pipelines are highly complex, span multiple institutions, and include many human and computational steps. The pipelines evolve over time and vary across institutions, so it is difficult to track and reason about the processing pipelines to ensure consistency and correctness of results. Provenance-enabled scientific workflow systems promise to aid here - yet such workflow systems are often avoided due to perceptions of inflexibility, lack of good provenance analytics tools, and emphasis on supporting the data consumer rather than producer. We propose to better incentivize the adoption of workflow and other provenance tracking tools: (1) Instead of requiring a single workflow system across the entire pipeline, which can be inflexible, we allow for integration across multiple autonomous systems (provenance- enabled workflow systems, provenance tracking systems for languages like Python and R, etc.), and even across steps performed without any provenance tracking at all. (2) We develop provenance reasoning capabilities specifically useful to the data provider, such as provenance analytics across time, sites, and users; finding the code modules that best explain why two results are different; regression testing to determine whether a code change would affect prior results; and reconstructing missing provenance for steps that were not captured. These capabilities are expected to lead to wider tracking of data provenance, and ultimately to more consistent, reproducible, and reliable science. We will validate this hypothesis through the evaluation of our technologies within a Next-Generation Sequencing pipeline run by one of the PIs with collaborators at other institutions.
|
1 |
2015 — 2018 |
Ives, Zachary Tannen, Val (co-PI) [⬀] Davidson, Susan (co-PI) [⬀] Kannan, Sampath (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cici: Data Provenance: Provenance-Based Trust Management For Collaborative Data Curation @ University of Pennsylvania
Data-driven science relies not only on statistics and machine learning, but also on human expertise. As data are being collected to tackle increasingly challenging scientific and medical problems, there is need to scale up the amount of expert human input (curation and, in certain cases, annotation) accordingly. This project addresses this need by developing collaborative data curation: instead of relying on a small number of experts, it enables annotations to be made by communities of users of varying expertise. Since the quality of annotations by different users will vary, novel quantitative techniques are developed to assess the trustworthiness of each user, based on their actions, and to distinguish trustworthy experts from unskilled and malicious users. Algorithms are developed to combine users' annotations based on their trustworthiness. Collaborative data curation will greatly increase the amount of human annotated data, which will, in turn, lead to better Big Data analysis and detection algorithms for the life sciences, medicine, and beyond.
The central problems of collaborative data curation lie in the high variability in the quality of users' annotations, and variability in the form the data takes when they annotate it. The proposal develops techniques to take annotations made by different users over different views of data (such as an EEG display with filters and transformations applied to the signal), to use provenance to reason about how the annotations relate to the original data, and to reason about the reliability and trustworthiness of each user's annotations over this data. To accomplish this, the research first defines data and provenance models that capture time- and space-varying data; novel reliability calculus algorithms for computing and dynamically updating the reliability and trustworthiness of individuals, based on their annotations and how these compare to annotations from recognized experts and the broader community; and a high-level language called PAL that enables the researchers to implement and compare multiple policies. The researchers will initially develop and validate the techniques on neuroscience and time series data, within a 900+ user public data sharing portal (with 1500+ EEG and other datasets for which annotations are required). The project team later expands the techniques to other data modalities, such as imaging and genomics
|
1 |
2016 — 2017 |
Ives, Zachary Nenkova, Ani (co-PI) [⬀] Wallace, Byron Casey |
UH2Activity Code Description: To support the development of new research activities in categorical program areas. (Support generally is restricted in level of support and in time.) |
Crowdsourcing Mark-Up of the Medical Literature to Support Evidence-Based Medicine and Develop Automated Annotation Capabilities @ Northeastern University
? DESCRIPTION (provided by applicant): Evidence-based medicine (EBM) promises to transform the way that physicians treat their patients, resulting in better quality and more consistent care informed directly by the totality of relevant evidence. However, clinicians do not have the time to keep up to date with the vast medical literature. Systematic reviews, which provide rigorous, comprehensive and transparent assessments of the evidence pertaining to specific clinical questions, promise to mitigate this problem by concisely summarizing all pertinent evidence. But producing such reviews has become increasingly burdensome (and hence expensive) due in part to the exponential expansion of the biomedical literature base, hampering our ability to provide evidence-based care. If we are to scale EBM to meet the demands imposed by the rapidly growing volume of published evidence, then we must modernize EBM tools and methods. More specifically, if we are to continue generating up-to-date evidence syntheses, then we must optimize the systematic review process. Toward this end, we propose developing new methods that combine crowdsourcing and machine learning to facilitate efficient annotation of the full-texts of articles describing clinical trials. These annotations will comprise mark-up of sections of text that discuss clinically relevant fields of importance in EBM, such as discussion of patient characteristics, interventions studied and potential sources of bias. Such annotations would make literature search and data extraction much easier for systematic reviewers, thus reducing their workload and freeing more time for them to conduct thoughtful evidence synthesis. This will be the first in-depth exploration of crowdsourcing for EBM. We will collect annotations from workers with varying levels of expertise and cost, ranging from medical students to workers recruited via Amazon Mechanical Turk. We will develop and evaluate novel methods of aggregating annotations from such heterogeneous sources. And we will use the acquired manual annotations to train machine learning models that automate this markup process. Models capable of automatically identifying clinically salient text snippets in full-text articles describing clinical trials would be broadly useful for biomedical literature retrieval tasks and would have impact beyond our immediate application of EBM.
|
0.942 |
2016 — 2021 |
Sim, Ida (co-PI) [⬀] Srivastava, Mani Ives, Zachary Kumar, Santosh |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cif21 Dibbs: Ei: Mprov: Provence-Based Data Analytics Cyberinfrastructure For High-Frequency Mobile Sensor Data
This project addresses a rapidly growing opportunity: the ability of the research community to use high-frequency mobile sensor data. Mobile sensors (embedded in phones, vehicles, wearables, and the environment) continuously capture data in great detail, and have the potential to address problems in a range of scientific and engineering domains. This effort focuses upon a specific case -- health data -- that builds upon several capabilities developed in National Institutes of Health (NIH) sponsored projects for assembling and analyzing health data collected through mobile sensors and apps. Improvements to the usefulness of extremely noisy, distributed data can serve many communities, and the components are extensible outside the human health domain.
Mobile sensors present a distinct set of data challenges: the data quantity and quality fluctuate, and uncertainty can be high. Establishing provenance on such noisy data is a challenge, and there are limitations on access to data from human subjects. This project addresses several of the distinctive challenges associated with mobile sensor data. Variability is addressed by providing detailed annotation with metadata (such as provenance and quality), and by providing facilities for context-specific reasoning about the metadata. The system captures provenance metadata along with data in a stream, and propagates this information alongside derived data from one stage to the next. This creates cyberinfrastructure that makes it possible to 'replay' mobile device data with different configurations, to comparatively benchmark two algorithms or to diagnose erroneous output. The project builds upon the capabilities and success of the NIH-funded Center of Excellence in Mobile Sensor Data to Knowledge (MD2K), which provides an open-source cyberinfrastructure enabling the collection, curation, analysis, visualization, and interpretation of high-frequency mobile sensor data. Conducting research with mobile sensor data collected by others continues to be challenging; this project develops a companion open-source provenance cyberinfrastructure, facilitating the sharing of the mobile sensor data itself. Results include metadata standards, interfaces, and runtime support for annotating data streams with the source (sensor, location, sampling rate, continuous or episodic), semantics of output (number, probability, class), provenance (features, rules for decision), and validation (specificity, sensitivity, benchmark used). The infrastructure accommodates a wide variety of data types and enables data discovery, analytics, visualization, integration, and validation by third party researchers. The project improves the ability of the wider scientific and engineering community to use mobile sensing systems and metadata, and it also has immediate, tangible societal benefits in health and wellness.
This award by the Advanced Cyberinfrastructure Division is jointly supported by the NSF Directorate for Computer & Information Science & Engineering (Division of Computer and Network Systems, and Division of Information and Intelligent Systems).
|
0.964 |
2017 |
Ives, Zachary Nenkova, Ani (co-PI) [⬀] Wallace, Byron Casey |
UH2Activity Code Description: To support the development of new research activities in categorical program areas. (Support generally is restricted in level of support and in time.) |
Project-001 @ Northeastern University
Evidence-based medicine (EBM) promises to transform the way that physicians treat their patients, resulting in better quality and more consistent care informed directly by the totality of relevant evidence. However, clinicians do not have the time to keep up to date with the vast medical literature. Systematic reviews, which provide rigorous, comprehensive and transparent assessments of the evidence pertaining to specific clinical questions, promise to mitigate this problem by concisely summarizing all pertinent evidence. But producing such reviews has become increasingly burdensome (and hence expensive) due in part to the exponential expansion of the biomedical literature base, hampering our ability to provide evidence-based care. If we are to scale EBM to meet the demands imposed by the rapidly growing volume of published evidence, then we must modernize EBM tools and methods. More specifically, if we are to continue generating up-to-date evidence syntheses, then we must optimize the systematic review process. Toward this end, we propose developing new methods that combine crowdsourcing and machine learning to facilitate efficient annotation of the full-texts of articles describing clinical trials. These annotations will comprise mark-up of sections of text that discuss clinically relevant fields of importance in EBM, such as discussion of patient characteristics, interventions studied and potential sources of bias. Such annotations would make literature search and data extraction much easier for systematic reviewers, thus reducing their workload and freeing more time for them to conduct thoughtful evidence synthesis. This will be the first in-depth exploration of crowdsourcing for EBM. We will collect annotations from workers with varying levels of expertise and cost, ranging from medical students to workers recruited via Amazon Mechanical Turk. We will develop and evaluate novel methods of aggregating annotations from such heterogeneous sources. And we will use the acquired manual annotations to train machine learning models that automate this mark up process. Models capable of automatically identifying clinically salient text snippets in full-text articles describing clinical trials would be broadly useful for biomedical literature retrieval tasks and would have impact beyond our immediate application of EBM.
|
0.942 |
2017 |
Ives, Zachary Kim, Junhyong (co-PI) [⬀] |
U01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies. |
Approximating and Reasoning About Data Provece @ University of Pennsylvania
? DESCRIPTION (provided by applicant): In many Big Data applications today, such as Next-Generation Sequencing, data processing pipelines are highly complex, span multiple institutions, and include many human and computational steps. The pipelines evolve over time and vary across institutions, so it is difficult to track and reason about the processing pipelines to ensure consistency and correctness of results. Provenance-enabled scientific workflow systems promise to aid here - yet such workflow systems are often avoided due to perceptions of inflexibility, lack of good provenance analytics tools, and emphasis on supporting the data consumer rather than producer. We propose to better incentivize the adoption of workflow and other provenance tracking tools: (1) Instead of requiring a single workflow system across the entire pipeline, which can be inflexible, we allow for integration across multiple autonomous systems (provenance- enabled workflow systems, provenance tracking systems for languages like Python and R, etc.), and even across steps performed without any provenance tracking at all. (2) We develop provenance reasoning capabilities specifically useful to the data provider, such as provenance analytics across time, sites, and users; finding the code modules that best explain why two results are different; regression testing to determine whether a code change would affect prior results; and reconstructing missing provenance for steps that were not captured. These capabilities are expected to lead to wider tracking of data provenance, and ultimately to more consistent, reproducible, and reliable science. We will validate this hypothesis through the evaluation of our technologies within a Next-Generation Sequencing pipeline run by one of the PIs with collaborators at other institutions.
|
1 |
2019 |
Gee, James C Ives, Zachary Maidment, Andrew D.a. (co-PI) [⬀] |
T32Activity Code Description: To enable institutions to make National Research Service Awards to individuals selected by them for predoctoral and postdoctoral research training in specified shortage areas. |
Training Program in Biomedical Imaging and Informational Sciences @ University of Pennsylvania
Principal Investigator/Program Director (Last, First, Middle): Gee, James C. Project Summary The training of quantitative basic scientists in clinically-related imaging science is increasingly important. Excellent imaging sciences are well represented at Penn in multiple schools, but previously no formal integration of efforts in graduate training existed, nor was there a formal clinical component to the training until the creation of the Training Program in Biomedical Imaging and Informational Sciences. Established in 2006 under the auspices of the HHMI-NIBIB Interfaces Initiative, today the program represents a partnership led by the Departments of Radiology, Bioengineering, and Computer and Information Science, in collaboration with many other Departments across multiple Schools. Our premise is that the most successful research and technologies in quantitative imaging science are those that integrate clinical relevance, mathematical rigor, and engineering finesse. Accordingly, the program embraces strong clinical exposure alongside analytical imaging science. The objective is to provide interdisciplinary training by ensuring that students attain a level of integration that will allow them to become the next generation of leaders in hypothesis-driven, clinically-focused biomedical imaging research. Program outcomes to date are strong across all impact measures, indicating successful progress toward training objectives: publications (219) and citations (5171); numerous research awards and distinctions; recruitment of 9 (23%) URM or disadvantaged trainees; and 14 graduates (78%) in faculty positions, post-doctoral training, or medical training and residencies. A formalized curriculum, the doctoral foundation, developed for the program provides 18 months of vertical integration of the core didactic elements of biomedicine and basic science education in biomedical imaging through three Foundational components, followed by elective Pathways. In the first, Foundations in Biomedical Science (2 courses), students participate in modules 1 and 2 of the medical student curriculum that teaches the Core Principles of Medicine (including Gross Anatomy) and a 12-month sequence of organ systems medicine, Integrative Systems and Diseases. This is complemented by 2 courses in Foundations of Imaging Science: Molecular Imaging, and Fundamental Techniques of Imaging. The third foundational component is Professional Training: Responsible Conduct of Research, Scientific Rigor and Reproducibility, Teaching Practicum, Patient-Oriented Research Training, Research `Survival' Skills, and Career Development Skills. The foundational curriculum is extended toward more specialized training by many elective courses offered through two Pathways ? Imaging Methods and Applications, and Imaging Data Science ? the latter new in the next renewal period. Didactic training is complemented by obligatory Laboratory Rotations that are offered through the laboratories of participating faculty. To ensure that the thesis research is directed to translational medicine through the solution of discrete clinical problems, trainees are required to be co-advised by members of the clinical and basic science faculty. PHS 398 (Rev. 11/07) Page Continuation Format Page
|
1 |