1988 — 1990 |
Wohn, Kwangyoen Lee, Insup [⬀] Davidson, Susan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
A Distributed Testbed For Real-Time Active Sensing @ University of Pennsylvania
Image processing equipment will be provided for researchers at the University of Pennsylvania in the Department of Electrical Engineering and Computer Science. This equipment is provided under the Instrumentation Grants for Research in Computer and Information Science and Engineering program. The research for which the equipment is to be used will be in the area(s) (1) design and implementation of distributed real-time kernel for active sensing (2) real-time integration, control and coordination of sensors (3) language constructs for distributed real-time integration, control and coordination of sensors (4) language constructs for distributed real-time programming, and real-time multiple image fusion
|
1 |
1991 — 1994 |
Lee, Insup [⬀] Davidson, Susan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
A Formal Approach to Real Time System Specification and Analysis @ University of Pennsylvania
The objective of this project is to develop a formal framework for the specification and analysis of real-time systems, and tools to support them. Since the correctness of a real-time system depends on not only how concurrent processes interact but also the time at which these interactions occur, the formal framework must be capable of modeling delays due to process synchronization as well as contention for shared resources. While most current real-time models capture delays due to process synchronization, they abstract out resource-specific details by assuming idealistic operating environments. On the other hand, scheduling and resource allocation algorithms used for real-time systems ignore the effect of process synchronization except for simple precedence relations between processes. To bridge the gap between these two disciplines, a specification language called Communicating Shared Resources (CSR), and a priority-based process algebra called the Calculus for Communicating Shared Resources (CCSR) are being developed. CSR supports the high-level description of real-time systems, whereas CCSR provides the resource-based computation model of CSR and a prioritized strong equivalence for terms based on strong bisimulation. Based on CCSR's formal framework, three tools to analyze process behavior will be developed: a simulator, a model checker and a mechanical theorem prover. The tools will be integrated in a single analyzer to be used at different stages of proving that a specification is correct.
|
1 |
1992 — 1997 |
Smith, Jonathan [⬀] Smith, Jonathan [⬀] Lee, Insup (co-PI) [⬀] Davidson, Susan Farber, David (co-PI) [⬀] Winston, Ira |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cise Educational Infrastructure: Telementoring: a Novel Approach to Teaching Undergraduate Computer Scientists @ University of Pennsylvania
This award is for the acquisition of teleconferencing and multimedia technology, and for curriculum changes to expand a curriculum in Telecommunications. The University of Pennsylvania will use a new instructional delivery system, the "video wall", to develop "Telementoring" as a long distance learning technique. The video wall is already being used in research projects by members of the AURORA Gigabit Testbed which is being supported by both NSF, DARPA and a consortium of industrial research partners. Educational materials developed will be made available to other academic institutions through Internet, and results of the educational experiments will be disseminated through publications and presentations at educational and professional meetings. The University of Pennsylvania plans to use a state of the art instructional delivery system, the "video wall", to provide multimedia and teleconferencing support for undergraduate courses in telecommunications. The video wall is an experimental video conferencing terminal with two large screen projection televisions mounted side-by-side creating the illusion of one large screen. Two cameras, co-located with the screens, are arranged to produce a single blended life- size image which is combined with high-quality directional sound. The results of the curriculum and materials development will be disseminated through Internet, and publications and presentations at educational and professional meetings.//
|
1 |
1994 — 1996 |
Lee, Insup [⬀] Davidson, Susan Smith, Jonathan (co-PI) [⬀] Smith, Jonathan (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Teleconferenced Workstations: Improving Experimentation in Undergraduate Education @ University of Pennsylvania
9451190 Lee A research platform developed at the University of Pennsylvania, "teleconferenced workstations", expands a cutting-edge curriculum in Telecommunications and Systems at Penn. "Teleconferenced workstations" are a unique mode of communication currently being used by members of the AURORA Gigabit Testbed, which connects Penn, MIT, IBM Research, and Bell Communications Research. The test bed service provision includes support for multimedia and Teleconferencing. We will use and experiment with this technology as part of a new course entitled "Distributed and Real-time Systems", as well as in an existing course entitled "Telecommunications networks". The courses not only discuss the principles behind the technology, but have a carefully constructed laboratory component with projects that apply the concepts covered in lectures to components of the prototyped system at Penn. This will not only augment our current curriculum in Telecommunications and Systems, but will serve as an excellent vehicle to evaluate the effectiveness of the teleconferenced workstation environment.
|
1 |
1994 — 1997 |
Davidson, Susan Buneman, O. Peter Overton, Chris |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Mediated Access to Biological Databases and Applications @ University of Pennsylvania
This award to an interdisciplinary group of scientists will develop both the theory and the tools both for integrating heterogeneous databases and for querying them. The immediate application of this work will be on data generated by the genome community. However, the need for such a tool is broader than that community. In particular the investigators will provide two tools. The first will be a view mechanism, which allows one to define a view (in yet another data model) encompassing parts of several related databases, and the second is a query language which can be used on this view. This work is being jointly supported by the Database Activity in the Biological Sciences and the Database and Expert Systems Program.
|
1 |
1999 — 2003 |
Davidson, Susan Tannen, Val (co-PI) [⬀] Buneman, O. Peter |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Dli Phase-2: Data Provenance @ University of Pennsylvania
Abstract
IIS-9817444 Buneman, Peter University of Pennsylvania $174,951
DLI Phase 2 - DATA PROVENANCE
This project will address issues associated with data provenance. Provenance is concerned with how information has arrived at the form in which appears -who produced it, who has corrected it, how old it is, it was originally produced, and so forth . Understanding provenance has occupied scientists, historians, textual critics and other scholars for centuries.
The provenance of data in databases is a newer and larger problem, because one is interested in data at all levels of granularity - from a single pixel in a digital image to a whole database. Just as scholars comment on documents by attaching annotations (marginalia) to text, part of the solution to recording provenance is the attachment of annotations to components of databases. Database researchers have recently considered loosely structured forms of data and have developed software systems for querying and storing such data. This work is closely related to new formats that have been developed for structured documents on the Web. It is expected that this technology will provide the substrate for recording and tracking provenance by advancing new data models, new query languages and new storage techniques.
|
1 |
1999 — 2003 |
Overton, Chris Davidson, Susan Buneman, O. Peter |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Maintaining Curated View Databases @ University of Pennsylvania
This Project will develop database tools that will enhance the ability of researchers in one emerging and interdisciplinary field -- forest canopy studies -- to collect, analyze, link, and archive their data.
The field of forest canopies is currently in its adolescence, and is growing rapidly in terms of interest from the scientific community, the general public, and policy makers. Coordinated study of the canopy is beginning at shared research sites, but canopy researchers view their major obstacle to answer emerging questions as the lack of database tools that would allow them to gather their data in harmonized ways and to link their data to enable comparative studies.
Previous NSF-sponsored collaborative research developed preliminary database tools and datasets, based on pilot studies of forest canopy structure/function relationships at the Wind River Canopy Crane Research Facility. These tools included data models, a metadata database, and a web-based project inventory for a multi-PI global climate change study.
This project develops these tools further by gathering archived and current data to test hypotheses on canopy structure/function relationships in a gradient of coniferous forests of differing structural diversity, ranging from very structurally complex (the late-successional primary forests) to very structurally simple (young monospecific plantation forests). These tools will be extended to incorporate forests of different structural constituents from two other forest types in Panama and Australia, both of which have a canopy crane facility and support collaborative canopy research. Researchers at the H.J. Andrews Experimental Forest, the Forest Science Database Center at Oregon State University, the Smithsonian Tropical Research Institute, and the Centre for Tropical Forest Studies will participate as part of a collaboration.
The project will disseminate findings through workshops and through the development of a "virtual center" of canopy research, building upon the existing International Canopy Network which is housed at Evergreen State. Research activities will be incorporated into opportunities for undergraduate and graduate students.
|
1 |
1999 — 2003 |
Davidson, Susan Buneman, O. Peter |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
A Deterministic Model For Semistructured and Structured Data @ University of Pennsylvania
With the development of the Internet, databases are being copied, transformed, and corrected on an unprecedented scale. With this comes the need to annotate data: for correcting it, commenting on it, or for recording its provenance. However, conventional databases have a regular structure and provide no "room" for annotations which are typically irregular and required at all levels of granularity. Annotations require some form of semistructured representation. This investigation centers on a new model for structured and semistructured data which is intended to facilitate the process of data annotation. The usual models of semistructured data are based on an edge-labeled graph and are "nondeterministic" in that a given vertex may have several out-edges with the same label. The model proposed here is more restrictive in that these labels must all be different. It is less restrictive because the labels themselves may themselves carry data. In fact a label may itself be a small piece of semistructured data. A number of advantages are claimed for this model. Like XML, the data representation is entirely syntactic. All operations are syntactic,, and serialization of data is immediate. Object identifiers in particular are syntactic constructs, they have structure, and certain database transformations may be achieved by manipulation of this structure. The goal of the investigation is to develop the model and associated interfaces so that it can be used as a common representation for both semistructured and structured data. A second goal is to develop tools for data annotation. http://db.cis.upenn.edu/Research/
|
1 |
2002 — 2008 |
Palmer, Martha (co-PI) [⬀] Liberman, Mark (co-PI) [⬀] Joshi, Aravind [⬀] Davidson, Susan Pereira, Fernando |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Itr: Mining the Bibliome -- Information Extraction From the Biomedical Literature @ University of Pennsylvania
EIA-0205448 Joshi, Aravind University of Pennsylvania
ITR: Mining the Bibliome -- Information Extraction from the Biomedical Literature
The major goal is the development of qualitatively better methods for automatically extracting information from the biomedical literature, relying on recent research in high-accuracy parsing and shallow semantic analysis. The special focus will be on information relevant to drug development, in collaboration with researchers in the Knowledge Integration and Discovery Systems group at GlaxoSmithKline.
This project will also address several database research problems, including methods for modeling complex, incomplete and changing information using semistructured data, and also ways to connect the text analysis process to an information integration environment that can deal with the wide variety of extant bioinformatic data models, formats, languages and interfaces.
The engine of recent progress in language processing research has been linguistic data: text corpora, treebanks, lexicons, test corpora for information retrieval and information extraction, and so on. Much of this data has been created by Penn researchers and published by Penn's Linguistic Data Consortium. Hence, one of our major goals is to develop and publish new linguistic resources in three categories: a large corpus of biomedical text annotated with syntactic structures `Treebank' and shallow semantic structures (proposition bank or `Propbank'; several large sets of biomedical abstracts and full-text articles annotated with entities and relations of interest to drug developers, such as enzyme inhibition by various compounds or genotype/phenotype connections `Factbanks'; and broad-coverage lexicons and tools for the analysis of biomedical texts.
|
1 |
2003 — 2008 |
Davidson, Susan Liberman, Mark [⬀] Santorini, Beatrice (co-PI) [⬀] Bird, Steven (co-PI) [⬀] Maxwell, Michael (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Querying Linguistic Databases @ University of Pennsylvania
With National Science Foundation support, Dr. Mark Liberman and Dr. Steven Bird will lead a team conducting three years of research on data models and query languages for linguistic databases. The project will develop relational and XML data models for linguistic databases combining annotated recordings, comparative wordlists, data tabulations, interlinear texts, syntactic trees, ontologies of descriptive terms, and links between all these types. High-level user interfaces will support query-by-example and online analytical processing, permitting linguists to select appropriate language data, integrate data from multiple sources, transform the structure of the data, add new annotations in collaboration with others, and convert it all to suitable formats for archiving and for use in research and teaching.
Describing and analyzing human languages depends on being able to manage large databases of annotated text and recorded speech. The size and complexity of these databases promises to bring unprecedented depth and breadth to empirical linguistic research. However, this promise will not be fulfilled until language scientists can readily access and manipulate the data. This project will apply recent research in databases to linguistics, develop a linguistic query language, and deploy it in a variety of open-source tools for creating, managing, analyzing, and displaying annotated linguistic databases. By making rich data re-usable, the research will open the way to a deeper and broader understanding of the world's languages.
|
1 |
2004 — 2007 |
Kurtzman, Gary Blank, Kenneth Tozeren, Aydin Davidson, Susan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Greater Philadelphia Bioinformatics Alliance
0332573 Tozeren
This award is to Drexel University to support the activity described below for 24 months. The proposal was submitted in response to the Partnerships for Innovation Program Solicitation (NSF-03521).
Partners The partners include Drexel University (Lead Institution), Pennsylvania State University-Great Valley, Temple University, Thomas Jefferson University, University of Pennsylvania, University of the Sciences in Philadelphia, Children's Hospital of Philadelphia, Fox Chase Cancer Research Center, Wistar Institute, and BioAdvance-The Biotechnology Greenhouse Corporation (an alliance of the biopharmaceutical industry in the Philadelphia region).
The primary objective of this award is to "transform knowledge into innovation in computational biotechnology in SE Pennsylvania. This is accomplished through developing training and education programs in bioinformatics, creating a virtual network of universities, industry, government agencies, and venture capitalists; promoting interdisciplinary teamwork; and supporting innovative business plans for commercially viable knowledge-based biotechnology ventures. The alliance activities have four main objectives that are interrelated and complementary: developing and maintaining a skilled workforce; creating a robust bioinformatics network; a "computational orchestra" that will catalyze and capture innovation in bioinformatics and biomedicine; and help create an infrastructure for commercialization of innovation. The activities also include development of multi-level, comprehensive and results-oriented educational and training programs to create and maintain a skilled bioinformatics workforce, from graduate level to continuing education.
Potential Economic Impact The Greater Philadelphia region is home to approximately 80 percent of the pharmaceutical employment in the U.S. and is rich in medical institutions, medical colleges, and biotechnology startup businesses. The grant will transform the wealth of biology and computational science resources in the regional universities and research institutions into innovation to accelerate the growth of the life sciences industry in the region. The activities will create new companies and jobs, and provide the workforce for those jobs.
The intellectual merit of the activity lies in providing an integrated effort from fundamental research in biological and computational sciences, creating a multilevel education and training program in bioinformatics, and support innovation in the region.
The broader impacts of the activity concentrate on creating a new education program that seamlessly integrates curricula at the vocational and high school level to the community college level to undergraduate and graduate degrees at on of several regional universities. Underrepresented groups will be involved in all of the activities of the grant.
|
0.961 |
2005 — 2009 |
Ives, Zachary (co-PI) [⬀] Stoeckert, Christian (co-PI) [⬀] White, Peter Tannen, Val (co-PI) [⬀] Davidson, Susan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ii: Data Cooperatives: Rapid and Incremental Data Sharing With Applications to Bioinformatics @ University of Pennsylvania
Generic tools and technologies for creating and maintaining data cooperatives- confederations whose purpose is distributed data sharing-will be developed to overcome the difficultiess encountered in the sharing of information in life sciences, specifically in bioinformatics.
The vision of large-scale data sharing has been a long-time goal of the bioinformatics field, much of it proceeding through data integration efforts. However, conventional approaches to data integration do not have the necessary flexibility and adaptability to make the existing and future plethora of data accessible and usable to typical biologists, while keeping it rapidly extensible to new concepts, domains, and types of queries, and thus fostering new research developments. The main reasons are that (1) different biologists work with different types of data and at differing levels of abstraction; (2) schemas in the bioinformatics world are typically large and complex; (3) queries and mappings may "break" without warning because of asynchronous updates; (4) it is logistically, economically and politically difficult to operate centralized data integration facilities. In response to these difficulties data cooperatives emphasize: decentralization for both scalability and flexibility, incremental development of resources such as schemas, mappings, and queries, rapid discovery mechanisms for finding the resources relevant to a topic, and tolerance for intermittent participation of members and for approximate consistency of mappings. More specifically, the technical goals of the proposal include: (1)collaboratively developed yellow pages of biological topics; (2) schema templates, capturing the part of the structure of data pertaining to a specific interest and functioning also as visual templates from which a query form created; (3) incremental specification of mappings; (4) reasoning about uncertainty in mappings by measuring with statistical tools their degree of reliability and using it in query answering; (5) multi-path answering for queries with caching and replication in a large-scale data cooperative where the participation of individual members may not always be assured.
Data cooperatives will have broader impact through applications in a variety of scientific and industrial fields, but it is in the field of bioinformatics that they are likely to have an immediate and significant impact. Therefore, a specific data cooperative as a biological testbed for evaluating the proposed technologies. This testbed is based on a small set of databases which are already collaborating and exchanging data related to Plasmodium falciparum. Broader impact will be also be achieved through the proposed educational initiatives, specifically through a "compu-tational orchestra" bioinformatics course which will expose students to data integration issues through project work, and a workshop for the Greater Philadelphia Bioinformatics Alliance (GPBA). Minority involvement will also be encouraged through a GPBA internship program.
|
1 |
2005 — 2009 |
Davidson, Susan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Preserving Constraints in Xml Data Exchange @ University of Pennsylvania
The goal of this research project is to study the interplay between constraints and mappings in interactions from XML to XML sources, or XML to relational data sources. The approach consists of developing a language (or languages) that is (are) sufficiently expressive to capture interesting classes of constraints, structure and mappings, and techniques for reasoning about how constraints are translated through these mappings. Using these techniques, algorithms for reasoning about the correctness of mappings with respect to constraints will be developed. Since an XML view also represents a mapping between two different XML sources, one of which is virtual, the related question of how to map an update on an XML view to the underlying data source will also be considered.
The results of this work will provide the ability to detect whether or not semantic conflicts will arise before data exchange actually occurs, thus avoiding time-consuming and unanticipated errors as data loading is performed at the target site. The project is motivated by problems in bioinformatics applications involving gene expression data sharing between projects in the Penn Center Bioinformatics. Since data exchange occurs in many different application domains, for example e-commerce, science, and government, the impact will be broadly applicable to all these areas.
The research performed will be included in an advanced database course taught by the PI, and undergraduate students will be involved in the research through a senior projects course. The results will be broadly disseminated via the project's Web site (http://www.cis.upenn.edu/~susan/NSF-IDM2005.htm).
|
1 |
2006 — 2011 |
Davidson, Susan Tannen, Val [⬀] Kim, Junhyong (co-PI) [⬀] Miller, Mark (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Core Database Technologies to Enable the Integration of Atol Information @ University of Pennsylvania
The AToL (Assembling the Tree of Life ) is a large-scale collaborative research effort sponsored by the National Science Foundation to reconstruct the evolutionary origins of all living things. Currently 31 projects involving 150+ PIs are underway generating novel data including studies of bacteria, microbial eukaryotes, vertebrates, flowering plants and many more. The data being generated by these projects include and are not limited to: (i) Specimens and their provenance including collection information, voucher deposition, etc.; (ii) Phenotypic descriptions and their provenance; (iii) Genotypic descriptions and their provenance; (iv) Interpretation of the primary measurements including homology ; (v) Estimates of phylogenies and methods employed; and (vi) Post-tree analyses such as character evolution hypotheses. While the data collection, storage, and dissemination within each projects are well coordinated, there is a critical need to develop the infrastructure to integrate all ATOL data sources, allowing the individual efforts to become multipliers for global hypotheses. Furthermore, as the projects continue to expand and address diverse corners of the Tree of Life, efficient project management will be greatly aided by workflow and data management tools targeted towards the ATOL problem domain. The project will develop new, compact, abstract data models for phylogenetics, leveraging use cases from a broad survey of empirical projects. The integration system will develop novel mappings between different phylogenetic data domains, and allow individual projects to join a network of integrated databases in an incremental manner. The data provenance system, which allows tracking of how each data object was created, will be unique to systematics data management. The provenance system will not only allow tracking of what kinds of decisions were made in producing a particular tree or a particular column of a data matrix, but will also allow tracking of alternative data lineages such that, for example, different opinions on character homology might be tracked. The results of the research will be delivered in robust software tools that can be used by the entire evolutionary biology community. The study will develop a community-based formal model of data objects used in systematics, primarily through a continuing set of workshops. This activity will not only develop new data management tools, but will also have the effect of synthesizing disparate views of the phylogenetics research domains. The results of the system will be extensible to other domains of evolutionary biology, thereby contributing to the broader mission of evolutionary synthesis. The project will also provide training for the general systematics community in latest database technologies. Finally, by leveraging existing outreach efforts at the Penn Center for Bioinformatics, the project will link to other biological database efforts in genomics and biomedical sciences, disseminating phylogenetic information to the broad biomedical research community.
|
1 |
2006 — 2009 |
Davidson, Susan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Sei+Ii Protocoldb: Archiving and Querying Scientific Protocols, Data and Provenance @ University of Pennsylvania
This project is addressing a systemic problem in scientific research: although datasets collected through scientific protocols may be properly stored, the protocol itself is often only recorded on paper or stored electronically as the script developed to implement the protocol. Once the scientist who has implemented the protocol leaves the laboratory, this record may be lost. Collected datasets become meaningless without a description of the process used to produce them; furthermore, the experiment designed to produce the data is not reproducible.
This research is developing a database (ProtocolDB) to manage scientific protocols and the collected datasets obtained from their execution. The approach will allow scientists to query, compare and revise protocols, and express queries across protocols and data. The research is also addressing the issue of recording and querying the provenance (the why and where) of data. ProtocolDB will benefit scientists by providing a scientific portfolio for the laboratory which not only enables querying and reasoning about protocols, executions of protocols and collected datasets, but enables data sharing and collaborations between teams.
The intellectual merit of the research includes the design of a model for scientific workflows, and a query language to retrieve, transform, compare scientific workflows, integrate datasets, and reason about data provenance. This theoretical contribution will establish advances in the development of systems supporting the expression of scientific protocols. The ProtocolDB implementation will be evaluated by our scientific partners. The broader impact resulting from the project is the development of a general-purpose system for managing scientific protocols and their collected datasets. The established collaborations, involving academic, governmental, and private institutions, will contribute significantly to the breadth of its use.
|
1 |
2006 — 2007 |
Ives, Zachary (co-PI) [⬀] Davidson, Susan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Seiii: Workshop On Information Integration @ University of Pennsylvania
The purpose of the workshop is to devote national attention to the need for identifying and comprehensively examining the major challenges in the area of information integration (II) and the required long-term research, engineering and development that will be needed to advance the state of the art and state of the practice in the application of advanced IT to resolve these challenges in the near- and long-term. The type and modality of information available in digital form is vast - image, video, text, audio, sensor and other forms of streaming data, as well as structured data such as databases and XML/ HTML documents. Although the data may be geographically distributed, collected and designed for specific uses and applications ("silos" of information), it is often logically inter-related, and many important questions can only be answered by accessing it collectively. However, despite considerable research and development over the last 20 years, truly ad hoc II, where disparate information systems are accessed efficiently in real time in response to unanticipated information needs and data is combined to form reliable answers to queries, remains an elusive goal. A systematic development of II technologies is needed to provide the necessary infrastructure leading to significant advances in the access to and analysis of widely distributed, heterogeneous, disparate information resources. Of particular importance to this workshop will be issues associated with the integration of science and engineering data and Federal, medical, and other records.
|
1 |
2008 — 2014 |
Tannen, Val (co-PI) [⬀] Khanna, Sanjeev (co-PI) [⬀] Davidson, Susan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii-Cor-Medium: Providing Provenance Through Workflows and Database Transformations @ University of Pennsylvania
Data provenance is a fundamental issue in the processing of scientific information and beyond. Two lines of research have been pursued in recent years with direct bearing on the issues of data provenance. In one of them, provenance in workflows, the emphasis is on extracting provenance from logs of events marking the execution of different modules to various intial and derived datasets. In the other line of research, provenance in databases, the emphasis is on the propagation of provenance through the operators that make up database views, or on propagation of provenance through copy/cut-and-paste operations within and among databases. These two bodies of work employ different techniques and at first glance their results appear quite different. However, in many scientific applications, database manipulations co-exist with the execution of workflow modules, and the provenance of the resulting data should integrate both kinds of processing into a usable paradigm.
By analyzing the work on data provenance in workflows and in databases, the PIs identify what they believe are the main difficulties in unifying and integrating these two different kinds of data provenance: (1) the lack of a data model that is rich enough to capture the interaction between the structure of the data and the structure of the workflow; and (2) the lack of a high-level specification framework in which database operators and workflow modules can be treated uniformly.
In this project, the PIs aim to overcome these difficulties and thus provide concepts and tools that allow a truly comprehensive approach to the provenace of scientific data. The project's approach relies on a data model that supports nested collections and on a functional language approach to workflow specification. Based on this, the project aims to deliver a framework and tools for defining, managing and querying data provenance in complex scientific workflows that include database manipulations. The project is expected to impact bioinformatics (through interdisciplinary collaborations in the Penn Center for Bioinformatics and the Penn Genome Frontiers Institute) and phyloinformatics (through contributions to the NSF AToL program) as well as ongoing standardization work on provenance in workflows and in the business processes (eg., BPEL) community.
The results of this project are disseminated as publications, through direct collaborations and through the project website: http://db.cis.upenn.edu/research/UNIPROVE.html.
|
1 |
2010 — 2014 |
Kafai, Yasmin (co-PI) [⬀] Sun, Joseph Powell, Rita (co-PI) [⬀] Griffin, Jean Davidson, Susan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Bpc-Dp: Penn Comp-Act: a College Service Learning Course to Promote and Increase Computational Thinking and Activities in Afterschool and Summer Programs @ University of Pennsylvania
The University of Pennsylvania proposes to develop and deploy a new CS service-learning course at the college level which will be integrated with the cascading mentoring of high and middle school students. The course - called College Service Learning Course to Promote and Increase COMPutational Thinking and ACTion (Penn COMP-ACT) - will train college students to teach K-12 computational activities. Penn COMP-ACT undergraduates will learn about computational thinking, and then they will teach and mentor high school and middle students in coordinated summer workshops and afterschool programs. The high school students will be engaged to work with the middle school students as well. This "learning-by-teaching" approach will improve all of the student's understanding of computational thinking and purposes by exposure to a variety of hands-on software design activities and materials. The Penn COMP-ACT course leverages several prior successful efforts including a pilot service-learning course set up in the CS course program, and the existing partnerships and programs within CS and Penn to recruit girls and minorities from the local community. It will be lead by an interdisciplinary team of computer scientists, computer science and K-12 educators.
|
1 |
2010 — 2014 |
Wang, Wen (co-PI) [⬀] Stolcke, Andreas Yuan, Jiahong [⬀] Liberman, Mark (co-PI) [⬀] Davidson, Susan |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ri: Medium: New Tools and Methods For Very-Large-Scale Phonetics Research @ University of Pennsylvania
The field of phonetics has experienced two revolutions in the last century: the advent of the sound spectrograph in the 1950s and the application of computers beginning in the 1970s. Today, advances in digital multimedia, networking and mass storage are promising a third revolution: a movement from the study of small, individual datasets to the analysis of published corpora that are several orders of magnitude larger.
These new bodies of data are badly needed, to enable the field of phonetics to develop and test hypotheses across languages and across the many types of individual, social and contextual variation. However, in contrast to speech technology research, speech science has so far taken relatively little advantage of this opportunity, because access to these resources for phonetics research requires tools and methods that are now incomplete, untested, and inaccessible to most researchers.
This project fills this gap by integrating, adapting and improving techniques developed in speech technology research, mainly forced alignment of digital audio with phonetic representations derived from orthographic transcripts. The research will help the field of phonetics to enter a new era: conducting research using very large speech corpora, in the range from hundreds of hours to hundreds of thousands of hours.
|
1 |
2013 — 2017 |
Davidson, Susan Tannen, Val (co-PI) [⬀] Buneman, O. Peter |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii: Medium: Collaborative Research: Citing Structured and Evolving Data @ University of Pennsylvania
Citation is an essential part of scientific publishing and, more generally, of scholarship. It is used to gauge the trust placed in published information and, for better or for worse, is an important factor in judging academic reputation. Now that so much scientific publishing involves data and takes place through a database rather than conventional journals, how is some part of a database to be cited? More generally, how should data stored in a repository that has complex internal structure and that is subject to change be cited?
The goal of this research is to develop a framework for data citation which takes into account the increasingly large number of possible citations; the need for citations to be both human and machine readable; and the need for citations to conform to various specifications and standards. A basic assumption is that citations must be generated, on the fly, from the database. The framework is validated by a prototype system in which citations conforming to pre-specified standards are automatically generated from the data, and tested on operational databases of pharmacological and Earth science data. The broader impact of this research is on scientists who publish their findings in organized data collections or databases; data centers that publish and preserve data; businesses and government agencies that provide on-line reference works; and on various organizations who formulate data citation principles. The research also tackles the issue of how to enrich linked data so that it can be properly cited.
|
1 |
2015 — 2018 |
Ives, Zachary [⬀] Tannen, Val (co-PI) [⬀] Davidson, Susan Kannan, Sampath (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cici: Data Provenance: Provenance-Based Trust Management For Collaborative Data Curation @ University of Pennsylvania
Data-driven science relies not only on statistics and machine learning, but also on human expertise. As data are being collected to tackle increasingly challenging scientific and medical problems, there is need to scale up the amount of expert human input (curation and, in certain cases, annotation) accordingly. This project addresses this need by developing collaborative data curation: instead of relying on a small number of experts, it enables annotations to be made by communities of users of varying expertise. Since the quality of annotations by different users will vary, novel quantitative techniques are developed to assess the trustworthiness of each user, based on their actions, and to distinguish trustworthy experts from unskilled and malicious users. Algorithms are developed to combine users' annotations based on their trustworthiness. Collaborative data curation will greatly increase the amount of human annotated data, which will, in turn, lead to better Big Data analysis and detection algorithms for the life sciences, medicine, and beyond.
The central problems of collaborative data curation lie in the high variability in the quality of users' annotations, and variability in the form the data takes when they annotate it. The proposal develops techniques to take annotations made by different users over different views of data (such as an EEG display with filters and transformations applied to the signal), to use provenance to reason about how the annotations relate to the original data, and to reason about the reliability and trustworthiness of each user's annotations over this data. To accomplish this, the research first defines data and provenance models that capture time- and space-varying data; novel reliability calculus algorithms for computing and dynamically updating the reliability and trustworthiness of individuals, based on their annotations and how these compare to annotations from recognized experts and the broader community; and a high-level language called PAL that enables the researchers to implement and compare multiple policies. The researchers will initially develop and validate the techniques on neuroscience and time series data, within a 900+ user public data sharing portal (with 1500+ EEG and other datasets for which annotations are required). The project team later expands the techniques to other data modalities, such as imaging and genomics
|
1 |