1991 — 2002 |
Gleitman, Lila (co-PI) [⬀] Joshi, Aravind [⬀] Liberman, Mark |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Center For Research in Cognitive Science @ University of Pennsylvania
ABSTRACT This proposal from the University of Pennsylvania requests funds to establish a Science Technology Center for Research in Cognitive Science. The Director of the Center will be Professor Aravind K. Joshi. The Center for Research in Cognitive Science unites a diverse and richly interconnected group from many traditional disciplines (computer science, linguistics, mathematics, philosophy, and psychology). The goal of the research is to understand the processes and mechanisms by which human beings acquire knowledge about their environment, store and retrieve that knowledge, communicate it to others, and apply it to carry out actions and manipulate their environment. The research is organized into three separate but highly interrelated themes: perception and action, language learning, and language processing. Research in the area of perception and action spans the processes involved in the first stages of visual and auditory representation of spatial and spectral information, to higher order representations of more complex attributes, to the storage and retrieval of such representations by the organism as they are used in goal-oriented actions. The study of language learning focuses on how children develop the abstract representations of language on the basis of their visual and auditory perceptions. The research in language processing combines investigation of formal systems with investigation of computational models, all in the context of empirical study of a wide range of natural languages. Significant features of the perception and action research are its increasing fidelity to actual neural computation and its sophisticated computational modeling and related potential for contributing to artificial intelligence technology. The language learning research has significant potential for technological spin-off in machine learning and automatic acquisition of lexical and grammatical information for language systems, crucial to the development of grammars sufficient for the robust analysis of unconstrained text. And the language processing research will have significant impact on the technological base for human- computer interaction, in particular the design of natural language interfaces for data base and expert systems and knowledge-rich systems in general. This Center will stimulate enhanced activity in precollege education and in the development of human resources.
|
0.915 |
1991 — 1993 |
Liberman, Mark |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Eletronic Materials For Natural Language Research @ University of Pennsylvania
This software capitalization project is to fund the reformatting of the online text data, which exists as part of the Association for Computational Linguistics Data Encoding Initiative, into a common SGML-based format and make it available to the research community at low cost and with minimal restrictions. This is the first of several collections which are being re-formatted. The project enables scaling up of natural language research so that more realistic problems can be studied. This is particularly relevant for applications in the recognition and analysis of text and speech. Existing generally-available text databases are too small. It is expensive and time consuming to obtain sufficient text and to make it usable for research. For individual researchers to duplicate this effort is wasteful. A common database will permit published results to be replicated or extended. There is joint funding for this project with other NSF offices and with DARPA.
|
0.915 |
1995 — 2001 |
Liberman, Mark |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Hlr: Improved Speech and Text Data Resources @ University of Pennsylvania
This is the initial funding of a 36-month cooperative agreement to fund the Linguistic Data Consortium (LDC). Under this funding, the LDC will create new databases and improve infrastructure to continue and improve its work of creating, publishing and distributing data for research on human language technology. Improved speech and text data resources will be created in three areas: multilingual speech, multilingual and parallel text, and human-computer interaction. The LDC will publish and distribute these resources using the same approach of the many databases already in its catalogue. In addition, the LDC proposes to provide improved Internet-based access for its members to essentially its entire archive of data, except where license agreements stand in the way. This access would encompass both WWW-based search and retrieval of modestly sized pieces of the data, and also (on an experimental basis) wide-area remote file mounting for access to entire databases. Significant LDC cost-sharing is proposed for all of these activities, including the donation of time and facilities as well as the expenditure of case for equipment, license fees and additional needed labor.
|
0.915 |
1998 — 2002 |
Metaxas, Dimitris [⬀] Liberman, Mark Badler, Norman (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Care: National Center For Sign Language & Gesture Resources @ University of Pennsylvania
The University of Pennsylvania and Boston University are collaborating on the establishment and maintenance of resources for research in sign language and gesture. The goal of this project is to make available several different types of experimental resources and analyzed data to facilitate linguistic and computational research on signed languages and the gestural components of spoken languages. Activities in the project include the following:
* A facility for collection of video-based language data will be established, equipped with synchronized digital cameras to capture multiple views of the subject.
* A substantial corpus of American Sign Language video data will be collected from native signers and made available in both compressed and uncompressed forms.
* Significant portions of the collected data will be linguistically annotated. The database of linguistic annotations will be made publicly available, along with the applications needed to access the database.
* Video data will be analyzed using computer-based algorithms, with analysis and software made publicly available.
Thus the project makes available sophisticated facilities for data collection, a standard protocol for such collection, and large amounts of language data. The combination of linguistic and computational expertise in this project will ensure scientific integrity of data collection, and will result in useful data for researchers in a variety of fields.
|
0.915 |
2000 — 2006 |
Liberman, Mark Cieri, Christopher [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Networking Data Centers @ University of Pennsylvania
This project will initiate collaboration between ELDA/ELRA and LDC that includes networking and cross-agreements between the two organizations for the production, acquisition, normalization, certification and distribution of novel language data resources for research, education and technology development. A reciprocity agreement to be negotiated between LDC and ELRA will take into account both European and American constraints on the distribution of data. The parties will implement the legal agreement in a concrete manner through the production of a large-scale broadcast news corpus encompassing data in over 45 languages, where the legal, technical, and distribution issues will be sorted out in accordance with the cross-agreement. The production process will be conducted in accordance with best practices in this area as defined by ELRA's and LDC's previous work, and in particular will take advantage of LDC's previous experience in the collection of single language broadcast news collections and Internet-based distribution of language resources. The joint undertaking will provide a concrete test case for transatlantic cooperation between the two organizations while simultaneously developing a unique resource to facilitate the other projects sponsored under this joint EC/NSF initiative.
|
0.915 |
2000 — 2003 |
Liberman, Mark Bird, Steven [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Multidimensional Exploration of Linguistic Databases @ University of Pennsylvania
This project aims to foster a new mode of fundamental research in linguistics, namely 'web-based exploration of linguistic field data.'' The objectives are to develop tools for manipulating linguistic databases; to store and disseminate large datasets using the model; to exploit the tools and datasets in teaching and research; and -- underlying all of the above -- to explore new methods for representing and analyzing linguistic data. The consequence of this research will be increased accessibility, accountability, and stability of empirical linguistic research.
The project will provide wide-ranging support for empirical linguistic research, through the combination of traditional field methods with new technologies for exploring and visualizing complex databases. The interlinked, heterogeneous, and multimodal aspects of the data will be a key component, and the research will encompass data types including lexicons, interlinear texts, field notes, paradigms, grammar sketches, annotated recordings, annotated maps and photographs, folios, course notes, and problem sets, as well as links between all of these. A set of collaborators have granted access to their field data for the purposes of this project, and have agreed to road-test the new tools in their ongoing fieldwork. All of the primary data created by the project will be published on the web site of the Linguistic Data Consortium (LDC), for general public access, subject to the appropriate permissions having been granted. All tools and documentation produced by the project will be freely available to others.
|
0.915 |
2000 — 2003 |
Palmer, Martha [⬀] Liberman, Mark |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Mliam: Isle-International Standards For Language Engineering @ University of Pennsylvania
This is the first year of funding of a 2 year continuing award. The increased interest in multilingual information processing, which requires detailed mappings between languages, has highlighted the need for international standards, and agreed upon evaluation/validation procedures. We can no longer afford to (re)develop language resources for each new application; a shared broad scale platform of basic components is an absolute necessity as a common infrastructure to ensure the interoperability of systems through compatible interfaces and components that can be readily integrated and reused (plug and play). This project provides a framework for achieving international consensus on essential standards that would enable the sharing of resources and components on a global scale. The PI's approach builds on an already existing methodology for achieving consensus that has been developedvithin the EAGLES standardization initiative in Europe (http://www.ilc.pi.cnr.it/EAGLES96/rep2.html). The PI will spearhead the formation of an equivalent American group, AIGLES, the American Interest Group on Language Engineering Standards (also French for eagles), that will join forces with the Europeans in the development of International Standardization for Language Engineering (ISLE). The underlying philosophy of the effort will be not to prove the truth or correctness of a particular theoretical approach, but rather to agree on a common format that can allow the merger of multiple sources of information. In order to move forward rapidly, coarse distinctions will be made initially and the results laterrefiner. (an approach that was anathema some years ago). The work will focus on three distinct areas the P1 considers the most critical for continued progress in multilingual information processing: standardization of multilingual lexicons, with lexical semantics; standardization of paradigms fornatural interaction inmultimodal systems; evaluation of machine translation systems and spoken language systems.
|
0.915 |
2002 — 2008 |
Dill, Ken Lafferty, John (co-PI) [⬀] Liberman, Mark Joshi, Aravind [⬀] Pereira, Fernando |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Itr: Language, Learning, and Modeling Biological Sequences @ University of Pennsylvania
EIA-0205456 Joshi. Aravind K University of Pennsylvania
ITR: Language, Learning, and Modeling Biological Sequences
Recent significant advances in natural language processing such as the integration of grammatical and probabilistic machine-learning techniques have not been exploited for modeling biological sequences. These new techniques are highly relevant to the biological domain because they support the integration of sequence features at several scales, from dependencies between successive items through dependencies involving complex structures to overall sequence statistics. Hence, the major goals to be pursued are: (1) Development of new techniques for integrating grammatical and probabilistic information, in particular, integration and evaluation of grammatical, probabilistic, and approximate counting methods for fold prediction in secondary and tertiary structures of biomolecules. (2) Development and evaluation of probabilistic exponential models for gene finding, in particular genes for apicoplast-targeted proteins in eukaryotic human pathogens of the phylum `Apicomplexa'.
This research is highly interdisciplinary, involving the disciplines of computer science, biology and linguistics. It will have a significant impact on the modeling of biological sequences. It will also provide a wonderful opportunity to train new researchers to carry out this interdisciplinary research, thus contributing to science and mathematical education and human resource development.
The proposed research arose out of many discussions that took place at a landmark workshop on `Language Modeling of Biological Data' held at the University of Pennsylvania in February 2001.
|
0.915 |
2002 — 2008 |
Palmer, Martha (co-PI) [⬀] Liberman, Mark Joshi, Aravind [⬀] Davidson, Susan (co-PI) [⬀] Pereira, Fernando |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Itr: Mining the Bibliome -- Information Extraction From the Biomedical Literature @ University of Pennsylvania
EIA-0205448 Joshi, Aravind University of Pennsylvania
ITR: Mining the Bibliome -- Information Extraction from the Biomedical Literature
The major goal is the development of qualitatively better methods for automatically extracting information from the biomedical literature, relying on recent research in high-accuracy parsing and shallow semantic analysis. The special focus will be on information relevant to drug development, in collaboration with researchers in the Knowledge Integration and Discovery Systems group at GlaxoSmithKline.
This project will also address several database research problems, including methods for modeling complex, incomplete and changing information using semistructured data, and also ways to connect the text analysis process to an information integration environment that can deal with the wide variety of extant bioinformatic data models, formats, languages and interfaces.
The engine of recent progress in language processing research has been linguistic data: text corpora, treebanks, lexicons, test corpora for information retrieval and information extraction, and so on. Much of this data has been created by Penn researchers and published by Penn's Linguistic Data Consortium. Hence, one of our major goals is to develop and publish new linguistic resources in three categories: a large corpus of biomedical text annotated with syntactic structures `Treebank' and shallow semantic structures (proposition bank or `Propbank'; several large sets of biomedical abstracts and full-text articles annotated with entities and relations of interest to drug developers, such as enzyme inhibition by various compounds or genotype/phenotype connections `Factbanks'; and broad-coverage lexicons and tools for the analysis of biomedical texts.
|
0.915 |
2002 — 2006 |
Liberman, Mark Joshi, Aravind [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cise Research Resources: Discourse Penn Treebank and Multimodal Form: Development of Two Richly Annotated Corpora @ University of Pennsylvania
EIA-0224417 Aravind K. Joshi Mark Liberman University of Pennsylvania
CISE RR: Discourse Penn Trebank and Multimodal FORM: Development of Two Richly Annotated Corpora
This project, providing critical resources for research discourse modeling and conversational interaction, aims at developing new technologies and systems for information retrieval and human computer interaction. Centering on the construction of annotated corpora, two large-scale resources, one in the discourse domain and one in the dialog domain will be built:
1. Discourse Penn Treebank (DPTB) and 2. MultiFORM: Augmenting the FORM corpus with body movements, speech, and intonation.
The former project develops a large scale and reliably annotated corpus that will encode coherence relations associated with discourse connectives, including their argument structure and anaphoric links, thus exposing a clearly defined level of discourse structure and supporting the extraction of a range of inferences associated with discourse connectives. This annotation will be "on top of" the Penn Treebank (PTB) annotations as well as the predicate-argument annotations of PTB (called the Proposition Bank or Prop Bank). The latter involves a corpus of gesture-annotated videos, FORM that was designed to be extensible in order to eventually represent the entire multimodal experience of conversational interaction. This multimodal FORM , MultiFORM, will be created by adding body movement, speech and syntactic structure, and intonation. Large-scale annotated corpora have played a critical role in speech and natural language research by enabling large-scale integration of statistical knowledge (derived from the corpora) with linguistic knowledge (as represented in annotations) leading to scientific and technological advances. Representative examples constitute robust parsing and automatic extraction of relations and coreferences and their applications to information extraction, question answering, summarization, and machine translation. PTB, a resource developed a decade ago, represents an example of such a resource that impacts natural language processing worldwide. PTB deals with corpora at the sentence level warranting a new large scale and reliable discourse and dialog structure annotated corpora. Although intellectual and practical connections exist between studies of the structures of discourse and dialog, the initial requirements for resources to study these areas diverge while overlapping in conception. On the discourse side, we need for corpora that deals with the kinds of structures found in composed text such as journalistic articles. The dialog side needs to focus on interactions among people and on extemporized rather than pre-composed material.
|
0.915 |
2003 — 2010 |
Liberman, Mark |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Itr-Scotus: a Resource For Collaborative Research in Speech Technology, Linguistics, Decision Processes and the Law @ University of Pennsylvania
This project will create a digital audio archive that will enable scientists in several fields to approach novel research issues in speech and language studies, issues in group decision-making, and issues at the leading edge of human communication scholarship. The Supreme Court of the United States (SCOTUS) has been recording its public proceedings since 3 October 1955. These recordings - now in the National Archives - span nearly five decades and consist principally of oral arguments in which justices and attorneys engage in various forms of persuasion and communication between bench and bar and, obliquely, among the justices themselves. The arguments have been transcribed professionally across the entire period, creating a matchless collection of audio materials coupled with highly accurate transcripts. The audio - along with other activities captured on audio such as the announcement of opinions - offers a unique opportunity for researchers across a wide spectrum of disciplines to engage in novel and transforming research projects that were once thought beyond the reach of investigators.
The chief result of this work will be a complete and continuing archive of more than six thousand hours of SCOTUS audio. It will provide synchronized (i.e., time-coded) transcripts of the collection, identify and tag individual speakers, build new mark-up tools for these new domains, and share the corpus with researchers and faculty. The result of this interaction among political scientists, legal scholars, linguists, and computer scientists will yield: new knowledge in the modeling of multi-party discussions with complex goals, novel strategies in small group decision process analysis, and path-breaking approaches to extended collaborative commentary addressing the dynamics of human communication.
The SCOTUS archive will be maintained as a shared public resource to enhance study and understanding of the Supreme Court of the United States. It will be available to anyone with World Wide Web access. Based on past experience, principal audiences include: researchers across diverse domains, teachers and students, lawyers and litigants, and the visually- and hearing-impaired.
Today, more than a million unique users access selected SCOTUS materials each month. With a complete and updated SCOTUS archive and improved ability to query and search, the number of users should expand substantially.
By exploiting common interest and beneficial interactions among diverse research communities, this project will create a vast collection of digital objects. Working with partners experienced in data-sharing, the effort aims at revolutionizing the ability to collaborate with physically distributed teams of researchers and their students.
|
0.915 |
2003 — 2008 |
Davidson, Susan (co-PI) [⬀] Liberman, Mark Santorini, Beatrice (co-PI) [⬀] Bird, Steven (co-PI) [⬀] Maxwell, Michael (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Querying Linguistic Databases @ University of Pennsylvania
With National Science Foundation support, Dr. Mark Liberman and Dr. Steven Bird will lead a team conducting three years of research on data models and query languages for linguistic databases. The project will develop relational and XML data models for linguistic databases combining annotated recordings, comparative wordlists, data tabulations, interlinear texts, syntactic trees, ontologies of descriptive terms, and links between all these types. High-level user interfaces will support query-by-example and online analytical processing, permitting linguists to select appropriate language data, integrate data from multiple sources, transform the structure of the data, add new annotations in collaboration with others, and convert it all to suitable formats for archiving and for use in research and teaching.
Describing and analyzing human languages depends on being able to manage large databases of annotated text and recorded speech. The size and complexity of these databases promises to bring unprecedented depth and breadth to empirical linguistic research. However, this promise will not be fulfilled until language scientists can readily access and manipulate the data. This project will apply recent research in databases to linguistics, develop a linguistic query language, and deploy it in a variety of open-source tools for creating, managing, analyzing, and displaying annotated linguistic databases. By making rich data re-usable, the research will open the way to a deeper and broader understanding of the world's languages.
|
0.915 |
2005 — 2013 |
Cheney, Dorothy (co-PI) [⬀] Gleitman, Lila (co-PI) [⬀] Trueswell, John [⬀] Liberman, Mark Pereira, Fernando |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Igert: the Dynamics of Communication in Context @ University of Pennsylvania
This Integrative Graduate Education and Research Training (IGERT) award supports a multidisciplinary graduate training program at the University of Pennsylvania designed to integrate the computational, cognitive and neuroscientific study of communication and communication systems, be they characterized as human-linguistic, animal or machine. The primary purpose is to create a new breed of communication scientists capable of integrating theoretical issues, methods, and formalisms that are currently distributed across graduate programs as diverse as anthropology, biology, computer science and engineering, linguistics, neuroscience, philosophy, and psychology. The intellectual merit consists of the two interrelated research themes that will unite and guide graduate training. The first theme emphasizes communication as a dynamical process, one that unfolds along multiple time scales varying from milliseconds (as in planning and understanding speech) to centuries (as in evolving dialects, languages, and systems of animal communication). The second theme emphasizes communication as a context-sensitive process, where contexts range from the physical setting and communicative history of a specific conversation, to the linguistic, social and technological assumptions of social groups. Trainees will be co-advised by a multidisciplinary faculty team and will commit to a five-year graduate training program, consisting of: (1) core disciplinary training in one of the current graduate programs above; (2) one-year cross-disciplinary training in a chosen second discipline, including completion of a publishable research project; (3) participation in a weekly interdisciplinary research meeting throughout the 5-year program; and (4) completion of an advanced course in the mathematical foundations of communication specifically designed for this program. Broader impacts of this program include applications in industry, technology, and clinical settings. IGERT is an NSF-wide program intended to meet the challenges of educating U.S. Ph.D. scientists and engineers with the interdisciplinary background, deep knowledge in a chosen discipline, and the technical, professional, and personal skills needed for the career demands of the future. The program is intended to catalyze a cultural change in graduate education by establishing innovative new models for graduate education and training in a fertile environment for collaborative research that transcends traditional disciplinary boundaries.
|
0.915 |
2007 — 2011 |
Liberman, Mark Bird, Steven (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Olac: Accessing the World's Language Resources @ University of Pennsylvania
Language resources are the bread and butter of language documentation and linguistic investigation. They include the primary objects of study such as texts and recordings, the outputs of research such as dictionaries and grammars, and the enabling technologies such as software tools and interchange standards. Increasingly, these resources are maintained and distributed in digital form. Although language resources have begun to proliferate on the web, they are often difficult or impossible to locate and reuse. In this collaborative research project, Drs. Mark Liberman and Steven Bird of the University of Pennsylvania and Dr. Gary Simons of the Graduate Institute of Applied Linguistics will address this problem through new research to enhance the digital infrastructure of the Open Language Archives Community (OLAC). OLAC provides a standard set of language resource descriptors and a portal that permits users to query dozens of language archives simultaneously using a single search. However, the current coverage of OLAC is only the tip of the iceberg. The aim of the project is to greatly improve access to language resources for linguists and the broader communities of interest, by achieving an order-of-magnitude increase in the coverage of the OLAC catalog and in the use of OLAC search services. The project will do so through two main areas of activity: developing guidelines and services that encourage language archives to follow best common practices that will facilitate language resource discovery through OLAC, and developing services to bridge from the resource catalogs of the library and web domains to the OLAC catalog.
The project should have a broad impact across the field of linguistics by developing an online service that gives linguists access to resources for the thousands of languages in the world. But the impact will extend well beyond the linguistics community. Access to these language resources will assist technologists who are endeavoring to make information technologies work with every language, not just a select few. It will also permit educators, students and members of society at large to access a wealth of materials that demonstrate the full range of linguistic diversity in the world. Yet another audience for access to language resources are the actual speakers of all the languages of the world. In the case of endangered languages, access to language resources is a critical asset in the process of language revitalization. The project will also serve to advocate the widespread use of ISO 639-3, a newly adopted standard that provides codes for precisely identifying the 7,500 known human languages, past and present. This will encourage reform in current cataloging practice which is based on an earlier ISO standard that recognizes fewer than 400 languages, and begin the process of helping the major storehouses of knowledge around the world to deal appropriately with linguistic diversity.
|
0.915 |
2010 — 2012 |
Yuan, Jiahong (co-PI) [⬀] Liberman, Mark Cieri, Christopher (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Eager: Mining a Year of Speech @ University of Pennsylvania
Technologies for storing and processing vast amounts of text are mature and well-defined. In contrast, technologies for browsing or mining content from large collections of non-textual material, especially audio and video, are less well developed. Large sale data mining on text has helped transform the relevant disciplines; the disciplines dealing with spoken language will reap similar benefits from accessible, searchable, large corpora.
This project explores the difficult problem of providing rich, intelligent data mining capabilities for a substantial collection of spoken audio data in American and British English. It applies and extends state-of-the-art techniques to offer sophisticated, rapid and flexible access to a richly annotated corpus of a year of speech (about 9,000 hours, 100 million words, or 2 terabytes), derived from the Linguistic Data Consortium, the British National Corpus, and other existing resources. This is ten times more data than has previously been used by researchers in fields such as phonetics, linguistics, and psychology, and 100 to 1,000 times the amounts that are used in common practice.
Speech-to-text alignment and search tools will open a new universe of data to researchers in many fields, from linguistics and phonetics to anthropology, speech communication, oral history, and media studies. Audio-video usage on the internet is large and growing at an extraordinary rate, offering increasingly large amounts of an increasingly large range of material. Reliable automatic annotation, indexing and search of this material will allow researchers to examine the distribution of both form and content across time, space, and social structure.
|
0.915 |
2010 — 2016 |
Liberman, Mark Bird, Steven (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Prosodic Systems in New Guinea: Integrating Computational and Typological Approaches to Linguistic Analysis @ University of Pennsylvania
The world's languages make heavy use of prosody--tone, stress, intonation, and length--to communicate meaning, and tone is the most complex of these elements. Although non-tone languages typically exploit pitch for intonational purposes, the more sophisticated use of pitch in tone languages means that speakers of such languages will have quite different mental representations of pitch from speakers of English and better-known European non-tone languages. This project will investigate the tone and reduced-tone languages of New Guinea, a linguistically under-investigated area of the world which is home to a sixth of the world's languages. The project will collect substantial new bodies of recorded and transcribed language data from several undescribed tone languages. It will then use computational and theoretical methods to analyze the geographical distribution of tonal properties and the interaction of tone and other prosodic features.
The project will incorporate technology into linguistic field work and develop an exemplary model of prosodic description. Language consultants will be trained in the model's use, leading to more accessible primary data and more accountable descriptions. The data will be made available in a form that can be readily used by scholars, language teachers, and communities of speakers and will support the development of writing systems and literacy programs for these languages.
|
0.915 |
2010 — 2014 |
Wang, Wen (co-PI) [⬀] Stolcke, Andreas Yuan, Jiahong [⬀] Liberman, Mark Davidson, Susan (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ri: Medium: New Tools and Methods For Very-Large-Scale Phonetics Research @ University of Pennsylvania
The field of phonetics has experienced two revolutions in the last century: the advent of the sound spectrograph in the 1950s and the application of computers beginning in the 1970s. Today, advances in digital multimedia, networking and mass storage are promising a third revolution: a movement from the study of small, individual datasets to the analysis of published corpora that are several orders of magnitude larger.
These new bodies of data are badly needed, to enable the field of phonetics to develop and test hypotheses across languages and across the many types of individual, social and contextual variation. However, in contrast to speech technology research, speech science has so far taken relatively little advantage of this opportunity, because access to these resources for phonetics research requires tools and methods that are now incomplete, untested, and inaccessible to most researchers.
This project fills this gap by integrating, adapting and improving techniques developed in speech technology research, mainly forced alignment of digital audio with phonetic representations derived from orthographic transcripts. The research will help the field of phonetics to enter a new era: conducting research using very large speech corpora, in the range from hundreds of hours to hundreds of thousands of hours.
|
0.915 |
2011 — 2013 |
Yaeger-Dror, Malcah Reed, Alyson Liberman, Mark Cieri, Christopher [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Workshop On Sociolinguistic Archive Preparation @ University of Pennsylvania
For nearly five decades sociolinguists and dialectologists have studied the differences in pronunciation, word choice and grammar as correlated with the demographics and attitudes of the speakers and the situations in which they find themselves. This work has important implications for society, education, politics, technology development and forensics. Sociolinguists routinely produce recordings of natural speech, variously transcribed and quantitatively analyzed for dialect features plus careful descriptions of the speakers' characteristics and the interview situation. These data have important potential for linguists, scholars in language related fields and computer scientists developing human language technologies. Although many sociolinguists are eager to share their work, there have been impediments to such sharing. The proposed workshop will address two of the most important. First, within the United States, an Institutional Review Board (IRB) must approve any research involving human subjects. The vast majority of sociolinguistic research involves extremely low risk, and potentially high social benefit particularly for minority communities, but no common body of practice exists for permitting data to be shared. Nor is there a common body of practice with respect to the demographic, attitudinal and situational information collected, complicating sharing and comparison across studies.
The workshop will gather leading sociolinguists and dialectologists, and other field researchers with extensive experience, to develop common practice in preparing for institutional review and sharing of data. Expected outcomes are a sketch of an IRB protocol and demographic, attitudinal and situational questionnaires, each containing a core set that scholars should collect for every subject as well as a larger set whose relevance will depend upon the interview itself. The Linguistic Society of America and the Linguistic Data Consortium will publish the protocol and modules on their web sites and announce them via their newsletters and mailing lists, which reach more than 8000 scholars worldwide.
|
0.915 |
2012 — 2014 |
Liberman, Mark Bird, Steven (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Language Preservation 2.0: Crowdsourcing Oral Language Documentation Using Mobile Devices @ University of Pennsylvania
Language Preservation 2.0
The purpose of this pilot project is to demonstrate the feasibility of a new approach to documenting endangered languages.
To allow wide-ranging investigation of a language even after it is no longer spoken, we need the equivalent of the million words of extant biblical Hebrew texts, or the five million words of extant classical Latin. But for endangered languages without a significant culture of literacy, diverse text collections on this scale seem out of reach.
Given typical speaking rates of about 10,000 word-equivalents per hour, a hundred hours of recorded speech -- conversations, narratives, or oral histories -- would give us the equivalent of a million words of text. With community involvement, hundreds of hours of such recordings are easily within reach.
However, transcribing such large audio collections is a daunting task, given the small number of literate native speakers and the time-consuming nature of such transcription, which can take 200 hours of work for every hour of audio. We propose to solve this problem by substituting re-speaking and verbal translation: one or more native speakers repeats each phrase of a recording, speaking slowly and carefully, and then translates it into a better-documented language.
The utility of translated passages as a way to analyze otherwise-unknown languages has been demonstrated many times, starting with the Rosetta Stone. This aspect of our task is easier, since at least a grammatical sketch will in general be available.
Our goal in this project is to demonstrate the utility of re-speaking. We believe that linguists, starting out with relatively little knowledge of a language, can produce phonetic transcriptions that will be good enough to support subsequent analysis resulting in coherent texts, in a process analogous to (but easier than) the process that allowed previous generations of scholars to learn to read ancient Egyptian or Sumerian.
|
0.915 |
2016 — 2017 |
Liberman, Mark Cieri, Christopher [⬀] Callison-Burch, Chris (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ci-P: Planning For Scalable Language Resource Creation Through Novel Incentives and Crowdsourcing @ University of Pennsylvania
Advances in human language technologies enable systems that, for example, obey natural language commands and respond in kind, translate among many language pairs and summarize multilingual news. However, the technology's potential remains largely untapped because the linguistic resources that fuel development still fall far short of need. This community infrastructure planning (CI-P) initiative begins the process of building infrastructure to continuously develop high quality language resources, by employing techniques proven to work in multiple scientific disciplines. Social media, crowd-sourcing, games with a purpose and citizen science show us that human resources are effectively limitless for some activities. By offering human contributors appropriate opportunities and incentives, this project enhances language resource development well beyond what direct funding alone can produce. By removing constraints on participation, designing activities to appeal to multiple communities the project creates educational opportunities for the public including students and under-represented groups. The increase in scale and diversity of data also benefits those working in language related research, education and technology development. The availability of an ever-growing body of resources for an expanding range of languages will permit developers to supply technologies to a greater proportion of the world.
This project is the first step in the creation of infrastructure capable of high volume, continuous collection of language data and judgments through: ubiquity, perseverance, comprehensive annotation, automated training and certification, appropriate incentives, task engineering and variants of crowdsourcing. Building upon Linguistic Data Consortium's WebAnn framework, virtual front end web servers provide multiple interfaces to incentivize and engineer linguistic data contributions from targeted groups: linguists, citizen scientists, game players and students. Collection and annotation activities are analyzed into component tasks according to the skills they require and are assigned as appropriate to different workforces using different workflows. The combination of customized interfaces and novel incentive strategies enables ongoing, scalable data collection and annotation resulting in diverse language resources available to the wider Computer and Information Science and Engineering research and education communities.
|
0.915 |
2017 — 2019 |
Cieri, Christopher [⬀] Callison-Burch, Chris (co-PI) [⬀] Liberman, Mark |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ci-New: Nieuw: Novel Incentives and Workflows in Linguistic Data Collection and Annotation @ University of Pennsylvania
Language touches every aspect of human life. People speak and write in order to manage relationships from the personal to the international, to gather and provide information, to negotiate, influence and inspire. Scientists use language to communicate their findings regardless of their field of study. Although researchers have been working for six decades to process language via computer, only in the past several years have their efforts have produced technologies of sufficient maturity that they can affect the lives of the average citizen. Today, some of the most fortunate use computers to search the vast archives of the Internet, to translate material from languages they do not understand into languages they do and to interact with smart devices by giving them natural language commands and queries and receive responses in kind. Despite the growth and promise of human language technologies, they are in fact available for only a tiny portion of the world's approximately 7000 languages and, even then, for only a limited range of situations. This is the case because the approaches that have proven most successful in developing human language technologies require vast amounts of spoken or written language material that have been augmented by human judgment as to their interpretation, but such resources are lacking for most languages and for many types of situations, even for languages of international importance, including English. This Research Infrastructure project will address this shortage of language resources by supporting the language technology research community to employ novel incentives and alternate workflows to greatly expand the methods that have been used to date for collecting and annotating language data. The resulting resources will support research and development on an expanded range of language technologies, leading to the creation and deployment of applications for an increasingly broad range of languages and situations.
Even a brief observation of user behavior on social media, online games, citizen science and public good initiatives demonstrates that many people around the world are willing to devote collectively vast amounts of effort when given appropriate motivation and effective tools. This project will harness some of the immense people-power that drives such activities and focus it on problems of developing language resources that help computers learn to process language. Specifically, the project will create a software toolkit to be developed by the project team in response to the needs of language technology researchers to create online activities that yield language resources. The activities will include games, citizen science and tools for language professionals, clustered into a series of portals that appeal to different populations of users. The project will build and maintain the database and web servers, with redundancy, load balancing and fail over, to run the principal instance of all of the activities, and an open-source release of the software will enable other researchers to build their own instances independently. Finally, the data resulting from this project will be shared with the least restrictive terms possible to further support language technology research and development activities worldwide.
|
0.915 |