1998 — 2000 |
Fellbaum, Christiane |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Wordnet as An Interlingual Lexical Resource
This work represents a two-pronged extension of the lexical database WordNet to create a resource suitable for automated word sense disambiguation, multilingual adoption, and crosslinguistic applications. Mapping WordNet's words and concepts into those of other languages requires disambiguation, identification, and discrimination of related entities. We first group WordNet's finely distinguished senses of highly polysemous nouns, verbs, and adjectives into underspecified "super"-senses to facilitate discrimination and disambiguation. Second, we add to the verbs in WordNet sentences illustrating their subcategorization patterns and selectional restrictions so as to better distinguish senses and facilitate matching the verb lexicons of different languages. The project is significant because it creates a resource that meets present demands for multilingual Natural Language Processing applications. Moreover, the augmentations carry considerable theoretical interest. Systematic underspecification of lexical entries might lead towards a psychologically realistic lexicon. A principled grouping of closely related senses will shed light on the nature of polysemy and specific ways in which flexible word meanings yield possible sense extensions in all areas of the lexicon. Subcategorization information and selectional restrictions will reveal the extent to which the semantic relatedness of verbs, expressed in WordNet's relational structure, is correlated with syntactic relatedness.
|
1 |
2001 — 2003 |
Fellbaum, Christiane |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Proposal: Using the Web as a Corpus For Empirical Linguistic Research
This project will develop tools that make it possible to retrieve naturally occurring sentences from the World Wide Web on the basis of lexical content and syntactic structure, providing linguists with an immediate, easily accessible source of raw linguistic data. The PIs will investigate specific linguistic hypotheses at the lexical semantics/syntax interface as an illustrative application of these tools. At a high level, the planned work constitutes an important step toward a new paradigm for linguistic research. Rather than relying entirely on introspective data generated by the linguist who is trying to (dis)prove a particular hypothesis, Web-enabled linguistics research will draw on the methodology and the tools developed by the PIs to supply naturally occurring data on which theories can rest. With regard to specific linguistic questions, the goal is to provide an explanation of the rules and constraints that govern three transitivity alternations (Middle, Unaccusative, Unspecified Object Deletion), and the PIs expect data made available by their tools to shed light on the "grey" area between competence and performance, that is, the linguistic behavior that seems to fall outside of rule-governed behavior. Although naturally occurring data are not accorded great emphasis in generative syntax, the use of text corpora has a tradition in the greater linguistic enterprise. An explosive new phenomenon in the world of naturally occurring text, the World Wide Web is an essentially untapped resource that embodies the rich and dynamic nature of language, presenting a data resource of unparalleled size and diversity
|
1 |
2004 — 2006 |
Charikar, Moses (co-PI) [⬀] Schapire, Robert (co-PI) [⬀] Osherson, Daniel (co-PI) [⬀] Fellbaum, Christiane |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Constructing An Enhanced Version of Wordnet
WordNet is an important lexical resource for research in areas including NLP and AI. This project initiates the development of a radically enhanced version of WordNet. Constructing WordNet+ involves a novel combination of empirical methods: human annotation, corpus analysis, and machine learning. WordNet+ specifically addresses some of WordNet's limited ability to identify word senses, stemming from the sparsity of Boolean arcs among sets of synonymous words ("synsets"). First, quantified, oriented arcs are to be added among a core set of 5,000 synsets. These arcs reflect evocation--the extent to which the meaning of one synset brings to mind another. Following the selection of the core synsets, a random subset of 250,000 arcs are to be elicited from annotators. The annotators, trained and tested for inter- and intra-reliability, record the strength of their mental associations using a specially designed and tested interface. The remaining arcs are to be extrapolated from the manually obtained arcs using machine learning algorithms.
All results will be made available to the research community: the core concepts, the indirect co-occurrence matrices, and all available ratings. Given WordNet's past contributions to a number of diverse disciplines, the initial stages of the construction of this research tool should stimulate great interest and have a significant impact on related work.
|
1 |
2006 — 2008 |
Fellbaum, Christiane |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: Cri: An Open Linguistic Infrastructure For American English
This project, supporting research into computational linguistics, plans the enhancing of the American National Corpus (ANC) with an open linguistic infrastructure that will add multiple manual and automatic annotations to a portion of ANC and will provide free access to these annotations in a common XML data format via a project website. The following activities are envisioned:
-Incorporation of automatic annotations derived from freely existing tools, mapped into the ANC XML format language, -Syntactic and named entity annotations of a 10Mw gold standard corpus, with partial manual annotation, -Hand-corrected automatic WordNet and FrameNet annotation for a portion of the gold standard corpus, -Enhancement of automatic annotation performance via experimentation with machine learning techniques, and -Development of a web interface for users to download above annotations, and to upload new annotation of ANC.
This work, describing methods for internal and external evaluation of the resources and tools developed, plans to create a richly, multiple annotated diverse corpus of natural language, and tools to access it. The full project would be the first large-scale execution of such effort, developing a 100 million word ANC and providing a 10-million word subset, annotated with syntax, named entities, and semantic categories in WordNet (WN) and FrameNet (FN). The annotated data will be balanced from different genres of text. One of the activities of the planning award consists in harmonizing all three resources, ANC, WN, and FN, and maximally exploiting their respective strengths. The other involves the continued development of the ANC, which, with the addition of a wide range of linguistic annotations, will serve as a resource for language processing research and applications for the NLP community. The planning project undertakes the following activities:
-Creation and annotation of WN senses and FN frames, -Planning meetings, -Further research into experimentation with methods and software to enhance automatic annotation, and -Outreach to the US computational linguistics community.
Broader Impact: Full completion of this work will further enhance the ANC by creating a comprehensive linguistic infrastructur for American English. The availability of a massive, richly annotated corpus of American English has impacts at many levels and across several areas, including computational linguistics and natural language processing, corpus linguistics, cross-linguistic studies, dialect studies, language acquisition, and materials development for both English language students and teacher training.
|
1 |
2007 — 2010 |
Fellbaum, Christiane |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ri: Collaborative Proposal: Complementary Lexical Resources: Towards An Alignment of Wordnet and Framenet
Machine-readable lexical resources are essential to Natural Language Processing applications such as information extraction and machine translation. The largest lexicon is WordNet, with semantic information about more than 150,000, or lexical units (LUs). A smaller, independently developed resource is FrameNet, which provides detailed information about the syntactic patterns for LUs. The project investigates the ways in which these complementary resources can be combined using the semantic-syntactic information from FrameNet (FN) where available and falling back on less detailed entries from WordNet (WN) in other cases.
WN and FN exhibit fundamentally different design principles. WN groups (near) synonymous LUs into "synsets," which are interconnected via conceptual and lexical relations to form a semantic network. FN groups LUs according to the "semantic frame" they evoke, which is a type of event, relation or state along with the participants involved in the event. Thus, while antonyms such as _praise_ and _blame_ may be in the same FN frame they are in different, though interlinked, WN synsets. Moreover, FN frames cover semantically related nouns, verbs and adjectives; WN synsets do not mix part of speech. Crucially for NLP applications, the resources differ with respect to sense distinctions. Alignment will be investigated for the following differences: lexical coverage, sense distinctions, taxonomic and other semantic relations, and scalar frames for adjectives. Some 1,000 word senses are examined in detail so as to provide an idea of the distribution of each of these phenomena over the entire lexicon.
This theoretical work lays the foundation for constructing a unique, invaluable resource for the NLP community.
|
1 |
2009 — 2013 |
Fellbaum, Christiane |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ci-Addo-En: a Second-Generation Architecture For Wordnet
This award is funded under the American Recovery and Reinvestment Act of 2009 (Public Law 111-5).
The work constitutes a significant update of WordNet, a large electronic lexical database of English that is a cornerstone of research and applications in computational linguistics, Natural Language Processing, Knowledge Representation, Semantic Web applications and that serves as the basis for numerous computational linguistics tools. WordNet?s database is redesigned and converted from a text-based to a relational (SQL) format to allow flexible and domain-specific extensions. Specific tasks include the definition of an SQL schema, the development of an ASCII-to-WordNetSQL table translation program, a searching interface and additional format conversion utilities. Syntactic limitations on WordNet?s lexicographer files and compiler are eliminated so as to allow long, variable-length word strings and special characters found technical terminology. The table-based SQL format is designed to allow a virtually unlimited number of relations per word form or synonym set, both user-created and ?original? to the Princeton WordNet. User extensions and modifications to WordNetSQL are distinguished from those made ?domestically,? ensuring that development of the core WordNet database remains independent of external updates, and that WordNet?s large user community continues to have available a common, consistent database against which automatic systems can be evaluated. We develp conversion tools for WordNetSQL and other popular WordNet representations (RDF/OWL, Prolog, XML). Maintenance of the Princeton WordNet lexicon and user support continue. The major impact of the freely and publicly available WordNetSQL is to enable flexible extensions by a broad and diverse user group to specific and technical domains including biology and medicine.
|
1 |
2009 — 2012 |
Finkelstein, Adam (co-PI) [⬀] Fellbaum, Christiane Funkhouser, Thomas [⬀] Blei, David (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Interactive Discovery and Semantic Labeling of Patterns in Spatial Data
Finding and labeling semantic patterns in large, spatial data sets is one of the most important problems facing computer scientists today. Massive spatial data sets are being acquired in almost every scientific discipline, such as medicine, geology, biology, astrophysics, and others. Finding meaningful patterns in those data is often the bottleneck to scientific discovery. The proposed research is to develop a transformative machine learning methodology, where the process of discovering semantic patterns in large spatial data sets is interactive and semi-autonomous. With the proposed tools and algorithms, the user is provided with an interactive system that shows the most likely segmentations and labelings given the information provided so far, but allows the user to provide additional information as he/she sees fit. The user might adjust a segmentation, provide a label, or specify an expected pattern. The system will adapt in real time to each of these inputs, thus adjusting its predictions throughout the data.
The broad impact of the proposed plan will be enhanced through an integrated educational and outreach plan. Besides the published results of research results, the field will benefit from free distribution of research and education resources, including web pages, bibliographies, software, and data sets, including augmentations to WordNet. Further broad impacts include focused workshops and courses on shape analysis, machine learning, and visualization at both the university and professional levels. Finally, diversity enhancement programs will promote the opportunities for disadvantaged groups in research.
|
1 |
2011 — 2012 |
Fellbaum, Christiane |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
A Workshop On Restructuring Adjectives in Wordnet
The workshop gathers developers and users of lexical resources, corpus and computational linguists and researchers in natural language processing to discuss a targeted restructuring of the adjectives in the lexical database WordNet. Specific proposals for replacing a subset of the current clustering of adjectives around antonyms with ordered scales reflecting the relative intensity of dimensional adjectives, such as "big", "huge" and "gigantic", are presented along with preliminary work demonstrating the feasibility of corpus-based construction of scales by means of lexical-semantic patterns and their potential benefits for NLP. Discussion topics include (1) the principal benefits of encoding scalar properties for applications including word sense disambiguation, textual entailment and language pedagogy; (2) suitable corpora for extracting data for scale construction; (3) limitations of the recently-developed AdjScales method and alternative or complementary methods for extracting scalar properties; and (4) modeling of scalar adjectives in WordNet. Participants evaluate the proposed restructuring of adjectives for its feasibility, value and relevance to their own work and its potential for future research and applications. A report including the presentations, discussions and recommendations of the group will be prepared and freely disseminated via the WordNet website.
The directions for targeted future developments of the widely used WordNet database as spelled out and agreed upon by representatives from a broad expert community assure significant consequences for research and applications in language technology and pedagogy. For a post-doctoral fellow and a graduate student the workshop provides a unique opportunity to interact with experts in the field.
|
1 |
2012 — 2013 |
Fellbaum, Christiane |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ci-P: Collaborative Research: Lexlink: Aligning Wordnet, Framenet, Propbank and Verbnet
This Computing Research Infrastructure planning grant addresses two challenges for automatic systems performing deep semantic processing: identifying the context-appropriate sense of polysemous words and interpreting the meanings and interrelations of verbs and nouns in event-denoting phrases. Preliminary steps are taken for aligning and linking four existing widely used lexical resources (WordNet, FrameNet, PropBank and VerbNet) with different but complementary contents and coverage. Methods for completing current cross-resource links and full transitive closure are explored and tested. The resulting infrastructure (LexLink) is designed to make the resources fully interoperable, capitalizing on their particular strengths with respect to word sense disambiguation and Semantic Role labeling.
Four activities are carried out in the context of planning LexLink. First, a workshop is held where key representatives of the Natural Language Processing and computational semantics communities articulate needs and requirements for the planned resource and offer advice on algorithms, annotation techniques and evaluation. Second, a subsection of cross-resource links for word senses and Semantic Role labels (Agent, Instrument, etc.) resulting from the automatic transitive closure is evaluated, yielding estimates for the error rate and leading to fine-tuning of algorithms. Third, current best performing mapping algorithms for word senses and Semantic Role labels are evaluated against a human-annotated Gold Standard. Fourth, new Gold Standard data are created for additional training and testing and to refine existing algorithms. As a whole, the work provides a solid foundation for a resource with significant beneficial impact on a range of natural language applications, including machine translation, text summarization and sentiment analysis affecting areas such as health care, marketing, and education.
|
1 |