1998 — 2006 |
Lafferty, John (co-PI) [⬀] Carbonell, Jaime (co-PI) [⬀] Yang, Yiming [⬀] Nyberg, Eric |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Kdi: Universal Information Access: Translingual Retrieval, Summarization, Tracking, Detection and Validation @ Carnegie-Mellon University
This is a three-year standard award. The ultimate goal of the Universal Information Access project is the full democratization of information and knowledge access, by removing -- or greatly lowering -- educational, linguistic and socio-economic barriers to effective information access and use. Progress towards this goal requires us to address the following challenges: (1) Translingual information retrieval, in order to access documents across language barriers, and across same-language jargon barriers, (2) Multi-level summarization, customized to the user's profile and information needs, (3) Automated hierarchical categorization, via high-dimensionality statistical learning methods, (4) Detection and tracking, of new topics and events of interest to each user as they unfold, and (5) Information validation as a function of source reliability and inter-source consistency. These capabilities will be integrated seamlessly into an information navigator's workstation, using a common underlying object model and a user-centric interface for visualization and management of information. These methods will be evaluated both with respect to quantitative metrics and with respect to user feedback from realistic tasks. Universal information access requires more than search engines and web browsing. For instance, much useful information may exist in languages other than English, or may come from sources of unknown reliability. Moreover, rapid analysis of information requires customized summarization, anti-redundancy filters, and hierarchical organization. Advances in these areas are beneficial to all disciplines which must cope with large volumes of rapidly growing information, such as scientific research, crisis management, international business, and improving our educational infrastructure. The proposed research, in addition to its clear impact on democratizing information access, should provide significant advances in: Information Retrieval, Machine Learning, Digital Libraries, and user-centered Information Management.
|
0.915 |
2016 — 2019 |
Nyberg, Eric |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii: Small: Matching and Ranking Via Proximity Graphs: Applications to Question Answering and Beyond @ Carnegie-Mellon University
This project will explore novel alternatives to a classic term-based full-text search, which is one of the most widely used computer algorithms. The current full-text search approaches heavily rely on memorizing which words and phrases appear in which text documents. The proposed research, in contrast, will examine methods that deviate from this well-studied path by using more generic similarity search methods. In doing so, the proposed research will pursue the following two objectives: (1) mitigating limitations of the existing approaches such as the mismatch between words that appear in queries and documents; and (2) developing approaches that permit an efficient separation of labor between data scientists and designers of retrieval algorithms. The latter would allow data scientists to focus on development of effective similarity models without worrying too much about low-level performance issues, while designers of retrieval algorithms and software engineers will be able to focus on development of more efficient and/or scalable approaches having fewer concerns about quality of results.
The proposed research will investigate at least two scenarios where a term-based full-text search is replaced with a more generic high-accuracy k-nearest (k-NN) neighbor search. In the first scenario, it will develop a similarity function that goes beyond pure lexical matching and takes into account distributional similarity, similarity learned from a parallel (monolingual) corpus, and so on. In this scenario, the similarity function will be used as a black-box function coupled with a generic similarity search engine, implemented as a part of the Non-Metric Space Library (NMSLIB). Several search algorithms will be explored. One of the search approaches will rely on building a proximity graph (also known as a neighborhood graph), where nodes are objects and similar nodes are connected by edges. In the second scenario, the proposed research will build a pseudo inverted file over super terms. Super terms are (dense or sparse) vectorial representations of words appearing within a sliding window of small size. The super terms form a pseudo-vocabulary that can be indexed using a proximity graph (or any other efficient k-NN search method). At query time, the super terms will be extracted from the query and matched against the pseudo-vocabulary to obtain k nearest super terms (as well as documents where they occur). This approach will incorporate term proximity and term similarity (the latter will make the approach less affected by the vocabulary mismatch). Because preliminary experiments demonstrated that proximity graphs are not sufficiently accurate and efficient for the task in hand, the proposed research will also attempt to develop better variants of the proximity graphs methods. Should such an improvement fail, alternative search methods will also be explored. Experimental insights, algorithmic improvements, and new challenging datasets (resulting from the proposed work) will advance the state of the art in k-NN search, which is another widely used method. This, in turn, will benefit a variety of other NLP tasks such as classification, dictionary-based entity detection, and first story detection, which all heavily relying on the k-NN search. Additional project information will be made available at the project website: http://www.lti.cs.cmu.edu/PGraph
|
0.915 |