2001 — 2004 |
Dorr, Bonnie (co-PI) [⬀] Weinberg, Amy (co-PI) [⬀] Raschid, Louiqa [⬀] Doermann, David Oard, Douglas (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cise Research Resources: Infrastructure to Develop a Large Scale Experiment Testbed of Multi-Modal Resources @ University of Maryland College Park
EIA-0130422 Louiqa Raschid University of Maryland College Park
CISE Research Resources: Infrastructure to Develop a Large Scale Experiment Testbed of Multi-model Resources
The use of the widely distributed collections of structured and unstructured information expressed in multiple languages or modalities provided by the Internet, requires production of scalable, robust algorithms for the discovery of replicated content, determination of delay or access latency of sources, and the confrontation of the inherently dynamic nature of the Internet.
This project's objective is to establish a laboratory testbed providing a controlled environment that captures structural, content, and latency characteristics of the (publicly accessible) Web. This will stimulate collaboration between researchers whose interests range over natural language applications, language independent processing of scanned documents, analysis of video information sources, information retrieval, and wide area applications and resource discovery across heterogeneous servers.
The testbed will support the development and testing of: (1) tools for broad-scale, cross-linguistic analysis and discovery of relevant information across languages and modalities, (2) cost models and access cost catalogs for wide area environments, reflecting the temporal variability in access latency, (3) distributed content based indexing and association of media clips for resource discovery, (4) transcoding and scheduling of multimedia resources for delivery any time and anywhere to disparate clients; from mobile wireless to high speed optical links.
|
0.915 |
2007 — 2011 |
Samet, Hanan [⬀] Doermann, David |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii: Spatiotextual Extraction of Document On the Web For Digital Government Applications @ University of Maryland College Park
Search technology today is dominated by search engines such as the one provided by Google where for a given query string s , a set D of documents is retrieved with the aid of an algorithm that ranks the elements of D on the basis of how many other documents link to it. This research will investigate the issues involved in the development of a search engine that supports geographic location retrieval, and its deployment in a setting involving digital government applications where it is also desirable to retrieve documents on the basis of spatial proximity.
Intellectual Merit: (1) Identifying geographic references in documents is a challenging issue and is necessary for advanced search applications. This is especially true in the case of users who want to browse large collections of documents that are not necessarily on the web and to explore and discover spatial relationships either in the same document or in a collection of documents. (2) Increasingly, users are looking for documents that contain spatially proximate content. Thus the traditional method of ranking by the link structure of the web is not appropriate. Determining the geographic focus of a document is a difficult task but is necessary in applications such as those dealing with documents on the hidden web, which is a set of documents, usually proprietary, that is for internal use of an organization and is often not available on the Internet. This means that there are few, if any links to these documents, and thus popular internet search strategies are not applicable. (3) Treating spatial content of documents as a first-class citizen, in the sense that a geographic scope is reported for each document that is retrieved regardless of whether the query has a spatial component, is difficult given the need to resolve issues related to aliasing (realizing that ''''Los Angeles'''' and ''''LA'''' are the same) and ambiguity (different interpretations for ''''London''''). (4) Developing query optimization and execution strategies for queries that involve both a textual and spatial component. (5) Developing effective techniques for measuring spatial similarity other than proximity, as well as techniques for measuring combinations of spatial and textual similarity. This includes the adaptation of the skyline operator.
Broad Impacts: The ability to retrieve documents on the basis of spatial proximity makes for a better search experience and will lead to more relevant results. The tools to be developed will also extend the reach of search engines from being restricted to documents on the internet to documents that reside on the hidden web. The deployment of these tools in government web sites via collaboration with the grant''s digital government partners has the effect of empowering citizens to find out what their government is doing, thereby leading to a more informed citizenry.
|
0.915 |
2011 — 2017 |
Doermann, David Oard, Douglas [⬀] Kirsch, David |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii: Medium: Development and Evaluation of Search Technology For Discovery of Evidence in Civil Litigation @ University of Maryland College Park
The civil litigation system of the United States serves as the ultimate arbiter for commercial and personal disputes. Under this system, plaintiffs and defendants are entitled to request relevant evidence from each other. Although digital records seem easier to find than their older paper counterparts, rapid growth in the volume, diversity, and possible locations of these records has actually made it harder to find the proverbial needles within the digital haystacks. The resulting rapid increase in the cost of discovery and exchange of relevant evidence, if left unchecked, raises concerns about access to justice. Hence, there is an urgent need for demonstrably accurate and cost-effective technologies to support "e-discovery" of the relevant records.
Professor Douglas W. Oard and colleagues of the University of Maryland are developing techniques to automatically decide within minutes the responsiveness of more documents than one person could examine in a lifetime. These techniques use "semi-supervised learning" algorithms for "training" the software to replicate the kinds of decisions that people make on representative examples. Using Finite Population Annotation, a new framework for integrating learning with evaluation, novel methods are being developed to achieve and measure the highest possible effectiveness for any specified level of human effort. These learning methods draw on rich approaches to representing the content of both born-digital structured documents and scanned paper. Measures for rigorously assessing the effectiveness of the resulting automated review techniques are being developed both to support decisions by legal professionals and by the courts about which methods to use, and to help developers further improve their algorithms.
The legal system demands technology whose effectiveness has been demonstrated on collections that are representative of what is actually expected in a real case. For that reason, this project is creating real world benchmarks in collaboration with the National Institute of Standards and Technology's Text Retrieval Conference (TREC). The project's results are expected to help to shape professional practice through workshops for legal and technical stakeholders, and through university courses to prepare the next generation of attorneys and information professionals to employ these new capabilities. "E-discovery" technologies resulting from this effort are likely to be broadly applicable in domains beyond the law practice, including preparation of systematic reviews of scientific literature, scholarly access to digital archives, and government responses to public information requests from citizens. Additional information is available at http://ediscovery.umiacs.umd.edu.
|
0.915 |
2012 — 2014 |
Davis, Larry (co-PI) [⬀] Doermann, David |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Eager: Large Scale Document Image Triage, Indexing and Retrieval @ University of Maryland College Park
Structural similarity search and retrieval in images that include both printed text and handwritten text remains a challenging problem, especially with collections that are noisy, and heterogeneous. Approaches currently in use generally convert documents before filtering. This work provides triage as a way to filter very large collections through structural similarity with known attributes, then new clustering with broader terms and hashing to extend the scale of collections considered. The work will provide new directions for document image retrieval, especially in conditions where there is a wide variation in structure and layout and will be made scalable in cloud environments. Another approach to scaling, especially in the area of duplicate detection, will extend multi-level locality sensitive hashing and generalize it to other analysis indexing and retrieval issues. In addition to including graduate students, results and software will be made available through Creative Commons licensing to provide for replication and extension of the results.
|
0.915 |
2012 — 2014 |
Davis, Larry (co-PI) [⬀] Doermann, David |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Eager: Video Analytics in Large Heterogeneous Repositories @ University of Maryland College Park
The planned research will take video analysis indexing and retrieval in a new and promising direction. The research is driven by the need for intelligence analysts to be able to express video queries more efficiently than traditional relevance feedback and to be able to "provide more expressive queries that include "nouns" and "verbs" as they would with human language. While still constrained, the approach goes a long way toward bridging the gap between traditional relevance feedback based only on assumed relationships in the image, and full human language queries. The graduate students involved in the project will be required to publish in international conferences and journals and will likely use this research as a basis for their dissertations. Other impacts of this work include the mentoring of graduate students and the inclusion of junior personnel in the management of the project. Research will be disseminated through local, national, and international meetings and journals. The team will also install a server and public interface for demonstration on existing datasets. The system will be accessible through the web on limited datasets and on the full dataset by request.
|
0.915 |
2013 — 2015 |
Doermann, David Davis, Larry [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Eager: Document Image Quality Estimation, Enhancement, Classification and Retrieval @ University of Maryland College Park
Traditional approaches to document retrieval focus on conversion to electronic text followed by indexing of the text content. Recently some work in the community has focused on indexing document image content directly. Such techniques break down when text content is limited or highly degraded. Work on document quality estimation will be extended image quality to address structural quality, a factor that is important for determining if traditional document processing operations will succeed or not. Then,the team will explore the effects of enhancement on classification and retrieval and extend existing work to adapt to changes in quality. The research is motivated by the need for analysts to deal with very large collections of image data. The traditional goal of converting all documents on an electronic form and using traditional text analysis methods fails when dealing with heterogeneous collections and very noisy (possibly multilingual) content. The approach will allow document image retrieval systems to scale to orders of magnitude beyond current capabilities, and permit users to move beyond content features and use structural similarity to explore large collections. This will permit the users to mine large collections for clusters of similar content without knowing a priori specifically what the collection contains through classification. The result will be adaptive techniques that can learn from small numbers of samples without knowledge of sources of degradation.
|
0.915 |
2013 — 2015 |
Doermann, David Davis, Larry (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Eager: Scalable Video Retrieval @ University of Maryland College Park
Traditional video analysis research has been centered on detection and recognition tasks for objects and activities from known sources with a fairly narrow range of content. This effort would extend the predictable dual view hashing algorithm developed in previous work from images to videos. Many videos can be naturally associated with text annotations by the producer and consumer comments (tags), language derived from speech tracks using speech to text methods or the semantic words associated with applying vision models like human detectors and local activity detectors. The team will combine appearance based methods for video classification with language models derived from these text sources so that videos can be retrieved via a natural language like interface. This will involve investigating ways of fusing these different text sources in one vector space language model and then applying the dual view hashing methods to a database of videos. They can then investigate retrieval performance using the text codes for a form of zero shot category definition. The research is driven by the need for intelligence analysts to be able to express video queries more efficiently than traditional relevance feedback and to be able to provide more expressive queries that include nouns and verbs as they would with human language. While still constrained the approach goes a long way toward bridging the gap between traditional relevance feedback based only on assumed relationships in the image, and full human language queries.
|
0.915 |