2003 — 2011 |
Mccallum, Andrew Jensen, David |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Itr: Unified Graphical Models of Information Extraction and Data Mining With Application to Social Network Analysis @ University of Massachusetts Amherst
This project aims to improve our ability to data mine information previously locked in unstructured natural language text. It focuses on developing novel statistical models for information extraction and data mining that have such tight integration that the boundaries between them disappear---resulting in a powerful unified framework for extraction and mining. Current information extraction methods populate slots in a database by identifying relevant subsequences of text, but they are usually unaware of the emerging patterns and regularities in the database. Current data mining methods begin from a populated database, and they are often unaware of where the data came from, or its inherent uncertainties. The result is that the accuracy of both suffers, and significant mining of complex text sources is beyond reach. This project uses probabilistic graphical models that make extraction and mining decisions in the same probabilistic currency, with a common inference procedure. Such models promise significant gains in accuracy and capability, as well as an opportunity for deeper understanding of the role of high-level, top-down patterns in natural language processing, and the role of low-level, bottom-up language data in symbolic processing. The project grounds this work in two real-world applications domains: scientific research and government information. The extraction and mining of large-scale databases in these domains will have broad impacts by providing useful, constantly-updated Web resources, by enabling insights into government efficiency and the flow of scientific ideas, and by making databases, analyses and source code publicly available. http://kdl.cs.umass.edu/projects/unified-graphical-models.html
|
0.976 |
2003 — 2004 |
Dietterich, Thomas [⬀] Mccallum, Andrew |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Student Participant Support For the International Conference On Machine Learning 2003 @ Oregon State University
This award provides funds to subsidize the travel and housing expenses of students selected to participate in the Twentieth International Conference on Machine Learning (ICML), which will be held on August 21-24, 2003, in Washington, DC.
At the conference, students will present results of their research in poster sessions as part of the normal conference schedule. Members of the Program Committee will be required to visit at least three posters to provide feedback to the students. This provides the students with invaluable exposure to outside perspectives on their work, at a critical time in their research, and also enables them to explore their career objectives. This workshop contributes to the professional development of young scientists who will lead this growing field in the coming decades.
|
0.942 |
2004 — 2008 |
Mccallum, Andrew |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Itr: Collaborative Research: (Acs+Nhs)-(Dmc+Soc): Machine Learning For Sequences and Structured Data: Tools For Non-Experts @ University of Massachusetts Amherst
Sequential and graph-structured data arise naturally in a wide variety of scientific, engineering, and intelligence problems, such as handwriting and speech recognition, text mining, gene finding, and network analysis. While researchers have recently made significant progress on machine learning methods for processing structured data, these methods are much less accessible to scientists, engineers, and analysts than the better understood statistical learning techniques of classification and regression.
This project is researching methods to advance the state of the art in machine learning for structured data, building on recent work in conditional random fields and weighted transducers. The project is also developing a software toolkit to make the results of these advances accessible to researchers working in a wide range of disciplines and application domains. The toolkit will enable users to define, train, and apply models for structured data without requiring advanced expertise in machine learning. The functionality of the toolkit will include methods for specifying features relevant to an application, automatically selecting the most relevant features, adjusting parameters to optimize suitable training objectives, and combining models that pertain to different facets of an application.
The software, which will be freely distributed, will be tested with selected users in several application domains, and be carefully documented. The project will thus provide the scientific and engineering community with the first generally usable tool for learning from structured data, serving a role that is parallel to that of the more standard tools for classification and regression that are already widely used.
|
0.976 |
2006 — 2010 |
Mccallum, Andrew |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Cri: Collaborative Research: Improving Experimental Computer Science With a Searchable Web Portal For Data Sets @ University of Massachusetts Amherst
This collaborative project, developing and populating a Web-based Dataset Portal, provides a powerful front-end for online searching, querying, and browsing research datasets coupled to an intelligent back-end system that dynamically provides cross-references among datasets, research papers, techniques, authors, grants, and journals/conferences. The datasets are linked to Rexa, a research paper digital library at U Mass. The work redesigns the UCI dataset archive with structured meta information that allow queries on the web creating a formalized repository of research datasets with uniform queryable metadata. The system is built on the UCI Machine Learning and KDD Data Repositories. In research areas such as machine learning, data mining, applied statistics, language modeling, information retrieval, computer vision, and speech recognition, methodologies are often evaluated on publicly available datasets. Although these Datasets often serve as a common touchstone for communication, identifying and locating specific data spread haphazardly across various Web sites presents some difficulty. This work creates a community resource to address this problem.
Broader Impact: The project directly impacts empirical research, teaching, and most collaborative research activities. Browsing data that suggest new models and applications should inspire researchers and students. Real world data sets not only broaden research but are also bound to encourage teachers to incorporate these in the curriculum. Sharing data should bring about more collaboration from multiple areas.
|
0.976 |
2008 — 2013 |
Mccallum, Andrew |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ri-Medium: Collaborative Research: Dynamically-Structured Conditional Random Fields For Complex, Natural Domains @ University of Massachusetts Amherst
Recent progress in bioinformatics, natural language understanding, computer vision, information retrieval and other areas has been significantly enabled by "conditional random fields" (CRFs)---machine learning models of structured outputs, such as sequences, trees and grids. However, many of the fundamental problems in these application areas involve not just fixed structures, but structures that must be inferred. This structural ambiguity arises from interacting choices at different levels of representation (e.g. from character sequences to meaning, or from pixels to scene interpretation). The project will move conditional random fields (CRFs) beyond fixed graphical structures to structures that are constructed dynamically during inference. Such a capability will be key to building next-generation systems that solve, not just an individual piece of a problem, but complex multi-step problems, as found in natural language understanding and computer vision, in a unified way.
|
0.976 |
2009 — 2013 |
Mccallum, Andrew Learned-Miller, Erik [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ri: Small: Coordinating Language Modeling, Computer Vision, and Machine Learning For Dramatic Advances in Optical Character Recognition @ University of Massachusetts Amherst
The goal of this research is to develop new methods for improving the performance of optical character recognition (OCR) systems. In particular, the PI investigates "iterative contextual modeling", an approach to OCR in which high confidence recognitions of easier document portions are used to help in developing document specific models. These models can be related to appearance--for example a sample of correct words can be used to develop a model for the font in a particular document. In addition, the models can be based on language and vocabulary information. For example, after recognizing a portion of the words in a document, the general topic of the document may be detected, at which point the distribution over likely words in the document can be changed. The ability to modify character appearance distributions and language statistics and tune them specifically to the document at hand is expected to produce significant increases in the quality of OCR results.
|
1 |
2010 — 2016 |
Mccallum, Andrew |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ci-Addo-En: Flexible Machine Learning For Natural Language in the Mallet Toolkit @ University of Massachusetts Amherst
Natural language processing, information extraction, information integration and other text processing solutions are central components of computer science, and key tools for addressing the ever-increasing problems in information overload. Issues of information overload are not only personal problems, but critical for business productivity, national defense, and increasingly government decision-making and transparency.
State-of-the-art natural language processing is increasingly based on machine learning. However, the methodologies can be complex, and software infrastructure necessary for such systems is generally difficult to develop from scratch. To address this need we have created MALLET (MAchine Learning for LanguagE) and FACTORIE (Factor graphs, Imperative, Extensible), open-source software toolkit that run in the Java virtual machine. They provide many modern state-of-the-art machine learning methods, specially tuned to be scalable for the idiosyncrasies of natural language data, while also applying well to many other discrete non- language tasks.
The project will fill three critical gaps: (1) broadening these toolkits' applicability to new data and tasks (with better end-user interfaces for labeling, training and diagnostics), (2) greatly enhancing their research-support capabilities (with infrastructure for flexibly specifying model structures), and (3) improving their understandability and support (with new documentation, examples, online community support).
The project will have a direct positive impact on NLP and other machine learning research, on teaching, and on collaborative research activities. Well-designed toolkits not only help researchers avoid duplicate implementation effort, but (a) they encourage sharing of algorithms and code, and thus also cultivate increased collaboration and intellectual flow of ideas; (b) they foster the communication of detailed clarity of algorithms and scientific reproducibility; (c) they help "level the playing field" by providing state-of-the-art implementations of foundational building blocks and recent methods to top-tier and small institutions alike; (d) they supply a teaching tool, not only by making it easy for students to experiment with the supplied research methodologies. Furthermore, by providing multiple ready-to-use systems, non-programmers will have access to modern, scalable implementations of text processing tools that will spread knowledge and use of these techniques across fields, to the social sciences, humanities, and bio-medical fields.
For further information see the project web site at the URL: http://www.cs.umass.edu/~mccallum/nsf-mallet
|
0.976 |
2010 — 2016 |
Wallach, Hanna Mccallum, Andrew |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Collaborative Research: New Methods to Enhance Our Understanding of the Diversity of Science @ University of Massachusetts Amherst
This project focuses on the development and implementation of new quantitative methods to provide a deeper understanding of science policy interventions. By building analytic tools that capture the diversity of science, this project moves beyond existing methods that typically analyze the rate of scientific innovation. This move is an important next step in the "science of science policy" agenda:
Intellectual Merit: Although understanding of institutional changes on the rate of inventive activity has improved markedly in recent years, effective science policy interventions must also be grounded in an understanding of their impact on diversity as well as productivity, construed both in terms of idea diversity -- the array of different ideas derived from novel scientific insights -- and individual diversity -- the variety of people and organizations in social space engaged in scientific progress. To move forward with this crucial agenda requires a rich new set of tools. In developing such tools, this project extends prior work that focuses on "citation-counting," combining novel approaches from social and computer sciences to represent and analyze publication, patent and grant data in idea and social space. Specifically, the tools integrate two powerful methods: (a) statistical topic modeling and (b) social network analysis.
Broader Impact: These methods can also be extended to examine diversity across national, social and topic boundaries, thus providing quantitative tools to characterize issues of key significance in debates over national competitiveness. While these science policy questions could be addressed in a wide variety of settings, this project focuses on the varied data associated with the human genome and human genetics.
|
1 |
2010 — 2011 |
Mccallum, Andrew Learned-Miller, Erik (co-PI) [⬀] |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
The Fourth Northeast Student Colloquium On Artificial Intelligence @ University of Massachusetts Amherst
This award will help to subsidize the participation of graduate students in the fourth Northeast Student Colloquium on Artificial Intelligence (NESCAI) to be held April 16-18, 2010 at the University of Massachusetts in Amherst. This conference is to include oral and poster presentations by students,invited talks by senior AI researchers, and student-run tutorials. The conference will be largely run by a program committee consisting of doctoral students under the guidance of senior faculty. The program committee will conduct a review process to select the projects chosen for oral and poster presentations. In addition to graduate students, the conference plans to encourage attendance by outstanding senior undergraduates through a special undergraduate track, in the hope that it will increase undergraduate enthusiasm for research and thus the likelihood that they will go on to graduate work. The project integrates research and education and commits to broadening diversity.
|
1 |
2015 — 2018 |
Mccallum, Andrew |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Dmref: Collaborative Research: the Synthesis Genome: Data Mining For Synthesis of New Materials @ University of Massachusetts Amherst
NON-TECHNICAL:
Development of new materials is the key to addressing many of the technical challenges our society faces from energy storage to water treatment and purification. To offer just a few examples: in the oil industry, new materials are needed to withstand aggressive conditions, where failure comes with tremendous cost; electrified vehicle drive trains will be advanced by higher performing battery electrodes; carbon dioxide capture requires inexpensive new materials with the proper thermodynamic and kinetic behavior towards absorption and release. The rapid design of novel materials has been transformed by approaches where properties for many tens of thousands of materials can be predicted or inferred by a computer. The pace of commercially-realized advanced materials seems now to be limited by trial-and-error synthesis techniques. In other words, researchers have accelerated the process of knowing what to make such that the bottleneck is now how to make the structures. This research will learn from existing knowledge to develop insight on the synthesis of inorganic compounds. The analytical foundation of these activities stems from advances in machine learning that has allowed computers to excel in typically "human" tasks such as health care diagnoses and game show participation. This research will further accelerate the goals of efforts such as the Materials Genome Initiative for Global Competitiveness by enabling efficient synthesis of novel materials thereby speeding up evaluation of newly suggested materials.
TECHNICAL:
Materials are a key bottleneck in many technological advances such as efficient catalysis, clean energy generation, and water filtration. Materials Genome Initiative-style efforts have produced several examples of computationally designed materials in the fields of energy storage, catalysis, thermoelectrics, and hydrogen storage, as well as large data resources that can be used to screen for potentially transformative compounds. These successes in accelerated materials design have moved the bottleneck in materials development towards the synthesis of novel compounds, and much of the momentum and efficiency gained in the design process becomes gated by trial-and-error synthesis techniques. This research will do for solid state advanced materials synthesis what modern computational methods are doing for materials properties: Build predictive tools for synthesis so that targeted compounds can be synthesized more rapidly. This work will combine knowledge regarding synthesis, first principles modeling, and data mining to suggest synthesis routes for novel compounds.
|
0.976 |
2015 — 2019 |
Mccallum, Andrew |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Iii: Medium: Constructing Knowledge Bases by Extracting Entity-Relations and Meanings From Natural Language Via "Universal Schema" @ University of Massachusetts Amherst
Automated knowledge base (KB) construction from natural language is of fundamental importance to (a) scientists (for example, there has been long-standing interest in building KBs of genes and proteins), (b) social scientists (for example, building social networks from textual data), and (c) national defense (where network analysis of criminals and terrorists have proven useful). The core of a knowledge base is its objects ("entities", such as proteins, people, organizations and locations) and its connections between these objects ("relations", such as one protein increasing production of another, or a person working for an organization). This project aims to greatly increase the accuracy with which entity-relations can be extracted from text, as well as increase the fidelity which many subtle distinctions among types of relations can be represented. The project's technical approach -- which we call "universal schema" -- is a markedly novel departure from traditional methods, based on representing all of the input relation expressions as positions in a common multi-dimensional space, with nearby relations having similar meanings. Broader impacts will include collaboration with industry on applications of economic importance, collaboration with academic non-computer-scientists on a multidisciplinary application, creating and publicly releasing new data sets for benchmark evaluation by ourselves and others (enabling scientific progress through improved performance comparisons), creating and publicly releasing an open-source implementation of our methods (enabling further scientific research, easy large-scale use, rapid commercialization and third-party enhancements). Education impacts include creating and teaching a new course on knowledge base construction for the sciences, organizing a research workshop on embeddings, extraction and knowledge representation, and training multiple undergraduates and graduate students.
Most previous research in relation extraction falls into one of two categories. In the first, one must define a pre-fixed schema of relation types (such as lives-in, employed-by and a handful of others), which limits expressivity and hides language ambiguities. Training machine learning models here either relies on labeled training data (which is scarce and expensive), or uses lightly-supervised self-training procedures (which are often brittle and wander farther from the truth with additional iterations). In the second category, one extracts into an "open" schema based on language strings themselves (lacking ability to generalize among them), or attempts to gain generalization with unsupervised clustering of these strings (suffering from clusters that fail to capture reliable synonyms, or even find the desired semantics at all). This project proposes research in relation extraction of "universal schema", where we learn a generalizing model of the union of all input schemas, including multiple available pre-structured KBs as well as all the observed natural language surface forms. The approach thus embraces the diversity and ambiguity of original language surface forms (not trying to force relations into pre-defined boxes), yet also successfully generalizes by learning non-symmetric implicature among explicit and implicit relations using new extensions to the probabilistic matrix factorization and vector embedding methods that were so successful in the NetFlix prize competition. Universal schema provide for a nearly limitless diversity of relation types (due to surface forms), and support convenient semi-supervised learning through integration with existing structured data (i.e., the relation types of existing databases). In preliminary experiments, the approach already surpassed by a wide margin the previous state-of-the-art relation extraction methods on a benchmark task. New proposed research includes new training processes, new representations that include multiple-senses for the same surface form as well as embeddings with variances, new methods of incorporating constraints, joint inference between entity- and relation-types, new models of non-binary and higher-order relations, and scalability through parallel distribution. The project web site (http://www.iesl.cs.umass.edu/projects/natural-language-relation-extraction-and-implicature-through-universal-schema-using-embeddings) will include information on the project and provide access to data sets, source code and documentation, teaching and workshop materials, and publications. In addition, datasets will be disseminated via UCI Machine Learning Repository (or other similar archive location for machine learning data) to facilitate sharing with other researchers and ensure long-term availability, and GitHub will be used to facilitate release, sharing, and archiving of code.
|
0.976 |
2019 — 2023 |
Wing, Jeannette Hendler, James (co-PI) [⬀] Honavar, Vasant Mccallum, Andrew Baston, Rene |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Bd Hubs: Northeast: the Northeast Big Data Innovation Hub
The BD Hubs foster regional networks of stakeholders and cooperate nationally on US priorities of importance to a region and to the nation. The activities of the BD Hubs contribute to a vibrant national data innovation ecosystem. The Northeast Big Data Innovation Hub serves as a uniquely neutral entity within this ecosystem, harnessing the data revolution by building strategic partnerships that advance innovative solutions to a broad range of societal, scientific, and industry challenges. This vision is empowered and strengthened through the Hub's collaboration with a diverse community of partners, including underserved populations, world-class institutions, and people of all backgrounds who rely on or are impacted by big data. Leveraging the distinctive characteristics and challenges of the northeastern United States, the Northeast Hub will design and facilitate multi-disciplinary, community-led activities and initiatives such as:
- Aggregating and helping to develop best practices for responsible data science; - Creating frameworks for data fluency; - Fostering better management of data security and privacy; - Integrating health data from traditional and novel sources; - Improving education through big data; and - Reducing barriers for data sharing within and between different sectors.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
0.954 |
2021 — 2025 |
Mccallum, Andrew |
N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information |
Ri: Medium: Probabilistic Box Embeddings @ University of Massachusetts Amherst
Artificial intelligence (AI) and machine learning are revolutionizing the pace of progress in science, biomedicine, healthcare, business, economics, and the national defense. A foundational technical choice in AI and machine learning is that of representation. Before a machine can reason over data, that data must be represented in a way that enables parameters to be learned, and useful inferences to be made. The choice of representation has profound implications for the method’s capabilities and safety. This project explores an new alternative fundamental representation that is expected to provide better expressivity, interpretability, uncertainty characterization, and robustness, thereby laying groundwork which has the potential to provide in future representational foundations advantageous to AI safety and commonsense reasoning.
The fundamental representation for data and concepts in nearly all machine learning, including neural networks, is the vector: a point in d-dimensional space. Vectors conveniently support symmetric distance calculation, semantic neighborhoods, and geometric reasoning. For example, learned vectors representing "eagle," "bird," and "fly" may designate points that are close to each other, indicating that they are semantically closely related. However, there are intriguing reasons to consider representations based not on points, but rather regions––regions of varying breadth and overlap, able to capture (like Venn diagrams) that "bird" is a broader concept than "eagle" and "all eagles are birds" and "some but not all birds fly." This project focuses on machine learning research in a new learnable representation called box embeddings, d-dimensional hyperrectangles, which are closed under intersection, can represent arbitrary directed acyclic graphs, define regions whose volume is easily calculated, and can precisely and compactly represent large joint probability distributions. The research will address foundational open research questions concerning (1) fundamentals such as expressivity, regularization, and alternative geometric spaces; (2) relation to graphical models, having already shown that boxes have interestingly different strengths; and (3) deep learning with boxes.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
|
1 |