Hongyu Zhao, Ph.D. - US grants

Affiliations:

2000

University of Minnesota, Twin Cities, Minneapolis, MN

Area:

Organic Chemistry

Tree Info Publications Similar researchers PubMed Report error

We are testing a new system for linking grants to scientists.

The funding information displayed below comes from the NIH Research Portfolio Online Reporting Tools and the NSF Award Database.
The grant data on this page is limited to grants awarded in the United States and is thus partial. It can nonetheless be used to understand how funding patterns influence mentorship networks and vice-versa, which has deep implications on how research is done.
You can help! If you notice any innacuracies, please sign in and mark grants as correct or incorrect matches.

Sign in to see low-probability grants and correct any errors in linkage between grants and researchers.

High-probability grants

According to our matching algorithm, Hongyu Zhao is the likely recipient of the following grants.

Years	Recipients	Code	Title / Keywords	Matching score
1999 — 2013	Zhao, Hongyu	R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies.	Statistical Methods to Map Genes For Complex Traits @ Yale University DESCRIPTION (provided by applicant): Hundreds of genetic regions have been implicated in complex human traits in the past several years through the genome wide association study (GWAS) paradigm. Despite these successes, statistical analyses in most published work were based on single genetic markers. In addition, prior biological knowledge on genetic markers is rarely used. From both statistical and biological points of view, the rich information in the collected GWAS data has not been fully utilized to reveal disease etiologies. To address these critical needs, many research groups have been actively developing statistical and computational methods that can jointly analyze multiple markers, both within a region and across regions, and methods that can more effectively incorporate other sources of information on genetic markers, genes, and pathways in association analysis. The long- term goals of this application are to develop and implement novel statistical methods to identify genes affecting an individual's susceptibility to complex traits, to apply these methods to ongoing studies to enable more biological findings, and to disseminate these tools to the general research community. To achieve these broad goals, we propose to accomplish the following specific aims: (1) to develop statistical methods to identify markers that are informative about an individual's ancestry, and to take advantage of this information for more effective adjustment of sample heterogeneity in genetic association studies; (2) to develop statistical methods that can more efficiently perform multi-marker analysis, and to evaluate the statistical power of different marker search strategies; (3) to develop statistical methods that can systematically integrate different sources of information, especially biological pathways and networks, to increase our power to identify markers truly associated with complex diseases; (4) to develop statistical methods to use resequencing data to identify genetic associations between phenotypes and candidate regions. In addition, we will collaborate with leading human geneticists to apply and refine the statistical methods to a wide array of diseases, and to disseminate well-tested and validated programs to the scientific community. PUBLIC HEALTH RELEVANCE: It is well known that genetics plays a major role in many complex human diseases, e.g. cancer, hypertension, and mental disorders. However, very few genes had been firmly implicated in these disorders until a few years ago. With the introduction of high-density platforms where hundreds of thousands of genetic variants can be monitored simultaneously and the formations of large collaborative projects where thousands of patients are jointly analyzed, the field of human genetics has enjoyed a revolution recently. Hundreds of genomic regions have been found to affect the risks of dozens of diseases, and this list will likely keep increasing in the foreseeable future. These rich data have generated many statistical challenges, especially with the rapid developments of resequencing technologies. This project will develop novel and powerful statistical methods to enable human geneticists to make the most out of the valuable data collected. Through extensive collaborations, our methods will be applied to many ongoing studies to identify more genomic regions and biological pathways for complex diseases. We will also distribute the well-tested computer programs so that other researchers can utilize the statistical tools developed by us. Show summary Hide summary	0.97
1999 — 2001	Zhao, Hongyu	R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies.	Statistical Methods For Nondisjunction Data @ Yale University Chromosome imbalance is the leading known cause of mental retardation, spontaneous abortion, and congenital heart defects in human. Furthermore, over 50 percent of all human pregnancy loss is attributable to chromosome imbalance in the fetus, making chromosome abnormalities the leading cause of reproductive failure. Maternal age is well recognized as a major risk factor for chromosome abnormalities. Alterations of recombinations are also found to be strongly associated with human nondisjunctions. However, standard statistical methods used in human nondisjunction studies are biased. In addition, the standard methods are inefficient in extracting genetic information from nondisjunction data, which are characterized by a limited amount of available materials, unknown stage of origin of nondisjunction error, uninformative matings, and missing parents. Built on our previous studies on the crossover process, ordered tetrads, and half-tetrads, we will develop efficient multilocus statistical methods for human nondisjunction data to include: (1) joint marker information; (2) crossover interference; (3) the uncertainty in the stage of origin of nondisjunction error; (4) parental age effects; (5) untyped and uninformative markers; (6) families with only one available parent; and (7) genotyping errors. These statistical methods will maximally utilize the information in human nondisjunction data to identify basic mechanisms responsible for chromosome abnormalities. Using the statistical methods developed in this project, we will collaborate with leading researchers in human nondisjunction studies to analyze, interpret, and report scientific findings for chromosomes X, 13, 15, 16, 18, 21 and 22. To make our methods available to the scientific community, efficient and well-documented computer software will be developed, tested, and distributed on the World Wide Web. The ultimate goals are to understand recombination and its alterations during nondisjunction, and to provide the knowledge needed to monitor and prevent chromosome abnormalities. Show summary Hide summary	0.97
2002 — 2008	Dinesh-Kumar, Savithramma [⬀] Dinesh-Kumar, Savithramma [⬀] Zhao, Hongyu	N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information	Functional Genomics of Host-Virus Interactions @ Yale University The relationship between eukaryotic viruses and their hosts are characteristic of most host-pathogen relationships that have co-evolved. The outcomes of virus-host interactions are genetically pre-determined. To produce disease, viruses must enter the host, multiply locally in host tissues, spread from the site of entry, and overcome or evade host immune responses. Plants have evolved various anti-viral defense strategies to clear viral infection. On the other hand, viruses have evolved counter-defense strategies. Therefore, it is important to understand molecular mechanisms of how viruses evade the host's antiviral defenses. A Tobacco Rattle Virus-based virus induced gene silencing (VIGS) system will be used to identify suppressors of viral resistance and susceptibility factors. In addition, a functional proteomics approach will be used to study virus-host interactions. The knowledge gained from these studies will help to combat infectious plant diseases. Protection of crops from disease can significantly improve agricultural production. Application of a plant's own defense mechanism can lead to more effective protection against plant pathogens. Control of pathogen-induced diseases using cellular genes that function in the disease resistance signaling pathway may provide tremendous agricultural benefits and serve the environment by offering an alternative to pesticide use to prevent disease. Tools and information developed in this project will be made available to the scientific community. These resources will assist in the efforts to improve economically important plants like tomato, potato and pepper. Deliverables Available now: pTRV1, pTRV2, pTRV2-GATEWAY, pTRV-NbPDS, pTRV-tomPDS VIGS system. Contact savithramma.dinesh-kumar@yale.edu Available by 9/03: 250-300 sequence verified pTRV2-tomato EST clones 1000-1500 sequence verified pTRV2-Nb-cDNA Agrobacterium TMV-TAP vector Show summary Hide summary	0.97
2003 — 2009	Nelson, Timothy [⬀] Deng, Xing-Wang Zhao, Hongyu	N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information	Analysis of Rice Cellular Expression Profiles @ Yale University The study of rice has tremendous potential impact, both because it is a useful monocot plant model and because it is a direct source of molecular tools for the analysis and manipulation of nearly every cereal grass crop. The high degree of synteny (parallel organization) among cereal genomes makes it possible to relate rice chromosomal regions, regulatory patterns, and genes to those of maize, wheat, barley, sorghum, rye, and other key crops. Laser capture microdissection (LCM) makes it possible to isolate RNA and other molecules from individual visually selected cell types. This project combines LCM with microarray analysis of gene expression to prepare a cellular atlas of expression profiles. This will consist of a public database that provides information about the expression of 15,000 rice genes initially, and subsequently on all 60,000 genes of rice in every cell type. This will serve as a resource for studies of physiology, growth and development of rice and other cereals, with an unprecedented resolution. The database will accommodate future cell-specific data collected under special conditions, such as stress or pathogen attack. The project will develop LCM techniques to gain access to every cell type of rice, and will rely on rice whole-genome microarrays being developed in another NSF project. Specific deliverables of the project will be the further development and testing of LCM in rice and the production of an expression profiling atlas of 125 rice cell types. These will be released to the publicly available website at a schedule of 10 in year 1, 15 in year 2, 50 in year 3 and an additional 50 by year 4. Outreach and training activities will include participation in Family Science Days at the Peabody Museum on the Yale campus and contributions to museum displays on plant genomics, targeted to a K-12 student audience. Show summary Hide summary	0.97
2003 — 2006	Snyder, Michael (co-PI) [⬀] Schultz, Martin (co-PI) [⬀] Gerstein, Mark (co-PI) [⬀] Zhao, Hongyu	N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information	Statistical and Computational Approaches For Integrated Genomics and Proteomics Analysis and Their Applications to Modeling G1/S Transition During Yeast Cell Cycle @ Yale University Advances in technologies are changing the field of biology to move beyond genomes to transcriptomes, proteomes and metabolomes. It has become clear that the combination of predictive modeling with systematic experimental verification will be required to gain a deeper insight into living organisms, therapeutic targeting and bioengineering. Although the importance of integrating various types of biological data to address scientific questions is well recognized and appreciated, the potential information carried in different types of data may not be fully realized without a sound and comprehensive statistical framework to integrate these data. In addition, close collaborations among statisticians, biologists, bioinformaticians, and computer scientists are essential to ensure that these statistical methods provide a reasonable description of the biological processes studied and the validity of these methods should be rigorously tested through biological experiments. In this project, a team of researchers with expertise in statistics, genomics and proteomics, bioinformatics, and computer science will develop an integrated approach to reconstructing biological pathways. Statistical and computational methods will be developed to better identify transcription factor targets, to integrate yeast two-hybrid data, protein complex data, protein localization data, and gene expression data to infer protein interaction networks, and to further integrate DNA- protein binding data to reconstruct transcriptional regulatory networks. This project focuses on the G1/S transition during the yeast cell cycle to statistically model and experimentally validate inferred regulatory networks. In addition, parallel computing methods will be developed to overcome the computing bottleneck in the analysis of large-scale networks. The resources generated from this project, both computer programs and network information will be made available to the scientific community. It is anticipated that this project will lead to a statistical framework that can be utilized to dissect biological pathways and also will lead to an approach to integrating expertise from diverse disciplines to address important scientific problems in the post-genome era. With recent progresses in biotechnologies, it has become reality to collect tens of thousands of gene expression and protein expression levels in humans and other organisms. In addition, scientists now are able to monitor interactions among proteins and interactions between proteins and DNA sequences, to investigate the location that each gene is expressed, and to study the overall effects on the whole organism of individual genes through large collections of mutation strains. The availability of such data has led to a revolution in biological and biomedical sciences. Although there is a great potential and an enormous amount of information in these data, the major challenge is how to best integrate, analyze, and interpret these data to understand biological pathways. In this project, statistical and computational methods will be developed to integrate various types of data in an effort to reconstruct biological pathways with a focus on the understanding of gene regulations in cell cycle. The statistical models to be developed will be validated with biological experiments. Computer programs will be developed and distributed to the scientific community after extensive testing to allow biologists and medical researchers to use these tools to study other biological pathways. This project will also develop high-performance computing approaches to implementing the developed methods and will involve training activities in the general area of computational biology and bioinformatics. This grant is made under the Joint DMS/NIGMS Initiative to Support Research Grants in the Area of Mathematical Biology. This is a joint competition sponsored by the Division of Mathematical Sciences (DMS) at the National Science Foundation and the National Institute of General Medical Sciences (NIGMS) at the National Institutes of Health. Show summary Hide summary	0.97
2004 — 2008	Zhao, Hongyu	P01Activity Code Description: For the support of a broadly based, multidisciplinary, often long-term research program which has a specific major objective or a basic theme. A program project generally involves the organized efforts of relatively large groups, members of which are conducting research projects designed to elucidate the various aspects or components of this objective. Each research project is usually under the leadership of an established investigator. The grant can provide support for certain basic resources used by these groups in the program, including clinical components, the sharing of which facilitates the total research effort. A program project is directed toward a range of problems having a central research focus, in contrast to the usually narrower thrust of the traditional research project. Each project supported through this mechanism should contribute or be directly related to the common theme of the total research effort. These scientifically meritorious projects should demonstrate an essential element of unity and interdependence, i.e., a system of research activities and projects directed toward a well-defined research program goal.	Theoretical Studies of Linkage Disequilibrium @ Yale University The long-term objective of this project is to develop statistical and computational methods for the analysis of haplotypes in population genetics. With the availability of large numbers of genetic markers in the human genome and the advances in genotyping technology, it is becoming feasible in population genetic studies to genotype thousands of markers in a large number of individuals from multiple populations. The analysis of such data poses challenging statistical and computational issues and both theoretical and empirical studies are needed to develop and evaluate statistical methods that can best extract the most relevant information for statistical inference of parameters of interest. The specific aims of this projects are: (1) Develop statistical and computational methods to infer haplotype frequencies from the observed unphased marker data in multiple populations; (2) Develop general guidelines for marker selection to identify disease susceptibility variants through haplotypes; (3) Use haplotypes consisting of single nucleotide polymorphisms as well as microsatellites from multiple populations for inference on population parameters as well as local recombination rates; (4) Investigate the power of statistical methods to identify chromosomal regions that have been subject to natural selections; (5) Implement and validate the developed methodologies in computer programs that will be distributed to the scientific community; and (6) Collaborate with other investigators to apply the methods and knowledge gained from this project to analyze data from other projects. Our methods will exploit two unique features in the data to be collected: the large number of populations around the world and the exhaustive cataloguing of haplotypes in extended chromosomal regions. The developments of these novel statistical methods and user-friendly computer programs will provide useful tools on population genetic studies and the analysis of data collected from other projects will lead to better understanding of relationships among various populations and different forces leading to linkage disequilibrium patterns in the human genome. Show summary Hide summary	0.97
2006	Zhao, Hongyu	R13Activity Code Description: To support recipient sponsored and directed international, national or regional meetings, conferences and workshops.	International Symposium On Genome-Wide Association Studies @ Yale University [unreadable] DESCRIPTION (provided by applicant): Recent years have seen great progress in the identifications and characterizations of millions of single nucleotide polymorphisms (SNPs) in the human genome, highlighted by the recent publication from the International HapMap Project. The knowledge on the SNPs coupled with the developments of various platforms for high-throughput, high quality genotyping have made genome-wide association studies within reach of human geneticists. However, significant challenges remain in the design, conduct, analysis, and interpretation of genome-wide association studies. Statistical genetics and epidemiological designs will play key roles to ensure the success of these studies. Therefore, a focused meeting on relevant methodological issues can stimulate novel ideas and exchange different views/approaches that can lead to more powerful designs and efficient analyses of genome-wide association studies. To address this need, we request funds to support an International Symposium on Genome-Wide Association Studies to facilitate discussions among leading scientists using genome-wide association strategies to identify genes underlying common diseases. The specific aims of this conference are to review the up-to-date knowledge of human population genetics from the International HapMap Project and other related efforts, to discuss design strategies for a genome-wide association study, to explore novel statistical and computational methods designed for genome-wide studies, to illustrate design and analysis issues through case studies, to introduce cutting-edge technologies that may prove indispensable for future studies, and to disseminate the knowledge and lessons learned from this conference to the general scientific community. A group of outstanding speakers have already committed to this meeting and they will play a leadership role at this conference. In addition, poster presentations selected from conference participants will provide ample opportunities for further exchanges and discussions among all conference attendees. The meeting materials will be made available to the public via several channels, such as published meeting report and conference web site, to further increase the impact of this conference. [unreadable] [unreadable] [unreadable] Show summary Hide summary	0.97
2007 — 2011	Weissman, Sherman (co-PI) [⬀] Zhao, Hongyu	N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information	Collaborative Research: a General Framework For High Throughput Biological Learning: Theory Development and Applications @ Yale University This application presents a comprehensive research plan for the investigation of a general framework and various new methods to handle complex large-scale data sets generated from biological (medical) as well as other scientific studies. Two goals are articulated in this proposal: theory development and application in biology and medicine. The former is focused on the study of a general yet core, model-free framework to effectively address major issues arising from high dimensional data. In the latter, the investigators seek to apply methods developed from the theory part to resolve machine learning type problems that arise in biology and medicine. In particular, this team intends to study the problems related to biological and medical prediction in response to treatments, clinical diagnosis of diseases (such as cancers), discovery of protein-protein interactions and biological network constructions related to disease etiology and motif identification. To achieve these two goals, the investigators will study theoretical and practical properties under a general setting and evaluate a series of novel statistical/computation procedures/software which will then be tested by a broad range of real and simulated data, some from current on-going studies. The emergence of high dimensional data in most scientific fields poses new challenges for statisticians. Methods successful in dealing with low dimensional data are no longer effective for high dimensional data. One of the greatest difficulties in analyzing these data is to identify the informative variables/features and their associated clusters, and decipher the characteristics of the interaction between these variables and clusters. To meet current and future needs for digging hidden knowledge out of high dimensional data comprehensively and systematically, the scientific fields must develop new methods. The current project is a direct response to this need. Based on theoretical evidence (as preliminary results) already obtained in extracting low dimensional information, this team plans to apply and to develop various effective procedures to address practically important problems in the domains of biology and medicine. The investigators will study a novel screening process applicable across fields to demonstrate how high quality classifiers of low dimensionality can be identified while joint information among the influential variables are fully utilized. For further interpretation for biological validation/confirmation this team will study how to construct biological networks based on low dimensional classifiers and how to identify significant association patterns among them. A feedback mechanism will be established between the methodology development and biological validation teams, where statistical/computational results will be regularly discussed and biologically validated. It is anticipated that the key ideas and methods developed here will find numerous applications in disciplines other than biology/medicine. The proposed research is likely to advance substantial knowledge and significantly benefit current and future efforts in molecular biology/statistics/computational biology/disease prediction/drug discovery. The project would also provide valuable research experiences and training to undergraduates. Show summary Hide summary	0.97
2010 — 2011	Zhao, Hongyu	U01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies.	Lost-of-Function Variants in the 1000 Genomes Data Set and Implications to Gwas @ Yale University DESCRIPTION (provided by applicant): The 1000 Genomes Project is an international research consortium whose aim is to produce a detailed map of human genetic variation to support disease studies with major sequencing effort. This project involves sequencing the genomes of at least a thousand people from around the world to facilitate the discovery and understanding of genetic variants such as single nucleotide polymorphisms and structural variants. The data generated from this project will help in the discovery of regions in the genome containing genetic variations associated with risk of human diseases as previously attempted by efforts such as the HapMap Project. However, there are significant challenges in the analysis, annotation, and applications of these data to guide the identifications of variants associated with diseases and various traits. In this application, we will focus on loss-of-function variants because they represent a major class of genetic variations that are potentially involved in complex traits, and we believe a comprehensive characterization of these variants and making the knowledge gained available to the general research community will facilitate the identifications of genes involved in complex traits. To accomplish this objective, we will develop a bioinformatics pipeline to identify loss-of-function variants from the 1000 genome data, associate them with other types of information accumulated in the literature and public databases, such as gene ontology, protein interactions, expression profiles, investigate the best approaches to attain the genotypes of these variants in population samples, and develop statistical methods to incorporate the annotation results to increase the statistical power to identify loss of function variants affecting complex traits. We will disseminate the methods and results to the public both through a stand-alone application focusing of loss of function variants as well as through collaboration with the UCSC Genome Browser team to add tracks on their browser to different types of information on these loss of function variants. We believe that this proposed project will generate very valuable resources to the scientific community that can significantly enhance our understanding of loss of function variants in human populations and use such knowledge to more effectively improve human health. RELEVANCE: The research proposal is developed to generate very valuable resources to the scientific community that can significantly enhance our understanding of loss of function variants in human populations and use such knowledge to more effectively improve human health. Show summary Hide summary	0.97
2011 — 2015	Zhao, Hongyu	N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information	Collaborative Research: Semiparametric Conditional Graphical Models With Applications to Gene Network Analysis @ Yale University The research proposed in this project is motivated by the following problem. In many genetic studies, in addition to gene expression data, other types f data are collected from the same individuals. The problem is how to make use of this additional information when construct gene networks. The investigators formulate this problem by a conditional Gaussian graphical model (CGGM), in which the external variables are incorporated as predictors. They propose an estimation procedure for this model by combining reproducing kernel Hilbert space with the lasso type regularization. The former is used to construct a model-free estimate of the conditional covariance matrix, and the latter is used to derive a sparse estimators of the conditional precision matrix, whose zero entry pattern correspond to a graph that describes the gene network. They propose to study the asymptotic properties, to introduce methods to determine the tuning constants, and to develop standardized and openly accessible computer programs for this model. Furthermore, the investigators propose to extend the CGGM in two directions. First, they propose to relax the Gaussian assumption by applying a copula transformation to the residuals and then using pseudo likelihood to estimate conditional correlations. These are then subject to the lasso-type regularization to yield sparse estimator of the precision matrix. The second direction is the development of sufficient graphical model, which is a mechanism to simultaneously reduce the dimension of the predictor and estimate the graphical structure of the response. High-throughput technologies that enable researchers to collect and monitor information at the genome level have revolutionized the field of biology in the past fifteen years. These data offer unprecedented amount and diverse types of data that reveal different aspects of the biological processes. At the same time, they also present many statistical and computational challenges that cannot be addressed by traditional statistical methods. In current genomics research it has become increasingly clear that statistical analysis based on individual genes may incur loss of information on the biological process under study. For example, a widely known study on identifying genetic patterns of diabetic patients show that no single gene could stand out statistically as responsible for the patterns, and yet clear signals emerged when genes were analyzed in groups. Motivated by this observation, greater attention has been paid to networks of genes. The investigators propose a class of new statistical methods, called conditional graphical models, for constructing gene networks that can take into account of a set of covariates. They also plan to develop theoretical properties and computer programs for the proposed methods. Although their inquire began with gene networks, the investigators envision conditional graphical models to have broad applications beyond genomics, such as in predicting asset returns and in studying social networks, which are becoming all the more prevalent in this age of Internet. Show summary Hide summary	0.97
2012 — 2016	Zhao, Hongyu	P01Activity Code Description: For the support of a broadly based, multidisciplinary, often long-term research program which has a specific major objective or a basic theme. A program project generally involves the organized efforts of relatively large groups, members of which are conducting research projects designed to elucidate the various aspects or components of this objective. Each research project is usually under the leadership of an established investigator. The grant can provide support for certain basic resources used by these groups in the program, including clinical components, the sharing of which facilitates the total research effort. A program project is directed toward a range of problems having a central research focus, in contrast to the usually narrower thrust of the traditional research project. Each project supported through this mechanism should contribute or be directly related to the common theme of the total research effort. These scientifically meritorious projects should demonstrate an essential element of unity and interdependence, i.e., a system of research activities and projects directed toward a well-defined research program goal.	Analytical Core @ Yale University The Analytical Core led by Dr. Hongyu Zhao will support the statistics, bioinformatics, implementation, and data analysis needs of all the Projects of this Program Project. In coordination with the Administrative Core led by Dr. Yung-Chi Cheng, this Core will function to conduct statistical analysis of the clinical trial data from Project 1, assist in the analysis of potential biomarkers and their associations with patient responses from Project 2, and implement and apply the novel statistical methods to be developed in Project 3. This Core will also provide a training environment for program project personnel on software tools on data analysis and interpretation. Show summary Hide summary	0.97
2017 — 2020	Paul, Debashis (co-PI) [⬀] Jiang, Jiming [⬀] Zhao, Hongyu	N/AActivity Code Description: No activity code was retrieved: click on the grant title for more information	Misspecified Mixed Model Analysis: Theory and Application @ University of California-Davis This project, a collaboration between statisticians and a statistical geneticist, focuses on the development of statistical theory and methods for the analysis of data from genome-wide association studies (GWAS). Over the past decade, while GWAS have been very successful in detecting genetic variants that affect complex human traits/diseases, these discoveries have only accounted for a small portion of the genetic factors. Recently, significant progress has been made using statistical analysis based on a class of statistical models called mixed effects models. However, there is a gap in understanding why the method works, because, in a way, the statistical model used in the analysis is misspecified. This project aims to fill the gap by developing new theory and methods, and evaluating the methods through applications to real data. The project will promote teaching, training and learning, broaden the participation of students from under-represented groups, and build research networks between institutions. The research will be of great interest to many other areas of science, and the results will be widely disseminated in subject matter domain journals. In the past decade, more than 24,000 single-nucleotide polymorphisms (SNPs) have been reported to be associated with at least one trait/disease at the genome-wide significance level. However, these significantly associated SNPs only account for a small portion of the genetic factors underlying complex human traits/diseases, referred to as "missing heritability" in the genetics community. Recently, significant progress has been made in using the restricted maximum likelihood (REML) approach based on linear mixed models (LMM). While the REML approach appears to provide the right answer to many problems of practical interest, researchers have been puzzled by the fact that the LMM, under which the REML estimators are derived, is misspecified. In a recently published article, the investigators proved that the REML estimators of some important genetic quantities, such as heritability and the variance of the environmental error, are consistent despite the model misspecification. While this pioneering work led to a new field called misspecified mixed model analysis (MMMA), many theoretical and practical challenges remain unsolved. This project seeks to address the following problems: (1) extension of MMMA to correlated SNPs, (2) development of the asymptotic distribution of the REML estimator under misspecified LMM, (3) resampling methods for MMMA, (4) estimation of the number of nonzero random effects, and (5) extensions to multiple random effect factors and discrete traits. The research will also include software development to implement the methods. Show summary Hide summary	0.943
2020 — 2021	Wu, Baolin (co-PI) [⬀] Zhao, Hongyu	R01Activity Code Description: To support a discrete, specified, circumscribed project to be performed by the named investigator(s) in an area representing his or her specific interest and competencies.	Novel Statistical Methods and Tools to Integrate Multiple Endophenotypes and Functional Annotation Data to Study the Roles of Rare Variants in Complex Human Diseases Using Sequencing Data @ Yale University Project Summary In the past fifteen years, great efforts have been made to understand the genetic architecture of complex human diseases through genome-wide association studies. Although many genome-wide significant variants have been identified, the heritability or variance explained by these variants remains very small, suggesting substantial missing heritability that may yet be explained by common genetic variants with smaller effect sizes and/or rare and low frequency variants, which calls for the development and application of novel statistical methods to whole genome/exome sequencing data collected from deeply phenotyped cohorts. In this project, we will develop methods that leverage multiple correlated endophenotypes and further integrate functional annotation data to identify novel rare variants for complex traits. We will develop a set of new computational and analytical tools that are practically useful and broadly applicable to general sequencing studies, and the applications of our methods will likely identity novel rare variant associations and shed new lights on the genetics of cardiometabolic diseases. In Aim 1, we propose to develop novel statistical methods to integrate multiple endophenotypes to study the impact of rare variants on complex human diseases. Our methods will fill in the gap between the current practice of association studies and the practical needs of integrating endophenotypes for improved understanding and diagnosis of clinical outcomes. In Aim 2, we will extend the methods to meta-analyses across studies. In Aim 3, we will develop a novel kernel machine learning approach to integrating various functional information to annotate the whole genome region, and further integrate them to develop a dynamic whole-genome scan test to detect rare variant associations with multiple endophenotypes. We will leverage the NHLBI TOPMed whole genome sequencing (WGS) data and the UK Biobank whole exome sequencing (WES) data, and integrate the functional annotation data to identify and dissect the role of rare variants on the cardiometabolic traits (Aim 4). Our proposed work is cost-effective as it leverages the existing WGS/WES samples and functional annotation data while providing methods and tools that are broadly applicable to other studies, and builds on a strong team of scientists with proven track record in statistical genetics, large-scale genetic studies, and cardiometabolic traits. We expect our methods will lead to the discoveries of many more rare and low frequency variants for these traits. These results will offer new insights to help design more effective treatment and prevention strategies. All our proposed methods will be disseminated to the public through well-tested and publicly available software (Aim 5). Show summary Hide summary	0.97