A distribution theory for predictive-R2 in high-dimensional models and its implications for genome-wide association studies

Nilanjan Chatterjee
National Cancer Institute

Modern genome-wide association studies have led to the discoveries of thousands of susceptibility loci across a variety of quantitative and qualitative traits. Although the loci discovered so far have limited ability for prediction of any individual trait, heritability calculations indicate that power for predictive models can be potentially increased substantially by building polygenic models on larger data sets. In this talk, I will first describe a novel theoretical framework that allow evaluation of the distribution of R2 and other related measures of predictive performance of high dimensional statistical models based on the sample size of a training dataset, the threshold for variable selection, the number of underlying predictive variables and the distribution of their effect-sizes. I use this theoretical framework together with empirical estimates of heritability and effect-size distribution for susceptibility SNPs across eight different complex traits to obtain estimates of sample-size required and optimal thresholds for SNP selection for building future polygenic models that may have prediction powers close to that of an "ideal" limiting model that could be built with an infinite sample size. The general framework we provide can be useful for planning development of prediction model in other contexts, such as for future studies of rare variants.


Back to Workshop IV: Coancestry, Association, and Population Genomics