Logistic Regression is extensively used to fit binary data, predict future outcomes and assess the statistical significance of feature variables. The likelihood ratio test (LRT) is frequently used for statistical inference in this model, and p-values for the test statistic are computed using a distributional approximation to the LRT. In classical statistics, the well-known Wilks’ theorem asserts that, for a fixed number of feature variables, twice the log-likelihood ratio is asymptotically distributed as a chi-square distribution in the limit of large sample sizes (under suitable regularity conditions and under the null hypothesis). This approximation, or very similar ones, is routinely invoked in all statistical software packages.
Our findings reveal, however, that in the high-dimensional regime where the number of samples is comparable to the number of features, Wilks’ theorem fails to hold; in fact, the chi-square approximation produces p-values that are far too small, which turns out to be especially problematic for multiple testing. This talk discusses that for a class of logistic models, the log-likelihood ratio converges in distribution to a rescaled chi-square. The rescaling factor depends on the relative growth rate of the number of features to the number of samples, and can be easily found by solving a nonlinear system of two equations with two unknowns. Our theoretical findings are complemented by numerical studies that demonstrate the accuracy of our results even for finite sample sizes.
This is joint work with Emmanuel Candès and Yuxin Chen.