Linear correlation in the presence of upper limits in astronomy

One reasonably common problem in astronomy is to assess the degree of correlation between two variables when one—or both—of them is censored. Censoring is the term in statistics that we use when some the value is missing, is a lower limit, or an upper limit. In the case of a lower limit we refer to it a right-censoring and in the case of a lower limit—as is common in astronomical surveys when an astronomical source is not detected—we refer to it as left-censoring. The figure below shows one such dataset with many datapoints subject to left-censoring.

Relationship between Be and Li abundances based on data from Santos et al. (2012). Dataset provided in Feigelson & Babu (2012).

This is not a new problem in statistics and goes back to at least one century ago, when statisticians were dealing with mortality rates—that’s why such studies are called survival analysis in statistics.

Recently, in one of the studies that my research group is doing, we had the interesting challenge of quantifying whether there was any significant correlation between two luminosities for a sample of sources, where one of them was dominated by upper limits. I found the classical paper Isobe et al. (1986). Isobe et al. tell you that if you want to quantify the correlation coefficient and probability of no correlation between the two variables in the presence of upper limits, you should be looking at the Cox regression. They tell you to do this using the ASURV fortran package. Problem is, good luck trying to compile ASURV in modern architectures! Let’s not even get started on how the package is provided. The 80s were a distant, wild time in astronomy, full of floppy disks, fortran, and codes that fail to compile nowadays.

Eventually, I found how to compute the correlation coefficient in a different—and hopefully superior—way by reading the wonderful book Modern Statistical methods for astronomy with R applications by Feigelson & Babu, more specifically section 10.8.3, “Bivariate and multivariate problems with censoring” (p304). It was the only place where I could find a code recipe for computing the correlation coefficient and estimate the associated probability for x-y data with left-censoring. On the other hand, there are plenty of examples around dealing with right-censorship in survival studies.

I took the code from Feigelson & Babu’s book and pasted it in this jupyter notebook which does the step-by-step R calculations to perform the linear regression for bivariate data in the presence of upper limits. The notebook also quantifies the null hypothesis probability. A few notes:

  • it does not take into account uncertainties in the observations
  • this computes the Akritas–Thiel–Sen (ATS) Kendall τ correlation coefficient (look for the 0.4974359 value)
  • also computes the associated null hypothesis probability (look for the 2.41662e-09 value

I hope this is useful for other astronomers handling upper limits and quantifying correlations.

By the way, if you need to perform a linear regression taking into account upper limits and uncertainties in both variables, check out Kelly (2007) and linmix. If upper limits are not an issue in your bivariate dataset, then check out the BCES package that I wrote.

Author: Rodrigo Nemmen

Professor of Astrophysics, Universidade de Sao Paulo

Leave a comment