7

Suppose we have 2 random variables $X$ and $Y$ with (marginal) CDFs $F$ and $G$. Given any $\rho\in[-1,1]$, is there a general approach to construct a joint distribution of $X$ and $Y$ such that their marginals are $F$ and $G$ and their correlation is $\rho$?

My interest is in simulation. For example, if $X\sim \chi^2(n)$ and $Y\sim\chi^2(m)$ then $F\equiv\frac{X/n}{Y/m}$ has the $F$ distribution $F(n,m)$ only if $X$ and $Y$ are independent. I would like to simulate $F$ to see how it behaves when $X$ and $Y$ are correlated. But I can't think of how to simulate such correlated $X$ and $Y$ besides starting from a joint distribution.

yurnero
  • 10,505
  • 1
    Here is a closely related question: http://math.stackexchange.com/questions/268298/sampling-from-a-2d-normal-with-a-given-covariance-matrix – Michael Hardy Dec 21 '16 at 03:21

2 Answers2

3

It is a relatively simple task to generate samples of random variables with given marginal distributions that are correlated. The difficulty lies in controlling the exact degree of correlation, if that is desired, unless the marginal distributions are normal.

The Cholesky approach mentioned works well for constructing random variables with a multivariate normal distribution and a specified correlation matrix given a set of independent random variables with normal marginal distributions. For example suppose independent random variables $Z_1$ and $Z_2$ both have standard normal marginal distributions, i.e. $Z_1, Z_2 \sim N(0,1)$, then take

$$X = Z_1, \,\,\, Y = \rho Z_1 + \sqrt{1 - \rho^2}Z_2.$$

Such a transformation preserves the marginal standard normal distributions, i.e. $X, Y \sim N(0,1)$ and imposes the desired correlation

$$E(XY) = \rho E(Z_1^2) + \sqrt{1- \rho^2}E(Z_1Z_2) = \rho.$$

An approximate approach for non-normal marginal distributions, $F$ and $G$, would be to first draw independent samples from a standard normal distribution, $Z_1, Z_2 \sim N(0,1)$. Next impose a correlation $\rho$ using the transformation

$$V_1 = Z_1, \,\,\, V_2 = \rho Z_1 + \sqrt{1-\rho^2}Z_2.$$

Note that $V_1$ and $V_2$ have a joint normal distribution. If $\Phi$ is the standard normal cummulative distribution function then $\Phi(V_1)$ and $\Phi(V_2)$ have uniform $U(0,1)$ distributions, since, for example,

$$P(\Phi(V_1) \leqslant v) = P(V_1 \leqslant \Phi^{-1}(v)) = \Phi[\Phi^{-1}(v)] = v. $$

Finally perform the following transformation using inverse marginal distribution functions $F^{-1}$ and $G^{-1}$ and the standard normal cumulative distribution function $\Phi$,

$$X = F^{-1}[\Phi(V_1)], \,\,\, Y = G^{-1}[\Phi(V_2)].$$

Now $X$ and $Y$ have the desired marginal distributions since, for example,

$$P(X \leqslant x) = P(F^{-1}[\Phi(V_1)] \leqslant x) = P(\Phi(V_1) \leqslant F(x)) = F(x).$$

In general due to non-linearity, $corr(X,Y) \neq \rho$, but it may not be far off and you can iterate on the choice of $\rho$ in the first step until you get close to the desired correlation.

A more comprehensive treatment of imposing a dependence structure on random variables with given marginals can be found in the theory of copulas.

RRL
  • 90,707
  • Could you please explain how $\rho E(Z_1^2) + \sqrt{1- \rho^2}E(Z_1,Z_2) = \rho$ was derived? Or was it from the problem statement? – 24n8 Apr 12 '20 at 18:26
  • 1
    First, I corrected the typo $E(Z_1,Z_2)$ which should be $E(Z_1Z_2)$. The product $XY$ clearly is $XY = \rho Z_1^2 + \sqrt{1-\rho^2}Z_1Z_2$. Taking the expected value we get $E(XY) = \rho E(Z_1^2) + \sqrt{1-\rho^2}E(Z_1Z_2)$. Do you understand what $Z_1,Z_2 \sim N(0,1)$ means? That tells you these are standard normal random variables so $E(Z_1) = E(Z_2) = 0$ and $E(Z_1^2) = E(Z_2^2) = 1$. – RRL Apr 12 '20 at 18:59
  • 1
    Finally $E(Z_1Z_2) = 0$ because they are chosen to be independent random variables. Hence they are uncorrelated and with zero mean values it follows that the expected value of their product is zero. – RRL Apr 12 '20 at 19:01
  • Ah yes, I just got it. Thanks. I had forgotten that $Z_1$ and $Z_2$ were independent, by definition. – 24n8 Apr 12 '20 at 19:02
  • Also, could you explain where the original $X_1$ and $X_2$ expressions came from? I have seen this in numerous places where standard normal random variables were asked to be generated. Is there something fundamental about these expressions, or is it sort of trial and error until you can confirm that the supposed expressions for $X_1$ and $X_2$ gives the correlation coefficient we want? – 24n8 Apr 12 '20 at 19:05
  • Ah I think I got it, but I couldn't do it by inspection (which I presume is what you did?). I formulated the covariance matrix $$\Sigma = \begin{bmatrix} 1 & \rho \ \rho & 1 \end{bmatrix} \$$, then formed a cholesky factorization $A=R^TR$, and found $$ R^T = \begin{bmatrix} 1 & 0 \ \rho & \sqrt{1-\rho^2} \end{bmatrix}$$. – 24n8 Apr 12 '20 at 19:59
  • That is exactly right. It is a general factorization for a positive semi definite covariance matrix. – RRL Apr 12 '20 at 21:06
1

For your specific question with chi-squared and F random variables, you can try this:

Let $U \sim Chisq(r),\, V \sim Chisq(s),$ and $W \sim Chisq(t),$ where $n = r + s$ and $m = s+t.$

Then $X = U+V \sim Chisq(n),\, Y \sim Chisq(m)$ and $X$ and $Y$ are correlated.

This is a situation that might actually happen in practice if someone gets confused about effects in an intricate ANOVA design.

Example in R statistical software:

m = 10^6
u = rchisq(m, 3);  v = rchisq(m, 5);  w = rchisq(m, 7)
x = u + v;  mean(x);  var(x)
## 7.997882                        # consistent with mean 8 ...
## 15.98177                        #   and variance 16 of Chisq(df = 8)
y = v + w;  mean(y);  var(y)
## 11.99655                        # consistent with mean 12 ...
## 23.99023                        #   and variance 24 of Chisq(12)
cor(x, y)
## 0.5106575                       # X and Y correlated, not indep.

fxy = (x/8)/(y/12)                 # fake F
quantile(fxy, .95)
##      95%
## 2.023275                        # wrong 95th percentile for true F(8, 12) 

f = rf(m, 8, 12)                   # true F
quantile(f, .95);  qf(.95, 8, 12)
##      95% 
## 2.846839                        # consistent with 95th percentile of F(8, 12) 
## 2.848565                        # exact 95th percentile of F(8, 12)

enter image description here

References: More generally, look at this post or the one suggested by @MichaelHardy. If some of the normal random variables are correlated then chi-squared random variables obtained by summing their squares will be correlated.

BruceET
  • 51,500