This post discusses the correlation coefficient of two random variables and .

Suppose that the joint behavior of the random variables and is known and is described by the joint density function where belongs to some appropriate region in the xy-plane. We are interested in knowing how one variable varies with respect to the other. For example, do and move together? In other words, if increases, does tend to increase? If increases, does tend to decrease? If so, how strong is the dependence between and ? If the two random variables do not move together, is there a measure that describes the lack of association? The correlation coefficient is a measure of the linear relationship between the random variables and .

We also discussion of concept of regression function or regression curve as well as the concept of least squares regression line. The correlation coefficient plays a central role in both concepts.

Practice problems to reinforce the concepts discussed here are found in this companion blog.

**Covariance**

To define the correlation coefficient , we first define the covariance .

.

(1)……..

.

where and . Thus the covariance measure is the average value of the product of the two deviations and . A little algebra can show that the following is an equivalent formulation.

.

(2)……..

.

Based on (1), if more probabilities are assigned to the points where the two deviations and have the same sign (both positive or both negative), then the covariance measure is positive. If more probabilities are assigned to the points where the two deviations have the opposite signs (one is positive and the other is negative), then the covariance measure is negative. If positive and negative values of cancel each other out, then .

Furthermore, if the joint probabilities cluster in a positive direction (e.g. cluster around a straight line of positive slope), then the covariance measure is positive, in which case higher values of associates with higher values of . If the joint probabilities cluster in a negative direction (e.g. cluster around a straight line of negative slope), then the covariance measure is negative, in which case higher values of associates with lower values of . Thus the covariance is a measure of the dependence of two random variables. It gives the direction of the dependence (or association). It also indicates the strength of the association – the larger the measure in absolute value, the stronger the association.

The covariance of and reflects the units of both random variable. As a result, it may be difficult to determine at a glance whether a covariance measure is large or small. The problem is eliminated by standardizing the covariance measure.

**Correlation**

The correlation coefficient is defined as follows:

.

(3)……..

.

where is the standard deviation of and is the standard deviation of . If the underlying random variables are understood, we drop the and and denote the correlation coefficient by . Note that is the covariance of the two standardized variables and . Thus it is a dimensionless measure of dependence of two random variables, allowing for easy comparison across joint distributions. The following are equivalent expressions of .

.

(4)……..

.

.

(5)……..

.

The formulations (4) and (5) are more calculation friendly than (3). To calculate the correlation coefficient, calculate the covariance, preferably using (2). Then divide the convariance by the two standard deviations. The covariance and correlation coefficient are applicable for both continuous and discrete joint distributions of and . The examples given here are continuous joint distributions. For discrete examples, just replace integrals with summations.

.

*Example 1*

Suppose that the joint density function of and is given by where , and . Calculate the covariance and the correlation coefficient.

Using (2) and (4), we need to have , , , and . We have the option of calculating these quantities using the joint density . Another option is to use the marginal density to calculate and and the marginal density to calculate and . We take this approach. Of course, the calculation of the quantity must utilize . The following gives the marginal density functions.

.

……..

……..

.

Use the marginal density functions to calculate the means and variances of and .

.

……..

……..

……..

……..

……..

……..

.

Use the joint density to calculate .

.

……..

.

The covariance and the correlation coefficient are computed as follows:

.

……..

……..

.

The following diagram shows the support for the joint distribution in this example – the triangular area below the green line.

.

**Figure 1**

Based on the diagram, the negative is not surprising. It is clear that the larger the , the smaller the is and that the smaller the , the larger the is.

**Rho is a Measure of Linear Dependence**

The correlation coefficient is a standardized measure and is a measure of linear relationship between the two random variables. The following theorem makes this clear.

*Theorem 1*

For any two random variables and , the following statements are true.

- if and only of for some constants and , except possibly on a set with zero probability.

Proof of Theorem 1

Let and be the standardized variables. Consider the variance of .

.

……..

.

If , then would be negative. If , then would be negative. It follows that . For the second claim, note that if , then and if , then . A random variable with zero variance is a constant variable except possibly on a set with zero probability. The fact that or is constant means that is a linear function of .

For the other direction, suppose that for some constants and . The covariance and correlation coefficient are calculated as follows.

.

……..

.

……..

.

……..

.

……..

.

The calculation of works as long as the linear relation holds every where except possibly on a set of zero probability. This completes the proof of Theorem 1.

Based on Theorem 1, the joint distribution of and lies on a straight line if the correlation coefficient is 1 or -1. If the correlation coefficient is close to 1 or -1, the distribution of and clusters around a straight line. Thus the correlation coefficient is a measure of linear dependence of on . It indicates both the direction and the strength of the linear dependence.

**Regression Curves**

In regression analysis, one focus is on estimating the relationship between a dependent variable (or response variable) and one or more independent variables (explanatory variables). Covariance and correlation coefficient do not distinguish between dependent variable and independent variable. The covariance is the same as . Likewise the correlation coefficient is the same as . Suppose that the random variable is regarded as the response variable and is regarded as the explanatory variable. A common problem in regression analysis is to estimate the conditional expectation of the dependent variable given the independent variable . Thus given a realization of , we would like to estimate . The conditional expected value is called the regression curve (or regression function) of on .

The present discussion is from the view point that the joint distribution of and is known. Of course, the joint distribution is rarely known in advance. So the discussion here is more like background information for subsequent discussion. Note that the regression curve is the mean of the conditional distribution . Given the joint density , derive the conditional density function . Then compute the mean associated with . Two examples illustrate this process.

*Example 2*

Let where . Determine the regression curve .

To find the regression curve , we need to determine the distribution of the possible values of given . First, we need to determine the marginal distribution of . The following is the marginal density function .

.

……..

.

The conditional density and its conditional mean are then derived as follows:

.

……..

.

……..

Here's a graph of the regression function .

.

**Figure 2**

The orange curve is the regression function where . The green line is the line . The support of the joint distribution of and is the area above the green line.

The regression curve gives the mean response for every fixed value of the explanatory variable . For example, for , the average response for Y is . The possible values for are randomly distributed around the mean 1.4.

*Example 3*

Consider the joint distribution in Example 1. The conditional density for is

.

……..

.

The mean of the conditional distribution is:

.

……..

.

The regression curve in this example is actually a regression line. The following diagram shows the regression line.

.

**Figure 3**

The triangular area below the green line is the support of the joint distribution. The orange line is the regression line . The orange line gives the average value of the dependent variable for a given value of the independent variable. For example, when , the mean is 1. The values that are observed at random for are distributed according to the density .

**When Regression Curve is a Straight Line**

When the regression function is a straight line, the slope of the line is determined by the correlation coefficient.

*Theorem 2*

Suppose that and are random variables such that for some constants and for all values of taken on by . In other words, the regression curve of on is a straight line. Then is of the following form.

.

……..

.

……..or

.

where and .

*Proof of Theorem 2*

We give a proof in the continuous case. For the discrete case, replace integrals with summations. First, establish the following fact.

.

……..

.

The following derivation establishes this fact.

.

……..

.

The step with * is the result of changing the order of integration. The fact that has just been established is not surprising. It is saying that weighing , the conditional mean of for , with the density is the unconditional mean . Now we perform the same derivation but replacing with .

.

……..

.

Immediately, . Now becomes …………………. and . Multiply both sides by .

.

……..

.

The following further evaluates the left-hand-side of the above equation.

.

……..

.

Setting the last line of the above equals to .

.

……..

.

Integrate both sides with respect to and the left-hand-side becomes .

.

……..

.

The following gives the desired .

.

……..

.

This completes the proof of Theorem 2.

Theorem 2 highlights the important role played by the correlation coefficient in linear regression. When the regression function is linear, its slope and y-intercept are obtained by , , , and . The following verifies that this is indeed the case with Example 3.

.

……..

.

……..

**Least Squares Line**

To gain more insight about the correlation coefficient , consider fitting a straight line to the joint probability distribution of and . Usually line fitting is done on observed data so that the fitted line can represent the overall behavior of the response variable as the explanatory changes. In the present discussion we do not have observed data. Instead the joint probability distribution of and is a given. Thus there is no need to estimate the mean of given . The goal here is to demonstrate the important roles played by the correlation coefficient in the topic of linear regression. Thus the ideas discussed here are background information useful for a subsequent discussion of linear regression.

We wish to fit a line through the support of a joint distribution so that the line approximates the relationship between and . For example, in Figure 1 and Figure 3, the support is the triangular area below the green line. In Figure 3, the regression curve is a straight line (the orange line) and we can take that as a fitted line in the triangular area. How good is that fitted line? In Figure 2, the regression curve is not a straight line. So we would like to fit a line in the triangular area above the green line.

There are many lines that can be drawn in the triangular area in Figure 2. The form of the fitted line is . When a point in the support is not on the fitted line, the deviation from the line is . It is a good goal that the total amount of deviation is as small as possible. The criterion we wish to use is the least squares criterion. This means that the total of the squares of the deviations is as small as possible. More specifically, we wish to find and such that the following expectation is minimized.

.

……..

.

To find the and such that is the least, we find and , the partial derivatives of . Then solve the two equations and simultaneously for and . Interestingly, the answers look similar to the line indicated in Theorem 2. The results are stated in the following theorem.

*Theorem 3*

Suppose that the random variables and have means and , variances and , and correlation coefficient . The line such that ………… is at a minimum has the following y-intercept and slope .

……..

.

……..

.

The line indicated in Theorem 3 is called the least squares regression line. This straight line is identical to the line indicated in Theorem 2. For the probability distribution in Example 1 and Example 3, the regression curve and the least squares regression line are the same. For the distribution in Example 2, the regression curve is not a straight line. The following gives the least squares regression line for Example 2.

.

……..

.

The following diagram shows both the regression curve and the least squares line for Example 2.

.

**Figure 4**

The blue straight line is the least squares line. The orange curve is the regression curve. Though they are different, there is a general agreement between the two in a certain range. In Example 2, the least squares line is a good approximation of the regression curve in a certain interval containing .

**Remarks**

The correlation coefficient is a measure of linear relationship between two random variables. A regression function (regression curve) is , the expected value of the dependent variable for a given value of the independent variable . The regression curve may or may not be a linear function. When it is, it has the form indicated in Theorem 2.

.

……..

.

Interestingly, the least squares regression line for a given joint distribution also has the same form as the above line. The slope of the least squares regression line is a function of the correlation coefficient. More specifically, the slope is multiplied by the ratio . When the two random variables are positively correlated, the slope of the least squares regression line is positive. When the two random variables are negatively correlated, the slope of the least squares regression line is negative. The least squares regression line discussed here can be called the population least squares regression line since it is calculated from the distribution that describes the population. The joint distribution for the population is rarely known in advance. The least squares line is usually calculated from observed data. This will be a topic in subsequent posts.

Practice problems to reinforce the concepts discussed here are found in this companion blog.

.

Dan Ma math

Daniel Ma mathematics

Dan Ma stats

Daniel Ma statistics

Dan Ma statistical

Daniel Ma statistical

2018 – Dan Ma

Practice Problem Set 4 – Correlation Coefficient | Probability and Statistics Problem Solve(22:00:15) :[…] provides practice problems to reinforce the concept of correlation coefficient discussed in this post in a companion blog. The post in the companion blog shows how to evaluate the covariance and the […]

LikeLike

Joint probability density – Example 1 | Probability Exam(23:18:10) :[…] concept of covariance and correlation coefficient is given detailed treatment in this post in a companion blog. Practice problems are available […]

LikeLike

Introducing bivariate normal distribution | Mathematical Statistics(16:04:20) :[…] to Theorem 2 in this previous post, whenever the conditional mean is a linear function, it must be of the form exactly as described […]

LikeLike

More on bivariate normal distribution | Mathematical Statistics(00:00:31) :[…] mean is a linear function, it must be of the exact same form as in (2) (this fact is Theorem 2 in this previous post). Given that the linear form of is part of the definition of bivariate normal, equation (2) is not […]

LikeLike