What is the correlation. Correlation analysis. Use of software in correlation analysis

Pearson's correlation test is a parametric statistics method that allows you to determine the presence or absence of a linear relationship between two quantitative indicators, as well as evaluate its closeness and statistical significance. In other words, the Pearson correlation test allows you to determine whether there is a linear relationship between changes in the values ​​of two variables. In statistical calculations and inferences, the correlation coefficient is usually denoted as rxy or Rxy.

1. History of the development of the correlation criterion

The Pearson correlation test was developed by a team of British scientists led by Karl Pearson(1857-1936) in the 90s of the 19th century, to simplify the analysis of the covariance of two random variables. In addition to Karl Pearson, Pearson's correlation test was also worked on Francis Edgeworth and Raphael Weldon.

2. What is Pearson's correlation test used for?

The Pearson correlation criterion allows you to determine what is the closeness (or strength) of the correlation between two indicators measured on a quantitative scale. With the help of additional calculations, you can also determine how statistically significant the identified relationship is.

For example, using the Pearson correlation criterion, one can answer the question of whether there is a relationship between body temperature and the content of leukocytes in the blood in acute respiratory infections, between the height and weight of the patient, between the content in drinking water fluoride and the incidence of caries in the population.

3. Conditions and restrictions on the use of Pearson's chi-square test

  1. Comparable indicators should be measured in quantitative scale(for example, heart rate, body temperature, leukocyte count per 1 ml of blood, systolic blood pressure).
  2. By means of the Pearson correlation criterion, it is possible to determine only the presence and strength of a linear relationship between quantities. Other characteristics of the relationship, including the direction (direct or reverse), the nature of the changes (rectilinear or curvilinear), as well as the dependence of one variable on another, are determined using regression analysis.
  3. The number of values ​​to be compared must be equal to two. In the case of analyzing the relationship of three or more parameters, you should use the method factor analysis.
  4. Pearson's correlation criterion is parametric, in connection with which the condition for its application is normal distribution matched variables. If it is necessary to perform a correlation analysis of indicators whose distribution differs from the normal one, including those measured on an ordinal scale, Spearman's rank correlation coefficient should be used.
  5. It is necessary to clearly distinguish between the concepts of dependence and correlation. The dependence of the values ​​determines the presence of a correlation between them, but not vice versa.

For example, the growth of a child depends on his age, that is, what older child, the higher it is. If we take two children of different ages, then with a high degree of probability the growth of the older child will be greater than that of the younger. This phenomenon is called addiction, implying a causal relationship between indicators. Of course, there are also correlation, meaning that changes in one indicator are accompanied by changes in another indicator.

In another situation, consider the relationship between the growth of the child and the heart rate (HR). As you know, both of these values ​​​​are directly dependent on age, therefore, in most cases, children of greater stature (and, therefore, older ones) will have lower heart rate values. That is, correlation will be observed and may have a sufficiently high tightness. However, if we take children the same age, but different height, then, most likely, their heart rate will differ insignificantly, in connection with which we can conclude that independence Heart rate from growth.

The above example shows how important it is to distinguish between concepts fundamental in statistics connections and dependencies indicators to draw correct conclusions.

4. How to calculate the Pearson correlation coefficient?

Pearson's correlation coefficient is calculated using the following formula:

5. How to interpret the value of the Pearson correlation coefficient?

The values ​​of the Pearson correlation coefficient are interpreted based on its absolute values. Possible values ​​of the correlation coefficient vary from 0 to ±1. The greater the absolute value of r xy, the higher the closeness of the relationship between the two quantities. r xy = 0 indicates a complete lack of connection. r xy = 1 - indicates the presence of an absolute (functional) connection. If the value of the Pearson correlation criterion turned out to be greater than 1 or less than -1, an error was made in the calculations.

To assess the closeness, or strength, of the correlation, generally accepted criteria are used, according to which the absolute values ​​of r xy< 0.3 свидетельствуют о weak connection, r xy values ​​from 0.3 to 0.7 - about connection middle tightness, r xy values ​​> 0.7 - o strong connections.

A more accurate estimate of the strength of the correlation can be obtained by using Chaddock table:

Grade statistical significance correlation coefficient r xy is carried out using the t-test, calculated by the following formula:

The obtained value t r is compared with the critical value at a certain level of significance and the number of degrees of freedom n-2. If t r exceeds t crit, then a conclusion is made about the statistical significance of the identified correlation.

6. An example of calculating the Pearson correlation coefficient

The aim of the study was to identify, determine the tightness and statistical significance of the correlation between two quantitative indicators: the level of testosterone in the blood (X) and the percentage of muscle mass in the body (Y). The initial data for a sample of 5 subjects (n = 5) are summarized in the table.

With a correlation the same value of one attribute corresponds to different values ​​of the other. For example: there is a correlation between height and weight, between the incidence of malignant neoplasms and age, etc.

There are 2 methods for calculating the correlation coefficient: the method of squares (Pearson), the method of ranks (Spearman).

The most accurate is the method of squares (Pearson), in which the correlation coefficient is determined by the formula: , where

r xy is the correlation coefficient between the statistical series X and Y.

d x is the deviation of each of the numbers statistical series X from its arithmetic mean.

d y is the deviation of each of the numbers of the statistical series Y from its arithmetic mean.

Depending on the strength of the connection and its direction, the correlation coefficient can range from 0 to 1 (-1). A correlation coefficient of 0 indicates a complete lack of connection. The closer the level of the correlation coefficient to 1 or (-1), the greater, respectively, the closer the direct or feedback measured by it. With a correlation coefficient equal to 1 or (-1), the connection is complete, functional.

Scheme for estimating the strength of the correlation by the correlation coefficient

Strength of connection

The value of the correlation coefficient, if available

direct connection (+)

feedback (-)

No connection

Communication is small (weak)

from 0 to +0.29

0 to -0.29

Communication average (moderate)

+0.3 to +0.69

-0.3 to -0.69

Communication big (strong)

+0.7 to +0.99

-0.7 to -0.99

Communication is complete

(functional)

To calculate the correlation coefficient using the method of squares, a table of 7 columns is compiled. Let's analyze the calculation process using an example:

DETERMINE THE STRENGTH AND NATURE OF THE RELATIONSHIP BETWEEN

It's time-

ness

goiter

(V y )

d x= V xM x

d y= V yM y

d x d y

d x 2

d y 2

Σ -1345 ,0

Σ 13996 ,0

Σ 313 , 47

1. Determine the average content of iodine in water (in mg / l).

mg/l

2. Determine the average incidence of goiter in%.

3. Determine the deviation of each V x from M x, i.e. d x .

201–138=63; 178–138=40 etc.

4. Similarly, we determine the deviation of each V y from M y, i.e. d

0.2–3.8=-3.6; 0.6–38=-3.2 etc.

5. We determine the products of deviations. The resulting product is summed up and obtained.

6. We square d x and summarize the results, we get.

7. Similarly, we square d y, summarize the results, we get

8. Finally, we substitute all the amounts received into the formula:

To resolve the issue of the reliability of the correlation coefficient, its average error is determined by the formula:

(If the number of observations is less than 30, then the denominator is n-1).

In our example

The value of the correlation coefficient is considered reliable if it is at least 3 times higher than its mean error.

In our example

Thus, the correlation coefficient is not reliable, which makes it necessary to increase the number of observations.

The correlation coefficient can be determined in a somewhat less accurate, but much easier way, the rank method (Spearman).

Spearman method: P=1-(6∑d 2 /n-(n 2 -1))

make two rows of paired compared features, designating the first and second rows, respectively, x and y. At the same time, present the first row of the attribute in descending or ascending order, and place the numerical values ​​of the second row opposite those values ​​of the first row to which they correspond

the value of the feature in each of the compared rows should be replaced by a serial number (rank). Ranks, or numbers, indicate the places of indicators (values) of the first and second rows. In this case, the ranks should be assigned to the numerical values ​​of the second attribute in the same order that was adopted when distributing their values ​​to the values ​​of the first attribute. With the same values ​​of the attribute in the series, the ranks should be determined as the average number from the sum of the ordinal numbers of these values

determine the difference in ranks between x and y (d): d = x - y

square the resulting rank difference (d 2)

get the sum of squares of the difference (Σ d 2) and substitute the obtained values ​​into the formula:

Example: using the rank method to establish the direction and strength of the relationship between the length of service in years and the frequency of injuries, if the following data are obtained:

Rationale for the choice of method: to solve the problem, only the rank correlation method can be chosen, since the first row of the attribute "work experience in years" has open options (work experience up to 1 year and 7 or more years), which does not allow using a more accurate method - the method of squares - to establish a relationship between the compared characteristics.

Solution. The sequence of calculations is described in the text, the results are presented in Table. 2.

table 2

Work experience in years

Number of injuries

Ordinal numbers (ranks)

Rank Difference

rank difference squared

d(x-y)

d 2

Each of the rows of paired signs is denoted by "x" and by "y" (columns 1-2).

The value of each of the signs is replaced by a rank (serial) number. The order of distribution of ranks in the "x" series is as follows: the minimum value of the attribute (experience up to 1 year) is assigned the serial number "1", the subsequent variants of the same series of the attribute, respectively, in increasing order of the 2nd, 3rd, 4th and 5th th serial numbers - ranks (see column 3). A similar order is observed when distributing ranks to the second feature "y" (column 4). In cases where there are several variants of the same size (for example, in the standard task, these are 12 and 12 injuries per 100 workers with an experience of 3-4 years and 5-6 years), the serial number is indicated by the average number from the sum of their serial numbers. These data on the number of injuries (12 injuries) in the ranking should take 2nd and 3rd places, so the average number of them is (2 + 3) / 2 = 2.5. ) should distribute the same ranking numbers - "2.5" (column 4).

Determine the difference in ranks d = (x - y) - (column 5)

Squaring the difference in ranks (d 2) and getting the sum of squares of the difference in ranks Σ d 2 (column 6).

Calculate the rank correlation coefficient using the formula:

where n is the number of matched pairs of options in row "x" and row "y"

The most important goal statistics is the study of objectively existing relationships between phenomena. During statistical study these relationships, it is necessary to identify cause-and-effect relationships between indicators, i.e. how the change in some indicators depends on the change in other indicators.

There are two categories of dependencies (functional and correlation) and two groups of signs (signs-factors and effective signs). In contrast to the functional relationship, where there is a complete correspondence between factor and resultant characteristics, in the correlation relationship there is no such complete correspondence.

correlation- this is a relationship where the impact of individual factors appears only as a trend (on average) with the mass observation of actual data. Examples of correlation dependence can be the dependence between the size of the bank's assets and the amount of the bank's profit, the growth of labor productivity and the length of service of employees.

The simplest version of the correlation dependence is pair correlation, i.e. dependence between two signs (effective and factorial or between two factorial ones). Mathematically, this dependence can be expressed as the dependence of the effective indicator y on the factor indicator x. Connections can be direct and reverse. In the first case, with an increase in the attribute x, the attribute y also increases; with feedback, with an increase in the attribute x, the attribute y decreases.

The most important task is to determine the form of connection with the subsequent calculation of the parameters of the equation, or, in other words, finding the equation of connection ( regression equations).

There may be various contact forms:

rectilinear

curvilinear in the form: second order parabolas (or higher orders)

hyperbole

exponential function, etc.

The parameters for all these coupling equations are usually determined from systems of normal equations, which must meet the requirement of the least squares method (LSM):

If the relationship is expressed by a second-order parabola ( ), then the system of normal equations for finding the parameters a0, a1, a2 (such a connection is called multiple, since it implies the dependence of more than two factors) can be represented as

Another major task is dependence tightness measurement- for all forms of communication can be solved by calculating the empirical correlation ratio:

where - variance in a series of equalized values ​​of the effective indicator;

Dispersion in a series of actual values ​​y.

To determine the degree of tightness of a paired linear dependence, linear correlation coefficient r, which can be calculated using, for example, the following two formulas:

The linear correlation coefficient can take values ​​ranging from -1 to + 1 or modulo from 0 to 1. The closer it is to 1 in absolute value, the closer the relationship. The sign indicates the direction of the connection: "+" - direct dependence, "-" takes place with inverse dependence.

In statistical practice, there may be cases where the quality of factor and resultant features cannot be expressed numerically. Therefore, to measure the closeness of dependence, it is necessary to use other indicators. For this purpose, so-called nonparametric methods.

The most widespread are rank correlation coefficients, which are based on the principle of numbering the values ​​of the statistical series. When using the correlation coefficients of ranks, it is not the values ​​of the indicators x and y that are correlated, but only the numbers of their places that they occupy in each series of values. In this case, the number of each individual unit will be its rank.

Correlation coefficients based on the use of the ranked method were proposed by K. Spearman and M. Kendall.

Spearman rank correlation coefficient(p) is based on consideration of the difference between the ranks of the values ​​of the resultant and factor characteristics and can be calculated by the formula

where d = Nx - Ny , i.e. difference of ranks of each pair of x and y values; n is the number of observations.

Kendal's rank correlation coefficient() can be determined by the formula

where S = P + Q.

Nonparametric research methods include association coefficient Cus and contingency factor Kkon, which are used if, for example, it is necessary to investigate the closeness of the relationship between qualitative features, each of which is presented in the form of alternative features.

To determine these coefficients, a calculation table is created (“four fields” table), where the statistical predicate is schematically presented in the following form:

signs

Here a, b, c, d are the frequencies of the mutual combination (combination) of two alternative signs; n- total amount frequencies.

The product allocation coefficient is calculated by the formula

It must be borne in mind that for the same data, the contingency coefficient (varies from -1 to +1) is always less than the association coefficient.

If it is necessary to assess the closeness of the relationship between alternative features that can take on any number of value options, apply Pearson's mutual conjugation coefficient(KP).

To study this kind of relationship, primary statistical information is placed in the form of a table:

signs

Here mij are the frequencies of the mutual combination of two attributive features; P is the number of pairs of observations.

Pearson's Mutual Contingency Coefficient is determined by the formula

where is the mean square conjugacy index:

Mutual contingency coefficient varies from 0 to 1.

Finally, it should be mentioned Fechner coefficient, which characterizes the elementary degree of closeness of the connection, which is advisable to use to establish the fact of the existence of a connection when there is a small amount of initial information. This coefficient is determined by the formula

where na is the number of coincidences of signs of deviations of individual values ​​from their arithmetic mean; nb - respectively, the number of mismatches.

The Fechner coefficient can vary within -1.0 Kf +1.0.

Correlation coefficient formula

In the process economic activity man gradually formed whole class tasks to identify various statistical patterns.

It was necessary to evaluate the degree of determinism of some processes by others, it was necessary to establish the tightness of interdependence between different processes and variables.
Correlation is the relationship of variables from each other.

To assess the tightness of the dependence, a correlation coefficient was introduced.

The physical meaning of the correlation coefficient

crisp physical meaning the correlation coefficient has, if the statistical parameters of the independent variables are subject to a normal distribution, such a distribution graphically represents a Gaussian curve. And the relationship is linear.

The correlation coefficient shows how one process is determined by another. Those. when one process changes, how often the dependent process also changes. It does not change at all - there is no dependence, it changes immediately every time - complete dependence.

The correlation coefficient can take values ​​in the range [-1:1]

The zero value of the coefficient means that there is no relationship between the considered variables.
The extreme values ​​of the range mean complete dependence between the variables.

If the value of the coefficient is positive, then the dependence is direct.

With a negative coefficient - the opposite. Those. in the first case, when the argument changes, the function changes proportionally, in the second case, inversely.
When the value of the correlation coefficient is in the middle of the range, i.e. from 0 to 1, or from -1 to 0, indicate an incomplete functional relationship.
The closer the coefficient value is to the extreme indicators, the greater the relationship between variables or random variables. The closer the value is to 0, the less interdependence.
Usually the correlation coefficient takes intermediate values.

The correlation coefficient is a measureless quantity

The correlation coefficient is used in statistics, in correlation analysis, to test statistical hypotheses.

Putting forward some statistical hypothesis of the dependence of one random variable on another, the correlation coefficient is calculated. According to it, it is possible to make a judgment - whether there is a relationship between the quantities and how dense it is.

The thing is, you can't always see the connection. Often, the values ​​are not directly related to each other, but depend on many factors. However, it may turn out that random variables are interdependent through a set of mediated connections. Of course, this may not mean their direct connection, so, for example, with the disappearance of the intermediary, dependence may also disappear.

The purpose of correlation analysis is to identify an estimate of the strength of the connection between random variables (features) that characterizes some real process.
Problems of correlation analysis:
a) Measurement of the degree of connection (tightness, strength, severity, intensity) of two or more phenomena.
b) The selection of factors that have the most significant impact on the resulting attribute, based on measuring the degree of connectivity between phenomena. Significant factors in this aspect are used further in the regression analysis.
c) Detection of unknown causal relationships.

The forms of manifestation of interrelations are very diverse. As their most common types, functional (complete) and correlation (incomplete) connection.
correlation manifests itself on average, for mass observations, when the given values ​​of the dependent variable correspond to a certain number of probabilistic values ​​of the independent variable. The connection is called correlation, if each value of the factor attribute corresponds to a well-defined non-random value of the resultant attribute.
Correlation field serves as a visual representation of the correlation table. It is a graph where X values ​​are plotted on the abscissa axis, Y values ​​are plotted along the ordinate axis, and combinations of X and Y are shown by dots. The presence of a connection can be judged by the location of the dots.
Tightness indicators make it possible to characterize the dependence of the variation of the resulting trait on the variation of the trait-factor.
A better indicator of the degree of tightness correlation is linear correlation coefficient. When calculating this indicator, not only deviations are taken into account individual values sign from the mean, but also the magnitude of these deviations.

The key issues of this topic are the equations of the regression relationship between the resulting feature and the explanatory variable, the least squares method for estimating parameters regression model, analysis of the quality of the obtained regression equation, construction of confidence intervals for the prediction of the values ​​of the resultant feature according to the regression equation.

Example 2


System of normal equations.
a n + b∑x = ∑y
a∑x + b∑x 2 = ∑y x
For our data, the system of equations has the form
30a + 5763 b = 21460
5763 a + 1200261 b = 3800360
From the first equation we express a and substitute into the second equation:
We get b = -3.46, a = 1379.33
Regression equation:
y = -3.46 x + 1379.33

2. Calculation of the parameters of the regression equation.
Sample means.



Sample variances:


standard deviation


1.1. Correlation coefficient
covariance.

We calculate the indicator of closeness of communication. Such an indicator is a selective linear correlation coefficient, which is calculated by the formula:

The linear correlation coefficient takes values ​​from –1 to +1.
Relationships between features can be weak or strong (close). Their criteria are evaluated on the Chaddock scale:
0.1 < r xy < 0.3: слабая;
0.3 < r xy < 0.5: умеренная;
0.5 < r xy < 0.7: заметная;
0.7 < r xy < 0.9: высокая;
0.9 < r xy < 1: весьма высокая;
In our example, the relationship between feature Y and factor X is high and inverse.
In addition, the coefficient of linear pair correlation can be determined in terms of the regression coefficient b:

1.2. Regression Equation(evaluation of the regression equation).

The linear regression equation is y = -3.46 x + 1379.33

The coefficient b = -3.46 shows the average change in the effective indicator (in units of y) with an increase or decrease in the value of the factor x per unit of its measurement. In this example, with an increase of 1 unit, y decreases by an average of -3.46.
The coefficient a = 1379.33 formally shows the predicted level of y, but only if x=0 is close to the sample values.
But if x=0 is far from the sample x values, then a literal interpretation can lead to incorrect results, and even if the regression line accurately describes the values ​​of the observed sample, there is no guarantee that this will also be the case when extrapolating to the left or to the right.
By substituting the corresponding values ​​of x into the regression equation, it is possible to determine the aligned (predicted) values ​​of the effective indicator y(x) for each observation.
The relationship between y and x determines the sign of the regression coefficient b (if > 0 - direct relationship, otherwise - inverse). In our example, the relationship is reverse.
1.3. elasticity coefficient.
It is undesirable to use regression coefficients (in example b) for a direct assessment of the influence of factors on the effective attribute if there is a difference in the units of measurement of the effective indicator y and the factor attribute x.
For these purposes, elasticity coefficients and beta coefficients are calculated.
The average coefficient of elasticity E shows how many percent the result will change on average in the aggregate at from its average value when changing the factor x 1% of its average value.
The coefficient of elasticity is found by the formula:


The elasticity coefficient is less than 1. Therefore, if X changes by 1%, Y will change by less than 1%. In other words, the influence of X on Y is not significant.
Beta coefficient shows by what part of the value of its standard deviation the value of the effective attribute will change on average when the factor attribute changes by the value of its standard deviation with the value of the remaining independent variables fixed at a constant level:

Those. an increase in x by the value of the standard deviation S x will lead to a decrease in the average value of Y by 0.74 standard deviation S y .
1.4. Approximation error.
Let us evaluate the quality of the regression equation using the absolute approximation error. The average approximation error is the average deviation of the calculated values ​​from the actual ones:


Since the error is less than 15%, this equation can be used as a regression.
Dispersion analysis.
The task of analysis of variance is to analyze the variance of the dependent variable:
∑(y i - y cp) 2 = ∑(y(x) - y cp) 2 + ∑(y - y(x)) 2
where
∑(y i - y cp) 2 - total sum of squared deviations;
∑(y(x) - y cp) 2 - sum of squared deviations due to regression (“explained” or “factorial”);
∑(y - y(x)) 2 - residual sum of squared deviations.
Theoretical correlation ratio for a linear relationship is equal to the correlation coefficient r xy .
For any form of dependence, the tightness of the connection is determined using multiple correlation coefficient:

This coefficient is universal, as it reflects the tightness of the connection and the accuracy of the model, and can also be used for any form of connection between variables. When constructing a one-factor correlation model, the multiple correlation coefficient is equal to the pair correlation coefficient r xy .
1.6. Determination coefficient.
The square of the (multiple) correlation coefficient is called the coefficient of determination, which shows the proportion of the variation of the resultant attribute explained by the variation of the factor attribute.
Most often, giving an interpretation of the coefficient of determination, it is expressed as a percentage.
R 2 \u003d -0.74 2 \u003d 0.5413
those. in 54.13% of cases, changes in x lead to a change in y. In other words, the accuracy of the selection of the regression equation is average. The remaining 45.87% of the change in Y is due to factors not taken into account in the model.

Bibliography

  1. Econometrics: Textbook / Ed. I.I. Eliseeva. - M.: Finance and statistics, 2001, p. 34..89.
  2. Magnus Ya.R., Katyshev P.K., Peresetsky A.A. Econometrics. Starting course. Tutorial. - 2nd ed., Rev. – M.: Delo, 1998, p. 17..42.
  3. Workshop on econometrics: Proc. allowance / I.I. Eliseeva, S.V. Kurysheva, N.M. Gordeenko and others; Ed. I.I. Eliseeva. - M.: Finance and statistics, 2001, p. 5..48.