What are the requirements in a regression analysis model. Fundamentals of linear regression. Correlation for Multiple Regression

Regression analysis is one of the most popular methods statistical research. It can be used to determine the degree of influence of independent variables on the dependent variable. The functionality of Microsoft Excel has tools designed to carry out this type of analysis. Let's take a look at what they are and how to use them.

But, in order to use the function that allows you to conduct regression analysis, first of all, you need to activate the Analysis Package. Only then the tools necessary for this procedure will appear on the Excel ribbon.


Now when we go to tab "Data", on the ribbon in the toolbox "Analysis" we will see a new button - "Data analysis".

Types of regression analysis

There are several types of regressions:

  • parabolic;
  • power;
  • logarithmic;
  • exponential;
  • demonstration;
  • hyperbolic;
  • linear regression.

About the execution of the last view regression analysis We'll talk more about Excel later.

Linear Regression in Excel

Below, as an example, is a table that shows the average daily air temperature on the street, and the number of store customers for the corresponding working day. Let's find out with the help of regression analysis exactly how weather in the form of air temperature can affect the attendance of a trading establishment.

The general linear regression equation looks like this: Y = a0 + a1x1 + ... + axk. In this formula Y means the variable whose influence we are trying to study. In our case, this is the number of buyers. Meaning x- this is various factors that affect the variable. Options a are the regression coefficients. That is, they determine the significance of a particular factor. Index k stands for total these same factors.


Analysis results analysis

The results of the regression analysis are displayed in the form of a table in the place specified in the settings.

One of the main indicators is R-square. It indicates the quality of the model. In our case, this coefficient is 0.705 or about 70.5%. This is an acceptable level of quality. A relationship less than 0.5 is bad.

Another important indicator is located in the cell at the intersection of the line "Y-intersection" and column "Coefficients". Here it is indicated what value Y will have, and in our case, this is the number of buyers, with all other factors zero. In this table, this value is 58.04.

Value at the intersection of the graph "Variable X1" and "Coefficients" shows the level of dependence of Y on X. In our case, this is the level of dependence of the number of store customers on temperature. A coefficient of 1.31 is considered a fairly high indicator of influence.

As you can see, it is quite easy to create a regression analysis table using Microsoft Excel. But, only a trained person can work with the data obtained at the output, and understand their essence.

RESULTS

Table 8.3a. Regression statistics
Regression statistics
Multiple R 0,998364
R-square 0,99673
Normalized R-square 0,996321
standard error 0,42405
Observations 10

Let's first look at the upper part of the calculations presented in Table 8.3a, the regression statistics.

The value R-square, also called a measure of certainty, characterizes the quality of the resulting regression line. This quality is expressed by the degree of correspondence between the original data and the regression model (calculated data). The measure of certainty is always within the interval .

In most cases, the R-squared value is between these values, called extremes, i.e. between zero and one.

If the value of the R-square is close to one, this means that the constructed model explains almost all the variability of the corresponding variables. Conversely, an R-squared value close to zero means poor quality of the constructed model.

In our example, the measure of certainty is 0.99673, which indicates a very good fit of the regression line to the original data.

Multiple R- coefficient of multiple correlation R - expresses the degree of dependence of independent variables (X) and dependent variable (Y).

Multiple R equals square root from the coefficient of determination, this value takes values ​​in the range from zero to one.

In a simple linear regression analysis, the multiple R is equal to the Pearson correlation coefficient. Indeed, the multiple R in our case is equal to the Pearson correlation coefficient from the previous example (0.998364).

Table 8.3b. Regression coefficients
Odds standard error t-statistic
Y-intersection 2,694545455 0,33176878 8,121757129
Variable X 1 2,305454545 0,04668634 49,38177965
* A truncated version of the calculations is given

Now consider the middle part of the calculations presented in table 8.3b. Here, the regression coefficient b (2.305454545) and the offset along the y-axis are given, i.e. constant a (2.694545455).

Based on the calculations, we can write the regression equation as follows:

Y= x*2.305454545+2.694545455

The direction of the relationship between variables is determined based on the signs (negative or positive) regression coefficients(coefficient b).

If the sign at regression coefficient- positive, the relationship of the dependent variable with the independent will be positive. In our case, the sign of the regression coefficient is positive, therefore, the relationship is also positive.

If the sign at regression coefficient- negative, the relationship between the dependent variable and the independent variable is negative (inverse).

In table 8.3c. the results of the output of the residuals are presented. In order for these results to appear in the report, it is necessary to activate the "Residuals" checkbox when launching the "Regression" tool.

REMAINING WITHDRAWAL

Table 8.3c. Remains
Observation Predicted Y Remains Standard balances
1 9,610909091 -0,610909091 -1,528044662
2 7,305454545 -0,305454545 -0,764022331
3 11,91636364 0,083636364 0,209196591
4 14,22181818 0,778181818 1,946437843
5 16,52727273 0,472727273 1,182415512
6 18,83272727 0,167272727 0,418393181
7 21,13818182 -0,138181818 -0,34562915
8 23,44363636 -0,043636364 -0,109146047
9 25,74909091 -0,149090909 -0,372915662
10 28,05454545 -0,254545455 -0,636685276

Using this part of the report, we can see the deviations of each point from the constructed regression line. Greatest absolute value

As a result of studying the material of chapter 4, the student should:

know

  • basic concepts of regression analysis;
  • methods of estimation and properties of estimates of the method of least squares;
  • basic rules for significance testing and interval estimation of the equation and regression coefficients;

be able to

  • find estimates of the parameters of two-dimensional and multiple models of regression equations from sample data, analyze their properties;
  • check the significance of the equation and regression coefficients;
  • find interval estimates of significant parameters;

own

  • the skills of statistical estimation of the parameters of the two-dimensional and multiple regression equations; skills to check the adequacy of regression models;
  • skills in obtaining a regression equation with all significant coefficients using analytical software.

Basic concepts

After the correlation analysis, when the presence of statistically significant relationships between variables has been revealed and the degree of their tightness has been assessed, they usually proceed to a mathematical description of the type of dependencies using regression analysis methods. For this purpose, a class of functions is selected that links the effective indicator at and arguments„ calculate estimates of the parameters of the constraint equation and analyze the accuracy of the resulting equation .

Function| describing the dependence of the conditional average value of the effective feature at from the given values ​​of the arguments, is called regression equation.

The term "regression" (from lat. regression- retreat, return to something) was introduced by the English psychologist and anthropologist F. Galton and is associated with one of his first examples, in which Galton, processing statistical data related to the question of the heredity of growth, found that if the height of the fathers deviates from the average height all fathers on X inches, then the height of their sons deviates from the average height of all sons by less than x inches The identified trend was called regression to the mean.

The term "regression" is widely used in the statistical literature, although in many cases it does not accurately characterize the statistical dependence.

For an accurate description of the regression equation, it is necessary to know the conditional law of distribution of the effective indicator y. In statistical practice, it is usually impossible to obtain such information, therefore, they are limited to finding suitable approximations for the function f(x u X 2, .... l *), based on a preliminary meaningful analysis of the phenomenon or on the original statistical data.

Within the framework of individual model assumptions about the type of distribution of the vector of indicators<) может быть получен общий вид regression equations, where. For example, under the assumption that the studied set of indicators obeys the ()-dimensional normal distribution law with the vector of mathematical expectations

Where, and by the covariance matrix,

where is the variance y,

The regression equation (conditional expectation) has the form

Thus, if a multivariate random variable ()

obeys the ()-dimensional normal distribution law, then the regression equation of the effective indicator at in explanatory variables has linear in X view.

However, in statistical practice, one usually has to limit oneself to finding suitable approximations for the unknown true regression function f(x), since the researcher does not have exact knowledge of the conditional law of the probability distribution of the analyzed performance indicator at for the given values ​​of the arguments X.

Consider the relationship between true, model, and regression estimates. Let the performance indicator at associated with the argument X ratio

where is a random variable with a normal distribution law, moreover. The true regression function in this case is

Suppose that we do not know the exact form of the true regression equation, but we have nine observations on a two-dimensional random variable related by the relations shown in Fig. 4.1.

Rice. 4.1. The relative position of the truef(x) and theoreticalwowregression models

Location of points in fig. 4.1 allows us to confine ourselves to the class of linear dependencies of the form

Using the least squares method, we find an estimate for the regression equation.

For comparison, in Fig. 4.1 shows graphs of the true regression function and the theoretical approximating regression function. The estimate of the regression equation converges in probability to the latter wow with an unlimited increase in the sample size ().

Since we mistakenly chose a linear regression function instead of a true regression function, which, unfortunately, is quite common in the practice of statistical research, our statistical conclusions and estimates will not have the consistency property, i.e. no matter how much we increase the volume of observations, our sample estimate will not converge to the true regression function

If we had chosen the class of regression functions correctly, then the inaccuracy in the description using wow would be explained only by the limitedness of the sample and, therefore, it could be made arbitrarily small with

In order to best restore the conditional value of the effective indicator and the unknown regression function from the initial statistical data, the following are most often used: adequacy criteria loss functions.

1. Least square method, according to which the squared deviation of the observed values ​​of the effective indicator, , from the model values ​​is minimized, where the coefficients of the regression equation; are the values ​​of the vector of arguments in "-M observation:

The problem of finding an estimate of the vector is being solved. The resulting regression is called mean square.

2. Method of least modules, according to which the sum of absolute deviations of the observed values ​​of the effective indicator from the modular values ​​is minimized, i.e.

The resulting regression is called mean absolute(median).

3. minimax method is reduced to minimizing the maximum deviation module of the observed value of the effective indicator y, from the model value, i.e.

The resulting regression is called minimax.

In practical applications, there are often problems in which the random variable is studied y, depending on some set of variables and unknown parameters. We will consider () as (k + 1)-dimensional general population, from which a random sample of volume P, where () is the result of the /-th observation,. It is required to estimate unknown parameters based on the results of observations. The task described above refers to the tasks of regression analysis.

regression analysis call the method of statistical analysis of the dependence of a random variable at on variables considered in regression analysis as non-random variables, regardless of the true distribution law

During their studies, students very often encounter a variety of equations. One of them - the regression equation - is considered in this article. This type of equation is used specifically to describe the characteristics of the relationship between mathematical parameters. This type of equality is used in statistics and econometrics.

Definition of regression

In mathematics, regression is understood as a certain quantity that describes the dependence of the average value of a data set on the values ​​of another quantity. The regression equation shows, as a function of a particular feature, the average value of another feature. The regression function has the form of a simple equation y \u003d x, in which y acts as a dependent variable, and x is an independent variable (feature factor). In fact, the regression is expressed as y = f (x).

What are the types of relationships between variables

In general, two opposite types of relationship are distinguished: correlation and regression.

The first is characterized by equality of conditional variables. In this case, it is not known for certain which variable depends on the other.

If there is no equality between the variables and the conditions say which variable is explanatory and which is dependent, then we can talk about the presence of a connection of the second type. In order to build a linear regression equation, it will be necessary to find out what type of relationship is observed.

Types of regressions

To date, there are 7 different types of regression: hyperbolic, linear, multiple, nonlinear, pairwise, inverse, logarithmically linear.

Hyperbolic, linear and logarithmic

The linear regression equation is used in statistics to clearly explain the parameters of the equation. It looks like y = c + m * x + E. The hyperbolic equation has the form of a regular hyperbola y \u003d c + m / x + E. The logarithmically linear equation expresses the relationship using the logarithmic function: In y \u003d In c + m * In x + In E.

Multiple and non-linear

Two more complex types of regression are multiple and non-linear. The multiple regression equation is expressed by the function y \u003d f (x 1, x 2 ... x c) + E. In this situation, y is the dependent variable and x is the explanatory variable. The variable E is stochastic and includes the influence of other factors in the equation. The non-linear regression equation is a bit inconsistent. On the one hand, with respect to the indicators taken into account, it is not linear, and on the other hand, in the role of assessing indicators, it is linear.

Inverse and Pairwise Regressions

An inverse is a kind of function that needs to be converted to a linear form. In the most traditional application programs, it has the form of a function y \u003d 1 / c + m * x + E. The pairwise regression equation shows the relationship between the data as a function of y = f(x) + E. Just like the other equations, y depends on x and E is a stochastic parameter.

The concept of correlation

This is an indicator that demonstrates the existence of a relationship between two phenomena or processes. The strength of the relationship is expressed as a correlation coefficient. Its value fluctuates within the interval [-1;+1]. A negative indicator indicates the presence of feedback, a positive indicator indicates a direct one. If the coefficient takes a value equal to 0, then there is no relationship. The closer the value is to 1 - the stronger the relationship between the parameters, the closer to 0 - the weaker.

Methods

Correlation parametric methods can estimate the tightness of the relationship. They are used on the basis of distribution estimates to study parameters that obey the normal distribution law.

The parameters of the linear regression equation are necessary to identify the type of dependence, the function of the regression equation and evaluate the indicators of the chosen relationship formula. The correlation field is used as a method for identifying a relationship. To do this, all existing data must be represented graphically. In a rectangular two-dimensional coordinate system, all known data must be plotted. This is how the correlation field is formed. The value of the describing factor is marked along the abscissa, while the values ​​of the dependent factor are marked along the ordinate. If there is a functional relationship between the parameters, they line up in the form of a line.

If the correlation coefficient of such data is less than 30%, we can talk about the almost complete absence of a connection. If it is between 30% and 70%, then this indicates the presence of links of medium closeness. A 100% indicator is evidence of a functional connection.

A non-linear regression equation, just like a linear one, must be supplemented with a correlation index (R).

Correlation for Multiple Regression

The coefficient of determination is an indicator of the square of the multiple correlation. He speaks about the tightness of the relationship of the presented set of indicators with the trait under study. It can also talk about the nature of the influence of parameters on the result. The multiple regression equation is evaluated using this indicator.

In order to calculate the multiple correlation index, it is necessary to calculate its index.

Least square method

This method is a way of estimating regression factors. Its essence lies in minimizing the sum of squared deviations obtained due to the dependence of the factor on the function.

A paired linear regression equation can be estimated using such a method. This type of equations is used in case of detection between the indicators of a paired linear relationship.

Equation Options

Each parameter of the linear regression function has a specific meaning. The paired linear regression equation contains two parameters: c and m. The parameter t shows the average change in the final indicator of the function y, subject to a decrease (increase) in the variable x by one conventional unit. If the variable x is zero, then the function is equal to the parameter c. If the variable x is not zero, then the factor c does not make economic sense. The only influence on the function is the sign in front of the factor c. If there is a minus, then we can say about a slow change in the result compared to the factor. If there is a plus, then this indicates an accelerated change in the result.

Each parameter that changes the value of the regression equation can be expressed in terms of an equation. For example, the factor c has the form c = y - mx.

Grouped data

There are such conditions of the task in which all information is grouped according to the attribute x, but at the same time, for a certain group, the corresponding average values ​​of the dependent indicator are indicated. In this case, the average values ​​characterize how the indicator depends on x. Thus, the grouped information helps to find the regression equation. It is used as a relationship analysis. However, this method has its drawbacks. Unfortunately, averages are often subject to external fluctuations. These fluctuations are not a reflection of the patterns of the relationship, they just mask its "noise". Averages show patterns of relationship much worse than a linear regression equation. However, they can be used as a basis for finding an equation. By multiplying the size of a particular population by the corresponding average, you can get the sum of y within the group. Next, you need to knock out all the received amounts and find the final indicator y. It is a little more difficult to make calculations with the sum indicator xy. In the event that the intervals are small, we can conditionally take the indicator x for all units (within the group) the same. Multiply it with the sum of y to find the sum of the products of x and y. Further, all the sums are knocked together and the total sum xy is obtained.

Multiple Pair Equation Regression: Assessing the Importance of a Relationship

As discussed earlier, multiple regression has a function of the form y \u003d f (x 1, x 2, ..., x m) + E. Most often, such an equation is used to solve the problem of supply and demand for goods, interest income on repurchased shares, studying the causes and type of production cost function. It is also actively used in a wide variety of macroeconomic studies and calculations, but at the level of microeconomics, such an equation is used a little less often.

The main task of multiple regression is to build a data model containing a huge amount of information in order to further determine what effect each of the factors has individually and in their totality on the indicator to be modeled and its coefficients. The regression equation can take on a variety of values. In this case, two types of functions are usually used to assess the relationship: linear and nonlinear.

A linear function is depicted in the form of such a relationship: y \u003d a 0 + a 1 x 1 + a 2 x 2, + ... + a m x m. In this case, a2, a m , are considered to be the coefficients of "pure" regression. They are necessary to characterize the average change in the parameter y with a change (decrease or increase) in each corresponding parameter x by one unit, with the condition of a stable value of other indicators.

Nonlinear equations have, for example, the form of a power function y=ax 1 b1 x 2 b2 ...x m bm . In this case, the indicators b 1, b 2 ..... b m - are called elasticity coefficients, they demonstrate how the result will change (by how much%) with an increase (decrease) in the corresponding indicator x by 1% and with a stable indicator of other factors.

What factors should be considered when building a multiple regression

In order to correctly construct a multiple regression, it is necessary to find out which factors should be paid special attention to.

It is necessary to have some understanding of the nature of the relationship between economic factors and the modeled. The factors to be included must meet the following criteria:

  • Must be measurable. In order to use a factor describing the quality of an object, in any case, it should be given a quantitative form.
  • There should be no factor intercorrelation, or functional relationship. Such actions most often lead to irreversible consequences - the system of ordinary equations becomes unconditioned, and this entails its unreliability and fuzzy estimates.
  • In the case of a huge correlation indicator, there is no way to find out the isolated influence of factors on the final result of the indicator, therefore, the coefficients become uninterpretable.

Construction Methods

There are a huge number of methods and ways to explain how you can choose the factors for the equation. However, all these methods are based on the selection of coefficients using the correlation index. Among them are:

  • Exclusion method.
  • Turn on method.
  • Stepwise regression analysis.

The first method involves sifting out all coefficients from the aggregate set. The second method involves the introduction of many additional factors. Well, the third is the elimination of factors that were previously applied to the equation. Each of these methods has the right to exist. They have their pros and cons, but they can solve the issue of screening out unnecessary indicators in their own way. As a rule, the results obtained by each individual method are quite close.

Methods of multivariate analysis

Such methods for determining factors are based on the consideration of individual combinations of interrelated features. These include discriminant analysis, pattern recognition, principal component analysis, and cluster analysis. In addition, there is also factor analysis, however, it appeared as a result of the development of the component method. All of them are applied in certain circumstances, under certain conditions and factors.

Modern political science proceeds from the position on the relationship of all phenomena and processes in society. It is impossible to understand events and processes, predict and manage the phenomena of political life without studying the connections and dependencies that exist in the political sphere of society. One of the most common tasks of policy research is to study the relationship between some observable variables. A whole class of statistical methods of analysis, united by the common name "regression analysis" (or, as it is also called, "correlation-regression analysis"), helps to solve this problem. However, if correlation analysis makes it possible to assess the strength of the relationship between two variables, then using regression analysis it is possible to determine the type of this relationship, to predict the dependence of the value of any variable on the value of another variable.

First, let's remember what a correlation is. Correlative called the most important special case of statistical relationship, which consists in the fact that equal values ​​of one variable correspond to different average values another. With a change in the value of the attribute x, the average value of the attribute y naturally changes, while in each individual case the value of the attribute at(with different probabilities) can take on many different values.

The appearance of the term “correlation” in statistics (and political science attracts the achievement of statistics for solving its problems, which, therefore, is a discipline related to political science) is associated with the name of the English biologist and statistician Francis Galton, who proposed in the 19th century. theoretical foundations of correlation-regression analysis. The term "correlation" in science was known before. In particular, in paleontology back in the 18th century. it was applied by the French scientist Georges Cuvier. He introduced the so-called correlation law, with the help of which, according to the remains of animals found during excavations, it was possible to restore their appearance.

There is a well-known story associated with the name of this scientist and his law of correlation. So, on the days of a university holiday, students who decided to play a trick on a famous professor pulled a goat skin with horns and hooves over one student. He climbed into the window of Cuvier's bedroom and shouted: "I'll eat you." The professor woke up, looked at the silhouette and replied: “If you have horns and hooves, then you are a herbivore and cannot eat me. And for ignorance of the law of correlation you will get a deuce. He turned over and fell asleep. A joke is a joke, but in this example we are seeing a special case of using multiple correlation-regression analysis. Here the professor, based on the knowledge of the values ​​of the two observed traits (the presence of horns and hooves), based on the law of correlation, derived the average value of the third trait (the class to which this animal belongs is a herbivore). In this case, we are not talking about the specific value of this variable (i.e., this animal could take on different values ​​on a nominal scale - it could be a goat, a ram, or a bull ...).

Now let's move on to the term "regression". Strictly speaking, it is not connected with the meaning of those statistical problems that are solved with the help of this method. An explanation of the term can only be given on the basis of knowledge of the history of the development of methods for studying the relationships between features. One of the first examples of studies of this kind was the work of statisticians F. Galton and K. Pearson, who tried to find a pattern between the growth of fathers and their children according to two observable signs (where X- father's height and U- children's growth). In their study, they confirmed the initial hypothesis that, on average, tall fathers raise averagely tall children. The same principle applies to low fathers and children. However, if the scientists had stopped there, their works would never have been mentioned in textbooks on statistics. The researchers found another pattern within the already mentioned confirmed hypothesis. They proved that very tall fathers produce children that are tall on average, but not very different in height from children whose fathers, although above average, are not very different from average height. The same is true for fathers with very small stature (deviating from the average of the short group) - their children, on average, did not differ in height from peers whose fathers were simply short. They called the function that describes this regularity regression function. After this study, all equations describing similar functions and constructed in a similar way began to be called regression equations.

Regression analysis is one of the methods of multivariate statistical data analysis, combining a set of statistical techniques designed to study or model relationships between one dependent and several (or one) independent variables. The dependent variable, according to the tradition accepted in statistics, is called the response and is denoted as V The independent variables are called predictors and are denoted as x. During the course of the analysis, some variables will be weakly related to the response and will eventually be excluded from the analysis. The remaining variables associated with the dependent may also be called factors.

Regression analysis makes it possible to predict the values ​​of one or more variables depending on another variable (for example, the propensity for unconventional political behavior depending on the level of education) or several variables. It is calculated on PC. To compile a regression equation that allows you to measure the degree of dependence of the controlled feature on the factor ones, it is necessary to involve professional mathematicians-programmers. Regression analysis can provide an invaluable service in building predictive models for the development of a political situation, assessing the causes of social tension, and in conducting theoretical experiments. Regression analysis is actively used to study the impact on the electoral behavior of citizens of a number of socio-demographic parameters: gender, age, profession, place of residence, nationality, level and nature of income.

In relation to regression analysis, the concepts independent and dependent variables. An independent variable is a variable that explains or causes a change in another variable. A dependent variable is a variable whose value is explained by the influence of the first variable. For example, in the presidential elections in 2004, the determining factors, i.e. independent variables were indicators such as stabilization of the financial situation of the population of the country, the level of popularity of candidates and the factor incumbency. In this case, the percentage of votes cast for candidates can be considered as a dependent variable. Similarly, in the pair of variables “age of the voter” and “level of electoral activity”, the first one is independent, the second one is dependent.

Regression analysis allows you to solve the following problems:

  • 1) establish the very fact of the presence or absence of a statistically significant relationship between Ci x;
  • 2) build the best (in the statistical sense) estimates of the regression function;
  • 3) according to the given values X build a prediction for the unknown At
  • 4) evaluate the specific weight of the influence of each factor X on the At and, accordingly, exclude insignificant features from the model;
  • 5) by identifying causal relationships between variables, partially manage the values ​​of P by adjusting the values ​​of explanatory variables x.

Regression analysis is associated with the need to select mutually independent variables that affect the value of the indicator under study, determine the form of the regression equation, and evaluate parameters using statistical methods for processing primary sociological data. This type of analysis is based on the idea of ​​the form, direction and closeness (density) of the relationship. Distinguish steam room and multiple regression depending on the number of studied features. In practice, regression analysis is usually performed in conjunction with correlation analysis. Regression Equation describes a numerical relationship between quantities, expressed as a tendency for one variable to increase or decrease while another increases or decreases. At the same time, razl and h a yut l frost and non-linear regression. When describing political processes, both variants of regression are equally found.

Scatterplot for the distribution of interdependence of interest in political articles ( U) and education of respondents (X) is a linear regression (Fig. 30).

Rice. thirty.

Scatterplot for the distribution of the level of electoral activity ( U) and age of the respondent (A) (conditional example) is a non-linear regression (Fig. 31).


Rice. 31.

To describe the relationship of two features (A "and Y) in a paired regression model, a linear equation is used

where a, is a random value of the error of the equation with variation of features, i.e. deviation of the equation from "linearity".

To evaluate the coefficients a and b use the least squares method, which assumes that the sum of the squared deviations of each point on the scatter plot from the regression line should be minimal. Odds a h b can be calculated using the system of equations:

The method of least squares estimation gives such estimates of the coefficients a and b, for which the line passes through the point with coordinates X and y, those. there is a ratio at = ax + b. The graphical representation of the regression equation is called theoretical regression line. With a linear dependence, the regression coefficient represents on the graph the tangent of the slope of the theoretical regression line to the x-axis. The sign at the coefficient shows the direction of the relationship. If it is greater than zero, then the relationship is direct; if it is less, it is inverse.

The following example from the study "Political Petersburg-2006" (Table 56) shows a linear relationship between citizens' perceptions of the degree of satisfaction with their lives in the present and expectations of changes in the quality of life in the future. The connection is direct, linear (the standardized regression coefficient is 0.233, the significance level is 0.000). In this case, the regression coefficient is not high, but it exceeds the lower limit of the statistically significant indicator (the lower limit of the square of the statistically significant indicator of the Pearson coefficient).

Table 56

The impact of the quality of life of citizens in the present on expectations

(St. Petersburg, 2006)

* Dependent variable: "How do you think your life will change in the next 2-3 years?"

In political life, the value of the variable under study most often simultaneously depends on several features. For example, the level and nature of political activity are simultaneously influenced by the political regime of the state, political traditions, the peculiarities of the political behavior of people in a given area and the social microgroup of the respondent, his age, education, income level, political orientation, etc. In this case, you need to use the equation multiple regression, which has the following form:

where coefficient b.- partial regression coefficient. It shows the contribution of each independent variable to determining the values ​​of the independent (outcome) variable. If the partial regression coefficient is close to 0, then we can conclude that there is no direct relationship between the independent and dependent variables.

The calculation of such a model can be performed on a PC using matrix algebra. Multiple regression allows you to reflect the multifactorial nature of social ties and clarify the measure of the impact of each factor individually and all together on the resulting trait.

Coefficient denoted b, is called the coefficient of linear regression and shows the strength of the relationship between the variation of the factor attribute X and variation of the effective feature Y This coefficient measures the strength of the relationship in absolute units of measurement of features. However, the closeness of the correlation of features can also be expressed in terms of the standard deviation of the resulting feature (such a coefficient is called the correlation coefficient). Unlike the regression coefficient b the correlation coefficient does not depend on the accepted units of measurement of features, and therefore, it is comparable for any features. Usually, the connection is considered strong if /> 0.7, medium tightness - at 0.5 g 0.5.

As you know, the closest connection is a functional connection, when each individual value Y can be uniquely assigned to the value x. Thus, the closer the correlation coefficient is to 1, the closer the relationship is to a functional one. The significance level for regression analysis should not exceed 0.001.

The correlation coefficient has long been considered as the main indicator of the closeness of the relationship of features. However, later the coefficient of determination became such an indicator. The meaning of this coefficient is as follows - it reflects the share of the total variance of the resulting feature At, explained by the variance of the feature x. It is found by simply squaring the correlation coefficient (changing from 0 to 1) and, in turn, for a linear relationship reflects the share from 0 (0%) to 1 (100%) characteristic values Y, determined by the values ​​of the attribute x. It is recorded as I 2 , and in the resulting tables of regression analysis in the SPSS package - without a square.

Let us denote the main problems of constructing the multiple regression equation.

  • 1. Choice of factors included in the regression equation. At this stage, the researcher first compiles a general list of the main causes that, according to the theory, determine the phenomenon under study. Then he must select the features in the regression equation. The main selection rule is that the factors included in the analysis should correlate as little as possible with each other; only in this case it is possible to attribute a quantitative measure of influence to a certain factor-attribute.
  • 2. Selecting the Form of the Multiple Regression Equation(in practice, linear or linear-logarithmic is more often used). So, to use multiple regression, the researcher must first build a hypothetical model of the influence of several independent variables on the resulting one. For the obtained results to be reliable, it is necessary that the model exactly matches the real process, i.e. the relationship between the variables must be linear, not a single significant independent variable can be ignored, just as not a single variable that is not directly related to the process under study can be included in the analysis. In addition, all measurements of variables must be extremely accurate.

From the above description follows a number of conditions for the application of this method, without which it is impossible to proceed to the procedure of multiple regression analysis (MRA). Only compliance with all of the following points allows you to correctly carry out regression analysis.