Simple regression analysis. Regression analysis is a statistical method for studying the dependence of a random variable on variables. Analysis results analysis

During their studies, students very often encounter a variety of equations. One of them - the regression equation - is considered in this article. This type of equation is used specifically to describe the characteristics of the relationship between mathematical parameters. This type equalities are used in statistics and econometrics.

Definition of regression

In mathematics, regression is understood as a certain quantity that describes the dependence of the average value of a data set on the values ​​of another quantity. The regression equation shows, as a function of a particular feature, the average value of another feature. The regression function has the form of a simple equation y \u003d x, in which y acts as a dependent variable, and x is an independent variable (feature factor). In fact, the regression is expressed as y = f (x).

What are the types of relationships between variables

In general, two opposite types of relationship are distinguished: correlation and regression.

The first is characterized by equality of conditional variables. AT this case it is not known for certain which variable depends on the other.

If there is no equality between the variables and the conditions say which variable is explanatory and which is dependent, then we can talk about the presence of a connection of the second type. In order to build a linear regression equation, it will be necessary to find out what type of relationship is observed.

Types of regressions

To date, there are 7 different types of regression: hyperbolic, linear, multiple, nonlinear, pairwise, inverse, logarithmically linear.

Hyperbolic, linear and logarithmic

The linear regression equation is used in statistics to clearly explain the parameters of the equation. It looks like y = c + m * x + E. The hyperbolic equation has the form of a regular hyperbola y \u003d c + m / x + E. The logarithmically linear equation expresses the relationship using the logarithmic function: In y \u003d In c + m * In x + In E.

Multiple and non-linear

two more complex types regressions are multiple and non-linear. The multiple regression equation is expressed by the function y \u003d f (x 1, x 2 ... x c) + E. In this situation, y is the dependent variable and x is the explanatory variable. The variable E is stochastic and includes the influence of other factors in the equation. The non-linear regression equation is a bit inconsistent. On the one hand, with respect to the indicators taken into account, it is not linear, and on the other hand, in the role of assessing indicators, it is linear.

Inverse and Pairwise Regressions

An inverse is a kind of function that needs to be converted to a linear form. In the most traditional application programs, it has the form of a function y \u003d 1 / c + m * x + E. The pairwise regression equation shows the relationship between the data as a function of y = f(x) + E. Just like the other equations, y depends on x and E is a stochastic parameter.

The concept of correlation

This is an indicator that demonstrates the existence of a relationship between two phenomena or processes. The strength of the relationship is expressed as a correlation coefficient. Its value fluctuates within the interval [-1;+1]. A negative indicator indicates the presence feedback, positive - about a straight line. If the coefficient takes a value equal to 0, then there is no relationship. The closer the value is to 1 - the stronger the relationship between the parameters, the closer to 0 - the weaker.

Methods

Correlation parametric methods can estimate the tightness of the relationship. They are used on the basis of distribution estimates to study parameters that obey the normal distribution law.

The parameters of the linear regression equation are necessary to identify the type of dependence, the function of the regression equation and evaluate the indicators of the chosen relationship formula. The correlation field is used as a method for identifying a relationship. To do this, all existing data must be represented graphically. In a rectangular two-dimensional coordinate system, all known data must be plotted. This is how the correlation field is formed. The value of the describing factor is marked along the abscissa, while the values ​​of the dependent factor are marked along the ordinate. If there is a functional relationship between the parameters, they line up in the form of a line.

If the correlation coefficient of such data is less than 30%, we can talk about the almost complete absence of a connection. If it is between 30% and 70%, then this indicates the presence of links of medium closeness. A 100% indicator is evidence of a functional connection.

A non-linear regression equation, just like a linear one, must be supplemented with a correlation index (R).

Correlation for Multiple Regression

The coefficient of determination is an indicator of the square of the multiple correlation. He speaks about the tightness of the relationship of the presented set of indicators with the trait under study. It can also talk about the nature of the influence of parameters on the result. The multiple regression equation is evaluated using this indicator.

In order to calculate the multiple correlation index, it is necessary to calculate its index.

Least square method

This method is a way of estimating regression factors. Its essence lies in minimizing the sum of squared deviations obtained due to the dependence of the factor on the function.

A paired linear regression equation can be estimated using such a method. This type of equations is used in case of detection between the indicators of a paired linear relationship.

Equation Options

Each parameter of the linear regression function has a specific meaning. The paired linear regression equation contains two parameters: c and m. The parameter t shows the average change in the final indicator of the function y, subject to a decrease (increase) in the variable x by one conventional unit. If the variable x is zero, then the function is equal to the parameter c. If the variable x is not zero, then the factor c does not make economic sense. The only influence on the function is the sign in front of the factor c. If there is a minus, then we can say about a slow change in the result compared to the factor. If there is a plus, then this indicates an accelerated change in the result.

Each parameter that changes the value of the regression equation can be expressed in terms of an equation. For example, the factor c has the form c = y - mx.

Grouped data

There are such conditions of the task in which all information is grouped according to the attribute x, but at the same time, for a certain group, the corresponding average values ​​of the dependent indicator are indicated. In this case, the average values ​​characterize how the indicator depends on x. Thus, the grouped information helps to find the regression equation. It is used as a relationship analysis. However, this method has its drawbacks. Unfortunately, averages are often subject to external fluctuations. These fluctuations are not a reflection of the patterns of the relationship, they just mask its "noise". Averages show patterns of relationship much worse than a linear regression equation. However, they can be used as a basis for finding an equation. By multiplying the size of a particular population by the corresponding average, you can get the sum of y within the group. Next, you need to knock out all the received amounts and find the final indicator y. It is a little more difficult to make calculations with the sum indicator xy. In the event that the intervals are small, we can conditionally take the indicator x for all units (within the group) the same. Multiply it with the sum of y to find the sum of the products of x and y. Further, all the sums are knocked together and it turns out total amount hu.

Multiple Pair Equation Regression: Assessing the Importance of a Relationship

As discussed earlier, multiple regression has a function of the form y \u003d f (x 1, x 2, ..., x m) + E. Most often, such an equation is used to solve the problem of supply and demand for a product, interest income on repurchased shares, studying the causes and type of production cost function. It is also actively used in a wide variety of macroeconomic studies and calculations, but at the level of microeconomics, such an equation is used a little less often.

The main task of multiple regression is to build a data model containing a huge amount of information in order to further determine what effect each of the factors has individually and in their totality on the indicator to be modeled and its coefficients. The regression equation can take on a variety of values. In this case, two types of functions are usually used to assess the relationship: linear and nonlinear.

A linear function is depicted in the form of such a relationship: y \u003d a 0 + a 1 x 1 + a 2 x 2, + ... + a m x m. In this case, a2, a m , are considered to be the coefficients of "pure" regression. They are necessary to characterize the average change in the parameter y with a change (decrease or increase) in each corresponding parameter x by one unit, with the condition of a stable value of other indicators.

Nonlinear equations have, for example, the form power function y=ax 1 b1 x 2 b2 ...x m bm . In this case, the indicators b 1, b 2 ..... b m - are called elasticity coefficients, they demonstrate how the result will change (by how much%) with an increase (decrease) in the corresponding indicator x by 1% and with a stable indicator of other factors.

What factors should be considered when building a multiple regression

In order to correctly construct a multiple regression, it is necessary to find out which factors should be paid special attention to.

It is necessary to have some understanding of the nature of the relationship between economic factors and the modeled. The factors to be included must meet the following criteria:

  • Must be measurable. In order to use a factor describing the quality of an object, in any case, it should be given a quantitative form.
  • There should be no factor intercorrelation, or functional relationship. Such actions most often lead to irreversible consequences - the system ordinary equations becomes unconditioned, and this entails its unreliability and fuzzy assessments.
  • In the case of a huge correlation indicator, there is no way to find out the isolated influence of factors on the final result of the indicator, therefore, the coefficients become uninterpretable.

Construction Methods

There are a huge number of methods and ways to explain how you can choose the factors for the equation. However, all these methods are based on the selection of coefficients using the correlation index. Among them are:

  • Exclusion method.
  • Turn on method.
  • Stepwise regression analysis.

The first method involves sifting out all coefficients from the aggregate set. The second method involves the introduction of many additional factors. Well, the third is the elimination of factors that were previously applied to the equation. Each of these methods has the right to exist. They have their pros and cons, but they can solve the issue of screening out unnecessary indicators in their own way. As a rule, the results obtained by each individual method are quite close.

Methods of multivariate analysis

Such methods for determining factors are based on the consideration of individual combinations of interrelated features. These include discriminant analysis, pattern recognition, principal component analysis, and cluster analysis. In addition, there is also factor analysis, however, it appeared as a result of the development of the component method. All of them are applied in certain circumstances, under certain conditions and factors.

What is regression?

Consider two continuous variables x=(x 1 , x 2 , .., x n), y=(y 1 , y 2 , ..., y n).

Let's place the points on a 2D scatter plot and say we have linear relationship if the data is approximated by a straight line.

If we assume that y depends on x, and the changes in y caused by changes in x, we can define a regression line (regression y on the x), which best describes the straight-line relationship between these two variables.

The statistical use of the word "regression" comes from a phenomenon known as regression to the mean, attributed to Sir Francis Galton (1889).

He showed that while tall fathers tend to have tall sons, the average height of sons is smaller than that of their tall fathers. The average height of sons "regressed" and "moved back" to the average height of all fathers in the population. Thus, on average, tall fathers have shorter (but still tall) sons, and short fathers have taller (but still rather short) sons.

regression line

Mathematical equation that evaluates a simple (pairwise) linear regression line:

x called the independent variable or predictor.

Y is the dependent or response variable. This is the value we expect for y(on average) if we know the value x, i.e. is the predicted value y»

  • a- free member (crossing) of the evaluation line; this value Y, when x=0(Fig.1).
  • b- slope or gradient of the estimated line; it is the amount by which Y increases on average if we increase x for one unit.
  • a and b are called the regression coefficients of the estimated line, although this term is often used only for b.

Pairwise linear regression can be extended to include more than one independent variable; in this case it is known as multiple regression.

Fig.1. Linear regression line showing the intersection of a and the slope b (the amount of increase in Y when x increases by one unit)

Least square method

We perform regression analysis using a sample of observations where a and b- sample estimates of the true (general) parameters, α and β , which determine the line of linear regression in the population (general population).

Most simple method determining coefficients a and b is least square method(MNK).

The fit is evaluated by considering the residuals (the vertical distance of each point from the line, e.g. residual = observable y- predicted y, Rice. 2).

The line of best fit is chosen so that the sum of the squares of the residuals is minimal.

Rice. 2. Linear regression line with depicted residuals (vertical dotted lines) for each point.

Linear Regression Assumptions

So, for each observed value, the residual is equal to the difference and the corresponding predicted one. Each residual can be positive or negative.

You can use residuals to test the following assumptions behind linear regression:

  • The residuals are normally distributed with zero mean;

If the assumptions of linearity, normality, and/or constant variance are questionable, we can transform or and calculate new line regression for which these assumptions are satisfied (for example, use logarithmic transformation or others).

Abnormal values ​​(outliers) and points of influence

An "influential" observation, if omitted, changes one or more model parameter estimates (ie slope or intercept).

An outlier (an observation that conflicts with most of the values ​​in the data set) can be an "influential" observation and can be well detected visually when looking at a 2D scatterplot or a plot of residuals.

Both for outliers and for "influential" observations (points), models are used, both with their inclusion and without them, pay attention to the change in the estimate (regression coefficients).

When doing an analysis, do not automatically discard outliers or influence points, as simply ignoring them can affect the results. Always study the causes of these outliers and analyze them.

Linear regression hypothesis

When constructing a linear regression, the null hypothesis is checked that the general slope of the regression line β zero.

If the slope of the line is zero, there is no linear relationship between and: the change does not affect

To test the null hypothesis that the true slope is zero, you can use the following algorithm:

Calculate the test statistic equal to the ratio , which obeys a distribution with degrees of freedom, where the standard error of the coefficient


,

- estimation of the variance of the residuals.

Usually, if the significance level reached is the null hypothesis is rejected.


where is the percentage point of the distribution with degrees of freedom which gives the probability of a two-tailed test

This is the interval that contains the general slope with a probability of 95%.

For large samples, let's say we can approximate with a value of 1.96 (that is, the test statistic will tend to be normally distributed)

Evaluation of the quality of linear regression: coefficient of determination R 2

Because of the linear relationship and we expect that changes as changes , and we call this the variation that is due to or explained by the regression. The residual variation should be as small as possible.

If so, then most of the variation will be explained by the regression, and the points will lie close to the regression line, i.e. the line fits the data well.

The proportion of the total variance that is explained by the regression is called determination coefficient, usually expressed as a percentage and denoted R2(in paired linear regression, this is the value r2, the square of the correlation coefficient), allows you to subjectively assess the quality of the regression equation.

The difference is the percentage of variance that cannot be explained by regression.

With no formal test to evaluate, we are forced to rely on subjective judgment to determine the quality of the fit of the regression line.

Applying a Regression Line to a Forecast

You can use a regression line to predict a value from a value within the observed range (never extrapolate beyond these limits).

We predict the mean for observables that have a certain value by substituting that value into the regression line equation.

So, if predicting as We use this predicted value and its standard error to estimate the confidence interval for the true population mean.

Repeating this procedure for different values ​​allows you to build confidence limits for this line. This is a band or area that contains a true line, for example, with a 95% confidence level.

Simple regression plans

Simple regression designs contain one continuous predictor. If there are 3 cases with predictor values ​​P , such as 7, 4 and 9, and the design includes a first order effect P , then the design matrix X will be

a regression equation using P for X1 looks like

Y = b0 + b1 P

If a simple regression design contains a higher order effect on P , such as a quadratic effect, then the values ​​in column X1 in the design matrix will be raised to the second power:

and the equation will take the form

Y = b0 + b1 P2

Sigma-restricted and overparameterized coding methods do not apply to simple regression designs and other designs containing only continuous predictors (because there are simply no categorical predictors). Regardless of the encoding method chosen, the values ​​of the continuous variables are incremented by the appropriate power and used as the values ​​for the X variables. In this case, no conversion is performed. In addition, when describing regression plans, you can omit consideration of the plan matrix X, and work only with the regression equation.

Example: Simple Regression Analysis

This example uses the data provided in the table:

Rice. 3. Table of initial data.

The data is based on a comparison of the 1960 and 1970 censuses in 30 randomly selected counties. County names are represented as observation names. Information regarding each variable is presented below:

Rice. 4. Variable specification table.

Research objective

For this example, the correlation between the poverty rate and the power that predicts the percentage of families that are below the poverty line will be analyzed. Therefore, we will treat variable 3 (Pt_Poor ) as a dependent variable.

One can put forward a hypothesis: the change in the population and the percentage of families that are below the poverty line are related. It seems reasonable to expect that poverty leads to an outflow of population, hence there would be a negative correlation between the percentage of people below the poverty line and population change. Therefore, we will treat variable 1 (Pop_Chng ) as a predictor variable.

View Results

Regression coefficients

Rice. 5. Regression coefficients Pt_Poor on Pop_Chng.

At the intersection of the Pop_Chng row and Param. the non-standardized coefficient for the regression of Pt_Poor on Pop_Chng is -0.40374 . This means that for every unit decrease in population, there is an increase in the poverty rate of .40374. The upper and lower (default) 95% confidence limits for this non-standardized coefficient do not include zero, so the regression coefficient is significant at the p level<.05 . Обратите внимание на не стандартизованный коэффициент, который также является коэффициентом корреляции Пирсона для простых регрессионных планов, равен -.65, который означает, что для каждого уменьшения стандартного отклонения численности населения происходит увеличение стандартного отклонения уровня бедности на.65.

Distribution of variables

Correlation coefficients can become significantly overestimated or underestimated if there are large outliers in the data. Let us examine the distribution of the dependent variable Pt_Poor by county. To do this, we will build a histogram of the Pt_Poor variable.

Rice. 6. Histogram of the Pt_Poor variable.

As you can see, the distribution of this variable differs markedly from the normal distribution. However, although even two counties (the right-hand two columns) have a higher percentage of families that are below the poverty line than expected in a normal distribution, they appear to be "inside the range."

Rice. 7. Histogram of the Pt_Poor variable.

This judgment is somewhat subjective. The rule of thumb is that outliers should be taken into account if an observation (or observations) does not fall within the interval (mean ± 3 times standard deviation). In this case, it is worth repeating the analysis with and without outliers to make sure that they do not have a serious effect on the correlation between members of the population.

Scatterplot

If one of the hypotheses is a priori about the relationship between the given variables, then it is useful to check it on the plot of the corresponding scatterplot.

Rice. 8. Scatterplot.

The scatterplot shows a clear negative correlation (-.65) between the two variables. It also shows the 95% confidence interval for the regression line, i.e., with 95% probability the regression line passes between the two dashed curves.

Significance criteria

Rice. 9. Table containing the significance criteria.

The test for the Pop_Chng regression coefficient confirms that Pop_Chng is strongly related to Pt_Poor , p<.001 .

Outcome

This example showed how to analyze a simple regression plan. An interpretation of non-standardized and standardized regression coefficients was also presented. The importance of studying the response distribution of the dependent variable is discussed, and a technique for determining the direction and strength of the relationship between the predictor and the dependent variable is demonstrated.

In the presence of a correlation between factor and resultant signs, doctors often have to determine by what amount the value of one sign can change when another is changed by a unit of measurement generally accepted or established by the researcher himself.

For example, how will the body weight of schoolchildren of the 1st grade (girls or boys) change if their height increases by 1 cm. For this purpose, the regression analysis method is used.

Most often, the regression analysis method is used to develop normative scales and standards for physical development.

  1. Definition of regression. Regression is a function that allows, based on the average value of one attribute, to determine the average value of another attribute that is correlated with the first one.

    For this purpose, the regression coefficient and a number of other parameters are used. For example, you can calculate the number of colds on average at certain values ​​of the average monthly air temperature in the autumn-winter period.

  2. Definition of the regression coefficient. The regression coefficient is the absolute value by which the value of one attribute changes on average when another attribute associated with it changes by the established unit of measurement.
  3. Regression coefficient formula. R y / x \u003d r xy x (σ y / σ x)
    where R y / x - regression coefficient;
    r xy - correlation coefficient between features x and y;
    (σ y and σ x) - standard deviations of features x and y.

    In our example ;
    σ x = 4.6 (standard deviation of air temperature in the autumn-winter period;
    σ y = 8.65 (standard deviation of the number of infectious colds).
    Thus, R y/x is the regression coefficient.
    R y / x \u003d -0.96 x (4.6 / 8.65) \u003d 1.8, i.e. with a decrease in the average monthly air temperature (x) by 1 degree, the average number of infectious colds (y) in the autumn-winter period will change by 1.8 cases.

  4. Regression Equation. y \u003d M y + R y / x (x - M x)
    where y is the average value of the attribute, which should be determined when the average value of another attribute (x) changes;
    x - known average value of another feature;
    R y/x - regression coefficient;
    M x, M y - known average values ​​of features x and y.

    For example, the average number of infectious colds (y) can be determined without special measurements at any average value of the average monthly air temperature (x). So, if x \u003d - 9 °, R y / x \u003d 1.8 diseases, M x \u003d -7 °, M y \u003d 20 diseases, then y \u003d 20 + 1.8 x (9-7) \u003d 20 + 3 .6 = 23.6 diseases.
    This equation is applied in the case of a straight-line relationship between two features (x and y).

  5. Purpose of the regression equation. The regression equation is used to plot the regression line. The latter allows, without special measurements, to determine any average value (y) of one attribute, if the value (x) of another attribute changes. Based on these data, a graph is built - regression line, which can be used to determine the average number of colds at any value of the average monthly temperature within the range between the calculated values ​​of the number of colds.
  6. Regression sigma (formula).
    where σ Ru/x - sigma (standard deviation) of the regression;
    σ y is the standard deviation of the feature y;
    r xy - correlation coefficient between features x and y.

    So, if σ y is the standard deviation of the number of colds = 8.65; r xy - the correlation coefficient between the number of colds (y) and the average monthly air temperature in the autumn-winter period (x) is - 0.96, then

  7. Purpose of sigma regression. Gives a characteristic of the measure of the diversity of the resulting feature (y).

    For example, it characterizes the diversity of the number of colds at a certain value of the average monthly air temperature in the autumn-winter period. So, the average number of colds at air temperature x 1 \u003d -6 ° can range from 15.78 diseases to 20.62 diseases.
    At x 2 = -9°, the average number of colds can range from 21.18 diseases to 26.02 diseases, etc.

    The regression sigma is used in the construction of a regression scale, which reflects the deviation of the values ​​of the effective attribute from its average value plotted on the regression line.

  8. Data required to calculate and plot the regression scale
    • regression coefficient - Ry/x;
    • regression equation - y \u003d M y + R y / x (x-M x);
    • regression sigma - σ Rx/y
  9. The sequence of calculations and graphic representation of the regression scale.
    • determine the regression coefficient by the formula (see paragraph 3). For example, one should determine how much the body weight will change on average (at a certain age depending on gender) if the average height changes by 1 cm.
    • according to the formula of the regression equation (see paragraph 4), determine what will be the average, for example, body weight (y, y 2, y 3 ...) * for a certain growth value (x, x 2, x 3 ...) .
      ________________
      * The value of "y" should be calculated for at least three known values"X".

      At the same time, the average values ​​of body weight and height (M x, and M y) for a certain age and sex are known

    • calculate the sigma of the regression, knowing the corresponding values ​​of σ y and r xy and substituting their values ​​into the formula (see paragraph 6).
    • based on the known values ​​x 1, x 2, x 3 and their corresponding average values ​​y 1, y 2 y 3, as well as the smallest (y - σ ru / x) and largest (y + σ ru / x) values ​​\u200b\u200b(y) construct a regression scale.

      For a graphical representation of the regression scale, the values ​​x, x 2 , x 3 (y-axis) are first marked on the graph, i.e. a regression line is built, for example, the dependence of body weight (y) on height (x).

      Then, at the corresponding points y 1 , y 2 , y 3 the numerical values ​​of the regression sigma are marked, i.e. on the graph find the smallest and largest values ​​of y 1 , y 2 , y 3 .

  10. Practical use of the regression scale. Normative scales and standards are being developed, in particular for physical development. According to the standard scale, it is possible to give an individual assessment of the development of children. At the same time, physical development is assessed as harmonious if, for example, at a certain height, the child’s body weight is within one regression sigma to the average calculated unit of body weight - (y) for a given height (x) (y ± 1 σ Ry / x).

    Physical development is considered disharmonious in terms of body weight if the child's body weight for a certain height is within the second regression sigma: (y ± 2 σ Ry/x)

    Physical development will be sharply disharmonious both due to excess and insufficient body weight if the body weight for a certain height is within the third sigma of the regression (y ± 3 σ Ry/x).

According to the results of a statistical study of the physical development of 5-year-old boys, it is known that their average height (x) is 109 cm, and their average body weight (y) is 19 kg. The correlation coefficient between height and body weight is +0.9, standard deviations are presented in the table.

Required:

  • calculate the regression coefficient;
  • using the regression equation, determine what the expected body weight of 5-year-old boys will be with a height equal to x1 = 100 cm, x2 = 110 cm, x3 = 120 cm;
  • calculate the regression sigma, build a regression scale, present the results of its solution graphically;
  • draw the appropriate conclusions.

The condition of the problem and the results of its solution are presented in the summary table.

Table 1

Conditions of the problem Problem solution results
regression equation sigma regression regression scale (expected body weight (in kg))
M σ r xy R y/x X At σRx/y y - σ Rу/х y + σ Rу/х
1 2 3 4 5 6 7 8 9 10
Height (x) 109 cm ± 4.4cm +0,9 0,16 100cm 17.56 kg ± 0.35 kg 17.21 kg 17.91 kg
Body weight (y) 19 kg ± 0.8 kg 110 cm 19.16 kg 18.81 kg 19.51 kg
120 cm 20.76 kg 20.41 kg 21.11 kg

Solution.

Conclusion. Thus, the regression scale within the calculated values ​​of body weight allows you to determine it for any other value of growth or to assess the individual development of the child. To do this, restore the perpendicular to the regression line.

  1. Vlasov V.V. Epidemiology. - M.: GEOTAR-MED, 2004. - 464 p.
  2. Lisitsyn Yu.P. Public health and healthcare. Textbook for high schools. - M.: GEOTAR-MED, 2007. - 512 p.
  3. Medik V.A., Yuriev V.K. A course of lectures on public health and health care: Part 1. Public health. - M.: Medicine, 2003. - 368 p.
  4. Minyaev V.A., Vishnyakov N.I. and others. Social medicine and healthcare organization (Guide in 2 volumes). - St. Petersburg, 1998. -528 p.
  5. Kucherenko V.Z., Agarkov N.M. and others. Social hygiene and organization of health care (Tutorial) - Moscow, 2000. - 432 p.
  6. S. Glantz. Medico-biological statistics. Per from English. - M., Practice, 1998. - 459 p.

Regression analysis examines the dependence of a certain quantity on another quantity or several other quantities. Regression analysis is mainly used in medium-term forecasting, as well as in long-term forecasting. Medium- and long-term periods make it possible to establish changes in the business environment and take into account the impact of these changes on the indicator under study.

To carry out regression analysis, it is necessary:

    availability of annual data on the studied indicators,

    availability of one-time forecasts, i.e. forecasts that do not improve with new data.

Regression analysis is usually carried out for objects that have a complex, multifactorial nature, such as the volume of investments, profits, sales volumes, etc.

At normative forecasting method the ways and terms of achieving the possible states of the phenomenon, taken as the goal, are determined. We are talking about predicting the achievement of desired states of the phenomenon on the basis of predetermined norms, ideals, incentives and goals. Such a forecast answers the question: in what ways can the desired be achieved? The normative method is more often used for programmatic or targeted forecasts. Both a quantitative expression of the standard and a certain scale of the possibilities of the evaluation function are used.

In the case of using a quantitative expression, for example, physiological and rational norms for the consumption of certain food and non-food products developed by specialists for various groups of the population, it is possible to determine the level of consumption of these goods for the years preceding the achievement of the specified norm. Such calculations are called interpolation. Interpolation is a way of calculating indicators that are missing in the time series of a phenomenon, based on an established relationship. Taking the actual value of the indicator and the value of its standards as the extreme members of the dynamic series, it is possible to determine the magnitude of the values ​​within this series. Therefore, interpolation is considered a normative method. The previously given formula (4), used in extrapolation, can be used in interpolation, where y n will no longer characterize the actual data, but the standard of the indicator.

If a scale (field, spectrum) of the possibilities of the evaluation function, i.e., the preference distribution function, is used in the normative method, approximately the following gradation is indicated: undesirable - less desirable - more desirable - most desirable - optimal (standard).

The normative forecasting method helps to develop recommendations for increasing the level of objectivity, and hence the effectiveness of decisions.

Modeling, perhaps the most difficult forecasting method. Mathematical modeling means the description of an economic phenomenon through mathematical formulas, equations and inequalities. The mathematical apparatus should accurately reflect the forecast background, although it is quite difficult to fully reflect the entire depth and complexity of the predicted object. The term "model" is derived from the Latin word modelus, which means "measure". Therefore, it would be more correct to consider modeling not as a forecasting method, but as a method for studying a similar phenomenon on a model.

In a broad sense, models are called substitutes for the object of study, which are in such a similarity with it that allows you to get new knowledge about the object. The model should be considered as a mathematical description of the object. In this case, the model is defined as a phenomenon (object, installation) that is in some correspondence with the object under study and can replace it in the research process, presenting information about the object.

With a narrower understanding of the model, it is considered as an object of forecasting, its study allows obtaining information about the possible states of the object in the future and ways to achieve these states. In this case, the purpose of the predictive model is to obtain information not about the object in general, but only about its future states. Then, when building a model, it may be impossible to directly check its correspondence to the object, since the model represents only its future state, and the object itself may currently be absent or have a different existence.

Models can be material and ideal.

Ideal models are used in economics. The most perfect ideal model for a quantitative description of a socio-economic (economic) phenomenon is a mathematical model that uses numbers, formulas, equations, algorithms or a graphical representation. With the help of economic models determine:

    the relationship between various economic indicators;

    various kinds of restrictions imposed on indicators;

    criteria to optimize the process.

A meaningful description of an object can be represented in the form of its formalized scheme, which indicates which parameters and initial information must be collected in order to calculate the desired values. A mathematical model, unlike a formalized scheme, contains specific numerical data characterizing an object. The development of a mathematical model largely depends on the forecaster's idea of ​​the essence of the process being modeled. Based on his ideas, he puts forward a working hypothesis, with the help of which an analytical record of the model is created in the form of formulas, equations and inequalities. As a result of solving the system of equations, specific parameters of the function are obtained, which describe the change in the desired variables over time.

The order and sequence of work as an element of the organization of forecasting is determined depending on the forecasting method used. Usually this work is carried out in several stages.

Stage 1 - predictive retrospection, i.e., the establishment of the object of forecasting and the forecast background. The work at the first stage is performed in the following sequence:

    formation of a description of an object in the past, which includes a pre-forecast analysis of the object, an assessment of its parameters, their significance and mutual relationships,

    identification and evaluation of sources of information, the procedure and organization of work with them, the collection and placement of retrospective information;

    setting research objectives.

Performing the tasks of predictive retrospection, forecasters study the history of the development of the object and the forecast background in order to obtain their systematic description.

Stage 2 - predictive diagnosis, during which a systematic description of the object of forecasting and the forecast background is studied in order to identify trends in their development and select models and methods of forecasting. The work is performed in the following sequence:

    development of a forecast object model, including a formalized description of the object, checking the degree of adequacy of the model to the object;

    selection of forecasting methods (main and auxiliary), development of an algorithm and work programs.

3rd stage - patronage, i.e. the process of extensive development of the forecast, including: 1) calculation of predicted parameters for a given lead period; 2) synthesis of individual components of the forecast.

4th stage - assessment of the forecast, including its verification, i.e., determining the degree of reliability, accuracy and validity.

In the course of prospecting and evaluation, forecasting tasks and its evaluation are solved on the basis of the previous stages.

The indicated phasing is approximate and depends on the main forecasting method.

The results of the forecast are drawn up in the form of a certificate, report or other material and are presented to the customer.

In forecasting, the deviation of the forecast from the actual state of the object can be indicated, which is called the forecast error, which is calculated by the formula:

;
;
. (9.3)

Sources of errors in forecasting

The main sources can be:

1. Simple transfer (extrapolation) of data from the past to the future (for example, the company does not have other forecast options, except for a 10% increase in sales).

2. The inability to accurately determine the probability of an event and its impact on the object under study.

3. Unforeseen difficulties (disruptive events) affecting the implementation of the plan, for example, the sudden dismissal of the head of the sales department.

In general, the accuracy of forecasting increases with the accumulation of experience in forecasting and the development of its methods.

Regression analysis underlies the creation of most econometric models, among which should be included the cost estimation models. To build valuation models, this method can be used if the number of analogues (comparable objects) and the number of cost factors (comparison elements) correlate with each other as follows: P> (5 -g-10) x to, those. there should be 5-10 times more analogues than cost factors. The same requirement for the ratio of the amount of data and the number of factors applies to other tasks: establishing a relationship between the cost and consumer parameters of an object; justification of the procedure for calculating corrective indices; clarification of price trends; establishing a relationship between wear and changes in influencing factors; obtaining dependencies for calculating cost standards, etc. The fulfillment of this requirement is necessary in order to reduce the probability of working with a data sample that does not satisfy the requirement of normal distribution of random variables.

The regression relationship reflects only the average trend of the resulting variable, for example, cost, from changes in one or more factor variables, for example, location, number of rooms, area, floor, etc. This is the difference between a regression relationship and a functional one, in which the value of the resulting variable is strictly defined for a given value of factor variables.

The presence of a regression relationship / between the resulting at and factor variables x p ..., x k(factors) indicates that this relationship is determined not only by the influence of the selected factor variables, but also by the influence of variables, some of which are generally unknown, others cannot be assessed and taken into account:

The influence of unaccounted for variables is denoted by the second term of this equation ?, which is called the approximation error.

There are the following types of regression dependencies:

  • ? paired regression - the relationship between two variables (resultant and factorial);
  • ? multiple regression - dependence of one resulting variable and two or more factor variables included in the study.

The main task of regression analysis is to quantify the closeness of the relationship between variables (in paired regression) and multiple variables (in multiple regression). The tightness of the relationship is quantified by the correlation coefficient.

The use of regression analysis allows you to establish the pattern of influence of the main factors (hedonic characteristics) on the indicator under study, both in their totality and each of them individually. With the help of regression analysis, as a method of mathematical statistics, it is possible, firstly, to find and describe the form of the analytical dependence of the resulting (desired) variable on the factorial ones and, secondly, to estimate the tightness of this dependence.

By solving the first problem, a mathematical regression model is obtained, with the help of which the desired indicator is then calculated for given factor values. The solution of the second problem makes it possible to establish the reliability of the calculated result.

Thus, regression analysis can be defined as a set of formal (mathematical) procedures designed to measure the tightness, direction and analytical expression of the form of the relationship between the resulting and factor variables, i.e. the output of such an analysis should be a structurally and quantitatively defined statistical model of the form:

where y - the average value of the resulting variable (the desired indicator, for example, cost, rent, capitalization rate) over P her observations; x is the value of the factor variable (/-th cost factor); to - number of factor variables.

Function f(x l ,...,x lc), describing the dependence of the resulting variable on the factorial ones is called the regression equation (function). The term "regression" (regression (lat.) - retreat, return to something) is associated with the specifics of one of the specific tasks solved at the stage of the formation of the method, and currently does not reflect the entire essence of the method, but continues to be used.

Regression analysis generally includes the following steps:

  • ? formation of a sample of homogeneous objects and collection of initial information about these objects;
  • ? selection of the main factors influencing the resulting variable;
  • ? checking the sample for normality using X 2 or binomial criterion;
  • ? acceptance of the hypothesis about the form of communication;
  • ? mathematical data processing;
  • ? obtaining a regression model;
  • ? assessment of its statistical indicators;
  • ? verification calculations using a regression model;
  • ? analysis of results.

The specified sequence of operations takes place in the study of both a pair relationship between a factor variable and one resulting variable, and a multiple relationship between the resulting variable and several factor variables.

The use of regression analysis imposes certain requirements on the initial information:

  • ? a statistical sample of objects should be homogeneous in functional and constructive-technological terms;
  • ? quite numerous;
  • ? the cost indicator under study - the resulting variable (price, cost, costs) - must be reduced to the same conditions for its calculation for all objects in the sample;
  • ? factor variables must be measured accurately enough;
  • ? factor variables must be independent or minimally dependent.

The requirements for homogeneity and completeness of the sample are in conflict: the more strictly the selection of objects is carried out according to their homogeneity, the smaller the sample is received, and, conversely, to enlarge the sample, it is necessary to include objects that are not very similar to each other.

After the data are collected for a group of homogeneous objects, they are analyzed to establish the form of the relationship between the resulting and factor variables in the form of a theoretical regression line. The process of finding a theoretical regression line consists in a reasonable choice of an approximating curve and calculation of the coefficients of its equation. The regression line is a smooth curve (in a particular case, a straight line) that describes with the help of a mathematical function the general trend of the dependence under study and smoothes irregular, random outliers from the influence of side factors.

To display paired regression dependencies in assessment tasks, the following functions are most often used: linear - y - a 0 + ars + s power - y - aj&i + c demonstrative - y - linear exponential - y - a 0 + ar * + s. Here - e approximation error due to the action of unaccounted for random factors.

In these functions, y is the resulting variable; x - factor variable (factor); a 0 , a r a 2 - regression model parameters, regression coefficients.

The linear exponential model belongs to the class of so-called hybrid models of the form:

where

where x (i = 1, /) - values ​​of factors;

b t (i = 0, /) are the coefficients of the regression equation.

In this equation, the components A, B and Z correspond to the cost of individual components of the asset being valued, for example, the cost of a land plot and the cost of improvements, and the parameter Q is common. It is designed to adjust the value of all components of the asset being valued for a common influence factor, such as location.

The values ​​of factors that are in the degree of the corresponding coefficients are binary variables (0 or 1). The factors that are at the base of the degree are discrete or continuous variables.

Factors associated with multiplication sign coefficients are also continuous or discrete.

The specification is carried out, as a rule, using an empirical approach and includes two stages:

  • ? plotting points of the regression field on the graph;
  • ? graphical (visual) analysis of the type of a possible approximating curve.

The type of regression curve is not always immediately selectable. To determine it, the points of the regression field are first plotted on the graph according to the initial data. Then a line is visually drawn along the position of the points, trying to find out the qualitative pattern of the relationship: uniform growth or uniform decrease, growth (decrease) with an increase (decrease) in the rate of dynamics, a smooth approach to a certain level.

This empirical approach is complemented by logical analysis, starting from already known ideas about the economic and physical nature of the studied factors and their mutual influence.

For example, it is known that the dependences of the resulting variables - economic indicators (prices, rent) on a number of factor variables - price-forming factors (distance from the center of the settlement, area, etc.) are non-linear, and they can be described quite strictly by a power, exponential or quadratic function . But with small ranges of factors, acceptable results can also be obtained using a linear function.

If it is still impossible to immediately make a confident choice of any one function, then two or three functions are selected, their parameters are calculated, and then, using the appropriate criteria for the tightness of the connection, the function is finally selected.

In theory, the regression process of finding the shape of a curve is called specification model, and its coefficients - calibration models.

If it is found that the resulting variable y depends on several factorial variables (factors) x ( , x 2 , ..., x k, then they resort to building a multiple regression model. Usually, three forms of multiple communication are used: linear - y - a 0 + a x x x + a^x 2 + ... + a k x k, demonstrative - y - a 0 a*i a x t- a x b, power - y - a 0 x x ix 2 a 2. .x^ or combinations thereof.

The exponential and exponential functions are more universal, as they approximate non-linear relationships, which are the majority of the dependences studied in the assessment. In addition, they can be used in the evaluation of objects and in the method of statistical modeling for mass evaluation, and in the method of direct comparison in individual evaluation when establishing correction factors.

At the calibration stage, the parameters of the regression model are calculated by the least squares method, the essence of which is that the sum of the squared deviations of the calculated values ​​of the resulting variable at., i.e. calculated according to the selected relation equation, from the actual values ​​should be minimal:

Values ​​j) (. and y. known, therefore Q is a function of only the coefficients of the equation. To find the minimum S take partial derivatives Q by the coefficients of the equation and equate them to zero:

As a result, we obtain a system of normal equations, the number of which is equal to the number of determined coefficients of the desired regression equation.

Suppose we need to find the coefficients of the linear equation y - a 0 + ars. The sum of squared deviations is:

/=1

Differentiate a function Q by unknown coefficients a 0 and and equate the partial derivatives to zero:

After transformations we get:

where P - number of original actual values at them (the number of analogues).

The above procedure for calculating the coefficients of the regression equation is also applicable for nonlinear dependencies, if these dependencies can be linearized, i.e. bring to a linear form using a change of variables. Power and exponential functions after taking logarithm and the corresponding change of variables acquire a linear form. For example, a power function after taking a logarithm takes the form: In y \u003d 1n 0 +a x 1ph. After the change of variables Y- In y, L 0 - In and No. X- In x we ​​get a linear function

Y=A0 + cijX, whose coefficients are found as described above.

The least squares method is also used to calculate the coefficients of a multiple regression model. So, the system of normal equations for calculating a linear function with two variables Xj and x 2 after a series of transformations, it looks like this:

Usually this system of equations is solved using linear algebra methods. A multiple exponential function is brought to a linear form by taking logarithms and changing variables in the same way as a paired exponential function.

When using hybrid models, multiple regression coefficients are found using numerical procedures of the method of successive approximations.

To make a final choice among several regression equations, it is necessary to test each equation for the tightness of the relationship, which is measured by the correlation coefficient, variance, and coefficient of variation. For evaluation, you can also use the criteria of Student and Fisher. The greater the tightness of the connection reveals the curve, the more preferable it is, all other things being equal.

If a problem of such a class is being solved, when it is necessary to establish the dependence of a cost indicator on cost factors, then the desire to take into account as many influencing factors as possible and thereby build a more accurate multiple regression model is understandable. However, two objective limitations hinder the expansion of the number of factors. First, building a multiple regression model requires a much larger sample of objects than building a paired model. It is generally accepted that the number of objects in the sample should exceed the number P factors, at least 5-10 times. It follows that in order to build a model with three influencing factors, it is necessary to collect a sample of approximately 20 objects with different sets of factor values. Secondly, the factors selected for the model in their influence on the value indicator should be sufficiently independent of each other. This is not easy to ensure, since the sample usually combines objects belonging to the same family, in which there is a regular change in many factors from object to object.

Quality regression models, as a rule, check using the following statistics.

Standard deviation of the regression equation error (estimation error):

where P - sample size (number of analogues);

to - number of factors (cost factors);

Error unexplained by the regression equation (Fig. 3.2);

y. - the actual value of the resulting variable (for example, cost); y t - calculated value of the resulting variable.

This indicator is also called standard error of estimation (RMS error). In the figure, the dots indicate specific values ​​of the sample, the symbol indicates the line of the mean values ​​of the sample, the inclined dash-dotted line is the regression line.


Rice. 3.2.

The standard deviation of the estimation error measures the amount of deviation of the actual values ​​of y from the corresponding calculated values. at( , obtained using the regression model. If the sample on which the model is built is subject to the normal distribution law, then it can be argued that 68% of the real values at are in the range at ± & e from the regression line, and 95% - in the range at ± 2d e. This indicator is convenient because the units of measure sg? match the units of measurement at,. In this regard, it can be used to indicate the accuracy of the result obtained in the evaluation process. For example, in a certificate of value, you can indicate that the value of the market value obtained using the regression model V with a probability of 95% is in the range from (V-2d,.) before (at + 2ds).

Coefficient of variation of the resulting variable:

where y - the mean value of the resulting variable (Figure 3.2).

In regression analysis, the coefficient of variation var is the standard deviation of the result, expressed as a percentage of the mean of the result variable. The coefficient of variation can serve as a criterion for the predictive qualities of the resulting regression model: the smaller the value var, the higher are the predictive qualities of the model. The use of the coefficient of variation is preferable to the exponent &e, since it is a relative exponent. In the practical use of this indicator, it can be recommended not to use a model whose coefficient of variation exceeds 33%, since in this case it cannot be said that these samples are subject to the normal distribution law.

Determination coefficient (multiple correlation coefficient squared):

This indicator is used to analyze the overall quality of the resulting regression model. It indicates what percentage of the variation in the resulting variable is due to the influence of all factor variables included in the model. The determination coefficient always lies in the range from zero to one. The closer the value of the coefficient of determination to unity, the better the model describes the original data series. The coefficient of determination can be represented in another way:

Here is the error explained by the regression model,

a - error unexplained

regression model. From an economic point of view, this criterion makes it possible to judge what percentage of the price variation is explained by the regression equation.

The exact acceptance limit of the indicator R2 it is impossible to specify for all cases. Both the sample size and the meaningful interpretation of the equation must be taken into account. As a rule, when studying data on objects of the same type, obtained at approximately the same time, the value R2 does not exceed the level of 0.6-0.7. If all prediction errors are zero, i.e. when the relationship between the resulting and factor variables is functional, then R2 =1.

Adjusted coefficient of determination:

The need to introduce an adjusted coefficient of determination is explained by the fact that with an increase in the number of factors to the usual coefficient of determination almost always increases, but the number of degrees of freedom decreases (n - k- one). The adjustment entered always reduces the value R2, because the (P - 1) > (p- to - one). As a result, the value R 2 CKOf) may even become negative. This means that the value R2 was close to zero before adjustment and the proportion of variance explained by the regression equation of the variable at very small.

Of the two variants of regression models that differ in the value of the adjusted coefficient of determination, but have equally good other quality criteria, the variant with a large value of the adjusted coefficient of determination is preferable. The coefficient of determination is not adjusted if (n - k): k> 20.

Fisher ratio:

This criterion is used to assess the significance of the determination coefficient. Residual sum of squares is a measure of prediction error using a regression of known cost values at.. Its comparison with the regression sum of squares shows how many times the regression dependence predicts the result better than the mean at. There is a table of critical values F R Fisher coefficient depending on the number of degrees of freedom of the numerator - to, denominator v 2 = p - k- 1 and significance level a. If the calculated value of the Fisher criterion F R is greater than the table value, then the hypothesis of the insignificance of the coefficient of determination, i.e. about the discrepancy between the relationships embedded in the regression equation and the really existing ones, with a probability p = 1 - a is rejected.

Average approximation error(average percentage deviation) is calculated as the average relative difference, expressed as a percentage, between the actual and calculated values ​​of the resulting variable:

How less value given indicator, the better the predictive quality of the model. When the value of this indicator is not higher than 7%, they indicate the high accuracy of the model. If a 8 > 15%, indicate the unsatisfactory accuracy of the model.

Standard error of the regression coefficient:

where (/I) -1 .- diagonal element of the matrix (X G X) ~ 1 to - number of factors;

X- matrix of factor variables values:

X7- transposed matrix of factor variables values;

(JL) _| is a matrix inverse to a matrix.

The smaller these scores for each regression coefficient, the more reliable the estimate of the corresponding regression coefficient.

Student's test (t-statistics):

This criterion allows you to measure the degree of reliability (significance) of the relationship due to a given regression coefficient. If the calculated value t. greater than table value

t av , where v - p - k - 1 is the number of degrees of freedom, then the hypothesis that this coefficient is statistically insignificant is rejected with a probability of (100 - a)%. There are special tables of the /-distribution that make it possible to determine the critical value of the criterion by a given level of significance a and the number of degrees of freedom v. The most commonly used value of a is 5%.

Multicollinearity, i.e. the effect of mutual relationships between factor variables leads to the need to be content with a limited number of them. If this is not taken into account, then you can end up with an illogical regression model. To avoid the negative effect of multicollinearity, before building a multiple regression model, pair correlation coefficients are calculated rxjxj between selected variables X. and X

Here XjX; - mean value of the product of two factorial variables;

XjXj- the product of the average values ​​of two factor variables;

Evaluation of the variance of the factor variable x..

Two variables are considered to be regressively related (i.e., collinear) if their pairwise correlation coefficient is strictly greater than 0.8 in absolute value. In this case, any of these variables should be excluded from consideration.

In order to expand the possibilities of economic analysis of the resulting regression models, averages are used coefficients of elasticity, determined by the formula:

where Xj- mean value of the corresponding factor variable;

y - mean value of the resulting variable; a i - regression coefficient for the corresponding factor variable.

The elasticity coefficient shows how many percent the value of the resulting variable will change on average when the factor variable changes by 1%, i.e. how the resulting variable reacts to a change in the factor variable. For example, how does the price of sq. m area of ​​the apartment at a distance from the city center.

Useful from the point of view of analyzing the significance of a particular regression coefficient is the estimate private coefficient of determination:

Here is the estimate of the variance of the resulting

variable. This coefficient shows how many percent the variation of the resulting variable is explained by the variation of the /-th factor variable included in the regression equation.

  • Hedonic characteristics are understood as the characteristics of an object that reflect its useful (valuable) properties from the point of view of buyers and sellers.