Wednesday, April 12, 2017

36. INTRODUCTION TO REGRESSIONS

Regressions are parametric models that predict a quantitative outcome (dependent variable) from one or more quantitative predictor variables (independent variable). The model to be applied depends on the kind of relationship that the variables exhibit. 

Regressions take the form of equations in which “y” is the response variables that represent the outcome and “x” is the input variable, that is to say the explanatory variable. Before undertaking the analysis, it is important that several conditions are met:
  • -          Y values must have a normal distribution: this can be analyzed with a standardized residual plot, in which most of the values should be close to 0 (in samples larger than 50, this is less important), or a probability residual plot, in which there should be an approximate straight line (Figure 31);
  • -          Y values must have a similar variance around each x value: we can use a best-fit line in a scatter plot (Figure 32);
  • -          Residuals must be independent; specifically, in the residual plot (Figure 33), the points must be equally distributed around the 0 line and not show any pattern (randomly distributed).


Normal probability plot

Figure 31: Normal Probability Plot

Best-Fit Line Scatter Plot

Figure 32: Best-Fit Line Scatter Plot

If the conditions are not met, we can either transform the variables[1] or perform a non-parametric analysis (see 47. INTRODUCTION TO NON-PARAMETRIC MODELS).

In addition, regressions are sensitive to outliers, so it is important to deal with them properly. We can detect outliers using a standardized residual plot, in which data that fall outside +3 and -3 (standard deviations) are usually considered to be outliers. In this case we should first check whether it was a mistake in collecting the data (for example a 200-year-old person is a mistake) and eliminate the outlier from the data set or replace it (see below how to deal with missing data). If it is not a mistake, a common practice is to carry out the regression with and without the outliers and present both results or to transform the data. For example, we may apply a log transformation or a rank transformation. In any case we should be aware of the implications of these transformations.

Standardized Residuals Plot with an Outlier

Figure 33: Standardized Residuals Plot with an Outlier

Another problem with regressions is that records with missing data are excluded from the analysis. First of all we should understand the meaning of a missing piece of information: does it mean 0 or does it mean that the interviewee preferred not to respond? In the second case, if it is important to include this information, we can substitute the missing data with a value:
  • -          With central tendency measures, if we think that the responses have a normal distribution, meaning that there is no specific reason for not responding to this question, we can use the mean or median of the existing data;
  • -          Predict the missing values using other variables; for example, if we have some missing data for the variable “income,” maybe we can use age and profession to for the prediction.

Check the linear regression template (see 38. LINEAR REGRESSION), which provides an example of how to generate the standardized residuals plot.





[1] To improve normality and variance conditions, we can try applying a log transformation to the response (dependent) variable. We can also use other types of transformations, but we must remember that the interpretation of the results is more complex when we transform our variables.

No comments:

Post a Comment