<< Previous Index Next >>

Analysis of Two Numerical Variables with Stata

5. Influential observations

There is a type of outliers in a regression that may cause problems, the influential observations. These are observations which are outliers, but that moreover can influence the results of the regression. To identify them, we can check how much the regression results change when we include them and when we exclude them.

We can identify the outliers in the scatterplot as cases with a large or low value with respect to the explanatory or dependent variable, or both (large or low in the sense of a large difference with respect to its mean). In the following picture we have marked the three cases that have a large difference with respect to the mean of one or both of the variables:

The two cases in the upper part of the scatterplot have a large value with respect to both the midterm and final exam, while the case in the bottom of the scatterplot has a very low value with respect to the final, but the result in the midterm was close to the mean.

To eliminate them we need to identify its case number. Since we have in our data set a variable called "OBS" which is the observation number, we can ask Stata to show the value of OBS at each point in the scatterplot, and this will allow us to identify the observation number of the problematic cases. We can do this issuing the following command:

scatter FINAL MIDTERM, mlabel(OBS) title("Grades")

This is the graph that we obtain:

Outliers with a large difference with respect to the explanatory variable can be influential cases, so cases 2 and 18 are the problematic ones. We can eliminate them issuing the command:

drop if OBS == 2 | OBS == 18

This command says to drop the cases if the value of OBS is equal to 2 or it is equal to 18. Notice the use of double "==" when it is a logical check and the use of " - " as the OR operator.

Now we can compute the regression again without these cases:

reg FINAL MIDTERM

The results are the following:

      Source |       SS       df       MS              Number of obs =      19
-------------+------------------------------           F(  1,    17) =    0.02
       Model |  2.15381915     1  2.15381915           Prob > F      =  0.8903
    Residual |  1867.00408    17  109.823769           R-squared     =  0.0012
-------------+------------------------------           Adj R-squared = -0.0576
       Total |  1869.15789    18  103.842105           Root MSE      =   10.48

------------------------------------------------------------------------------
       FINAL |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     MIDTERM |   .0870693   .6217396     0.14   0.890    -1.224687    1.398825
       _cons |   48.65543   32.61553     1.49   0.154    -20.15733    117.4682
------------------------------------------------------------------------------

As it can be seen, the slope has gone down significantly from 1.126989 to 0.0870693. Therefore, we can consider these cases as influential.

<< Previous Index Next >>

File translated from T_EX by T_TH, version 4.08.