Analysis of Two Numerical
Variables with Stata |
5. Influential observations |
There is a type of outliers in a regression that may cause problems,
the influential observations. These are observations which are
outliers, but that moreover can influence the results of the
regression. To identify them, we can check how much the regression
results change when we include them and when we exclude them.
We can identify the outliers in the scatterplot as cases with a large
or low value with respect to the explanatory or dependent variable, or both
(large or low in the sense of a large difference with respect to its
mean). In the following picture we have marked the three cases that
have a large difference with respect to the mean of one or both of the
variables:
The two cases in the upper part of the scatterplot have a large value
with respect to both the midterm and final exam, while the case in the
bottom of the scatterplot has a very low value with respect to the
final, but the result in the midterm was close to the mean.
To eliminate them we need to identify its case number. Since we have
in our data set a variable called "OBS" which is the observation
number, we can ask Stata to show the value of OBS at each point in the
scatterplot, and this will allow us to identify the observation number
of the problematic cases. We can do this issuing the following
command:
scatter FINAL MIDTERM, mlabel(OBS) title("Grades")
This is the graph that we obtain:
Outliers with a large difference with respect to the explanatory
variable can be influential cases, so cases 2 and 18 are the
problematic ones. We can eliminate them issuing the command:
drop if OBS == 2 | OBS == 18
This command says to drop the cases if the value of OBS is equal to 2
or it is equal to 18. Notice the use of double "==" when it is a
logical check and the use of " - " as the OR operator.
Now we can compute the regression again without these cases:
reg FINAL MIDTERM
The results are the following:
Source | SS df MS Number of obs = 19
-------------+------------------------------ F( 1, 17) = 0.02
Model | 2.15381915 1 2.15381915 Prob > F = 0.8903
Residual | 1867.00408 17 109.823769 R-squared = 0.0012
-------------+------------------------------ Adj R-squared = -0.0576
Total | 1869.15789 18 103.842105 Root MSE = 10.48
------------------------------------------------------------------------------
FINAL | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
MIDTERM | .0870693 .6217396 0.14 0.890 -1.224687 1.398825
_cons | 48.65543 32.61553 1.49 0.154 -20.15733 117.4682
------------------------------------------------------------------------------
As it can be seen, the slope has gone down significantly from 1.126989
to 0.0870693. Therefore, we can consider these cases as influential.
File translated from
TEX
by
TTH,
version 4.08.