We’ll use the data from A. Mroz (1987), “The Sensitivity of an Empirical Model of Married Women’s Hours of Work to Economic and Statistical Assumptions,” Econometrica 55, 765-799.
We’ll use the data on married working women to estimate the return to education using a simple OLS model.
quietly cd "/Users/Sam/Desktop/Econ 645/Data/Wooldridge"
use "mroz.dta", clear
reg lwage educ
display (exp(_b[educ])-1)*100
Source | SS df MS Number of obs = 428
-------------+---------------------------------- F(1, 426) = 56.93
Model | 26.3264193 1 26.3264193 Prob > F = 0.0000
Residual | 197.001022 426 .462443713 R-squared = 0.1179
-------------+---------------------------------- Adj R-squared = 0.1158
Total | 223.327441 427 .523015084 Root MSE = .68003
------------------------------------------------------------------------------
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ | .1086487 .0143998 7.55 0.000 .0803451 .1369523
_cons | -.1851968 .1852259 -1.00 0.318 -.5492673 .1788736
------------------------------------------------------------------------------
11.477061
Our estimate implies that the returns to education is \((e^{0.109}-1)*100\%=11.5\%\)
Use Fathers Education as an instrument for the observation’s level of education if the women is in the labor force.
reg educ fathedu if inlf==1
predict edu_hat
Source | SS df MS Number of obs = 428
-------------+---------------------------------- F(1, 426) = 88.84
Model | 384.841983 1 384.841983 Prob > F = 0.0000
Residual | 1845.35428 426 4.33181756 R-squared = 0.1726
-------------+---------------------------------- Adj R-squared = 0.1706
Total | 2230.19626 427 5.22294206 Root MSE = 2.0813
------------------------------------------------------------------------------
educ | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
fatheduc | .2694416 .0285863 9.43 0.000 .2132538 .3256295
_cons | 10.23705 .2759363 37.10 0.000 9.694685 10.77942
------------------------------------------------------------------------------
(option xb assumed; fitted values)
Get F-Statistic
test fathedu
( 1) fatheduc = 0
F( 1, 426) = 88.84
Prob > F = 0.0000
Our F-test shows that the instrument is greater than F-stat > 15, so it seem like a good candidate for an instrument.
Father’s Education as an instrument
reg lwage edu_hat
Source | SS df MS Number of obs = 428
-------------+---------------------------------- F(1, 426) = 2.59
Model | 1.34752449 1 1.34752449 Prob > F = 0.1086
Residual | 221.979916 426 .521079616 R-squared = 0.0060
-------------+---------------------------------- Adj R-squared = 0.0037
Total | 223.327441 427 .523015084 Root MSE = .72186
------------------------------------------------------------------------------
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
edu_hat | .0591735 .0367969 1.61 0.109 -.0131525 .1314995
_cons | .4411034 .4671121 0.94 0.346 -.4770279 1.359235
------------------------------------------------------------------------------
6.095928
One additional year of education increases wages by \((e^{0.059}-1)*100\% = 6.1\%\)
The F-test passed the instrument relevance (F-stat > 15), but what about instrument exogeneity? Can you think of possible ways father’s education is related to a child’s income? Social connections? Father’s income?
Let’s estimate the returns to education for men. We’ll use data from M. Blackburn and D. Neumark (1992), “Unobserved Ability, Efficiency Wages, and Interindustry Wage Differentials,” Quarterly Journal of Economics 107, 1421-1436.
Use number of siblings as an instrument to predict an observation’s level of education. We’ll keep it simple and have no other covariates.
use "wage2.dta", clear
First, our biased OLS estimate
reg lwage educ
Source | SS df MS Number of obs = 935
-------------+---------------------------------- F(1, 933) = 100.70
Model | 16.1377042 1 16.1377042 Prob > F = 0.0000
Residual | 149.518579 933 .160255712 R-squared = 0.0974
-------------+---------------------------------- Adj R-squared = 0.0964
Total | 165.656283 934 .177362188 Root MSE = .40032
------------------------------------------------------------------------------
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ | .0598392 .0059631 10.03 0.000 .0481366 .0715418
_cons | 5.973063 .0813737 73.40 0.000 5.813366 6.132759
------------------------------------------------------------------------------
6.1665821
We have a estimated return of education of \((e^{0.0598392}-1)*100\% = 6.2\%\)
Second, use Number of siblings as an instrument for the first stage.
reg educ sibs
predict edu_hat
Source | SS df MS Number of obs = 935
-------------+---------------------------------- F(1, 933) = 56.67
Model | 258.055048 1 258.055048 Prob > F = 0.0000
Residual | 4248.7642 933 4.55387374 R-squared = 0.0573
-------------+---------------------------------- Adj R-squared = 0.0562
Total | 4506.81925 934 4.82528828 Root MSE = 2.134
------------------------------------------------------------------------------
educ | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
sibs | -.2279164 .0302768 -7.53 0.000 -.287335 -.1684979
_cons | 14.13879 .1131382 124.97 0.000 13.91676 14.36083
------------------------------------------------------------------------------
(option xb assumed; fitted values)
Third, Get F-statistic.
test sibs
( 1) sibs = 0
F( 1, 933) = 56.67
Prob > F = 0.0000
Fourth, Use number of siblings as an instrument.
reg lwage edu_hat
Source | SS df MS Number of obs = 935
-------------+---------------------------------- F(1, 933) = 22.31
Model | 3.86818074 1 3.86818074 Prob > F = 0.0000
Residual | 161.788103 933 .173406326 R-squared = 0.0234
-------------+---------------------------------- Adj R-squared = 0.0223
Total | 165.656283 934 .177362188 Root MSE = .41642
------------------------------------------------------------------------------
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
edu_hat | .1224326 .0259225 4.72 0.000 .0715595 .1733057
_cons | 5.130026 .3494009 14.68 0.000 4.444323 5.815729
------------------------------------------------------------------------------
13.024298
One additional year of education increases wages by \((e^{0.1224}-1)*100\%=13.0\%\)
Another thing that is interesting here is the the OLS estimate is biased downward, which is not what we would expect. Possible reasons: 1. Siblings could be correlated with ability - more siblings, less partental attention which could result in lower ability. 2. The OLS estimator is downward biased due to measurement error in educ, but this is less likely to satisfy the classic error-in-variables (CEM) assumption.
It is important to see an example of a poor instrument. We’ll use data from . Mullahy (1997), “Instrumental-Variable Estimation of Count Data Models: Applications to Models of Cigarette Smoking Behavior,” Review of Economics and Statistics 79, 596-593.
use "bwght.dta", clear
Biased OLS looks at the natural log of birth weight and cigarette packs smoked per day by the mother.We would expect that smoking is correlated with unobserved health and parental decisions, so it is like biased due to unobserved confounders.
reg lbwght packs
Source | SS df MS Number of obs = 1,388
-------------+---------------------------------- F(1, 1386) = 27.98
Model | .997781141 1 .997781141 Prob > F = 0.0000
Residual | 49.4225525 1,386 .035658407 R-squared = 0.0198
-------------+---------------------------------- Adj R-squared = 0.0191
Total | 50.4203336 1,387 .036352079 Root MSE = .18883
------------------------------------------------------------------------------
lbwght | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
packs | -.0898131 .0169786 -5.29 0.000 -.1231197 -.0565065
_cons | 4.769404 .0053694 888.26 0.000 4.758871 4.779937
------------------------------------------------------------------------------
-8.5895151
An additional package smoked per day decreases birthweight by \((e^{-0.08981}-1)*100\% \approx 8.6\%\)
We’ll use cigarette prices as an instrument for cigarette packs smoked per day. We assume that cigarette prices and the error term u are uncorrelated (instrument exogeneity). Note that some states fund health care with cigarette tax revenue.
We will use cigarette price and quantity of packs smoked should be negatively correlated.
reg packs cigprice
Source | SS df MS Number of obs = 1,388
-------------+---------------------------------- F(1, 1386) = 0.13
Model | .011648626 1 .011648626 Prob > F = 0.7179
Residual | 123.684481 1,386 .089238442 R-squared = 0.0001
-------------+---------------------------------- Adj R-squared = -0.0006
Total | 123.696129 1,387 .089182501 Root MSE = .29873
------------------------------------------------------------------------------
packs | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
cigprice | .0002829 .000783 0.36 0.718 -.0012531 .0018188
_cons | .0674257 .1025384 0.66 0.511 -.1337215 .2685728
------------------------------------------------------------------------------
Our instrument is not associated with quantity consumed.
We will test our instrument even though it appeared weak. We’ll use the predict command to get the estimated packs
predict packs_hat
Use the test command
( 1) cigprice = 0
F( 1, 1386) = 0.13
Prob > F = 0.7179
Our instrument has an F-stat of 0.7179. This is below our rule of thumb of F-stat > 15. This indicates that we have a weak instrument, and the Instrument Relevance assumption fails.
We’ll still use the instrument to estimate our parameter for packs of cigarettes smoked on the natural log of birth weight
reg lbwght packs_hat
Source | SS df MS Number of obs = 1,388
-------------+---------------------------------- F(1, 1386) = 2.87
Model | .104047659 1 .104047659 Prob > F = 0.0907
Residual | 50.316286 1,386 .036303237 R-squared = 0.0021
-------------+---------------------------------- Adj R-squared = 0.0013
Total | 50.4203336 1,387 .036352079 Root MSE = .19053
------------------------------------------------------------------------------
lbwght | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
packs_hat | 2.988676 1.765368 1.69 0.091 -.4744067 6.451758
_cons | 4.448136 .1843027 24.13 0.000 4.086594 4.809679
------------------------------------------------------------------------------
Cigarette price as an instrument for packs smoked is a poor instrument and our use of predicted packs smoked is in the wrong direction.
We can and should always test Instrument Relevance. If we have a poor instrument, we should go back to the drawing board.
Another issue here is that price is a poor instrument, since price and quantity are simultaneously determined. We would need a second set of instruments on price to estimate our first-stage.
We’ll use Mroz data again on working women.
use "mroz.dta", clear
We’ll use both parent’s education as instruments to identify the effect of education on wages for working women. We overidentify the endogenous variable with 2 instruments: father’s education and mother’s education.
reg lwage educ c.exper##c.exper
Source | SS df MS Number of obs = 428
-------------+---------------------------------- F(3, 424) = 26.29
Model | 35.0222967 3 11.6740989 Prob > F = 0.0000
Residual | 188.305144 424 .444115906 R-squared = 0.1568
-------------+---------------------------------- Adj R-squared = 0.1509
Total | 223.327441 427 .523015084 Root MSE = .66642
---------------------------------------------------------------------------------
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
educ | .1074896 .0141465 7.60 0.000 .0796837 .1352956
exper | .0415665 .0131752 3.15 0.002 .0156697 .0674633
|
c.exper#c.exper | -.0008112 .0003932 -2.06 0.040 -.0015841 -.0000382
|
_cons | -.5220406 .1986321 -2.63 0.009 -.9124667 -.1316144
---------------------------------------------------------------------------------
11.347933
Our result is \((e^{0.1074896}-1)*100 \approx 11.3\%\)
We’ll use to instruments for one endogenous variable. Use Parent’s Education as an instrument for women in the labor force.
reg educ c.exper##c.exper fathedu mothedu if inlf==1
Source | SS df MS Number of obs = 428
-------------+---------------------------------- F(4, 423) = 28.36
Model | 471.620998 4 117.90525 Prob > F = 0.0000
Residual | 1758.57526 423 4.15738833 R-squared = 0.2115
-------------+---------------------------------- Adj R-squared = 0.2040
Total | 2230.19626 427 5.22294206 Root MSE = 2.039
---------------------------------------------------------------------------------
educ | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
exper | .0452254 .0402507 1.12 0.262 -.0338909 .1243417
|
c.exper#c.exper | -.0010091 .0012033 -0.84 0.402 -.0033744 .0013562
|
fatheduc | .1895484 .0337565 5.62 0.000 .1231971 .2558997
motheduc | .157597 .0358941 4.39 0.000 .087044 .2281501
_cons | 9.10264 .4265614 21.34 0.000 8.264196 9.941084
---------------------------------------------------------------------------------
test fathedu mothedu
( 1) fatheduc = 0
( 2) motheduc = 0
F( 2, 423) = 55.40
Prob > F = 0.0000
Our instrument appears to be a relevant instrument, since \(F>15\)
predict edu_hat
(option xb assumed; fitted values)
Using Father’s Education as an instrument
reg lwage edu_hat c.exper##c.exper
display (exp(_b[edu_hat])-1)*100
Source | SS df MS Number of obs = 428
-------------+---------------------------------- F(3, 424) = 7.40
Model | 11.117828 3 3.70594265 Prob > F = 0.0001
Residual | 212.209613 424 .50049437 R-squared = 0.0498
-------------+---------------------------------- Adj R-squared = 0.0431
Total | 223.327441 427 .523015084 Root MSE = .70746
---------------------------------------------------------------------------------
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
edu_hat | .0613966 .0329624 1.86 0.063 -.0033933 .1261866
exper | .0441704 .0140844 3.14 0.002 .0164865 .0718543
|
c.exper#c.exper | -.000899 .0004212 -2.13 0.033 -.0017268 -.0000711
|
_cons | .0481003 .4197565 0.11 0.909 -.7769624 .873163
---------------------------------------------------------------------------------
6.3320574
One additional year of education increases wages by \((e^{0.061}-1)*100\%=6.3\%\)
We went through each step, but we can use the IV estimator with one command: ivregress 2sls.
ivregress 2sls \(<dependent var> (<explanatory var>=<IV>) \ x_1 \ x_2 \ ... \ x_k\)
ivregress 2sls lwage (educ=fathedu mothedu) c.exper##c.exper
display (exp(_b[educ])-1)*100
Instrumental variables (2SLS) regression Number of obs = 428
Wald chi2(3) = 24.65
Prob > chi2 = 0.0000
R-squared = 0.1357
Root MSE = .67155
---------------------------------------------------------------------------------
lwage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----------------+----------------------------------------------------------------
educ | .0613966 .0312895 1.96 0.050 .0000704 .1227228
exper | .0441704 .0133696 3.30 0.001 .0179665 .0703742
|
c.exper#c.exper | -.000899 .0003998 -2.25 0.025 -.0016826 -.0001154
|
_cons | .0481003 .398453 0.12 0.904 -.7328532 .8290538
---------------------------------------------------------------------------------
Instrumented: educ
Instruments: exper c.exper#c.exper fatheduc motheduc
6.3320574
We can also add the First Stage option with first
ivregress 2sls lwage (educ=fathedu mothedu) c.exper##c.exper, first
First-stage regressions
-----------------------
Number of obs = 428
F( 4, 423) = 28.36
Prob > F = 0.0000
R-squared = 0.2115
Adj R-squared = 0.2040
Root MSE = 2.0390
---------------------------------------------------------------------------------
educ | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
exper | .0452254 .0402507 1.12 0.262 -.0338909 .1243417
|
c.exper#c.exper | -.0010091 .0012033 -0.84 0.402 -.0033744 .0013562
|
fatheduc | .1895484 .0337565 5.62 0.000 .1231971 .2558997
motheduc | .157597 .0358941 4.39 0.000 .087044 .2281501
_cons | 9.10264 .4265614 21.34 0.000 8.264196 9.941084
---------------------------------------------------------------------------------
Instrumental variables (2SLS) regression Number of obs = 428
Wald chi2(3) = 24.65
Prob > chi2 = 0.0000
R-squared = 0.1357
Root MSE = .67155
---------------------------------------------------------------------------------
lwage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----------------+----------------------------------------------------------------
educ | .0613966 .0312895 1.96 0.050 .0000704 .1227228
exper | .0441704 .0133696 3.30 0.001 .0179665 .0703742
|
c.exper#c.exper | -.000899 .0003998 -2.25 0.025 -.0016826 -.0001154
|
_cons | .0481003 .398453 0.12 0.904 -.7328532 .8290538
---------------------------------------------------------------------------------
Instrumented: educ
Instruments: exper c.exper#c.exper fatheduc motheduc
6.3320574
Go to my website and download the mroz data. Use father’s and mother’s education as an instrument to estimate education with the ivregress 2sls command. https://github.com/rowesamuel/Econ645/blob/main/Data/Wooldridge/mroz.dta
Estimate OLS using \(wage_{i} = \beta_{0} + \beta_{1}edu +\beta_{2} exper + \beta_{3} exper^{2}\)
Estimate the 2SLS with ivregress 2sls command
OLS
reg lwage educ c.exper##c.exper
Source | SS df MS Number of obs = 428
-------------+---------------------------------- F(3, 424) = 26.29
Model | 35.0222967 3 11.6740989 Prob > F = 0.0000
Residual | 188.305144 424 .444115906 R-squared = 0.1568
-------------+---------------------------------- Adj R-squared = 0.1509
Total | 223.327441 427 .523015084 Root MSE = .66642
---------------------------------------------------------------------------------
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
educ | .1074896 .0141465 7.60 0.000 .0796837 .1352956
exper | .0415665 .0131752 3.15 0.002 .0156697 .0674633
|
c.exper#c.exper | -.0008112 .0003932 -2.06 0.040 -.0015841 -.0000382
|
_cons | -.5220406 .1986321 -2.63 0.009 -.9124667 -.1316144
---------------------------------------------------------------------------------
2SLS
ivregress 2sls lwage (educ=fatheduc motheduc) c.exper##c.exper
Instrumental variables (2SLS) regression Number of obs = 428
Wald chi2(3) = 24.65
Prob > chi2 = 0.0000
R-squared = 0.1357
Root MSE = .67155
---------------------------------------------------------------------------------
lwage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----------------+----------------------------------------------------------------
educ | .0613966 .0312895 1.96 0.050 .0000704 .1227228
exper | .0441704 .0133696 3.30 0.001 .0179665 .0703742
|
c.exper#c.exper | -.000899 .0003998 -2.25 0.025 -.0016826 -.0001154
|
_cons | .0481003 .398453 0.12 0.904 -.7328532 .8290538
---------------------------------------------------------------------------------
Instrumented: educ
Instruments: exper c.exper#c.exper fatheduc motheduc
We’ll keep using Mroz’s data on working women.
use "mroz.dta", clear
If want to see if our explanatory variable of interest is potentially endogenous with the error term. We can conduct an endogeneity test. We have two ways to test endogeneity in the OLS estimator.
Estimate the reduced form for potentially endogenous variable of interest by regressing all exogenous variables on education. This includes all those in the structural model and the additional IVs.
reg educ c.exper##c.exper fathedu mothedu if inlf==1
Source | SS df MS Number of obs = 428
-------------+---------------------------------- F(4, 423) = 28.36
Model | 471.620998 4 117.90525 Prob > F = 0.0000
Residual | 1758.57526 423 4.15738833 R-squared = 0.2115
-------------+---------------------------------- Adj R-squared = 0.2040
Total | 2230.19626 427 5.22294206 Root MSE = 2.039
---------------------------------------------------------------------------------
educ | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
exper | .0452254 .0402507 1.12 0.262 -.0338909 .1243417
|
c.exper#c.exper | -.0010091 .0012033 -0.84 0.402 -.0033744 .0013562
|
fatheduc | .1895484 .0337565 5.62 0.000 .1231971 .2558997
motheduc | .157597 .0358941 4.39 0.000 .087044 .2281501
_cons | 9.10264 .4265614 21.34 0.000 8.264196 9.941084
---------------------------------------------------------------------------------
Next, obtains residuals \(\hat{v_{2}}\).
predict r, residual
Then, add \(\hat{v_{2}}\) to the structural equation.
reg lwage educ c.exper##c.exper r
Source | SS df MS Number of obs = 428
-------------+---------------------------------- F(4, 423) = 20.50
Model | 36.2573098 4 9.06432744 Prob > F = 0.0000
Residual | 187.070131 423 .442246173 R-squared = 0.1624
-------------+---------------------------------- Adj R-squared = 0.1544
Total | 223.327441 427 .523015084 Root MSE = .66502
---------------------------------------------------------------------------------
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
educ | .0613966 .0309849 1.98 0.048 .000493 .1223003
exper | .0441704 .0132394 3.34 0.001 .0181471 .0701937
|
c.exper#c.exper | -.000899 .0003959 -2.27 0.024 -.0016772 -.0001208
|
r | .0581666 .0348073 1.67 0.095 -.0102502 .1265834
_cons | .0481003 .3945753 0.12 0.903 -.7274721 .8236727
---------------------------------------------------------------------------------
There is possible evidence of endogeneity since \(p < .1\) but \(p > .05\). You should report IV and OLS. We can compare our results in a mini sensitivity analysis.
eststo m1: quietly reg lwage educ c.exper##c.exper
eststo m2: quietly ivregress 2sls lwage (educ=fatheduc motheduc) c.exper##c.exper
esttab m1 m2, mtitle(OLS IV)
(1) (2)
OLS IV
--------------------------------------------
educ 0.107*** 0.0614*
(7.60) (1.96)
exper 0.0416** 0.0442***
(3.15) (3.30)
c.exper#c.~r -0.000811* -0.000899*
(-2.06) (-2.25)
_cons -0.522** 0.0481
(-2.63) (0.12)
--------------------------------------------
N 428 428
--------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001
We can also use a postestimation command estat endogenous.
ivregress 2sls lwage (educ=fathedu mothedu) c.exper##c.exper
estat endogenous
Instrumental variables (2SLS) regression Number of obs = 428
Wald chi2(3) = 24.65
Prob > chi2 = 0.0000
R-squared = 0.1357
Root MSE = .67155
---------------------------------------------------------------------------------
lwage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----------------+----------------------------------------------------------------
educ | .0613966 .0312895 1.96 0.050 .0000704 .1227228
exper | .0441704 .0133696 3.30 0.001 .0179665 .0703742
|
c.exper#c.exper | -.000899 .0003998 -2.25 0.025 -.0016826 -.0001154
|
_cons | .0481003 .398453 0.12 0.904 -.7328532 .8290538
---------------------------------------------------------------------------------
Instrumented: educ
Instruments: exper c.exper#c.exper fatheduc motheduc
Tests of endogeneity
Ho: variables are exogenous
Durbin (score) chi2(1) = 2.80707 (p = 0.0938)
Wu-Hausman F(1,423) = 2.79259 (p = 0.0954)
We have weak evidence that education is endogenous, since p<.1 but p >.05. However, we know from theory that ability is a confounder on wages and education. We should report both OLS and IV models.
When we have one IV for one endogenous explanatory variable, we have a just identified equation. When we have two instruments and one endogenous explanatory variable we have overidentification.
When we have multiple IVs, we can test to see some of our instruments are correlated with the structural error term.
We can estimate two 2sls models (one for each IV) and then compare them. They should only differ by the sampling error. If our two beta coefficients on our fitted explanatory variable of interest are different then, we conclude that at least one instrument or maybe both is/are correlated with the structural error term.
When we add too many instrumental variables (or overidentification), we can increase the efficiency of the 2SLS estimator. However, we may run the risk of violating the instrument exogeneity assumption.
When we use motheredu and fatheredu as IVS for education, we have a single overidentification restriction. We have two IVs and 1 endogenous explanatory variable.
use "mroz.dta", clear
ivregress 2sls lwage (educ=motheduc fatheduc) c.exper##c.exper
predict r, resid
reg r mothedu fathedu c.exper##c.exper
ereturn list
local N=`e(N)'
display "`N'"
local rsq=`e(r2)'
display "`rsq'"
local nR=`N'*`rsq'
display "`nR'"
scalars:
e(N) = 428
e(df_m) = 4
e(df_r) = 423
e(F) = .0934962445404771
e(r2) = .000883344256925
e(rmse) = .6752103466059415
e(mss) = .1705031219578643
e(rss) = 192.8495121452517
e(r2_a) = -.0085645673813073
e(ll) = -436.7021015142834
e(ll_0) = -436.891220726253
e(rank) = 5
macros:
e(cmdline) : "regress r mothedu fathedu c.exper##c.exper"
e(title) : "Linear regression"
e(marginsok) : "XB default"
e(vce) : "ols"
e(depvar) : "r"
e(cmd) : "regress"
e(properties) : "b V"
e(predict) : "regres_p"
e(estat_cmd) : "regress_estat"
matrices:
e(b) : 1 x 5
e(V) : 5 x 5
functions:
e(sample)
428
.000883344256925
.3780713419639
Under the null hypothesis that all IVs are uncorrelated with \(u_{1}\) \(N*R^{2} \sim \chi^{2}_{q}\), where q is the number of instruments from outside the model minus the total number of endogenous explanatory variables. If \(N*R^{2}\) exceeds the 5% critical value in \(\chi^{2}_{q}\), then we reject the null hypothesis and conclude that at least some of the IVs are not exogenous.
Here we have q=2-1=1 df for the chi-squared test and we fail to reject the null hypothesis since \(N*R^{2} = 0.37807\) and \(\chi^{2}_{1}\) at the 5% critical value is 3.841.
We can also use the postestimation command of estat overid to get our nR2.
ivregress 2sls lwage (educ=motheduc fatheduc) c.exper##c.exper
estat overid
Instrumental variables (2SLS) regression Number of obs = 428
Wald chi2(3) = 24.65
Prob > chi2 = 0.0000
R-squared = 0.1357
Root MSE = .67155
---------------------------------------------------------------------------------
lwage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----------------+----------------------------------------------------------------
educ | .0613966 .0312895 1.96 0.050 .0000704 .1227228
exper | .0441704 .0133696 3.30 0.001 .0179665 .0703742
|
c.exper#c.exper | -.000899 .0003998 -2.25 0.025 -.0016826 -.0001154
|
_cons | .0481003 .398453 0.12 0.904 -.7328532 .8290538
---------------------------------------------------------------------------------
Instrumented: educ
Instruments: exper c.exper#c.exper motheduc fatheduc
Tests of overidentifying restrictions:
Sargan (score) chi2(1) = .378071 (p = 0.5386)
Basmann chi2(1) = .373985 (p = 0.5408)
Let’s add husband’s education, so we have 2 overidentification restrictions.
ivregress 2sls lwage (educ=motheduc fatheduc huseduc) c.exper##c.exper
estat overid
Instrumental variables (2SLS) regression Number of obs = 428
Wald chi2(3) = 34.90
Prob > chi2 = 0.0000
R-squared = 0.1495
Root MSE = .66616
---------------------------------------------------------------------------------
lwage | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----------------+----------------------------------------------------------------
educ | .0803918 .021672 3.71 0.000 .0379155 .1228681
exper | .0430973 .0132027 3.26 0.001 .0172204 .0689742
|
c.exper#c.exper | -.0008628 .0003943 -2.19 0.029 -.0016357 -.0000899
|
_cons | -.1868572 .2840591 -0.66 0.511 -.7436029 .3698885
---------------------------------------------------------------------------------
Instrumented: educ
Instruments: exper c.exper#c.exper motheduc fatheduc huseduc
Tests of overidentifying restrictions:
Sargan (score) chi2(2) = 1.11504 (p = 0.5726)
Basmann chi2(2) = 1.10228 (p = 0.5763)
Notice that we still fail reject the null hypothesis, so we might consider adding it as an IV. Also, notice that the coefficient and standard error around education have changed as well.
It is a good idea to report both in a sensitivity analysis.
est clear
eststo m1: quietly ivregress 2sls lwage (educ=motheduc fatheduc) c.exper##c.exper
eststo m2: quietly ivregress 2sls lwage (educ=motheduc fatheduc huseduc) c.exper##c.exper
esttab m1 m2, mtitle(2IVs 3IVs)
(1) (2)
2IVs 3IVs
--------------------------------------------
educ 0.0614* 0.0804***
(1.96) (3.71)
exper 0.0442*** 0.0431**
(3.30) (3.26)
c.exper#c.~r -0.000899* -0.000863*
(-2.25) (-2.19)
_cons 0.0481 -0.187
(0.12) (-0.66)
--------------------------------------------
N 428 428
--------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001
We won’t spend too much time looking into Chapter 2 and 3, since they should be a review from ECON 644. Chapter 4 will be a review, but we’ll look into some CPS data.
Recommended resources UCLA OARC Stata resources: https://stats.oarc.ucla.edu/stata/modules/ I have used
this source on numerous times. Even when you become proficient with
Stata, UCLA’s OARC Stata resources have been very helpful. If you just
type “UCLA oarc Stata
Statalist if usually another good resource when trying to figure out what went wrong. If you ask as question on the listserve, you will likely be scolded by Nick Cox, or you will be told to “read the manual”. However, if you just read what others have asked, you will often find useful information.
If you would like to recreate Mitchell’s data, then follow the following steps
Set your working directory and use net install. E.g:
cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
*Use:
net from https://www.stata-press.com/data/dmus2
net get dmus1
net get dmus2
*All commands:
net install dmus1
Note when your working directory, I recommend using “/” instead of “". Stata will use both, and the backslash”" often runs into string issues.
We’ll use the use command to get our data into Stata and we’ll add the clear option to clear our memory to prevent an error.
use "dentists.dta", clear
From Stata Press website
use "https://www.stata-press.com/data/dmus2/dentists.dta", clear
From NBER website for CPS data
use "https://data.nber.org/morg/annual/morg20.dta", clear
Subsetting by columns is a great way to work with large dataset when you do not need the entire dataset. For example, the January CPS Basic has 393 variables and over 125,000 observations. While this isn’t very large, it would be very beneficial if you work with the ACS PUMS data, which is usually around 16GB. We can subset with the following: 1. By Variable Names 2. By Rows 3. By Variable Names and Rows
Reading a subset of the Stata data when you know the variable names (e.g.: when you have a data dictionary)
use pemlr gestfips pwsswgt using "jan25pub.dta"
list in 1/5
| gestfips pemlr pwsswgt |
|-----------------------------|
1. | 1 1 39739102 |
2. | 1 1 39925108 |
3. | 1 . . |
4. | 1 6 28808444 |
5. | 1 . . |
+-----------------------------+
Subsetting by rows (observations) and variable names
*gestfips code of 24 is Maryland
use pemlr gestfips pwsswgt using "jan25pub.dta" if gestfips == 24, clear
| gestfips pemlr pwsswgt |
|-----------------------------|
1. | 24 5 48660345 |
2. | 24 1 67339169 |
3. | 24 1 50801953 |
4. | 24 1 40983471 |
5. | 24 1 62453493 |
+-----------------------------+
Reading the data dictionary beforehand will help, especially with larger files such as the PUMS ACS or CPS
I’m not going to focus on these, but they are avaiable There are data sets that are included in Stata and use the sysuse command
sysuse auto, clear
You can use Stata files that are available, but didn’t ship with Stata and they use the webuse command
webuse fullauto, clear
Many times you will need to grab data from a website and these data may be in an Excel format such as “.xls” or “.xlsx”. Our main command is import excel. We need to specify the firstrow option if we want the headers as variable names. Be careful if the headers are numerics, such as years, in the firstrow. This will throw an error and we’ll need to rename them in Excel before importing these files. For example, 2015 becomes y2015
import excel using "tfp_major_sector_long.xlsx", firstrow clear
list in 1/10
1. | NAICS | Sector | Basis | Measure |
| XG | Private nonfarm business sector | All workers | Total factor productivity |
|-----------------------------------------------------------------------------------|
| Units | Year | Value |
| Index (2017=100) | 1987 | 78.664 |
+-----------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------+
2. | NAICS | Sector | Basis | Measure |
| XG | Private nonfarm business sector | All workers | Real value-added output |
|-----------------------------------------------------------------------------------|
| Units | Year | Value |
| Index (2017=100) | 1987 | 41.875 |
+-----------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------+
3. | NAICS | Sector | Basis | Measure |
| XG | Private nonfarm business sector | All workers | Combined inputs |
|-----------------------------------------------------------------------------------|
| Units | Year | Value |
| Index (2017=100) | 1987 | 53.233 |
+-----------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------+
4. | NAICS | Sector | Basis | Measure |
| XG | Private nonfarm business sector | All workers | Capital input |
|-----------------------------------------------------------------------------------|
| Units | Year | Value |
| Index (2017=100) | 1987 | 35.158 |
+-----------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------+
5. | NAICS | Sector | Basis | Measure |
| XG | Private nonfarm business sector | All workers | Labor input |
|-----------------------------------------------------------------------------------|
| Units | Year | Value |
| Index (2017=100) | 1987 | 65.818 |
+-----------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------+
6. | NAICS | Sector | Basis | Measure |
| XG | Private nonfarm business sector | All workers | Hours worked |
|-----------------------------------------------------------------------------------|
| Units | Year | Value |
| Index (2017=100) | 1987 | 75.78700000000001 |
+-----------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------+
7. | NAICS | Sector | Basis | Measure |
| XG | Private nonfarm business sector | All workers | Labor composition |
|-----------------------------------------------------------------------------------|
| Units | Year | Value |
| Index (2017=100) | 1987 | 86.846 |
+-----------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------+
8. | NAICS | Sector | Basis | Measure |
| XG | Private nonfarm business sector | All workers | Labor productivity |
|-----------------------------------------------------------------------------------|
| Units | Year | Value |
| Index (2017=100) | 1987 | 55.253 |
+-----------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------+
9. | NAICS | Sector | Basis | Measure |
| XG | Private nonfarm business sector | All workers | Capital productivity |
|-----------------------------------------------------------------------------------|
| Units | Year | Value |
| Index (2017=100) | 1987 | 119.104 |
+-----------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------+
10. | NAICS | Sector | Basis | Measure |
| XG | Private nonfarm business sector | All workers | Capital intensity |
|-----------------------------------------------------------------------------------|
| Units | Year | Value |
| Index (2017=100) | 1987 | 46.391 |
+-----------------------------------------------------------------------------------+
We need to take caution of the sheet names of the excel files we are importing If you don’t pay attention you may import the wrong sheet.
import excel using "dentists2.xls", firstrow clear
list
| name years fulltime recom |
|----------------------------------------------|
1. | Y. Don Uflossmore 7.25 0 1 |
2. | Olive Tu'Drill 10.25 1 1 |
3. | Isaac O'Yerbreath 32.75 1 1 |
4. | Ruth Canaale 22 1 1 |
5. | Mike Avity 8.5 0 0 |
+----------------------------------------------+
Note: importing files from other statistical packages is a pain. If you are using another statistical software packages like R, SAS, Python, or heaven- forbin SPSS, then export the data from those software package as CSV files. CSV files are fairly universal. There maybe some labeling issues, but if labels are not a concern then use CSV file formats. Luckily, SAS data files are becoming less common. Even the Census Bureau has seen the light and provides their data in a csv format.
For me, these commands are unavailable in Stata 14, but if you have Stata 16 or higher you can use these command. import sas is our main command to import SAS from version 7 or higher.
import sas "dentists.sas7bdat", clear
We can import SAS XPORT Version 5
import sasxport5 "dentists.xpt", clear
We can import SAS XPORT Version 8
import sasxport8 "dentists.xpt", clear
We can import SPSS files
import spss "dentlab.sav", clear
We can import dBase files
import dbase "dentlab.dbf", clear
There are three key types raw data files we will cover 1. Delimited (such as csv) 2. Space separated 3. Fixed width
Importing csv files will likely be more common than other types of files so far with the exception of Stata and Excel files. Note: there are other types of delimited files, but comman-delimited are the most common. Other types of delimiters include “tab”, “space”, or “|” delimited.
If you read the manual you’ll noticed that “,” and “ are the default delimiters. Let try it using”,” delimter
import delimited using "jan25pub.csv", clear
Note: CSV files may come in .txt files or .csv files using “,” delimiter
import delimited using "dentists1.txt", clear
list
(4 vars, 5 obs)
+----------------------------------------------+
| name years fulltime recom |
|----------------------------------------------|
1. | Y. Don Uflossmore 7.25 0 1 |
2. | Olive Tu'Drill 10.25 1 1 |
3. | Isaac O'Yerbreath 32.75 1 1 |
4. | Ruth Canaale 22 1 1 |
5. | Mike Avity 8.5 0 0 |
+----------------------------------------------+
There may be “tab”-delimited files that use “ delimiter
import delimited using "dentists2.txt", clear
list
(4 vars, 5 obs)
+----------------------------------------------+
| name years fulltime recom |
|----------------------------------------------|
1. | Y. Don Uflossmore 7.25 0 1 |
2. | Olive Tu'Drill 10.25 1 1 |
3. | Isaac O'Yerbreath 32.75 1 1 |
4. | Ruth Canaale 22 1 1 |
5. | Mike Avity 8.5 0 0 |
+----------------------------------------------+
Using the following code will produce the same results as above since “ is a default
import delimited using "dentists2.txt", clear delimiters("\t")
list
(4 vars, 5 obs)
+----------------------------------------------+
| name years fulltime recom |
|----------------------------------------------|
1. | Y. Don Uflossmore 7.25 0 1 |
2. | Olive Tu'Drill 10.25 1 1 |
3. | Isaac O'Yerbreath 32.75 1 1 |
4. | Ruth Canaale 22 1 1 |
5. | Mike Avity 8.5 0 0 |
+----------------------------------------------+
Let’s use a file that uses “:” as a delimiter
import delimited using "dentists4.txt", clear delimiters(":")
list
(4 vars, 5 obs)
+-------------------------------------+
| v1 v2 v3 v4 |
|-------------------------------------|
1. | Y. Don Uflossmore 7.25 0 1 |
2. | Olive Tu'Drill 10.25 1 1 |
3. | Isaac O'Yerbreath 32.75 1 1 |
4. | Ruth Canaale 22 1 1 |
5. | Mike Avity 8.5 0 0 |
+-------------------------------------+
We can subset columns and rows with import delimited, which can be helpful for large files. Variable names will be defaulted for the first few rows, so when we use rowrange(1:3) this will only include 2 rows of data.
import delimited using "dentists2.txt", clear rowrange(1:3)
list
We can import a csv straight from a website and convert it to a Stata dta file
import delimited using "https://www2.census.gov/programs-surveys/cps/datasets/2023/basic/jun23pub.csv", clear
pwd
save "jun23pub.dta", replace
Sometimes files have space-delimited files (which is a bad idea when strings are involved). Dentists5.txt is an example of such a file. We can use the infile command to import these data, but we need to specify that the name variable is a string at least 17 characters wide.
infile str17 name years full rec using "dentists5.txt", clear
list
(5 observations read)
+----------------------------------------+
| name years full rec |
|----------------------------------------|
1. | Y. Don Uflossmore 7.25 0 1 |
2. | Olive Tu'Drill 10.25 1 1 |
3. | Isaac O'Yerbreath 32.75 1 1 |
4. | Ruth Canaale 22 1 1 |
5. | Mike Avity 8.5 0 0 |
+----------------------------------------+
Note: Infile does not read in variable names in the first row. We can use import delimited and specify that the delimiter is a blank space to get around this issue.
import delimited using "dentists6.txt", delimiters(" ") clear
list
(4 vars, 5 obs)
+----------------------------------------------+
| name years fulltime recom |
|----------------------------------------------|
1. | Y. Don Uflossmore 7.25 0 1 |
2. | Olive Tu'Drill 10.25 1 1 |
3. | Isaac O'Yerbreath 32.75 1 1 |
4. | Ruth Canaale 22 1 1 |
5. | Mike Avity 8.5 0 0 |
+----------------------------------------------+
Unfortunately, fixed-column files are more common than I prefer. They are a pain, since you need to specify each column length. I have found this with files with a “.dat” file extension.
We need a data dictionary to know when each varible starts. When we know this, we can use the infix command to read in fixed-column files. Let’s look at our fixed-column width files.
type "dentists7.txt"
Y. Don Uflossmore 7.2501
Olive Tu'Drill 10.2511
Isaac O'Yerbreath32.7511
Ruth Canaale 22.0011
Mike Avity 8.5000
Our name variable is between 1 and 17, years are between 18 and 22, and our fulltime and recommend binaries are only length of 1 with 23 and 24.
infix str name 1-17 years 18-22 fulltime 23 recom 24 using "dentists7.txt", clear
list
(5 observations read)
+----------------------------------------------+
| name years fulltime recom |
|----------------------------------------------|
1. | Y. Don Uflossmore 7.25 0 1 |
2. | Olive Tu'Drill 10.25 1 1 |
3. | Isaac O'Yerbreath 32.75 1 1 |
4. | Ruth Canaale 22 1 1 |
5. | Mike Avity 8.5 0 0 |
+----------------------------------------------+
This is a real pain, but luckily Stata has dictionary files to help if one of these files are available. They have a .dct file extension. We already have a fixed-column dictionary with dentists1.dct and we can use it to open a fixed-column width file dentists7.txt
Let’s look at our dictionary file.
type "dentists1.dct"
infix dictionary {
str name 1-17 years 18-22 fulltime 23 recom 24
}
Let’s look at our fixed-column width file.
type "dentists7.txt"
Y. Don Uflossmore 7.2501
Olive Tu'Drill 10.2511
Isaac O'Yerbreath32.7511
Ruth Canaale 22.0011
Mike Avity 8.5000
Let’s use the two together to import the data file.
infix using "dentists1.dct", using(dentists7.txt) clear
list
infix dictionary {
str name 1-17 years 18-22 fulltime 23 recom 24
}
(5 observations read)
+----------------------------------------------+
| name years fulltime recom |
|----------------------------------------------|
1. | Y. Don Uflossmore 7.25 0 1 |
2. | Olive Tu'Drill 10.25 1 1 |
3. | Isaac O'Yerbreath 32.75 1 1 |
4. | Ruth Canaale 22 1 1 |
5. | Mike Avity 8.5 0 0 |
+----------------------------------------------+
We can combine infile command with a dictionary file.
type "dentists3.dct"
infile dictionary using dentists7.txt {
str17 name %17s "Name of dentist"
years %5f "Years in practice"
fulltime %1f "Full time?"
recom %1f "Recommend Quaddent?"
}
Notice that the first line of the dictionary file is infile dictionary using dentists7.txt and then the column widths.
infile using "dentists3.dct", clear
list
infile dictionary using dentists7.txt {
str17 name %17s "Name of dentist"
years %5f "Years in practice"
fulltime %1f "Full time?"
recom %1f "Recommend Quaddent?"
}
(5 observations read)
+----------------------------------------------+
| name years fulltime recom |
|----------------------------------------------|
1. | Y. Don Uflossmore 7.25 0 1 |
2. | Olive Tu'Drill 10.25 1 1 |
3. | Isaac O'Yerbreath 32.75 1 1 |
4. | Ruth Canaale 22 1 1 |
5. | Mike Avity 8.5 0 0 |
+----------------------------------------------+
Dictionary .dct files are very helpful when they are available with fixed-column width files.
We can use the “in” qualifier to subset the number of rows to import with either infix or infile
infix str name 1-17 years 18-22 fulltime 23 recom 24 using "dentists7.txt" in 1/3, clear
list
(3 observations read)
+----------------------------------------------+
| name years fulltime recom |
|----------------------------------------------|
1. | Y. Don Uflossmore 7.25 0 1 |
2. | Olive Tu'Drill 10.25 1 1 |
3. | Isaac O'Yerbreath 32.75 1 1 |
+----------------------------------------------+
infix using "dentists1.dct" in 1/3, using(dentists7.txt) clear
list
infix dictionary {
str name 1-17 years 18-22 fulltime 23 recom 24
}
(3 observations read)
+----------------------------------------------+
| name years fulltime recom |
|----------------------------------------------|
1. | Y. Don Uflossmore 7.25 0 1 |
2. | Olive Tu'Drill 10.25 1 1 |
3. | Isaac O'Yerbreath 32.75 1 1 |
+----------------------------------------------+
Tip: Don’t forget to clear your data. Either the line before or add it as an option , clear
If you do not clear your data with the use command, then you will get the message “No; dataset in memory has changed since last saved” - Use clear
If you do not clear your data with the import, infix, or infile commands, then you will get the message: “You must start with an empty dataset”
My opinion: Don’t do this. If you need to manually enter data, I don’t recommend manually entering data into STATA. I recommend using a spreadsheet, then import the spreadsheet. If for some reason you don’t have access to Excel or a spreadsheet application like OSX Numbers or Google Sheets, then read pages 43-50.
It can be helpful to convert a csv, excel, or other type of file to a Stata dta file. It is also good to consider older types of Stata files. Older versions of Stata help with sharing data. (One of my beefs with Stata is that newer data files are not compatible with older versions of Stata).
Let’s import a comma-delimited file
import delimited using "dentists1.txt", clear
We can use the save and saveold commands to convert this comma-delimited file into a Stata dta file. Don’t forget to replace or else you will run into an error when trying to replicate the code.
Save will save it in Stata 14 (for me) and whatever version of Stata you run
save mydentists, replace
Saveold and specify version 12
saveold mydentists_v12, version(12) replace
We can subset as well with the keep command
keep if recom==1
save mydentists, replace
saveold mydentists_v12, version(12) replace
There is a compress command, but harddrive are large enough today that it should not be a problem.
#### 3.3 Exporting Excel FilesLet’s say we want to collaborate with someone that doesn’t have Stata or wants to make graphs in Excel. We can export to excel either “.xls” or “.xlsx” from Stata with the export excel command.
Let’s use our main file again
use dentlab, clear
We can use the export excel command. Don’t forget to use replace option and set the firstrow as variable names.
export excel using "dentlab.xlsx", replace firstrow(variables)
Note: Labels are exported into the binary 0/1 for parttime and recommend We can use the nolabel option to work around this.
export excel using "dentlab.xlsx", replace firstrow(variables) nolabel
But what we want a “labeled” and an “unlabeled” sheet in the same excel file? Using the option replace option will completely replace the entire Excel file. We don’t want that! So, let’s add another sheet called “Unlabeled”. We use the replace option in the sheet subcommand option, so we will not replace the entire excel file
export excel using "dentlab.xlsx", sheet("Unlabeled", replace) firstrow(variables) nolabel
My recommendation is to use csv files instead of SAS or SPSS files. If you need to do so, please review pages 59-62.
I recommend when collaborating with others, especially if they are using different software like SAS, R, Python, etc. to use csv. They are universally read, which makes collaboration easier. You may lose labels (or values). However, if everyone is not using Stata, then export csv (or excel) is a good idea.
The default export delimiter is comma-delimited
use dentlab
export delimited using "dentists_comma.csv", replace
This will export the labels instead of the values. If you want to export the values instead of the labels, use the nolabels option.
export delimited using "dentists_comma.csv", nolabel replace
Note: You also have the option of putting quotes around strings, which can be a good idea if you strings have “,” in them. Otherwise, the csv will think the “,” is a place to delimit and your dataset will be incorrectly exported. The option is simply quote.
export delimited using "dentists_comma.csv", nolabel quote replace
Sometimes we may need to format our data for exporting.
replace years = 8.93 if name == "Mike Avity"
export delimited using "dentists_comma.csv", nolabel quote replace
type "dentists_comma.csv"
(1 real change made)
file dentists_comma.csv saved
name,years,fulltime,recom
"Y. Don Uflossmore",7.25,0,1
"Olive Tu'Drill",10.25,1,1
"Isaac O'Yerbreath",32.75,1,1
"Ruth Canaale",22,1,1
"Mike Avity",8.9300003,0,0
Note that the years for Mike Avity we just replace exported as 8.9300003. To fix this, we can use the format command to set the total width to 5 and the decimal places to 2 with %5.2f
format years %5.2f
export delimited using "dentists_comma.csv", nolabel quote replace
type "dentists_comma.csv"
file dentists_comma.csv saved
name,years,fulltime,recom
"Y. Don Uflossmore",7.25,0,1
"Olive Tu'Drill",10.25,1,1
"Isaac O'Yerbreath",32.75,1,1
"Ruth Canaale",22,1,1
"Mike Avity",8.5,0,0
For more options see:
help export delimited
Don’t do this. Use comma-delimited files. If you are really interested in this, please review pages 65-66.
I liked this addition to the exporting data section. A lot of time, senior leadership may need a report, and with export excel (or putexcel) we can generate automated reports. The key here is automated, where the pipeline from raw data to reports is done completely through a .do file.
A lot of times it is a pain to get the data analysis to your write-ups, and Export excel and put excel are great tools to make it easy. We’ll cover estout to get regression outputs into Word or LaTex a bit later. *I have used putexcel in the past, but we’ll cover two ways to fill out Excel reports
We are going to create a Full-time Part-time Report in Excel
use "dentlab.dta", clear
keep name years fulltime
export excel using "dentrpt1.xlsx", cell(A4) replace
Notice this is just exporting our data into an excel file without a shell, but we do place our data starting on line A4 and we do not export variable names.
We’ll use the shell report called “dentrpt1-skeleton.xlsx”, and we’ll fill it out using Stata. This is just an example, but it can be easily applicable in many office settings. It is preferable to fill out the report this way, since if there is a mistake, then it is is easy to fix without generating a new report from scratch.
First we’ll use the copy command to copy the Report Shell over our newly created file called “dentrpt1.xlsx”
copy "dentrpt1-skeleton.xlsx" "dentrpt1.xlsx", replace
Now we will use the modify option to modify our report shell to fill out the report with data.
export excel "dentrpt1.xlsx", sheet("dentists", modify) cell(A4)
There are two sheets in the report shell: “dentists” and “Sheet2”. We
modified the sheet called “dentists” and pasted our data in this sheet
starting at cell A4.
We can format our formats, too. We have our report shell in “dentrpt2-skeleton.xlsx”. We will do is similar to the first report, but we will add the keepcellfmt option.
copy "dentrpt2-skeleton.xlsx" "dentrpt2.xlsx", replace
export excel "dentrpt2.xlsx", sheet("dentists", modify) cell(A4) //keepcellfmt
Note: this keepcellfmt is unavailable in Stata 14, but you should have it in your newer version.
Let’s include averages for years and full-time vs part time. We make a copy.
copy "dentrpt3-skeleton.xlsx" "dentrpt3.xlsx", replace
export excel "dentrpt3.xlsx", sheet("dentists", modify) cell(A6)
You have two options for the percentages: 1) add average formula in Excel or 2) use put excel. It is easy to add formulas in excel, but what if the number of observation changes over time? Then, you would need to modify the Excel sheet. Using put excel you can update the averages without messing with formulas in excel.
Note: It is poor practice to run the summarize command and manually entering it into your report. Use local macros and do it dynamically!
We will set our dentists report to be modified by Stata, and then run the summarize command and take the average stored after the command and put it into excel. putexcel set “dentrpt3.xlsx”, sheet(“dentists”) modify
After setting the sheet to be modified, we can write expression, returned results, formulas, graphs, scalars, matrices, etc.
Let’s look at what the summarize command stores after being run.
summarize years
return list
quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell/"
quietly use dentlab
summarize years
return list
There are several scalars that are stored, such as number of observations r(N), mean r(mean), standard deviation r(sd), etc. If we use the detail option, we’ll see more scalars. We can retreive those scalars with local macros, such as `r(mean)’.
summarize years
putexcel B2 = `r(mean)'
summarize fulltime
putexcel B3 = `r(mean)'
putexcel save //Not in older version
More on
Putexcel https://blog.stata.com/2017/01/10/creating-excel-tables-with-putexcel-part-1-introduction-and-formatting/
Good note from Mitchell at the beginning: Garbage In; Garbage Out.
Before doing analysis, you really need to check your data to look for data quality issues.
The duplicates command can be helpful, but it will only find duplicates that are exactly the same unless you specify which variables with duplicates list vars. We’ll cover this a bit later.
#### 4.3 Checking individual variablesWe start off with looking from problems in individual variables. There are three useful commands for checking individual variables. 1. describe 2. tabulate 3. summarize
use wws, clear
The describe command is a useful command to see an overview of your data with variables labels, storage type, and value lables (if any)
describe
quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell/"
quietly use wws, clear
describe
Contains data from wws.dta
obs: 2,246 Working Women Survey
vars: 30 3 Jan 2010 00:42
size: 163,958 (_dta has notes)
------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------------------------------
idcode int %8.0g Unique ID
age byte %8.0g age in current year
race byte %8.0g race
married byte %8.0g married
collgrad byte %16.0g college graduate
south byte %8.0g lives in south
industry byte %23.0g industry
occupation byte %22.0g occupation
union byte %8.0g union worker
wage float %9.0g hourly wage
hours byte %8.0g usual hours worked
nevermarried byte %8.0g Woman never been married
yrschool byte %8.0g Years of school completed
metro byte %9.0g Does woman live in metro area?
ccity byte %8.0g Does woman live in city center?
currexp float %9.0g Years worked at current job
prevexp float %9.0g Years worked at previous job
everworked float %9.0g Has woman ever worked?
uniondues float %9.0g Union Dues paid last week
marriedyrs float %9.0g Years married (rounded to nearest year)
unempins float %9.0g Under/Unemployment insur. received last week
numkids float %9.0g Number of children
kidage1 float %9.0g Age of first child
kidage2 float %9.0g Age of second child
kidage3 float %9.0g Age of third child
grade byte %8.0g current grade completed
grade4 byte %9.0g 4 level Current Grade Completed
wage2 float %9.0g Wages, rounded to 2 digits
fwt float %9.0g Frequency weight
networth float %9.0g Net worth
------------------------------------------------------------------------------------------------
Sorted by:
You will noticed that we lack value labels and a data dictionary or what each value is coded as.
The tabulation or tab command is very useful to inspect categorical variables. Let’s look at collgrad, which is a binary for college graduate status. Let’s include the missing option to make sure no observations are missing data.
tabulate collgrad, missing
college |
graduate | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,713 76.27 76.27
1 | 533 23.73 100.00
------------+-----------------------------------
Total | 2,246 100.00
Let’s look at race
tabulate race, missing
race | Freq. Percent Cum.
------------+-----------------------------------
1 | 1,636 72.84 72.84
2 | 583 25.96 98.80
3 | 26 1.16 99.96
4 | 1 0.04 100.00
------------+-----------------------------------
Total | 2,246 100.00
Race should only be coded between 1 and 3, so we have one miscoded observation. Let’s find the observation’s idcode.
list idcode race if race==4
| idcode race |
|---------------|
2013. | 543 4 |
+---------------+
The summarize or sum command is very useful to inspect continuous variables. Let’s look at the amount of unemployment insurance in unempins. The values range between 0 and 300 dollars for unemployed insurance received last week.
summarize unempins
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
unempins | 2,246 30.50401 73.16682 0 299
Let’s look at wages.
summarize wage
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
wage | 2,246 288.2885 9595.692 0 380000
The data range between 0 and 380,000 dollars in hourly wage last week with a mean of 288.29 dollars and a standard deviation of 9,595 dollars. The max seems a bit high. Let’s add the detail option to get more information.
summarize wage, detail
hourly wage
-------------------------------------------------------------
Percentiles Smallest
1% 1.892108 0
5% 2.801002 1.004952
10% 3.220612 1.032247 Obs 2,246
25% 4.259257 1.151368 Sum of Wgt. 2,246
50% 6.276297 Mean 288.2885
Largest Std. Dev. 9595.692
75% 9.661837 40.19808
90% 12.77777 40.74659 Variance 9.21e+07
95% 16.73912 250000 Skewness 35.45839
99% 38.70926 380000 Kurtosis 1297.042
Our mean appears to be skewed rightward with some outliers. Our median hourly wage is 6.7 dollars per hour and our 99% percentile hourly wage is 38.7 dollars per hour, so a mean of 288.29 dollar/hour is highly skewed. Our Kurtosis shows that the max outliers are heavily skewing the results. A normal distributed variable should have a kurtosis of 3, and our kurtosis is 1297. Let’s look at the outliers
list idcode wage if wage > 1000
| idcode wage |
|-----------------|
893. | 3145 380000 |
1241. | 2341 250000 |
+-----------------+
Let’s look at ages, which should range frm 21 to 50 years old. We can use both the summarize and tabulate commands. Tabulate can be very useful with continuous variables if the range is not too large.
summarize age
tabulate age
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
age | 2,246 36.25111 5.437983 21 83
age in |
current |
year | Freq. Percent Cum.
------------+-----------------------------------
21 | 2 0.09 0.09
22 | 8 0.36 0.45
23 | 16 0.71 1.16
24 | 35 1.56 2.72
25 | 44 1.96 4.67
26 | 50 2.23 6.90
27 | 50 2.23 9.13
28 | 63 2.80 11.93
29 | 49 2.18 14.11
30 | 55 2.45 16.56
31 | 57 2.54 19.10
32 | 60 2.67 21.77
33 | 56 2.49 24.27
34 | 95 4.23 28.50
35 | 213 9.48 37.98
36 | 224 9.97 47.95
37 | 171 7.61 55.57
38 | 175 7.79 63.36
39 | 167 7.44 70.79
40 | 139 6.19 76.98
41 | 148 6.59 83.57
42 | 111 4.94 88.51
43 | 110 4.90 93.41
44 | 98 4.36 97.77
45 | 45 2.00 99.78
46 | 1 0.04 99.82
47 | 1 0.04 99.87
48 | 1 0.04 99.91
54 | 1 0.04 99.96
83 | 1 0.04 100.00
------------+-----------------------------------
Total | 2,246 100.00
Let’s look at the outliers.
list idcode age if age > 50
| idcode age |
|--------------|
2205. | 80 54 |
2219. | 51 83 |
+--------------+
#### 4.4 Checking categorical by categorical variables
The two-way tabulation through the tabulate command is a very useful way to look for data quality problems or to double check binaries created. Let’s look at variables metro, which is a binary for whether or not the observation lives in a metropolitian areas, and ccity, which is whether or not the observation lives in the center of the city.
use wws, clear
tabulate metro ccity, missing
Does woman |
live in | Does woman live in
metro | city center?
area? | 0 1 | Total
-----------+----------------------+----------
0 | 665 0 | 665
1 | 926 655 | 1,581
-----------+----------------------+----------
Total | 1,591 655 | 2,246
An alternative is to count the number of observations that shouldn’t be there So we’ll count the number of observations not in a metro area but in a city center. I personally don’t use it, but it has it’s uses.
count if metro==0 & ccity==1
Let’s look at married and nevermarriaged. There should be no individuals that appear in both married and nevermarried.
tabulate married nevermarried, missing
| Woman never been
| married
married | 0 1 | Total
-----------+----------------------+----------
0 | 570 234 | 804
1 | 1,440 2 | 1,442
-----------+----------------------+----------
Total | 2,010 236 | 2,246
There may be observation that have been married, but not currently married. There should be no observations that are both married and never married, and there are 2.
count if married == 1 & nevermarried == 1
list idcode married nevermarried if married == 1 & nevermarried == 1
2
+-----------------------------+
| idcode married neverm~d |
|-----------------------------|
1523. | 1758 1 1 |
2231. | 22 1 1 |
+-----------------------------+
Let’s look at college graduates and years of school completed. We can use tabulate or the table commands. I personally prefer tabulate, since we can look for missing
tabulate yrschool collgrad, missing
Years of |
school | college graduate
completed | 0 1 | Total
-----------+----------------------+----------
8 | 69 1 | 70
9 | 55 0 | 55
10 | 84 0 | 84
11 | 123 0 | 123
12 | 943 0 | 943
13 | 174 2 | 176
14 | 180 7 | 187
15 | 81 11 | 92
16 | 0 252 | 252
17 | 0 106 | 106
18 | 0 154 | 154
. | 4 0 | 4
-----------+----------------------+----------
Total | 1,713 533 | 2,246
Or we can use the table command
table yrschool collgrad
list idcode if yrschool < 16 & collgrad == 1
Years of | college
school | graduate
completed | 0 1
----------+-----------
8 | 69 1
9 | 55
10 | 84
11 | 123
12 | 943
13 | 174 2
14 | 180 7
15 | 81 11
16 | 252
17 | 106
18 | 154
----------------------
+--------+
| idcode |
|--------|
195. | 4721 |
369. | 4334 |
464. | 4131 |
553. | 3929 |
690. | 3589 |
|--------|
1092. | 2681 |
1098. | 2674 |
1114. | 2640 |
1124. | 2613 |
1221. | 2384 |
|--------|
1493. | 1810 |
1829. | 993 |
1843. | 972 |
1972. | 689 |
2114. | 312 |
|--------|
2174. | 172 |
2198. | 107 |
2215. | 63 |
2228. | 25 |
2229. | 24 |
|--------|
2230. | 23 |
+--------+
It is also helpful to look at continuous variables by different categories. Let’s look at the binary variable for union, if the observation is a union member or not by uniondues.
summarize uniondues, detail
Union Dues paid last week
-------------------------------------------------------------
Percentiles Smallest
1% 0 0
5% 0 0
10% 0 0 Obs 2,242
25% 0 0 Sum of Wgt. 2,242
50% 0 Mean 5.603479
Largest Std. Dev. 9.029045
75% 10 29
90% 22 29 Variance 81.52365
95% 26 29 Skewness 1.35268
99% 29 29 Kurtosis 3.339635
Let’s use the bysort command, which will sort our data by the variable specified.
bysort union: summarize uniondues
-> union = 0
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
uniondues | 1,413 .094126 1.502237 0 27
------------------------------------------------------------------------------------------------
-> union = 1
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
uniondues | 461 14.65944 8.707759 0 29
------------------------------------------------------------------------------------------------
-> union = .
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
uniondues | 368 15.41304 8.815582 0 29
The mean for not in a union is less than 1 dollar with 1,413 observations. The mean for union is about 15 dollars with 461 observations. The mean for missing is about 15 dollars with 368 observations.
tabulate uniondues if union==0, missing
Union Dues |
paid last |
week | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,407 99.29 99.29
10 | 1 0.07 99.36
17 | 1 0.07 99.44
26 | 2 0.14 99.58
27 | 2 0.14 99.72
. | 4 0.28 100.00
------------+-----------------------------------
Total | 1,417 100.00
We see about 4 observations have union dues, so let’s recode them to 0. However, it is possible that someone may be not be a union members, but still have to pay an agency fee.
We’ll assume no agency fee, so we can use the recode command to create a new variable if a person pay union dues.
recode uniondues (0=0) (1/max=1), generate(paysdues)
(784 differences between uniondues and paysdues)
Now let’s compare union members to those who pay union dues.
tabulate union paysdues, missing
variable paysdues not found
r(111);
r(111);
Let’s list the non-union members paying union dues. Note: we use the abb(#) option to abbreviate the observation to 20 characters.
list idcode union uniondues if union==0 & (uniondues > 0) & !missing(uniondues), abb(20)
| idcode union uniondues |
|----------------------------|
561. | 3905 0 10 |
582. | 3848 0 26 |
736. | 3464 0 17 |
1158. | 2541 0 27 |
1668. | 1411 0 27 |
|----------------------------|
2100. | 345 0 26 |
+----------------------------+
Let’s look at another great command for observing categorical and continuous variables together: tabstat
We could use tab with a sum option.
tab married, sum(marriedyrs)
| Summary of Years married (rounded
| to nearest year)
married | Mean Std. Dev. Freq.
------------+------------------------------------
0 | 0 0 804
1 | 5.5409154 3.5521383 1,442
------------+------------------------------------
Total | 3.5574354 3.8933494 2,246
Or, we can use tabstat which gives us more options.
tabstat marriedyrs, by(married) statistics(n mean sd min max p25 p50 p75) missing
Summary for variables: marriedyrs
by categories of: married (married)
married | N mean sd min max p25 p50 p75
---------+--------------------------------------------------------------------------------
0 | 804 0 0 0 0 0 0 0
1 | 1442 5.540915 3.552138 0 11 2 6 9
---------+--------------------------------------------------------------------------------
Total | 2246 3.557435 3.893349 0 11 0 2 7
------------------------------------------------------------------------------------------
No observation that said they were never married reported years of marriage.
Let’s look at current years of experience and everworked binary.
tabstat currexp, by(everworked) statistics(n mean sd min max) missing
Summary for variables: currexp
by categories of: everworked (Has woman ever worked?)
everworked | N mean sd min max
-----------+--------------------------------------------------
0 | 60 0 0 0 0
1 | 2171 5.328881 5.042181 0 26
-----------+--------------------------------------------------
Total | 2231 5.185567 5.048073 0 26
--------------------------------------------------------------
Let’s total years of experience, which is currexp plus prevexp.
generate totexp=currexp+prevexp
tabstat totexp, by(everworked) statistics(n mean sd min max) missing
(15 missing values generated)
Summary for variables: totexp
by categories of: everworked (Has woman ever worked?)
everworked | N mean sd min max
-----------+--------------------------------------------------
0 | 60 0 0 0 0
1 | 2171 11.57761 4.552392 1 29
-----------+--------------------------------------------------
Total | 2231 11.26625 4.865816 0 29
--------------------------------------------------------------
Everyone with at least one year experience has experience working.
Summary of ways to look at categorical by continuous 1. bysort x_cat: summarize x_con 2. tab x_cat, sum(x_con) 3. tabstat x_con, by(x_cat) statistics(…)
We will compare two continuous variables to continuous variables. Let’s look at unemployment insuranced received last week if the hours were greater than 30 and hours are not missing.
summarize unempins if hours > 30 & !missing(hours)
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
unempins | 1,800 1.333333 16.04617 0 287
The mean is around 1.3 and our range is between our expected 0 and 287.
count if (hours>30) & !missing(hours) & (unempins>0) & !missing(unempins)
19
Let’s look at the hours count.
tabulate hours if (hours>30) & !missing(hours) & (unempins>0) & !missing(unempins)
usual hours |
worked | Freq. Percent Cum.
------------+-----------------------------------
35 | 1 5.26 5.26
38 | 2 10.53 15.79
40 | 14 73.68 89.47
65 | 1 5.26 94.74
70 | 1 5.26 100.00
------------+-----------------------------------
Total | 19 100.00
Let’s look at age and and years married to find any unusual observations.
generate agewhenmarried = age - marriedyrs
summarize agewhenmarried
tabulate agewhenmarried
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
agewhenmar~d | 2,246 32.69368 6.709341 13 73
agewhenmarr |
ied | Freq. Percent Cum.
------------+-----------------------------------
13 | 1 0.04 0.04
14 | 4 0.18 0.22
15 | 11 0.49 0.71
16 | 8 0.36 1.07
17 | 18 0.80 1.87
18 | 19 0.85 2.72
19 | 14 0.62 3.34
20 | 20 0.89 4.23
21 | 36 1.60 5.83
22 | 42 1.87 7.70
23 | 52 2.32 10.02
24 | 65 2.89 12.91
25 | 67 2.98 15.89
26 | 81 3.61 19.50
27 | 73 3.25 22.75
28 | 93 4.14 26.89
29 | 90 4.01 30.90
30 | 118 5.25 36.15
31 | 91 4.05 40.20
32 | 124 5.52 45.73
33 | 109 4.85 50.58
34 | 109 4.85 55.43
35 | 149 6.63 62.07
36 | 138 6.14 68.21
37 | 119 5.30 73.51
38 | 124 5.52 79.03
39 | 119 5.30 84.33
40 | 89 3.96 88.29
41 | 80 3.56 91.85
42 | 62 2.76 94.61
43 | 50 2.23 96.84
44 | 43 1.91 98.75
45 | 23 1.02 99.78
46 | 2 0.09 99.87
47 | 1 0.04 99.91
48 | 1 0.04 99.96
73 | 1 0.04 100.00
------------+-----------------------------------
Total | 2,246 100.00
Let’s look for anyone under 18. Some states allow under 18 marriages, but it still is suspicious.
tabulate agewhenmarried if agewhenmarried < 18
variable agewhenmarried not found
r(111);
r(111);
Let’s use the same strategy for years of experience deducted from age to find the first age working.
generate agewhenstartwork = age - (prevexp + currexp)
tabulate agewhenstartwork if agewhenstartwork < 16
(15 missing values generated)
agewhenstar |
twork | Freq. Percent Cum.
------------+-----------------------------------
8 | 1 1.49 1.49
9 | 1 1.49 2.99
12 | 1 1.49 4.48
14 | 20 29.85 34.33
15 | 44 65.67 100.00
------------+-----------------------------------
Total | 67 100.00
Some states allow work at the age of 15, but anything below looks suspicious.
We can also look for the number of age of children. We should expect that the age of the third child is never older than the second child.
table kidage2 kidage3 if numkids==3
Age of |
second | Age of third child
child | 0 1 2 3 4 5 6 7
----------+-----------------------------------------------
0 | 12
1 | 10 9
2 | 11 8 10
3 | 10 12 6 8
4 | 10 12 10 7 5
5 | 12 11 9 3 6 8
6 | 9 8 10 6 5 6 6
7 | 7 6 7 9 4 14 12 6
8 | 5 11 7 6 14 6 11
9 | 8 13 10 7 12 9
10 | 15 3 10 6 12
11 | 9 8 3 13
12 | 16 9 6
13 | 11 5
14 | 8
----------------------------------------------------------
Check the count.
count if (kidage3 > kidage2) & (numkids==3) & !missing(kidage3)
count if (kidage2 > kidage1) & (numkids>=2) & !missing(kidage2)
0
0
Now let’s check the age of the observation when the first child was born
generate agewhenfirstkid = age - kidage1
tabulate agewhenfirstkid if agewhenfirstkid < 18
age not found
r(111);
r(111);
There are some suspicious observations after this tabulation. Let’s look at scatter plot. Scatter plots can be helpful when looking at continuous by continuous, as well.
twoway scatter hours wage if hours > 0 & wage >0 & wage < 1000
graph export "hour_wage.png", width(500) replace
(file hour_wage.png written in PNG format)
Let go ahead and fix some of these errors that we have found. A word of caution is that you may need to talk to the data stewards about the best ways to correct the data. You don’t want to attempt to fix the measurement error and introduce additional problems. Institutional knowledge of the data is very helpful before correcting errors.
Let’s fix when race was equal to 4 when there are only 3 categories.
list idcode race if race==4
replace race=1 if idcode == 543
tabulate race
| idcode race |
|---------------|
2013. | 543 4 |
+---------------+
(1 real change made)
race | Freq. Percent Cum.
------------+-----------------------------------
1 | 1,637 72.89 72.89
2 | 583 25.96 98.84
3 | 26 1.16 100.00
------------+-----------------------------------
Total | 2,246 100.00
Let’s add a note for future users, which is good practice for replication.
note race: race changed to 1 (from 4) for idcode 543
Let’s fix college gradudate when there were only 8 years of education.
list idcode collgrad if yrschool < 12 & collgrad == 1
replace collgrad = 0 if idcode == 107
tab yrschool collgrad
note collgrad: collgrad changed to 0 (from 1) for idcode 107
| idcode collgrad |
|-------------------|
2198. | 107 1 |
+-------------------+
(1 real change made)
Years of |
school | college graduate
completed | 0 1 | Total
-----------+----------------------+----------
8 | 70 0 | 70
9 | 55 0 | 55
10 | 84 0 | 84
11 | 123 0 | 123
12 | 943 0 | 943
13 | 174 2 | 176
14 | 180 7 | 187
15 | 81 11 | 92
16 | 0 252 | 252
17 | 0 106 | 106
18 | 0 154 | 154
-----------+----------------------+----------
Total | 1,710 532 | 2,242
Let’s fix age which where the digits were switched.
list idcode age if age > 50
replace age = 38 if idcode == 51
replace age = 45 if idcode == 80
tab age
note age: the value of 83 was corrected to 38 for idcode 51
note age: the value of 54 was corrected to 45 for idcode 80
| idcode age |
|--------------|
2205. | 80 54 |
2219. | 51 83 |
+--------------+
(1 real change made)
(1 real change made)
age in |
current |
year | Freq. Percent Cum.
------------+-----------------------------------
21 | 2 0.09 0.09
22 | 8 0.36 0.45
23 | 16 0.71 1.16
24 | 35 1.56 2.72
25 | 44 1.96 4.67
26 | 50 2.23 6.90
27 | 50 2.23 9.13
28 | 63 2.80 11.93
29 | 49 2.18 14.11
30 | 55 2.45 16.56
31 | 57 2.54 19.10
32 | 60 2.67 21.77
33 | 56 2.49 24.27
34 | 95 4.23 28.50
35 | 213 9.48 37.98
36 | 224 9.97 47.95
37 | 171 7.61 55.57
38 | 176 7.84 63.40
39 | 167 7.44 70.84
40 | 139 6.19 77.03
41 | 148 6.59 83.62
42 | 111 4.94 88.56
43 | 110 4.90 93.46
44 | 98 4.36 97.82
45 | 46 2.05 99.87
46 | 1 0.04 99.91
47 | 1 0.04 99.96
48 | 1 0.04 100.00
------------+-----------------------------------
Total | 2,246 100.00
Let’s look at our notes.
note
_dta:
1. This is a hypothetical dataset and should not be used for analysis purposes
age:
1. the value of 83 was corrected to 38 for idcode 51
2. the value of 54 was corrected to 45 for idcode 80
race:
1. race changed to 1 (from 4) for idcode 543
collgrad:
1. collgrad changed to 0 (from 1) for idcode 107
We need to be careful with duplicate observations. When working with panel data we will want multiple observations of the same cross-sectional unit, but there should be multiple observations for different time units and not the same time unit.
use "dentists_dups.dta", clear
We can start with our duplicates list command to find rows that are completely the same. This command will show all the duplicates. Note: for duplicate records to be found, all data needs to be the same.
duplicates list
Duplicates in terms of all variables
+-----------------------------------------------------------+
| group: obs: name years fulltime recom |
|-----------------------------------------------------------|
| 1 4 Mike Avity 8.5 0 0 |
| 1 6 Mike Avity 8.5 0 0 |
| 1 8 Mike Avity 8.5 0 0 |
| 2 1 Olive Tu'Drill 10.25 1 1 |
| 2 11 Olive Tu'Drill 10.25 1 1 |
|-----------------------------------------------------------|
| 3 2 Ruth Canaale 22 1 1 |
| 3 3 Ruth Canaale 22 1 1 |
+-----------------------------------------------------------+
We’ll have a more condensed version with duplicates examples. It will give an example of the duplicate records for each group of duplicate records.
duplicates examples
Duplicates in terms of all variables
+-------------------------------------------------------------------+
| group: # e.g. obs name years fulltime recom |
|-------------------------------------------------------------------|
| 1 3 4 Mike Avity 8.5 0 0 |
| 2 2 1 Olive Tu'Drill 10.25 1 1 |
| 3 2 2 Ruth Canaale 22 1 1 |
+-------------------------------------------------------------------+
We can find the a consice list of duplicate records with duplicates report.
duplicates report
Duplicates in terms of all variables
--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 4 0
2 | 4 2
3 | 3 2
--------------------------------------
The duplicates tag command can return the number of duplicates for each group of duplicates (note the separator option puts a line in the table after # rows.
duplicates tag, generate(dup)
list, separator(2)
Duplicates in terms of all variables
+----------------------------------------------------+
| name years fulltime recom dup |
|----------------------------------------------------|
1. | Olive Tu'Drill 10.25 1 1 1 |
2. | Ruth Canaale 22 1 1 1 |
|----------------------------------------------------|
3. | Ruth Canaale 22 1 1 1 |
4. | Mike Avity 8.5 0 0 2 |
|----------------------------------------------------|
5. | Mary Smith 3 1 1 0 |
6. | Mike Avity 8.5 0 0 2 |
|----------------------------------------------------|
7. | Y. Don Uflossmore 7.25 0 1 0 |
8. | Mike Avity 8.5 0 0 2 |
|----------------------------------------------------|
9. | Mary Smith 27 0 0 0 |
10. | Isaac O'Yerbreath 32.75 1 1 0 |
|----------------------------------------------------|
11. | Olive Tu'Drill 10.25 1 1 1 |
+----------------------------------------------------+
Let’s sort our data for better legibility, and then we will add lines in our output table whenever there is a new group.
sort name years
list, sepby(name years)
| name years fulltime recom dup |
|----------------------------------------------------|
1. | Isaac O'Yerbreath 32.75 1 1 0 |
|----------------------------------------------------|
2. | Mary Smith 3 1 1 0 |
|----------------------------------------------------|
3. | Mary Smith 27 0 0 0 |
|----------------------------------------------------|
4. | Mike Avity 8.5 0 0 2 |
5. | Mike Avity 8.5 0 0 2 |
6. | Mike Avity 8.5 0 0 2 |
|----------------------------------------------------|
7. | Olive Tu'Drill 10.25 1 1 1 |
8. | Olive Tu'Drill 10.25 1 1 1 |
|----------------------------------------------------|
9. | Ruth Canaale 22 1 1 1 |
10. | Ruth Canaale 22 1 1 1 |
|----------------------------------------------------|
11. | Y. Don Uflossmore 7.25 0 1 0 |
+----------------------------------------------------+
Notice: There are two observations for Mary Smith, but one has 3 years of experience and the other one has 27 years, so it does not appear as a duplicate.
If there were too many variables, which could use our data browser.
browse if dup > 0
request ignored because of batch mode
We will want to drop our duplicates, but not the first observation. The duplicates drop command will drop the duplicate records but still keep one observation of the duplicate group.
duplicates drop
Let’s use a new dataset that is a bit more practical.
use "wws.dta", replace
Use duplicates list for the variable of interest
duplicates list idcode
Duplicates in terms of idcode
(0 observations are duplicates)
There are no duplicate idcodes, so we should expect that duplicates list will not return any duplicates, since every single values needs to be the same for there to be a dupicate record.
duplicates list
Duplicates in terms of all variables
(0 observations are duplicates)
Let’s use a dataset that does have duplicates.
use "wws_dups.dta", clear
Let’s check for duplicate idcodes.
duplicates list idcode, sepby(idcode)
Duplicates in terms of idcode
+------------------------+
| group: obs: idcode |
|------------------------|
| 1 1088 2831 |
| 1 2248 2831 |
|------------------------|
| 2 1244 3905 |
| 2 1245 3905 |
|------------------------|
| 3 277 4214 |
| 3 2247 4214 |
+------------------------+
I prefer duplicates list var1 instead of isid, since an error will break the do file using isid.
Let’s generate our dup variable for identifying the number of duplicates per group of duplicates.
duplicates tag idcode, generate(iddup)
Duplicates in terms of idcode
Let’s generate a different duplicate variable to find complete duplicate records. Notice that we do not specify a variable after tag.
duplicates tag, generate(alldup)
Duplicates in terms of all variables
Let’s list the data.
list idcode age race yrschool occupation wage if iddup==1 & alldup==0
| idcode age race yrschool occupa~n wage |
|------------------------------------------------------|
1244. | 3905 36 1 14 11 4.339774 |
1245. | 3905 41 1 10 5 7.004828 |
+------------------------------------------------------+
It seems that idcode 3905 is a different person with a possible incorrect idcode. Let’s fix that.
replace idcode=5160 if idcode==3905 & age==41
Let’s see if it is resolved, and it is.
list idcode age race yrschool occupation wage if iddup==1 & alldup==0
| idcode age race yrschool occupa~n wage |
|------------------------------------------------------|
1244. | 3905 36 1 14 11 4.339774 |
1245. | 5160 41 1 10 5 7.004828 |
+------------------------------------------------------+
Let’s look observation that are complete duplicates.
duplicates report
list idcode age race yrschool occupation wage if iddup==1 & alldup==1
Duplicates in terms of all variables
--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 2244 0
2 | 4 2
--------------------------------------
+------------------------------------------------------+
| idcode age race yrschool occupa~n wage |
|------------------------------------------------------|
277. | 4214 35 1 17 13 11.27214 |
1088. | 2831 37 2 8 8 2.697261 |
2247. | 4214 35 1 17 13 11.27214 |
2248. | 2831 37 2 8 8 2.697261 |
+------------------------------------------------------+
We need to keep 1 observation of the duplicates
duplicates drop
duplicates report
Duplicates in terms of all variables
(2 observations deleted)
Duplicates in terms of all variables
--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 2246 0
--------------------------------------
Let’s work the Census CPS. Pull the Census CPS. 1. You can download the file and import it, or 2. You can use Stata to pull the file.
import delimited using "https://www2.census.gov/programs-surveys/cps/datasets/2023/basic/jun23pub.csv", clear
Let’s save a copy of the csv file in Stata.
Let’s grab the data dictionary. Hint search: “Census CPS 2023 data dictionary”
Let’s check our data.