Wooldridge: Instrumental Variables and 2SLS

Estimating Returns to Education for Married Women

We’ll use the data from A. Mroz (1987), “The Sensitivity of an Empirical Model of Married Women’s Hours of Work to Economic and Statistical Assumptions,” Econometrica 55, 765-799.

We’ll use the data on married working women to estimate the return to education using a simple OLS model.

Biased OLS Estimate

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Wooldridge"
use "mroz.dta", clear
reg lwage educ
display (exp(_b[educ])-1)*100
      Source |       SS           df       MS      Number of obs   =       428
-------------+----------------------------------   F(1, 426)       =     56.93
       Model |  26.3264193         1  26.3264193   Prob > F        =    0.0000
    Residual |  197.001022       426  .462443713   R-squared       =    0.1179
-------------+----------------------------------   Adj R-squared   =    0.1158
       Total |  223.327441       427  .523015084   Root MSE        =    .68003

------------------------------------------------------------------------------
       lwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .1086487   .0143998     7.55   0.000     .0803451    .1369523
       _cons |  -.1851968   .1852259    -1.00   0.318    -.5492673    .1788736
------------------------------------------------------------------------------

11.477061

Our estimate implies that the returns to education is \((e^{0.109}-1)*100\%=11.5\%\)

Use an Instrument

Use Fathers Education as an instrument for the observation’s level of education if the women is in the labor force.

reg educ fathedu if inlf==1
predict edu_hat
      Source |       SS           df       MS      Number of obs   =       428
-------------+----------------------------------   F(1, 426)       =     88.84
       Model |  384.841983         1  384.841983   Prob > F        =    0.0000
    Residual |  1845.35428       426  4.33181756   R-squared       =    0.1726
-------------+----------------------------------   Adj R-squared   =    0.1706
       Total |  2230.19626       427  5.22294206   Root MSE        =    2.0813

------------------------------------------------------------------------------
        educ |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    fatheduc |   .2694416   .0285863     9.43   0.000     .2132538    .3256295
       _cons |   10.23705   .2759363    37.10   0.000     9.694685    10.77942
------------------------------------------------------------------------------

(option xb assumed; fitted values)

Test the Instrument

Get F-Statistic

test fathedu
 ( 1)  fatheduc = 0

       F(  1,   426) =   88.84
            Prob > F =    0.0000

Our F-test shows that the instrument is greater than F-stat > 15, so it seem like a good candidate for an instrument.

Estimate the Using an IV Estimator

Father’s Education as an instrument

reg lwage edu_hat
      Source |       SS           df       MS      Number of obs   =       428
-------------+----------------------------------   F(1, 426)       =      2.59
       Model |  1.34752449         1  1.34752449   Prob > F        =    0.1086
    Residual |  221.979916       426  .521079616   R-squared       =    0.0060
-------------+----------------------------------   Adj R-squared   =    0.0037
       Total |  223.327441       427  .523015084   Root MSE        =    .72186

------------------------------------------------------------------------------
       lwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     edu_hat |   .0591735   .0367969     1.61   0.109    -.0131525    .1314995
       _cons |   .4411034   .4671121     0.94   0.346    -.4770279    1.359235
------------------------------------------------------------------------------

6.095928

One additional year of education increases wages by \((e^{0.059}-1)*100\% = 6.1\%\)

Question?

The F-test passed the instrument relevance (F-stat > 15), but what about instrument exogeneity? Can you think of possible ways father’s education is related to a child’s income? Social connections? Father’s income?

Estimating Returns to Education for Men

Let’s estimate the returns to education for men. We’ll use data from M. Blackburn and D. Neumark (1992), “Unobserved Ability, Efficiency Wages, and Interindustry Wage Differentials,” Quarterly Journal of Economics 107, 1421-1436.

Use number of siblings as an instrument to predict an observation’s level of education. We’ll keep it simple and have no other covariates.

use "wage2.dta", clear
  1. Estimate the potentially biased OLS
  2. Estimate your first stage
  3. Test your first stage
  4. Run your 2SLS

First, our biased OLS estimate

reg lwage educ
      Source |       SS           df       MS      Number of obs   =       935
-------------+----------------------------------   F(1, 933)       =    100.70
       Model |  16.1377042         1  16.1377042   Prob > F        =    0.0000
    Residual |  149.518579       933  .160255712   R-squared       =    0.0974
-------------+----------------------------------   Adj R-squared   =    0.0964
       Total |  165.656283       934  .177362188   Root MSE        =    .40032

------------------------------------------------------------------------------
       lwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        educ |   .0598392   .0059631    10.03   0.000     .0481366    .0715418
       _cons |   5.973063   .0813737    73.40   0.000     5.813366    6.132759
------------------------------------------------------------------------------

6.1665821

We have a estimated return of education of \((e^{0.0598392}-1)*100\% = 6.2\%\)

Second, use Number of siblings as an instrument for the first stage.

reg educ sibs
predict edu_hat
      Source |       SS           df       MS      Number of obs   =       935
-------------+----------------------------------   F(1, 933)       =     56.67
       Model |  258.055048         1  258.055048   Prob > F        =    0.0000
    Residual |   4248.7642       933  4.55387374   R-squared       =    0.0573
-------------+----------------------------------   Adj R-squared   =    0.0562
       Total |  4506.81925       934  4.82528828   Root MSE        =     2.134

------------------------------------------------------------------------------
        educ |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        sibs |  -.2279164   .0302768    -7.53   0.000     -.287335   -.1684979
       _cons |   14.13879   .1131382   124.97   0.000     13.91676    14.36083
------------------------------------------------------------------------------

(option xb assumed; fitted values)

Third, Get F-statistic.

test sibs
 ( 1)  sibs = 0

       F(  1,   933) =   56.67
            Prob > F =    0.0000

Fourth, Use number of siblings as an instrument.

reg lwage edu_hat
      Source |       SS           df       MS      Number of obs   =       935
-------------+----------------------------------   F(1, 933)       =     22.31
       Model |  3.86818074         1  3.86818074   Prob > F        =    0.0000
    Residual |  161.788103       933  .173406326   R-squared       =    0.0234
-------------+----------------------------------   Adj R-squared   =    0.0223
       Total |  165.656283       934  .177362188   Root MSE        =    .41642

------------------------------------------------------------------------------
       lwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     edu_hat |   .1224326   .0259225     4.72   0.000     .0715595    .1733057
       _cons |   5.130026   .3494009    14.68   0.000     4.444323    5.815729
------------------------------------------------------------------------------

13.024298

One additional year of education increases wages by \((e^{0.1224}-1)*100\%=13.0\%\)

Another thing that is interesting here is the the OLS estimate is biased downward, which is not what we would expect. Possible reasons: 1. Siblings could be correlated with ability - more siblings, less partental attention which could result in lower ability. 2. The OLS estimator is downward biased due to measurement error in educ, but this is less likely to satisfy the classic error-in-variables (CEM) assumption.

Smoking on Birthweight

It is important to see an example of a poor instrument. We’ll use data from . Mullahy (1997), “Instrumental-Variable Estimation of Count Data Models: Applications to Models of Cigarette Smoking Behavior,” Review of Economics and Statistics 79, 596-593.

use "bwght.dta", clear

Biased OLS looks at the natural log of birth weight and cigarette packs smoked per day by the mother.We would expect that smoking is correlated with unobserved health and parental decisions, so it is like biased due to unobserved confounders.

reg lbwght packs
      Source |       SS           df       MS      Number of obs   =     1,388
-------------+----------------------------------   F(1, 1386)      =     27.98
       Model |  .997781141         1  .997781141   Prob > F        =    0.0000
    Residual |  49.4225525     1,386  .035658407   R-squared       =    0.0198
-------------+----------------------------------   Adj R-squared   =    0.0191
       Total |  50.4203336     1,387  .036352079   Root MSE        =    .18883

------------------------------------------------------------------------------
      lbwght |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       packs |  -.0898131   .0169786    -5.29   0.000    -.1231197   -.0565065
       _cons |   4.769404   .0053694   888.26   0.000     4.758871    4.779937
------------------------------------------------------------------------------

-8.5895151

An additional package smoked per day decreases birthweight by \((e^{-0.08981}-1)*100\% \approx 8.6\%\)

We’ll use cigarette prices as an instrument for cigarette packs smoked per day. We assume that cigarette prices and the error term u are uncorrelated (instrument exogeneity). Note that some states fund health care with cigarette tax revenue.

We will use cigarette price and quantity of packs smoked should be negatively correlated.

reg packs cigprice
      Source |       SS           df       MS      Number of obs   =     1,388
-------------+----------------------------------   F(1, 1386)      =      0.13
       Model |  .011648626         1  .011648626   Prob > F        =    0.7179
    Residual |  123.684481     1,386  .089238442   R-squared       =    0.0001
-------------+----------------------------------   Adj R-squared   =   -0.0006
       Total |  123.696129     1,387  .089182501   Root MSE        =    .29873

------------------------------------------------------------------------------
       packs |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    cigprice |   .0002829    .000783     0.36   0.718    -.0012531    .0018188
       _cons |   .0674257   .1025384     0.66   0.511    -.1337215    .2685728
------------------------------------------------------------------------------

Our instrument is not associated with quantity consumed.

Testing Instrument Relevance

We will test our instrument even though it appeared weak. We’ll use the predict command to get the estimated packs

predict packs_hat

Use the test command to get the F-statistic.

 ( 1)  cigprice = 0

       F(  1,  1386) =    0.13
            Prob > F =    0.7179

Our instrument has an F-stat of 0.7179. This is below our rule of thumb of F-stat > 15. This indicates that we have a weak instrument, and the Instrument Relevance assumption fails.

We’ll still use the instrument to estimate our parameter for packs of cigarettes smoked on the natural log of birth weight

reg lbwght packs_hat
      Source |       SS           df       MS      Number of obs   =     1,388
-------------+----------------------------------   F(1, 1386)      =      2.87
       Model |  .104047659         1  .104047659   Prob > F        =    0.0907
    Residual |   50.316286     1,386  .036303237   R-squared       =    0.0021
-------------+----------------------------------   Adj R-squared   =    0.0013
       Total |  50.4203336     1,387  .036352079   Root MSE        =    .19053

------------------------------------------------------------------------------
      lbwght |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   packs_hat |   2.988676   1.765368     1.69   0.091    -.4744067    6.451758
       _cons |   4.448136   .1843027    24.13   0.000     4.086594    4.809679
------------------------------------------------------------------------------

Cigarette price as an instrument for packs smoked is a poor instrument and our use of predicted packs smoked is in the wrong direction.

We can and should always test Instrument Relevance. If we have a poor instrument, we should go back to the drawing board.

Another issue here is that price is a poor instrument, since price and quantity are simultaneously determined. We would need a second set of instruments on price to estimate our first-stage.

Estimating Returns to Education for Married Women Part 2

We’ll use Mroz data again on working women.

use "mroz.dta", clear

We’ll use both parent’s education as instruments to identify the effect of education on wages for working women. We overidentify the endogenous variable with 2 instruments: father’s education and mother’s education.

Potentially Biased OLS

reg lwage educ c.exper##c.exper
      Source |       SS           df       MS      Number of obs   =       428
-------------+----------------------------------   F(3, 424)       =     26.29
       Model |  35.0222967         3  11.6740989   Prob > F        =    0.0000
    Residual |  188.305144       424  .444115906   R-squared       =    0.1568
-------------+----------------------------------   Adj R-squared   =    0.1509
       Total |  223.327441       427  .523015084   Root MSE        =    .66642

---------------------------------------------------------------------------------
          lwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
           educ |   .1074896   .0141465     7.60   0.000     .0796837    .1352956
          exper |   .0415665   .0131752     3.15   0.002     .0156697    .0674633
                |
c.exper#c.exper |  -.0008112   .0003932    -2.06   0.040    -.0015841   -.0000382
                |
          _cons |  -.5220406   .1986321    -2.63   0.009    -.9124667   -.1316144
---------------------------------------------------------------------------------

11.347933

Our result is \((e^{0.1074896}-1)*100 \approx 11.3\%\)

Estimate the First Stage

We’ll use to instruments for one endogenous variable. Use Parent’s Education as an instrument for women in the labor force.

reg educ c.exper##c.exper fathedu mothedu if inlf==1
      Source |       SS           df       MS      Number of obs   =       428
-------------+----------------------------------   F(4, 423)       =     28.36
       Model |  471.620998         4   117.90525   Prob > F        =    0.0000
    Residual |  1758.57526       423  4.15738833   R-squared       =    0.2115
-------------+----------------------------------   Adj R-squared   =    0.2040
       Total |  2230.19626       427  5.22294206   Root MSE        =     2.039

---------------------------------------------------------------------------------
           educ |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
          exper |   .0452254   .0402507     1.12   0.262    -.0338909    .1243417
                |
c.exper#c.exper |  -.0010091   .0012033    -0.84   0.402    -.0033744    .0013562
                |
       fatheduc |   .1895484   .0337565     5.62   0.000     .1231971    .2558997
       motheduc |    .157597   .0358941     4.39   0.000      .087044    .2281501
          _cons |    9.10264   .4265614    21.34   0.000     8.264196    9.941084
---------------------------------------------------------------------------------

Test for Instrument Revelance with the F-Statistic

test fathedu mothedu
 ( 1)  fatheduc = 0
 ( 2)  motheduc = 0

       F(  2,   423) =   55.40
            Prob > F =    0.0000

Our instrument appears to be a relevant instrument, since \(F>15\)

Predict Education

predict edu_hat
(option xb assumed; fitted values)

Estimate Using an IV Estimator

Using Father’s Education as an instrument

reg lwage edu_hat c.exper##c.exper
display (exp(_b[edu_hat])-1)*100
      Source |       SS           df       MS      Number of obs   =       428
-------------+----------------------------------   F(3, 424)       =      7.40
       Model |   11.117828         3  3.70594265   Prob > F        =    0.0001
    Residual |  212.209613       424   .50049437   R-squared       =    0.0498
-------------+----------------------------------   Adj R-squared   =    0.0431
       Total |  223.327441       427  .523015084   Root MSE        =    .70746

---------------------------------------------------------------------------------
          lwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
        edu_hat |   .0613966   .0329624     1.86   0.063    -.0033933    .1261866
          exper |   .0441704   .0140844     3.14   0.002     .0164865    .0718543
                |
c.exper#c.exper |   -.000899   .0004212    -2.13   0.033    -.0017268   -.0000711
                |
          _cons |   .0481003   .4197565     0.11   0.909    -.7769624     .873163
---------------------------------------------------------------------------------

6.3320574

One additional year of education increases wages by \((e^{0.061}-1)*100\%=6.3\%\)

Use the ivregress 2sls command

We went through each step, but we can use the IV estimator with one command: ivregress 2sls.

ivregress 2sls \(<dependent var> (<explanatory var>=<IV>) \ x_1 \ x_2 \ ... \ x_k\)

ivregress 2sls lwage (educ=fathedu mothedu) c.exper##c.exper
display (exp(_b[educ])-1)*100
Instrumental variables (2SLS) regression          Number of obs   =        428
                                                  Wald chi2(3)    =      24.65
                                                  Prob > chi2     =     0.0000
                                                  R-squared       =     0.1357
                                                  Root MSE        =     .67155

---------------------------------------------------------------------------------
          lwage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
           educ |   .0613966   .0312895     1.96   0.050     .0000704    .1227228
          exper |   .0441704   .0133696     3.30   0.001     .0179665    .0703742
                |
c.exper#c.exper |   -.000899   .0003998    -2.25   0.025    -.0016826   -.0001154
                |
          _cons |   .0481003    .398453     0.12   0.904    -.7328532    .8290538
---------------------------------------------------------------------------------
Instrumented:  educ
Instruments:   exper c.exper#c.exper fatheduc motheduc

6.3320574

We can also add the First Stage option with first

ivregress 2sls lwage (educ=fathedu mothedu) c.exper##c.exper, first
First-stage regressions
-----------------------

                                                Number of obs     =        428
                                                F(   4,    423)   =      28.36
                                                Prob > F          =     0.0000
                                                R-squared         =     0.2115
                                                Adj R-squared     =     0.2040
                                                Root MSE          =     2.0390

---------------------------------------------------------------------------------
           educ |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
          exper |   .0452254   .0402507     1.12   0.262    -.0338909    .1243417
                |
c.exper#c.exper |  -.0010091   .0012033    -0.84   0.402    -.0033744    .0013562
                |
       fatheduc |   .1895484   .0337565     5.62   0.000     .1231971    .2558997
       motheduc |    .157597   .0358941     4.39   0.000      .087044    .2281501
          _cons |    9.10264   .4265614    21.34   0.000     8.264196    9.941084
---------------------------------------------------------------------------------


Instrumental variables (2SLS) regression          Number of obs   =        428
                                                  Wald chi2(3)    =      24.65
                                                  Prob > chi2     =     0.0000
                                                  R-squared       =     0.1357
                                                  Root MSE        =     .67155

---------------------------------------------------------------------------------
          lwage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
           educ |   .0613966   .0312895     1.96   0.050     .0000704    .1227228
          exper |   .0441704   .0133696     3.30   0.001     .0179665    .0703742
                |
c.exper#c.exper |   -.000899   .0003998    -2.25   0.025    -.0016826   -.0001154
                |
          _cons |   .0481003    .398453     0.12   0.904    -.7328532    .8290538
---------------------------------------------------------------------------------
Instrumented:  educ
Instruments:   exper c.exper#c.exper fatheduc motheduc

6.3320574

Exercise

Go to my website and download the mroz data. Use father’s and mother’s education as an instrument to estimate education with the ivregress 2sls command. https://github.com/rowesamuel/Econ645/blob/main/Data/Wooldridge/mroz.dta

  1. Estimate OLS using \(wage_{i} = \beta_{0} + \beta_{1}edu +\beta_{2} exper + \beta_{3} exper^{2}\)

  2. Estimate the 2SLS with ivregress 2sls command

Answers

OLS

reg lwage educ c.exper##c.exper
      Source |       SS           df       MS      Number of obs   =       428
-------------+----------------------------------   F(3, 424)       =     26.29
       Model |  35.0222967         3  11.6740989   Prob > F        =    0.0000
    Residual |  188.305144       424  .444115906   R-squared       =    0.1568
-------------+----------------------------------   Adj R-squared   =    0.1509
       Total |  223.327441       427  .523015084   Root MSE        =    .66642

---------------------------------------------------------------------------------
          lwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
           educ |   .1074896   .0141465     7.60   0.000     .0796837    .1352956
          exper |   .0415665   .0131752     3.15   0.002     .0156697    .0674633
                |
c.exper#c.exper |  -.0008112   .0003932    -2.06   0.040    -.0015841   -.0000382
                |
          _cons |  -.5220406   .1986321    -2.63   0.009    -.9124667   -.1316144
---------------------------------------------------------------------------------

2SLS

ivregress 2sls lwage (educ=fatheduc motheduc) c.exper##c.exper
Instrumental variables (2SLS) regression          Number of obs   =        428
                                                  Wald chi2(3)    =      24.65
                                                  Prob > chi2     =     0.0000
                                                  R-squared       =     0.1357
                                                  Root MSE        =     .67155

---------------------------------------------------------------------------------
          lwage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
           educ |   .0613966   .0312895     1.96   0.050     .0000704    .1227228
          exper |   .0441704   .0133696     3.30   0.001     .0179665    .0703742
                |
c.exper#c.exper |   -.000899   .0003998    -2.25   0.025    -.0016826   -.0001154
                |
          _cons |   .0481003    .398453     0.12   0.904    -.7328532    .8290538
---------------------------------------------------------------------------------
Instrumented:  educ
Instruments:   exper c.exper#c.exper fatheduc motheduc

Testing for Endogeneity - Returns to Education for Working Women

We’ll keep using Mroz’s data on working women.

use "mroz.dta", clear

If want to see if our explanatory variable of interest is potentially endogenous with the error term. We can conduct an endogeneity test. We have two ways to test endogeneity in the OLS estimator.

Manually Test

Estimate the reduced form for potentially endogenous variable of interest by regressing all exogenous variables on education. This includes all those in the structural model and the additional IVs.

reg educ c.exper##c.exper fathedu mothedu if inlf==1
      Source |       SS           df       MS      Number of obs   =       428
-------------+----------------------------------   F(4, 423)       =     28.36
       Model |  471.620998         4   117.90525   Prob > F        =    0.0000
    Residual |  1758.57526       423  4.15738833   R-squared       =    0.2115
-------------+----------------------------------   Adj R-squared   =    0.2040
       Total |  2230.19626       427  5.22294206   Root MSE        =     2.039

---------------------------------------------------------------------------------
           educ |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
          exper |   .0452254   .0402507     1.12   0.262    -.0338909    .1243417
                |
c.exper#c.exper |  -.0010091   .0012033    -0.84   0.402    -.0033744    .0013562
                |
       fatheduc |   .1895484   .0337565     5.62   0.000     .1231971    .2558997
       motheduc |    .157597   .0358941     4.39   0.000      .087044    .2281501
          _cons |    9.10264   .4265614    21.34   0.000     8.264196    9.941084
---------------------------------------------------------------------------------

Next, obtains residuals \(\hat{v_{2}}\).

predict r, residual

Then, add \(\hat{v_{2}}\) to the structural equation.

reg lwage educ c.exper##c.exper r
      Source |       SS           df       MS      Number of obs   =       428
-------------+----------------------------------   F(4, 423)       =     20.50
       Model |  36.2573098         4  9.06432744   Prob > F        =    0.0000
    Residual |  187.070131       423  .442246173   R-squared       =    0.1624
-------------+----------------------------------   Adj R-squared   =    0.1544
       Total |  223.327441       427  .523015084   Root MSE        =    .66502

---------------------------------------------------------------------------------
          lwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
           educ |   .0613966   .0309849     1.98   0.048      .000493    .1223003
          exper |   .0441704   .0132394     3.34   0.001     .0181471    .0701937
                |
c.exper#c.exper |   -.000899   .0003959    -2.27   0.024    -.0016772   -.0001208
                |
              r |   .0581666   .0348073     1.67   0.095    -.0102502    .1265834
          _cons |   .0481003   .3945753     0.12   0.903    -.7274721    .8236727
---------------------------------------------------------------------------------

There is possible evidence of endogeneity since \(p < .1\) but \(p > .05\). You should report IV and OLS. We can compare our results in a mini sensitivity analysis.

eststo m1: quietly reg lwage educ c.exper##c.exper
eststo m2: quietly ivregress 2sls lwage (educ=fatheduc motheduc) c.exper##c.exper
esttab m1 m2, mtitle(OLS IV)
                      (1)             (2)   
                      OLS              IV   
--------------------------------------------
educ                0.107***       0.0614*  
                   (7.60)          (1.96)   

exper              0.0416**        0.0442***
                   (3.15)          (3.30)   

c.exper#c.~r    -0.000811*      -0.000899*  
                  (-2.06)         (-2.25)   

_cons              -0.522**        0.0481   
                  (-2.63)          (0.12)   
--------------------------------------------
N                     428             428   
--------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

Using estat post-estimate command

We can also use a postestimation command estat endogenous.

ivregress 2sls lwage (educ=fathedu mothedu) c.exper##c.exper
estat endogenous
Instrumental variables (2SLS) regression          Number of obs   =        428
                                                  Wald chi2(3)    =      24.65
                                                  Prob > chi2     =     0.0000
                                                  R-squared       =     0.1357
                                                  Root MSE        =     .67155

---------------------------------------------------------------------------------
          lwage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
           educ |   .0613966   .0312895     1.96   0.050     .0000704    .1227228
          exper |   .0441704   .0133696     3.30   0.001     .0179665    .0703742
                |
c.exper#c.exper |   -.000899   .0003998    -2.25   0.025    -.0016826   -.0001154
                |
          _cons |   .0481003    .398453     0.12   0.904    -.7328532    .8290538
---------------------------------------------------------------------------------
Instrumented:  educ
Instruments:   exper c.exper#c.exper fatheduc motheduc


  Tests of endogeneity
  Ho: variables are exogenous

  Durbin (score) chi2(1)          =  2.80707  (p = 0.0938)
  Wu-Hausman F(1,423)             =  2.79259  (p = 0.0954)

We have weak evidence that education is endogenous, since p<.1 but p >.05. However, we know from theory that ability is a confounder on wages and education. We should report both OLS and IV models.

Testing Overidentifying Restrictions - Returns to Education for Working Women

When we have one IV for one endogenous explanatory variable, we have a just identified equation. When we have two instruments and one endogenous explanatory variable we have overidentification.

When we have multiple IVs, we can test to see some of our instruments are correlated with the structural error term.

We can estimate two 2sls models (one for each IV) and then compare them. They should only differ by the sampling error. If our two beta coefficients on our fitted explanatory variable of interest are different then, we conclude that at least one instrument or maybe both is/are correlated with the structural error term.

When we add too many instrumental variables (or overidentification), we can increase the efficiency of the 2SLS estimator. However, we may run the risk of violating the instrument exogeneity assumption.

When we use motheredu and fatheredu as IVS for education, we have a single overidentification restriction. We have two IVs and 1 endogenous explanatory variable.

Testing the Overidentifying Restrictions Manually

  1. Get our working women data.
use "mroz.dta", clear
  1. Use an 2SLS model with mother’s education and father’s education as two IVs.
ivregress 2sls lwage (educ=motheduc fatheduc) c.exper##c.exper
  1. Get our residuals
predict r, resid
  1. Regress the residual on all exogenous variables
reg r mothedu fathedu c.exper##c.exper
  1. Obtain R-squared and N
ereturn list
local N=`e(N)'
display "`N'"
local rsq=`e(r2)'
display "`rsq'"
local nR=`N'*`rsq'
display "`nR'"
scalars:
                  e(N) =  428
               e(df_m) =  4
               e(df_r) =  423
                  e(F) =  .0934962445404771
                 e(r2) =  .000883344256925
               e(rmse) =  .6752103466059415
                e(mss) =  .1705031219578643
                e(rss) =  192.8495121452517
               e(r2_a) =  -.0085645673813073
                 e(ll) =  -436.7021015142834
               e(ll_0) =  -436.891220726253
               e(rank) =  5

macros:
            e(cmdline) : "regress r mothedu fathedu c.exper##c.exper"
              e(title) : "Linear regression"
          e(marginsok) : "XB default"
                e(vce) : "ols"
             e(depvar) : "r"
                e(cmd) : "regress"
         e(properties) : "b V"
            e(predict) : "regres_p"
          e(estat_cmd) : "regress_estat"

matrices:
                  e(b) :  1 x 5
                  e(V) :  5 x 5

functions:
             e(sample)   


428


.000883344256925


.3780713419639

Under the null hypothesis that all IVs are uncorrelated with \(u_{1}\) \(N*R^{2} \sim \chi^{2}_{q}\), where q is the number of instruments from outside the model minus the total number of endogenous explanatory variables. If \(N*R^{2}\) exceeds the 5% critical value in \(\chi^{2}_{q}\), then we reject the null hypothesis and conclude that at least some of the IVs are not exogenous.

Here we have q=2-1=1 df for the chi-squared test and we fail to reject the null hypothesis since \(N*R^{2} = 0.37807\) and \(\chi^{2}_{1}\) at the 5% critical value is 3.841.

Testing the Overidentifying Restrictions with Post-Estimation Estat

We can also use the postestimation command of estat overid to get our nR2.

ivregress 2sls lwage (educ=motheduc fatheduc) c.exper##c.exper
estat overid
Instrumental variables (2SLS) regression          Number of obs   =        428
                                                  Wald chi2(3)    =      24.65
                                                  Prob > chi2     =     0.0000
                                                  R-squared       =     0.1357
                                                  Root MSE        =     .67155

---------------------------------------------------------------------------------
          lwage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
           educ |   .0613966   .0312895     1.96   0.050     .0000704    .1227228
          exper |   .0441704   .0133696     3.30   0.001     .0179665    .0703742
                |
c.exper#c.exper |   -.000899   .0003998    -2.25   0.025    -.0016826   -.0001154
                |
          _cons |   .0481003    .398453     0.12   0.904    -.7328532    .8290538
---------------------------------------------------------------------------------
Instrumented:  educ
Instruments:   exper c.exper#c.exper motheduc fatheduc


  Tests of overidentifying restrictions:

  Sargan (score) chi2(1) =  .378071  (p = 0.5386)
  Basmann chi2(1)        =  .373985  (p = 0.5408)

Adding Another Potential Instrument - Husband’s Education

Let’s add husband’s education, so we have 2 overidentification restrictions.

ivregress 2sls lwage (educ=motheduc fatheduc huseduc) c.exper##c.exper
estat overid
Instrumental variables (2SLS) regression          Number of obs   =        428
                                                  Wald chi2(3)    =      34.90
                                                  Prob > chi2     =     0.0000
                                                  R-squared       =     0.1495
                                                  Root MSE        =     .66616

---------------------------------------------------------------------------------
          lwage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
           educ |   .0803918    .021672     3.71   0.000     .0379155    .1228681
          exper |   .0430973   .0132027     3.26   0.001     .0172204    .0689742
                |
c.exper#c.exper |  -.0008628   .0003943    -2.19   0.029    -.0016357   -.0000899
                |
          _cons |  -.1868572   .2840591    -0.66   0.511    -.7436029    .3698885
---------------------------------------------------------------------------------
Instrumented:  educ
Instruments:   exper c.exper#c.exper motheduc fatheduc huseduc


  Tests of overidentifying restrictions:

  Sargan (score) chi2(2) =  1.11504  (p = 0.5726)
  Basmann chi2(2)        =  1.10228  (p = 0.5763)

Notice that we still fail reject the null hypothesis, so we might consider adding it as an IV. Also, notice that the coefficient and standard error around education have changed as well.

It is a good idea to report both in a sensitivity analysis.

est clear
eststo m1: quietly ivregress 2sls lwage (educ=motheduc fatheduc) c.exper##c.exper
eststo m2: quietly ivregress 2sls lwage (educ=motheduc fatheduc huseduc) c.exper##c.exper
esttab m1 m2, mtitle(2IVs 3IVs)
                      (1)             (2)   
                     2IVs            3IVs   
--------------------------------------------
educ               0.0614*         0.0804***
                   (1.96)          (3.71)   

exper              0.0442***       0.0431** 
                   (3.30)          (3.26)   

c.exper#c.~r    -0.000899*      -0.000863*  
                  (-2.25)         (-2.19)   

_cons              0.0481          -0.187   
                   (0.12)         (-0.66)   
--------------------------------------------
N                     428             428   
--------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

Mitchell - Data Management in Stata

We won’t spend too much time looking into Chapter 2 and 3, since they should be a review from ECON 644. Chapter 4 will be a review, but we’ll look into some CPS data.

Recommended resources UCLA OARC Stata resources: https://stats.oarc.ucla.edu/stata/modules/ I have used this source on numerous times. Even when you become proficient with Stata, UCLA’s OARC Stata resources have been very helpful. If you just type “UCLA oarc Stata ” in a search engine, then it will usually pop up.

Statalist if usually another good resource when trying to figure out what went wrong. If you ask as question on the listserve, you will likely be scolded by Nick Cox, or you will be told to “read the manual”. However, if you just read what others have asked, you will often find useful information.

Getting our data

If you would like to recreate Mitchell’s data, then follow the following steps

Set your working directory and use net install. E.g:

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"

*Use:
net from https://www.stata-press.com/data/dmus2
net get dmus1
net get dmus2

*All commands:
net install dmus1

Chapter 2: Reading and Importing Data

Note when your working directory, I recommend using “/” instead of “". Stata will use both, and the backslash”" often runs into string issues.

2.2 Reading Stata data

We’ll use the use command to get our data into Stata and we’ll add the clear option to clear our memory to prevent an error.

use "dentists.dta", clear

From Stata Press website

use "https://www.stata-press.com/data/dmus2/dentists.dta", clear

From NBER website for CPS data

use "https://data.nber.org/morg/annual/morg20.dta", clear
Reading Stata data with subsets

Subsetting by columns is a great way to work with large dataset when you do not need the entire dataset. For example, the January CPS Basic has 393 variables and over 125,000 observations. While this isn’t very large, it would be very beneficial if you work with the ACS PUMS data, which is usually around 16GB. We can subset with the following: 1. By Variable Names 2. By Rows 3. By Variable Names and Rows

Reading a subset of the Stata data when you know the variable names (e.g.: when you have a data dictionary)

use pemlr gestfips pwsswgt using "jan25pub.dta"
list in 1/5
     | gestfips   pemlr    pwsswgt |
     |-----------------------------|
  1. |        1       1   39739102 |
  2. |        1       1   39925108 |
  3. |        1       .          . |
  4. |        1       6   28808444 |
  5. |        1       .          . |
     +-----------------------------+

Subsetting by rows (observations) and variable names

*gestfips code of 24 is Maryland
use pemlr gestfips pwsswgt using "jan25pub.dta" if gestfips == 24, clear 
     | gestfips   pemlr    pwsswgt |
     |-----------------------------|
  1. |       24       5   48660345 |
  2. |       24       1   67339169 |
  3. |       24       1   50801953 |
  4. |       24       1   40983471 |
  5. |       24       1   62453493 |
     +-----------------------------+

Reading the data dictionary beforehand will help, especially with larger files such as the PUMS ACS or CPS

System data

I’m not going to focus on these, but they are avaiable There are data sets that are included in Stata and use the sysuse command

sysuse auto, clear

You can use Stata files that are available, but didn’t ship with Stata and they use the webuse command

webuse fullauto, clear

2.3 Importing Excel Files

Many times you will need to grab data from a website and these data may be in an Excel format such as “.xls” or “.xlsx”. Our main command is import excel. We need to specify the firstrow option if we want the headers as variable names. Be careful if the headers are numerics, such as years, in the firstrow. This will throw an error and we’ll need to rename them in Excel before importing these files. For example, 2015 becomes y2015

import excel using "tfp_major_sector_long.xlsx", firstrow clear
list in 1/10 
  1. | NAICS |                          Sector |       Basis |                   Measure |
     |    XG | Private nonfarm business sector | All workers | Total factor productivity |
     |-----------------------------------------------------------------------------------|
     |                  Units       |       Year        |                   Value        |
     |       Index (2017=100)       |       1987        |                  78.664        |
     +-----------------------------------------------------------------------------------+

     +-----------------------------------------------------------------------------------+
  2. | NAICS |                          Sector |       Basis |                   Measure |
     |    XG | Private nonfarm business sector | All workers |   Real value-added output |
     |-----------------------------------------------------------------------------------|
     |                  Units       |       Year        |                   Value        |
     |       Index (2017=100)       |       1987        |                  41.875        |
     +-----------------------------------------------------------------------------------+

     +-----------------------------------------------------------------------------------+
  3. | NAICS |                          Sector |       Basis |                   Measure |
     |    XG | Private nonfarm business sector | All workers |           Combined inputs |
     |-----------------------------------------------------------------------------------|
     |                  Units       |       Year        |                   Value        |
     |       Index (2017=100)       |       1987        |                  53.233        |
     +-----------------------------------------------------------------------------------+

     +-----------------------------------------------------------------------------------+
  4. | NAICS |                          Sector |       Basis |                   Measure |
     |    XG | Private nonfarm business sector | All workers |             Capital input |
     |-----------------------------------------------------------------------------------|
     |                  Units       |       Year        |                   Value        |
     |       Index (2017=100)       |       1987        |                  35.158        |
     +-----------------------------------------------------------------------------------+

     +-----------------------------------------------------------------------------------+
  5. | NAICS |                          Sector |       Basis |                   Measure |
     |    XG | Private nonfarm business sector | All workers |               Labor input |
     |-----------------------------------------------------------------------------------|
     |                  Units       |       Year        |                   Value        |
     |       Index (2017=100)       |       1987        |                  65.818        |
     +-----------------------------------------------------------------------------------+

     +-----------------------------------------------------------------------------------+
  6. | NAICS |                          Sector |       Basis |                   Measure |
     |    XG | Private nonfarm business sector | All workers |              Hours worked |
     |-----------------------------------------------------------------------------------|
     |                  Units       |       Year        |                   Value        |
     |       Index (2017=100)       |       1987        |       75.78700000000001        |
     +-----------------------------------------------------------------------------------+

     +-----------------------------------------------------------------------------------+
  7. | NAICS |                          Sector |       Basis |                   Measure |
     |    XG | Private nonfarm business sector | All workers |         Labor composition |
     |-----------------------------------------------------------------------------------|
     |                  Units       |       Year        |                   Value        |
     |       Index (2017=100)       |       1987        |                  86.846        |
     +-----------------------------------------------------------------------------------+

     +-----------------------------------------------------------------------------------+
  8. | NAICS |                          Sector |       Basis |                   Measure |
     |    XG | Private nonfarm business sector | All workers |        Labor productivity |
     |-----------------------------------------------------------------------------------|
     |                  Units       |       Year        |                   Value        |
     |       Index (2017=100)       |       1987        |                  55.253        |
     +-----------------------------------------------------------------------------------+

     +-----------------------------------------------------------------------------------+
  9. | NAICS |                          Sector |       Basis |                   Measure |
     |    XG | Private nonfarm business sector | All workers |      Capital productivity |
     |-----------------------------------------------------------------------------------|
     |                  Units       |       Year        |                   Value        |
     |       Index (2017=100)       |       1987        |                 119.104        |
     +-----------------------------------------------------------------------------------+

     +-----------------------------------------------------------------------------------+
 10. | NAICS |                          Sector |       Basis |                   Measure |
     |    XG | Private nonfarm business sector | All workers |         Capital intensity |
     |-----------------------------------------------------------------------------------|
     |                  Units       |       Year        |                   Value        |
     |       Index (2017=100)       |       1987        |                  46.391        |
     +-----------------------------------------------------------------------------------+
Pay attention to the Excel Sheet Name

We need to take caution of the sheet names of the excel files we are importing If you don’t pay attention you may import the wrong sheet.

import excel using "dentists2.xls", firstrow clear
list
     |              name   years   fulltime   recom |
     |----------------------------------------------|
  1. | Y. Don Uflossmore    7.25          0       1 |
  2. |    Olive Tu'Drill   10.25          1       1 |
  3. | Isaac O'Yerbreath   32.75          1       1 |
  4. |      Ruth Canaale      22          1       1 |
  5. |        Mike Avity     8.5          0       0 |
     +----------------------------------------------+

2.4-2.6 Importing SAS, SPSS, and dBase files

Note: importing files from other statistical packages is a pain. If you are using another statistical software packages like R, SAS, Python, or heaven- forbin SPSS, then export the data from those software package as CSV files. CSV files are fairly universal. There maybe some labeling issues, but if labels are not a concern then use CSV file formats. Luckily, SAS data files are becoming less common. Even the Census Bureau has seen the light and provides their data in a csv format.

For me, these commands are unavailable in Stata 14, but if you have Stata 16 or higher you can use these command. import sas is our main command to import SAS from version 7 or higher.

import sas "dentists.sas7bdat", clear

We can import SAS XPORT Version 5

import sasxport5 "dentists.xpt", clear

We can import SAS XPORT Version 8

import sasxport8 "dentists.xpt", clear

We can import SPSS files

import spss "dentlab.sav", clear

We can import dBase files

import dbase "dentlab.dbf", clear

2.7 Importing Raw Data files

There are three key types raw data files we will cover 1. Delimited (such as csv) 2. Space separated 3. Fixed width

Importing csv files will likely be more common than other types of files so far with the exception of Stata and Excel files. Note: there are other types of delimited files, but comman-delimited are the most common. Other types of delimiters include “tab”, “space”, or “|” delimited.

CSV file

If you read the manual you’ll noticed that “,” and “ are the default delimiters. Let try it using”,” delimter

import delimited using "jan25pub.csv", clear

Note: CSV files may come in .txt files or .csv files using “,” delimiter

import delimited using "dentists1.txt", clear
list
(4 vars, 5 obs)

     +----------------------------------------------+
     |              name   years   fulltime   recom |
     |----------------------------------------------|
  1. | Y. Don Uflossmore    7.25          0       1 |
  2. |    Olive Tu'Drill   10.25          1       1 |
  3. | Isaac O'Yerbreath   32.75          1       1 |
  4. |      Ruth Canaale      22          1       1 |
  5. |        Mike Avity     8.5          0       0 |
     +----------------------------------------------+

There may be “tab”-delimited files that use “ delimiter

import delimited using "dentists2.txt", clear
list
(4 vars, 5 obs)

     +----------------------------------------------+
     |              name   years   fulltime   recom |
     |----------------------------------------------|
  1. | Y. Don Uflossmore    7.25          0       1 |
  2. |    Olive Tu'Drill   10.25          1       1 |
  3. | Isaac O'Yerbreath   32.75          1       1 |
  4. |      Ruth Canaale      22          1       1 |
  5. |        Mike Avity     8.5          0       0 |
     +----------------------------------------------+

Using the following code will produce the same results as above since “ is a default

import delimited using "dentists2.txt", clear delimiters("\t")
list
(4 vars, 5 obs)

     +----------------------------------------------+
     |              name   years   fulltime   recom |
     |----------------------------------------------|
  1. | Y. Don Uflossmore    7.25          0       1 |
  2. |    Olive Tu'Drill   10.25          1       1 |
  3. | Isaac O'Yerbreath   32.75          1       1 |
  4. |      Ruth Canaale      22          1       1 |
  5. |        Mike Avity     8.5          0       0 |
     +----------------------------------------------+

Let’s use a file that uses “:” as a delimiter

import delimited using "dentists4.txt", clear delimiters(":") 
list
(4 vars, 5 obs)

     +-------------------------------------+
     |                v1      v2   v3   v4 |
     |-------------------------------------|
  1. | Y. Don Uflossmore    7.25    0    1 |
  2. |    Olive Tu'Drill   10.25    1    1 |
  3. | Isaac O'Yerbreath   32.75    1    1 |
  4. |      Ruth Canaale      22    1    1 |
  5. |        Mike Avity     8.5    0    0 |
     +-------------------------------------+

We can subset columns and rows with import delimited, which can be helpful for large files. Variable names will be defaulted for the first few rows, so when we use rowrange(1:3) this will only include 2 rows of data.

    import delimited using "dentists2.txt", clear rowrange(1:3)
    list
Import delimited from the web

We can import a csv straight from a website and convert it to a Stata dta file

import delimited using "https://www2.census.gov/programs-surveys/cps/datasets/2023/basic/jun23pub.csv", clear
pwd
save "jun23pub.dta", replace
Import space-separated files

Sometimes files have space-delimited files (which is a bad idea when strings are involved). Dentists5.txt is an example of such a file. We can use the infile command to import these data, but we need to specify that the name variable is a string at least 17 characters wide.

infile str17 name years full rec using "dentists5.txt", clear
list
(5 observations read)

     +----------------------------------------+
     |              name   years   full   rec |
     |----------------------------------------|
  1. | Y. Don Uflossmore    7.25      0     1 |
  2. |    Olive Tu'Drill   10.25      1     1 |
  3. | Isaac O'Yerbreath   32.75      1     1 |
  4. |      Ruth Canaale      22      1     1 |
  5. |        Mike Avity     8.5      0     0 |
     +----------------------------------------+

Note: Infile does not read in variable names in the first row. We can use import delimited and specify that the delimiter is a blank space to get around this issue.

import delimited using "dentists6.txt", delimiters(" ") clear
list
(4 vars, 5 obs)

     +----------------------------------------------+
     |              name   years   fulltime   recom |
     |----------------------------------------------|
  1. | Y. Don Uflossmore    7.25          0       1 |
  2. |    Olive Tu'Drill   10.25          1       1 |
  3. | Isaac O'Yerbreath   32.75          1       1 |
  4. |      Ruth Canaale      22          1       1 |
  5. |        Mike Avity     8.5          0       0 |
     +----------------------------------------------+
Importing fixed-column files

Unfortunately, fixed-column files are more common than I prefer. They are a pain, since you need to specify each column length. I have found this with files with a “.dat” file extension.

We need a data dictionary to know when each varible starts. When we know this, we can use the infix command to read in fixed-column files. Let’s look at our fixed-column width files.

type "dentists7.txt"
Y. Don Uflossmore 7.2501
Olive Tu'Drill   10.2511
Isaac O'Yerbreath32.7511
Ruth Canaale     22.0011
Mike Avity        8.5000

Our name variable is between 1 and 17, years are between 18 and 22, and our fulltime and recommend binaries are only length of 1 with 23 and 24.

infix str name 1-17 years 18-22 fulltime 23 recom 24 using "dentists7.txt", clear
list
(5 observations read)

     +----------------------------------------------+
     |              name   years   fulltime   recom |
     |----------------------------------------------|
  1. | Y. Don Uflossmore    7.25          0       1 |
  2. |    Olive Tu'Drill   10.25          1       1 |
  3. | Isaac O'Yerbreath   32.75          1       1 |
  4. |      Ruth Canaale      22          1       1 |
  5. |        Mike Avity     8.5          0       0 |
     +----------------------------------------------+

This is a real pain, but luckily Stata has dictionary files to help if one of these files are available. They have a .dct file extension. We already have a fixed-column dictionary with dentists1.dct and we can use it to open a fixed-column width file dentists7.txt

Let’s look at our dictionary file.

type "dentists1.dct"
infix dictionary {
  str name 1-17 years 18-22 fulltime 23 recom 24
}

Let’s look at our fixed-column width file.

type "dentists7.txt"
Y. Don Uflossmore 7.2501
Olive Tu'Drill   10.2511
Isaac O'Yerbreath32.7511
Ruth Canaale     22.0011
Mike Avity        8.5000

Let’s use the two together to import the data file.

infix using "dentists1.dct", using(dentists7.txt) clear
list
infix dictionary {
  str name 1-17 years 18-22 fulltime 23 recom 24
}
(5 observations read)

     +----------------------------------------------+
     |              name   years   fulltime   recom |
     |----------------------------------------------|
  1. | Y. Don Uflossmore    7.25          0       1 |
  2. |    Olive Tu'Drill   10.25          1       1 |
  3. | Isaac O'Yerbreath   32.75          1       1 |
  4. |      Ruth Canaale      22          1       1 |
  5. |        Mike Avity     8.5          0       0 |
     +----------------------------------------------+

We can combine infile command with a dictionary file.

type "dentists3.dct"
infile dictionary using dentists7.txt {
  str17 name     %17s  "Name of dentist"
        years    %5f   "Years in practice"
        fulltime %1f   "Full time?"
        recom    %1f   "Recommend Quaddent?"
}

Notice that the first line of the dictionary file is infile dictionary using dentists7.txt and then the column widths.

infile using "dentists3.dct", clear
list
infile dictionary using dentists7.txt {
  str17 name     %17s  "Name of dentist"
        years    %5f   "Years in practice"
        fulltime %1f   "Full time?"
        recom    %1f   "Recommend Quaddent?"
}

(5 observations read)

     +----------------------------------------------+
     |              name   years   fulltime   recom |
     |----------------------------------------------|
  1. | Y. Don Uflossmore    7.25          0       1 |
  2. |    Olive Tu'Drill   10.25          1       1 |
  3. | Isaac O'Yerbreath   32.75          1       1 |
  4. |      Ruth Canaale      22          1       1 |
  5. |        Mike Avity     8.5          0       0 |
     +----------------------------------------------+

Dictionary .dct files are very helpful when they are available with fixed-column width files.

Subsetting fixed-column width files

We can use the “in” qualifier to subset the number of rows to import with either infix or infile

infix str name 1-17 years 18-22 fulltime 23 recom 24 using "dentists7.txt" in 1/3, clear
list
(3 observations read)

     +----------------------------------------------+
     |              name   years   fulltime   recom |
     |----------------------------------------------|
  1. | Y. Don Uflossmore    7.25          0       1 |
  2. |    Olive Tu'Drill   10.25          1       1 |
  3. | Isaac O'Yerbreath   32.75          1       1 |
     +----------------------------------------------+
infix using "dentists1.dct" in 1/3, using(dentists7.txt) clear
list
infix dictionary {
  str name 1-17 years 18-22 fulltime 23 recom 24
}
(3 observations read)

     +----------------------------------------------+
     |              name   years   fulltime   recom |
     |----------------------------------------------|
  1. | Y. Don Uflossmore    7.25          0       1 |
  2. |    Olive Tu'Drill   10.25          1       1 |
  3. | Isaac O'Yerbreath   32.75          1       1 |
     +----------------------------------------------+

2.8 Common errors when reading and importing files

Tip: Don’t forget to clear your data. Either the line before or add it as an option , clear

Error 1

If you do not clear your data with the use command, then you will get the message “No; dataset in memory has changed since last saved” - Use clear

Error 2

If you do not clear your data with the import, infix, or infile commands, then you will get the message: “You must start with an empty dataset”

2.9 Entering data directly into Stata Data Editor

My opinion: Don’t do this. If you need to manually enter data, I don’t recommend manually entering data into STATA. I recommend using a spreadsheet, then import the spreadsheet. If for some reason you don’t have access to Excel or a spreadsheet application like OSX Numbers or Google Sheets, then read pages 43-50.

Chapter 3: Saving and Exporting Data

3.2 Saving Stata files

It can be helpful to convert a csv, excel, or other type of file to a Stata dta file. It is also good to consider older types of Stata files. Older versions of Stata help with sharing data. (One of my beefs with Stata is that newer data files are not compatible with older versions of Stata).

Let’s import a comma-delimited file

import delimited using "dentists1.txt", clear

We can use the save and saveold commands to convert this comma-delimited file into a Stata dta file. Don’t forget to replace or else you will run into an error when trying to replicate the code.

Save will save it in Stata 14 (for me) and whatever version of Stata you run

    save mydentists, replace

Saveold and specify version 12

    saveold mydentists_v12, version(12) replace

We can subset as well with the keep command

    keep if recom==1
    save mydentists, replace
    saveold mydentists_v12, version(12) replace

There is a compress command, but harddrive are large enough today that it should not be a problem.

#### 3.3 Exporting Excel Files

Let’s say we want to collaborate with someone that doesn’t have Stata or wants to make graphs in Excel. We can export to excel either “.xls” or “.xlsx” from Stata with the export excel command.

Let’s use our main file again

    use dentlab, clear

We can use the export excel command. Don’t forget to use replace option and set the firstrow as variable names.

export excel using "dentlab.xlsx", replace firstrow(variables)

Note: Labels are exported into the binary 0/1 for parttime and recommend We can use the nolabel option to work around this.

export excel using "dentlab.xlsx", replace firstrow(variables) nolabel

But what we want a “labeled” and an “unlabeled” sheet in the same excel file? Using the option replace option will completely replace the entire Excel file. We don’t want that! So, let’s add another sheet called “Unlabeled”. We use the replace option in the sheet subcommand option, so we will not replace the entire excel file

export excel using "dentlab.xlsx", sheet("Unlabeled", replace) firstrow(variables) nolabel

3.4, 3.5, and 3.6 Exporting SAS and dBase files

My recommendation is to use csv files instead of SAS or SPSS files. If you need to do so, please review pages 59-62.

3.7 Exporting comma and tab delimited files

Exporting CSV Files

I recommend when collaborating with others, especially if they are using different software like SAS, R, Python, etc. to use csv. They are universally read, which makes collaboration easier. You may lose labels (or values). However, if everyone is not using Stata, then export csv (or excel) is a good idea.

The default export delimiter is comma-delimited

use dentlab
export delimited using "dentists_comma.csv", replace

This will export the labels instead of the values. If you want to export the values instead of the labels, use the nolabels option.

export delimited using "dentists_comma.csv", nolabel replace    

Note: You also have the option of putting quotes around strings, which can be a good idea if you strings have “,” in them. Otherwise, the csv will think the “,” is a place to delimit and your dataset will be incorrectly exported. The option is simply quote.

    export delimited using "dentists_comma.csv", nolabel quote replace

Sometimes we may need to format our data for exporting.

replace years = 8.93 if name == "Mike Avity"
export delimited using "dentists_comma.csv", nolabel quote replace
type "dentists_comma.csv"
(1 real change made)

file dentists_comma.csv saved

name,years,fulltime,recom
"Y. Don Uflossmore",7.25,0,1
"Olive Tu'Drill",10.25,1,1
"Isaac O'Yerbreath",32.75,1,1
"Ruth Canaale",22,1,1
"Mike Avity",8.9300003,0,0

Note that the years for Mike Avity we just replace exported as 8.9300003. To fix this, we can use the format command to set the total width to 5 and the decimal places to 2 with %5.2f

format years %5.2f
export delimited using "dentists_comma.csv", nolabel quote replace
type "dentists_comma.csv"
file dentists_comma.csv saved

name,years,fulltime,recom
"Y. Don Uflossmore",7.25,0,1
"Olive Tu'Drill",10.25,1,1
"Isaac O'Yerbreath",32.75,1,1
"Ruth Canaale",22,1,1
"Mike Avity",8.5,0,0

For more options see:

    help export delimited

3.8 Exporting space-separated files

Don’t do this. Use comma-delimited files. If you are really interested in this, please review pages 65-66.

3.9 Creating Reports

I liked this addition to the exporting data section. A lot of time, senior leadership may need a report, and with export excel (or putexcel) we can generate automated reports. The key here is automated, where the pipeline from raw data to reports is done completely through a .do file.

A lot of times it is a pain to get the data analysis to your write-ups, and Export excel and put excel are great tools to make it easy. We’ll cover estout to get regression outputs into Word or LaTex a bit later. *I have used putexcel in the past, but we’ll cover two ways to fill out Excel reports

We are going to create a Full-time Part-time Report in Excel

use "dentlab.dta", clear
keep name years fulltime
export excel using "dentrpt1.xlsx", cell(A4) replace

Notice this is just exporting our data into an excel file without a shell, but we do place our data starting on line A4 and we do not export variable names.

We’ll use the shell report called “dentrpt1-skeleton.xlsx”, and we’ll fill it out using Stata. This is just an example, but it can be easily applicable in many office settings. It is preferable to fill out the report this way, since if there is a mistake, then it is is easy to fix without generating a new report from scratch.

First we’ll use the copy command to copy the Report Shell over our newly created file called “dentrpt1.xlsx”

copy "dentrpt1-skeleton.xlsx" "dentrpt1.xlsx", replace

Now we will use the modify option to modify our report shell to fill out the report with data.

export excel "dentrpt1.xlsx", sheet("dentists", modify) cell(A4)

There are two sheets in the report shell: “dentists” and “Sheet2”. We modified the sheet called “dentists” and pasted our data in this sheet starting at cell A4. Dental Report

We can format our formats, too. We have our report shell in “dentrpt2-skeleton.xlsx”. We will do is similar to the first report, but we will add the keepcellfmt option.

copy "dentrpt2-skeleton.xlsx" "dentrpt2.xlsx", replace
export excel "dentrpt2.xlsx", sheet("dentists", modify) cell(A4) //keepcellfmt

Note: this keepcellfmt is unavailable in Stata 14, but you should have it in your newer version.

Let’s include averages for years and full-time vs part time. We make a copy.

copy "dentrpt3-skeleton.xlsx" "dentrpt3.xlsx", replace
export excel "dentrpt3.xlsx", sheet("dentists", modify) cell(A6) 

You have two options for the percentages: 1) add average formula in Excel or 2) use put excel. It is easy to add formulas in excel, but what if the number of observation changes over time? Then, you would need to modify the Excel sheet. Using put excel you can update the averages without messing with formulas in excel.

Note: It is poor practice to run the summarize command and manually entering it into your report. Use local macros and do it dynamically!

We will set our dentists report to be modified by Stata, and then run the summarize command and take the average stored after the command and put it into excel. putexcel set “dentrpt3.xlsx”, sheet(“dentists”) modify

After setting the sheet to be modified, we can write expression, returned results, formulas, graphs, scalars, matrices, etc.

Let’s look at what the summarize command stores after being run.

summarize years
return list
quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell/"
quietly use dentlab
summarize years
return list

There are several scalars that are stored, such as number of observations r(N), mean r(mean), standard deviation r(sd), etc. If we use the detail option, we’ll see more scalars. We can retreive those scalars with local macros, such as `r(mean)’.

summarize years
putexcel B2 = `r(mean)'
summarize fulltime
putexcel B3 = `r(mean)'
putexcel save //Not in older version

Dental Report with scalars More on Putexcel https://blog.stata.com/2017/01/10/creating-excel-tables-with-putexcel-part-1-introduction-and-formatting/

Chapter 4: Cleaning and Checking Data

Good note from Mitchell at the beginning: Garbage In; Garbage Out.

Before doing analysis, you really need to check your data to look for data quality issues.

4.2 Double data entry

The duplicates command can be helpful, but it will only find duplicates that are exactly the same unless you specify which variables with duplicates list vars. We’ll cover this a bit later.

#### 4.3 Checking individual variables

We start off with looking from problems in individual variables. There are three useful commands for checking individual variables. 1. describe 2. tabulate 3. summarize

use wws, clear

The describe command is a useful command to see an overview of your data with variables labels, storage type, and value lables (if any)

describe
quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell/"
quietly use wws, clear
describe
Contains data from wws.dta
  obs:         2,246                          Working Women Survey
 vars:            30                          3 Jan 2010 00:42
 size:       163,958                          (_dta has notes)
------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------
idcode          int     %8.0g                 Unique ID
age             byte    %8.0g                 age in current year
race            byte    %8.0g                 race
married         byte    %8.0g                 married
collgrad        byte    %16.0g                college graduate
south           byte    %8.0g                 lives in south
industry        byte    %23.0g                industry
occupation      byte    %22.0g                occupation
union           byte    %8.0g                 union worker
wage            float   %9.0g                 hourly wage
hours           byte    %8.0g                 usual hours worked
nevermarried    byte    %8.0g                 Woman never been married
yrschool        byte    %8.0g                 Years of school completed
metro           byte    %9.0g                 Does woman live in metro area?
ccity           byte    %8.0g                 Does woman live in city center?
currexp         float   %9.0g                 Years worked at current job
prevexp         float   %9.0g                 Years worked at previous job
everworked      float   %9.0g                 Has woman ever worked?
uniondues       float   %9.0g                 Union Dues paid last week
marriedyrs      float   %9.0g                 Years married (rounded to nearest year)
unempins        float   %9.0g                 Under/Unemployment insur. received last week
numkids         float   %9.0g                 Number of children
kidage1         float   %9.0g                 Age of first child
kidage2         float   %9.0g                 Age of second child
kidage3         float   %9.0g                 Age of third child
grade           byte    %8.0g                 current grade completed
grade4          byte    %9.0g                 4 level Current Grade Completed
wage2           float   %9.0g                 Wages, rounded to 2 digits
fwt             float   %9.0g                 Frequency weight
networth        float   %9.0g                 Net worth
------------------------------------------------------------------------------------------------
Sorted by: 

You will noticed that we lack value labels and a data dictionary or what each value is coded as.

One way tabulation

The tabulation or tab command is very useful to inspect categorical variables. Let’s look at collgrad, which is a binary for college graduate status. Let’s include the missing option to make sure no observations are missing data.

tabulate collgrad, missing
    college |
   graduate |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      1,713       76.27       76.27
          1 |        533       23.73      100.00
------------+-----------------------------------
      Total |      2,246      100.00

Let’s look at race

tabulate race, missing
       race |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |      1,636       72.84       72.84
          2 |        583       25.96       98.80
          3 |         26        1.16       99.96
          4 |          1        0.04      100.00
------------+-----------------------------------
      Total |      2,246      100.00

Race should only be coded between 1 and 3, so we have one miscoded observation. Let’s find the observation’s idcode.

list idcode race if race==4
      | idcode   race |
      |---------------|
2013. |    543      4 |
      +---------------+
Summarize

The summarize or sum command is very useful to inspect continuous variables. Let’s look at the amount of unemployment insurance in unempins. The values range between 0 and 300 dollars for unemployed insurance received last week.

summarize unempins
    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    unempins |      2,246    30.50401    73.16682          0        299

Let’s look at wages.

    summarize wage
    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        wage |      2,246    288.2885    9595.692          0     380000

The data range between 0 and 380,000 dollars in hourly wage last week with a mean of 288.29 dollars and a standard deviation of 9,595 dollars. The max seems a bit high. Let’s add the detail option to get more information.

    summarize wage, detail
                         hourly wage
-------------------------------------------------------------
      Percentiles      Smallest
 1%     1.892108              0
 5%     2.801002       1.004952
10%     3.220612       1.032247       Obs               2,246
25%     4.259257       1.151368       Sum of Wgt.       2,246

50%     6.276297                      Mean           288.2885
                        Largest       Std. Dev.      9595.692
75%     9.661837       40.19808
90%     12.77777       40.74659       Variance       9.21e+07
95%     16.73912         250000       Skewness       35.45839
99%     38.70926         380000       Kurtosis       1297.042

Our mean appears to be skewed rightward with some outliers. Our median hourly wage is 6.7 dollars per hour and our 99% percentile hourly wage is 38.7 dollars per hour, so a mean of 288.29 dollar/hour is highly skewed. Our Kurtosis shows that the max outliers are heavily skewing the results. A normal distributed variable should have a kurtosis of 3, and our kurtosis is 1297. Let’s look at the outliers

list idcode wage if wage > 1000
      | idcode     wage |
      |-----------------|
 893. |   3145   380000 |
1241. |   2341   250000 |
      +-----------------+

Let’s look at ages, which should range frm 21 to 50 years old. We can use both the summarize and tabulate commands. Tabulate can be very useful with continuous variables if the range is not too large.

summarize age
tabulate age
    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |      2,246    36.25111    5.437983         21         83

     age in |
    current |
       year |      Freq.     Percent        Cum.
------------+-----------------------------------
         21 |          2        0.09        0.09
         22 |          8        0.36        0.45
         23 |         16        0.71        1.16
         24 |         35        1.56        2.72
         25 |         44        1.96        4.67
         26 |         50        2.23        6.90
         27 |         50        2.23        9.13
         28 |         63        2.80       11.93
         29 |         49        2.18       14.11
         30 |         55        2.45       16.56
         31 |         57        2.54       19.10
         32 |         60        2.67       21.77
         33 |         56        2.49       24.27
         34 |         95        4.23       28.50
         35 |        213        9.48       37.98
         36 |        224        9.97       47.95
         37 |        171        7.61       55.57
         38 |        175        7.79       63.36
         39 |        167        7.44       70.79
         40 |        139        6.19       76.98
         41 |        148        6.59       83.57
         42 |        111        4.94       88.51
         43 |        110        4.90       93.41
         44 |         98        4.36       97.77
         45 |         45        2.00       99.78
         46 |          1        0.04       99.82
         47 |          1        0.04       99.87
         48 |          1        0.04       99.91
         54 |          1        0.04       99.96
         83 |          1        0.04      100.00
------------+-----------------------------------
      Total |      2,246      100.00

Let’s look at the outliers.

list idcode age if age > 50
      | idcode   age |
      |--------------|
2205. |     80    54 |
2219. |     51    83 |
      +--------------+
#### 4.4 Checking categorical by categorical variables
Two-way tabulation

The two-way tabulation through the tabulate command is a very useful way to look for data quality problems or to double check binaries created. Let’s look at variables metro, which is a binary for whether or not the observation lives in a metropolitian areas, and ccity, which is whether or not the observation lives in the center of the city.

use wws, clear
tabulate metro ccity, missing
Does woman |
   live in |  Does woman live in
     metro |     city center?
     area? |         0          1 |     Total
-----------+----------------------+----------
         0 |       665          0 |       665 
         1 |       926        655 |     1,581 
-----------+----------------------+----------
     Total |     1,591        655 |     2,246 

An alternative is to count the number of observations that shouldn’t be there So we’ll count the number of observations not in a metro area but in a city center. I personally don’t use it, but it has it’s uses.

    count if metro==0 & ccity==1

Let’s look at married and nevermarriaged. There should be no individuals that appear in both married and nevermarried.

tabulate married nevermarried, missing
           |   Woman never been
           |        married
   married |         0          1 |     Total
-----------+----------------------+----------
         0 |       570        234 |       804 
         1 |     1,440          2 |     1,442 
-----------+----------------------+----------
     Total |     2,010        236 |     2,246 

There may be observation that have been married, but not currently married. There should be no observations that are both married and never married, and there are 2.

count if married == 1 & nevermarried == 1
list idcode married nevermarried if married == 1 & nevermarried == 1
  2

      +-----------------------------+
      | idcode   married   neverm~d |
      |-----------------------------|
1523. |   1758         1          1 |
2231. |     22         1          1 |
      +-----------------------------+

Let’s look at college graduates and years of school completed. We can use tabulate or the table commands. I personally prefer tabulate, since we can look for missing

tabulate yrschool collgrad, missing
  Years of |
    school |   college graduate
 completed |         0          1 |     Total
-----------+----------------------+----------
         8 |        69          1 |        70 
         9 |        55          0 |        55 
        10 |        84          0 |        84 
        11 |       123          0 |       123 
        12 |       943          0 |       943 
        13 |       174          2 |       176 
        14 |       180          7 |       187 
        15 |        81         11 |        92 
        16 |         0        252 |       252 
        17 |         0        106 |       106 
        18 |         0        154 |       154 
         . |         4          0 |         4 
-----------+----------------------+----------
     Total |     1,713        533 |     2,246 

Or we can use the table command

table yrschool collgrad
list idcode if yrschool < 16 & collgrad == 1
Years of  |  college  
school    |  graduate 
completed |    0     1
----------+-----------
        8 |   69     1
        9 |   55      
       10 |   84      
       11 |  123      
       12 |  943      
       13 |  174     2
       14 |  180     7
       15 |   81    11
       16 |        252
       17 |        106
       18 |        154
----------------------

      +--------+
      | idcode |
      |--------|
 195. |   4721 |
 369. |   4334 |
 464. |   4131 |
 553. |   3929 |
 690. |   3589 |
      |--------|
1092. |   2681 |
1098. |   2674 |
1114. |   2640 |
1124. |   2613 |
1221. |   2384 |
      |--------|
1493. |   1810 |
1829. |    993 |
1843. |    972 |
1972. |    689 |
2114. |    312 |
      |--------|
2174. |    172 |
2198. |    107 |
2215. |     63 |
2228. |     25 |
2229. |     24 |
      |--------|
2230. |     23 |
      +--------+

4.5 Checking categorical by continuous variables

It is also helpful to look at continuous variables by different categories. Let’s look at the binary variable for union, if the observation is a union member or not by uniondues.

summarize uniondues, detail
                  Union Dues paid last week
-------------------------------------------------------------
      Percentiles      Smallest
 1%            0              0
 5%            0              0
10%            0              0       Obs               2,242
25%            0              0       Sum of Wgt.       2,242

50%            0                      Mean           5.603479
                        Largest       Std. Dev.      9.029045
75%           10             29
90%           22             29       Variance       81.52365
95%           26             29       Skewness        1.35268
99%           29             29       Kurtosis       3.339635

Let’s use the bysort command, which will sort our data by the variable specified.

bysort union: summarize uniondues
-> union = 0

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
   uniondues |      1,413     .094126    1.502237          0         27

------------------------------------------------------------------------------------------------
-> union = 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
   uniondues |        461    14.65944    8.707759          0         29

------------------------------------------------------------------------------------------------
-> union = .

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
   uniondues |        368    15.41304    8.815582          0         29

The mean for not in a union is less than 1 dollar with 1,413 observations. The mean for union is about 15 dollars with 461 observations. The mean for missing is about 15 dollars with 368 observations.

tabulate uniondues if union==0, missing
 Union Dues |
  paid last |
       week |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      1,407       99.29       99.29
         10 |          1        0.07       99.36
         17 |          1        0.07       99.44
         26 |          2        0.14       99.58
         27 |          2        0.14       99.72
          . |          4        0.28      100.00
------------+-----------------------------------
      Total |      1,417      100.00

We see about 4 observations have union dues, so let’s recode them to 0. However, it is possible that someone may be not be a union members, but still have to pay an agency fee.

We’ll assume no agency fee, so we can use the recode command to create a new variable if a person pay union dues.

recode uniondues (0=0) (1/max=1), generate(paysdues)
(784 differences between uniondues and paysdues)

Now let’s compare union members to those who pay union dues.

tabulate union paysdues, missing
variable paysdues not found
r(111);

r(111);

Let’s list the non-union members paying union dues. Note: we use the abb(#) option to abbreviate the observation to 20 characters.

list idcode union uniondues if union==0 & (uniondues > 0) & !missing(uniondues), abb(20)
      | idcode   union   uniondues |
      |----------------------------|
 561. |   3905       0          10 |
 582. |   3848       0          26 |
 736. |   3464       0          17 |
1158. |   2541       0          27 |
1668. |   1411       0          27 |
      |----------------------------|
2100. |    345       0          26 |
      +----------------------------+

Let’s look at another great command for observing categorical and continuous variables together: tabstat

We could use tab with a sum option.

tab married, sum(marriedyrs)
            |  Summary of Years married (rounded
            |          to nearest year)
    married |        Mean   Std. Dev.       Freq.
------------+------------------------------------
          0 |           0           0         804
          1 |   5.5409154   3.5521383       1,442
------------+------------------------------------
      Total |   3.5574354   3.8933494       2,246

Or, we can use tabstat which gives us more options.

tabstat marriedyrs, by(married) statistics(n mean sd min max p25 p50 p75) missing
Summary for variables: marriedyrs
     by categories of: married (married)

 married |         N      mean        sd       min       max       p25       p50       p75
---------+--------------------------------------------------------------------------------
       0 |       804         0         0         0         0         0         0         0
       1 |      1442  5.540915  3.552138         0        11         2         6         9
---------+--------------------------------------------------------------------------------
   Total |      2246  3.557435  3.893349         0        11         0         2         7
------------------------------------------------------------------------------------------

No observation that said they were never married reported years of marriage.

Let’s look at current years of experience and everworked binary.

tabstat currexp, by(everworked) statistics(n mean sd min max) missing
Summary for variables: currexp
     by categories of: everworked (Has woman ever worked?)

everworked |         N      mean        sd       min       max
-----------+--------------------------------------------------
         0 |        60         0         0         0         0
         1 |      2171  5.328881  5.042181         0        26
-----------+--------------------------------------------------
     Total |      2231  5.185567  5.048073         0        26
--------------------------------------------------------------

Let’s total years of experience, which is currexp plus prevexp.

generate totexp=currexp+prevexp
tabstat totexp, by(everworked) statistics(n mean sd min max) missing
(15 missing values generated)


Summary for variables: totexp
     by categories of: everworked (Has woman ever worked?)

everworked |         N      mean        sd       min       max
-----------+--------------------------------------------------
         0 |        60         0         0         0         0
         1 |      2171  11.57761  4.552392         1        29
-----------+--------------------------------------------------
     Total |      2231  11.26625  4.865816         0        29
--------------------------------------------------------------

Everyone with at least one year experience has experience working.

Summary of ways to look at categorical by continuous 1. bysort x_cat: summarize x_con 2. tab x_cat, sum(x_con) 3. tabstat x_con, by(x_cat) statistics(…)

4.6 Checking continuous by continuous variables

We will compare two continuous variables to continuous variables. Let’s look at unemployment insuranced received last week if the hours were greater than 30 and hours are not missing.

summarize unempins if hours > 30 & !missing(hours)
    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    unempins |      1,800    1.333333    16.04617          0        287

The mean is around 1.3 and our range is between our expected 0 and 287.

count if (hours>30) & !missing(hours) & (unempins>0) & !missing(unempins)
  19

Let’s look at the hours count.

tabulate hours if (hours>30) & !missing(hours) & (unempins>0) & !missing(unempins)
usual hours |
     worked |      Freq.     Percent        Cum.
------------+-----------------------------------
         35 |          1        5.26        5.26
         38 |          2       10.53       15.79
         40 |         14       73.68       89.47
         65 |          1        5.26       94.74
         70 |          1        5.26      100.00
------------+-----------------------------------
      Total |         19      100.00

Let’s look at age and and years married to find any unusual observations.

generate agewhenmarried = age - marriedyrs
summarize agewhenmarried 
tabulate agewhenmarried
    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
agewhenmar~d |      2,246    32.69368    6.709341         13         73


agewhenmarr |
        ied |      Freq.     Percent        Cum.
------------+-----------------------------------
         13 |          1        0.04        0.04
         14 |          4        0.18        0.22
         15 |         11        0.49        0.71
         16 |          8        0.36        1.07
         17 |         18        0.80        1.87
         18 |         19        0.85        2.72
         19 |         14        0.62        3.34
         20 |         20        0.89        4.23
         21 |         36        1.60        5.83
         22 |         42        1.87        7.70
         23 |         52        2.32       10.02
         24 |         65        2.89       12.91
         25 |         67        2.98       15.89
         26 |         81        3.61       19.50
         27 |         73        3.25       22.75
         28 |         93        4.14       26.89
         29 |         90        4.01       30.90
         30 |        118        5.25       36.15
         31 |         91        4.05       40.20
         32 |        124        5.52       45.73
         33 |        109        4.85       50.58
         34 |        109        4.85       55.43
         35 |        149        6.63       62.07
         36 |        138        6.14       68.21
         37 |        119        5.30       73.51
         38 |        124        5.52       79.03
         39 |        119        5.30       84.33
         40 |         89        3.96       88.29
         41 |         80        3.56       91.85
         42 |         62        2.76       94.61
         43 |         50        2.23       96.84
         44 |         43        1.91       98.75
         45 |         23        1.02       99.78
         46 |          2        0.09       99.87
         47 |          1        0.04       99.91
         48 |          1        0.04       99.96
         73 |          1        0.04      100.00
------------+-----------------------------------
      Total |      2,246      100.00

Let’s look for anyone under 18. Some states allow under 18 marriages, but it still is suspicious.

tabulate agewhenmarried if agewhenmarried < 18
variable agewhenmarried not found
r(111);

r(111);

Let’s use the same strategy for years of experience deducted from age to find the first age working.

generate agewhenstartwork = age - (prevexp + currexp)
tabulate agewhenstartwork if agewhenstartwork < 16
(15 missing values generated)


agewhenstar |
      twork |      Freq.     Percent        Cum.
------------+-----------------------------------
          8 |          1        1.49        1.49
          9 |          1        1.49        2.99
         12 |          1        1.49        4.48
         14 |         20       29.85       34.33
         15 |         44       65.67      100.00
------------+-----------------------------------
      Total |         67      100.00

Some states allow work at the age of 15, but anything below looks suspicious.

We can also look for the number of age of children. We should expect that the age of the third child is never older than the second child.

table kidage2 kidage3 if numkids==3
Age of    |
second    |               Age of third child              
child     |    0     1     2     3     4     5     6     7
----------+-----------------------------------------------
        0 |   12                                          
        1 |   10     9                                    
        2 |   11     8    10                              
        3 |   10    12     6     8                        
        4 |   10    12    10     7     5                  
        5 |   12    11     9     3     6     8            
        6 |    9     8    10     6     5     6     6      
        7 |    7     6     7     9     4    14    12     6
        8 |          5    11     7     6    14     6    11
        9 |                8    13    10     7    12     9
       10 |                     15     3    10     6    12
       11 |                            9     8     3    13
       12 |                                 16     9     6
       13 |                                       11     5
       14 |                                              8
----------------------------------------------------------

Check the count.

count if (kidage3 > kidage2) & (numkids==3) & !missing(kidage3)
count if (kidage2 > kidage1) & (numkids>=2) & !missing(kidage2)
  0

  0

Now let’s check the age of the observation when the first child was born

generate agewhenfirstkid = age - kidage1
tabulate agewhenfirstkid if agewhenfirstkid < 18
age not found
r(111);

r(111);

There are some suspicious observations after this tabulation. Let’s look at scatter plot. Scatter plots can be helpful when looking at continuous by continuous, as well.

twoway scatter hours wage if hours > 0 & wage >0 & wage < 1000
graph export "hour_wage.png", width(500) replace
(file hour_wage.png written in PNG format)
Hours and Wages
Hours and Wages

4.7 Correcting errors in data

Let go ahead and fix some of these errors that we have found. A word of caution is that you may need to talk to the data stewards about the best ways to correct the data. You don’t want to attempt to fix the measurement error and introduce additional problems. Institutional knowledge of the data is very helpful before correcting errors.

Let’s fix when race was equal to 4 when there are only 3 categories.

list idcode race if race==4
replace race=1 if idcode == 543
tabulate race
      | idcode   race |
      |---------------|
2013. |    543      4 |
      +---------------+

(1 real change made)

       race |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |      1,637       72.89       72.89
          2 |        583       25.96       98.84
          3 |         26        1.16      100.00
------------+-----------------------------------
      Total |      2,246      100.00

Let’s add a note for future users, which is good practice for replication.

note race: race changed to 1 (from 4) for idcode 543

Let’s fix college gradudate when there were only 8 years of education.

list idcode collgrad if yrschool < 12 & collgrad == 1
replace collgrad = 0 if idcode == 107
tab yrschool collgrad
note collgrad: collgrad changed to 0 (from 1) for idcode 107
      | idcode   collgrad |
      |-------------------|
2198. |    107          1 |
      +-------------------+

(1 real change made)


  Years of |
    school |   college graduate
 completed |         0          1 |     Total
-----------+----------------------+----------
         8 |        70          0 |        70 
         9 |        55          0 |        55 
        10 |        84          0 |        84 
        11 |       123          0 |       123 
        12 |       943          0 |       943 
        13 |       174          2 |       176 
        14 |       180          7 |       187 
        15 |        81         11 |        92 
        16 |         0        252 |       252 
        17 |         0        106 |       106 
        18 |         0        154 |       154 
-----------+----------------------+----------
     Total |     1,710        532 |     2,242 

Let’s fix age which where the digits were switched.

list idcode age if age > 50
replace age = 38 if idcode == 51
replace age = 45 if idcode == 80
tab age
note age: the value of 83 was corrected to 38 for idcode 51
note age: the value of 54 was corrected to 45 for idcode 80
      | idcode   age |
      |--------------|
2205. |     80    54 |
2219. |     51    83 |
      +--------------+

(1 real change made)

(1 real change made)

     age in |
    current |
       year |      Freq.     Percent        Cum.
------------+-----------------------------------
         21 |          2        0.09        0.09
         22 |          8        0.36        0.45
         23 |         16        0.71        1.16
         24 |         35        1.56        2.72
         25 |         44        1.96        4.67
         26 |         50        2.23        6.90
         27 |         50        2.23        9.13
         28 |         63        2.80       11.93
         29 |         49        2.18       14.11
         30 |         55        2.45       16.56
         31 |         57        2.54       19.10
         32 |         60        2.67       21.77
         33 |         56        2.49       24.27
         34 |         95        4.23       28.50
         35 |        213        9.48       37.98
         36 |        224        9.97       47.95
         37 |        171        7.61       55.57
         38 |        176        7.84       63.40
         39 |        167        7.44       70.84
         40 |        139        6.19       77.03
         41 |        148        6.59       83.62
         42 |        111        4.94       88.56
         43 |        110        4.90       93.46
         44 |         98        4.36       97.82
         45 |         46        2.05       99.87
         46 |          1        0.04       99.91
         47 |          1        0.04       99.96
         48 |          1        0.04      100.00
------------+-----------------------------------
      Total |      2,246      100.00

Let’s look at our notes.

note
_dta:
  1.  This is a hypothetical dataset and should not be used for analysis purposes

age:
  1.  the value of 83 was corrected to 38 for idcode 51
  2.  the value of 54 was corrected to 45 for idcode 80

race:
  1.  race changed to 1 (from 4) for idcode 543

collgrad:
  1.  collgrad changed to 0 (from 1) for idcode 107

4.8 Identifying duplicates

We need to be careful with duplicate observations. When working with panel data we will want multiple observations of the same cross-sectional unit, but there should be multiple observations for different time units and not the same time unit.

use "dentists_dups.dta", clear

We can start with our duplicates list command to find rows that are completely the same. This command will show all the duplicates. Note: for duplicate records to be found, all data needs to be the same.

duplicates list
Duplicates in terms of all variables

  +-----------------------------------------------------------+
  | group:   obs:             name   years   fulltime   recom |
  |-----------------------------------------------------------|
  |      1      4       Mike Avity     8.5          0       0 |
  |      1      6       Mike Avity     8.5          0       0 |
  |      1      8       Mike Avity     8.5          0       0 |
  |      2      1   Olive Tu'Drill   10.25          1       1 |
  |      2     11   Olive Tu'Drill   10.25          1       1 |
  |-----------------------------------------------------------|
  |      3      2     Ruth Canaale      22          1       1 |
  |      3      3     Ruth Canaale      22          1       1 |
  +-----------------------------------------------------------+

We’ll have a more condensed version with duplicates examples. It will give an example of the duplicate records for each group of duplicate records.

duplicates examples 
Duplicates in terms of all variables

  +-------------------------------------------------------------------+
  | group:   #   e.g. obs             name   years   fulltime   recom |
  |-------------------------------------------------------------------|
  |      1   3          4       Mike Avity     8.5          0       0 |
  |      2   2          1   Olive Tu'Drill   10.25          1       1 |
  |      3   2          2     Ruth Canaale      22          1       1 |
  +-------------------------------------------------------------------+

We can find the a consice list of duplicate records with duplicates report.

duplicates report
Duplicates in terms of all variables

--------------------------------------
   copies | observations       surplus
----------+---------------------------
        1 |            4             0
        2 |            4             2
        3 |            3             2
--------------------------------------

The duplicates tag command can return the number of duplicates for each group of duplicates (note the separator option puts a line in the table after # rows.

duplicates tag, generate(dup)
list, separator(2)
Duplicates in terms of all variables

     +----------------------------------------------------+
     |              name   years   fulltime   recom   dup |
     |----------------------------------------------------|
  1. |    Olive Tu'Drill   10.25          1       1     1 |
  2. |      Ruth Canaale      22          1       1     1 |
     |----------------------------------------------------|
  3. |      Ruth Canaale      22          1       1     1 |
  4. |        Mike Avity     8.5          0       0     2 |
     |----------------------------------------------------|
  5. |        Mary Smith       3          1       1     0 |
  6. |        Mike Avity     8.5          0       0     2 |
     |----------------------------------------------------|
  7. | Y. Don Uflossmore    7.25          0       1     0 |
  8. |        Mike Avity     8.5          0       0     2 |
     |----------------------------------------------------|
  9. |        Mary Smith      27          0       0     0 |
 10. | Isaac O'Yerbreath   32.75          1       1     0 |
     |----------------------------------------------------|
 11. |    Olive Tu'Drill   10.25          1       1     1 |
     +----------------------------------------------------+

Let’s sort our data for better legibility, and then we will add lines in our output table whenever there is a new group.

    sort name years
    list, sepby(name years)
     |              name   years   fulltime   recom   dup |
     |----------------------------------------------------|
  1. | Isaac O'Yerbreath   32.75          1       1     0 |
     |----------------------------------------------------|
  2. |        Mary Smith       3          1       1     0 |
     |----------------------------------------------------|
  3. |        Mary Smith      27          0       0     0 |
     |----------------------------------------------------|
  4. |        Mike Avity     8.5          0       0     2 |
  5. |        Mike Avity     8.5          0       0     2 |
  6. |        Mike Avity     8.5          0       0     2 |
     |----------------------------------------------------|
  7. |    Olive Tu'Drill   10.25          1       1     1 |
  8. |    Olive Tu'Drill   10.25          1       1     1 |
     |----------------------------------------------------|
  9. |      Ruth Canaale      22          1       1     1 |
 10. |      Ruth Canaale      22          1       1     1 |
     |----------------------------------------------------|
 11. | Y. Don Uflossmore    7.25          0       1     0 |
     +----------------------------------------------------+

Notice: There are two observations for Mary Smith, but one has 3 years of experience and the other one has 27 years, so it does not appear as a duplicate.

If there were too many variables, which could use our data browser.

browse if dup > 0
request ignored because of batch mode

We will want to drop our duplicates, but not the first observation. The duplicates drop command will drop the duplicate records but still keep one observation of the duplicate group.

duplicates drop

Let’s use a new dataset that is a bit more practical.

use "wws.dta", replace

Use duplicates list for the variable of interest

duplicates list idcode
Duplicates in terms of idcode

(0 observations are duplicates)

There are no duplicate idcodes, so we should expect that duplicates list will not return any duplicates, since every single values needs to be the same for there to be a dupicate record.

duplicates list 
Duplicates in terms of all variables

(0 observations are duplicates)

Let’s use a dataset that does have duplicates.

use "wws_dups.dta", clear

Let’s check for duplicate idcodes.

duplicates list idcode, sepby(idcode)
Duplicates in terms of idcode

  +------------------------+
  | group:   obs:   idcode |
  |------------------------|
  |      1   1088     2831 |
  |      1   2248     2831 |
  |------------------------|
  |      2   1244     3905 |
  |      2   1245     3905 |
  |------------------------|
  |      3    277     4214 |
  |      3   2247     4214 |
  +------------------------+

I prefer duplicates list var1 instead of isid, since an error will break the do file using isid.

Let’s generate our dup variable for identifying the number of duplicates per group of duplicates.

duplicates tag idcode, generate(iddup)
Duplicates in terms of idcode

Let’s generate a different duplicate variable to find complete duplicate records. Notice that we do not specify a variable after tag.

duplicates tag, generate(alldup)
Duplicates in terms of all variables

Let’s list the data.

list idcode age race yrschool occupation wage if iddup==1 & alldup==0
      | idcode   age   race   yrschool   occupa~n       wage |
      |------------------------------------------------------|
1244. |   3905    36      1         14         11   4.339774 |
1245. |   3905    41      1         10          5   7.004828 |
      +------------------------------------------------------+

It seems that idcode 3905 is a different person with a possible incorrect idcode. Let’s fix that.

replace idcode=5160 if idcode==3905 & age==41

Let’s see if it is resolved, and it is.

list idcode age race yrschool occupation wage if iddup==1 & alldup==0
      | idcode   age   race   yrschool   occupa~n       wage |
      |------------------------------------------------------|
1244. |   3905    36      1         14         11   4.339774 |
1245. |   5160    41      1         10          5   7.004828 |
      +------------------------------------------------------+

Let’s look observation that are complete duplicates.

duplicates report
list idcode age race yrschool occupation wage if iddup==1 & alldup==1
Duplicates in terms of all variables

--------------------------------------
   copies | observations       surplus
----------+---------------------------
        1 |         2244             0
        2 |            4             2
--------------------------------------

      +------------------------------------------------------+
      | idcode   age   race   yrschool   occupa~n       wage |
      |------------------------------------------------------|
 277. |   4214    35      1         17         13   11.27214 |
1088. |   2831    37      2          8          8   2.697261 |
2247. |   4214    35      1         17         13   11.27214 |
2248. |   2831    37      2          8          8   2.697261 |
      +------------------------------------------------------+

We need to keep 1 observation of the duplicates

duplicates drop
duplicates report
Duplicates in terms of all variables

(2 observations deleted)


Duplicates in terms of all variables

--------------------------------------
   copies | observations       surplus
----------+---------------------------
        1 |         2246             0
--------------------------------------

Practice

Let’s work the Census CPS. Pull the Census CPS. 1. You can download the file and import it, or 2. You can use Stata to pull the file.

import delimited using "https://www2.census.gov/programs-surveys/cps/datasets/2023/basic/jun23pub.csv", clear

Let’s save a copy of the csv file in Stata.

Let’s grab the data dictionary. Hint search: “Census CPS 2023 data dictionary”

Let’s check our data.

  1. What variable contains our laborforce status?
  2. What variable contains our weekly earnings?
  3. Does everyone employed have weekly earnings? Hint: Try summarize
  4. What flag with our weekly earning that provides information about available data?
  5. Try tabstat with the flag and weekly earnings. What do you find?
  6. Try tabstat with statistics(n mean sd p25 p50 p75)
  7. What are our identifier variable(s)? How many did you find?
  8. Are there any duplicate records?