Econ 645

Week 4: Fixed Effects (Within) and Random Effects

Samuel Rowe Adapted from Wooldridge

. clear 

. set more off

*Set Working Directory

. cd "/Users/Sam/Desktop/Econ 645/Data/Wooldridge"
/Users/Sam/Desktop/Econ 645/Data/Wooldridge

Wooldridge

Union Premium

Lesson: we can control for ability and preference for unionization to control for unobserved time-invariant confounders.

. use "wagepan.dta", clear

Set the Panel

. xtset nr year
       panel variable:  nr (strongly balanced)
        time variable:  year, 1980 to 1987
                delta:  1 unit
Pooled OLS
. eststo OLS: reg lwage c.edu exper expersq i.black i.south i.married i.union i.d8*

      Source │       SS           df       MS      Number of obs   =     4,360
─────────────┼──────────────────────────────────   F(14, 4345)     =     73.43
       Model │  236.577196        14  16.8983712   Prob > F        =    0.0000
    Residual │  999.952446     4,345  .230138653   R-squared       =    0.1913
─────────────┼──────────────────────────────────   Adj R-squared   =    0.1887
       Total │  1236.52964     4,359  .283672779   Root MSE        =    .47973

─────────────┬────────────────────────────────────────────────────────────────
       lwage │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
        educ │   .0895815   .0051762    17.31   0.000     .0794334    .0997295
       exper │   .0665418   .0136722     4.87   0.000     .0397372    .0933463
     expersq │  -.0023832   .0008183    -2.91   0.004    -.0039875   -.0007789
     1.black │  -.1275978   .0236068    -5.41   0.000    -.1738791   -.0813165
     1.south │  -.0529354   .0155534    -3.40   0.001    -.0834281   -.0224428
   1.married │   .1122106   .0157149     7.14   0.000     .0814014    .1430198
     1.union │   .1802554   .0171339    10.52   0.000     .1466642    .2138466
       1.d81 │    .058693   .0303147     1.94   0.053    -.0007393    .1181253
       1.d82 │   .0638086   .0331714     1.92   0.054    -.0012242    .1288414
       1.d83 │   .0633355   .0366129     1.73   0.084    -.0084444    .1351153
       1.d84 │   .0915252   .0400387     2.29   0.022     .0130289    .1700216
       1.d85 │   .1105424   .0432973     2.55   0.011     .0256575    .1954272
       1.d86 │   .1437822   .0463663     3.10   0.002     .0528806    .2346837
       1.d87 │   .1756699   .0493736     3.56   0.000     .0788725    .2724673
       _cons │   .1333638   .0774193     1.72   0.085    -.0184175     .285145
─────────────┴────────────────────────────────────────────────────────────────
If we use FE or FD, we cannot assess race, education, or experience since they remain constant.
FE Within
. eststo Within: xtreg lwage c.edu exper expersq i.black i.south i.married i.union i.d8*, fe
note: educ omitted because of collinearity
note: 1.black omitted because of collinearity
note: 1.d87 omitted because of collinearity

Fixed-effects (within) regression               Number of obs     =      4,360
Group variable: nr                              Number of groups  =        545

R-sq:                                           Obs per group:
     within  = 0.1815                                         min =          8
     between = 0.0009                                         avg =        8.0
     overall = 0.0497                                         max =          8

                                                F(11,3804)        =      76.71
corr(u_i, Xb)  = -0.1739                        Prob > F          =     0.0000

─────────────┬────────────────────────────────────────────────────────────────
       lwage │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
        educ │          0  (omitted)
       exper │   .1313318   .0098277    13.36   0.000     .1120638    .1505999
     expersq │  -.0051475   .0007043    -7.31   0.000    -.0065284   -.0037666
     1.black │          0  (omitted)
     1.south │   .1018493   .0479605     2.12   0.034     .0078186      .19588
   1.married │   .0462057   .0183034     2.52   0.012     .0103203     .082091
     1.union │   .0809394   .0193065     4.19   0.000     .0430874    .1187914
       1.d81 │   .0190821   .0203532     0.94   0.349    -.0208222    .0589864
       1.d82 │  -.0117214   .0202191    -0.58   0.562    -.0513627    .0279199
       1.d83 │  -.0425071   .0203126    -2.09   0.036    -.0823318   -.0026825
       1.d84 │  -.0378955   .0203069    -1.87   0.062    -.0777089    .0019179
       1.d85 │  -.0427598   .0202377    -2.11   0.035    -.0824377   -.0030819
       1.d86 │  -.0276303   .0203773    -1.36   0.175    -.0675818    .0123211
       1.d87 │          0  (omitted)
       _cons │   .9952994    .033587    29.63   0.000      .929449     1.06115
─────────────┼────────────────────────────────────────────────────────────────
     sigma_u │  .40924292
     sigma_e │  .35082825
         rho │  .57640294   (fraction of variance due to u_i)
─────────────┴────────────────────────────────────────────────────────────────
F test that all u_i=0: F(544, 3804) = 9.54                   Prob > F = 0.0000
Compare
. esttab OLS Within, mtitles se scalars(F r2) drop(0.d* 0.union 0.married)

────────────────────────────────────────────
                      (1)             (2)   
                      OLS          Within   
────────────────────────────────────────────
educ               0.0896***            0   
                (0.00518)             (.)   

exper              0.0665***        0.131***
                 (0.0137)       (0.00983)   

expersq          -0.00238**      -0.00515***
               (0.000818)      (0.000704)   

0.black                 0               0   
                      (.)             (.)   

1.black            -0.128***            0   
                 (0.0236)             (.)   

0.south                 0               0   
                      (.)             (.)   

1.south           -0.0529***        0.102*  
                 (0.0156)        (0.0480)   

1.married           0.112***       0.0462*  
                 (0.0157)        (0.0183)   

1.union             0.180***       0.0809***
                 (0.0171)        (0.0193)   

1.d81              0.0587          0.0191   
                 (0.0303)        (0.0204)   

1.d82              0.0638         -0.0117   
                 (0.0332)        (0.0202)   

1.d83              0.0633         -0.0425*  
                 (0.0366)        (0.0203)   

1.d84              0.0915*        -0.0379   
                 (0.0400)        (0.0203)   

1.d85               0.111*        -0.0428*  
                 (0.0433)        (0.0202)   

1.d86               0.144**       -0.0276   
                 (0.0464)        (0.0204)   

1.d87               0.176***            0   
                 (0.0494)             (.)   

_cons               0.133           0.995***
                 (0.0774)        (0.0336)   
────────────────────────────────────────────
N                    4360            4360   
F                   73.43           76.71   
r2                  0.191           0.182   
────────────────────────────────────────────
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001

After controlling for time-invariant individual fixed effects the Pooled OLS is seen to be upward biased.

Pooled OLS Wage Premium Estimate

. display (exp(.180)-1)*100
19.721736

FE Within Wage Premium Estimate

. display (exp(.0844)-1)*100
8.8064032
Plot the Coefficients of Interest

Credit: John Kane: Making Regression Coefficients Plots in Stata

. quietly reg lwage c.edu exper expersq i.black i.south i.married i.union i.d8*

. estimates store pooled

. quietly xtreg lwage c.edu i.black i.south i.married i.union i.d8*, fe

. estimates store fe

. coefplot (pooled, label("{bf:Pooled OLS}") mcolor(midblue) mlcolor(cyan) ///
>   ciopts(lcolor(magenta midblue))) /// options for first group
>   (fe, label("{bf: Within}") mcolor(green) mlcolor(lime) ///
>   ciopts(lcolor(lime green))), /// options for second group
>   title(Union and Marriage Wage Premiums) ///
>   drop(_cons 1.d* 0.black 0.south) ///
>   xline(0, lcolor(red) lwidth(medium)) scheme(jet_white) ///
>   xtitle("{bf: Effect of Union}") ///
>   graphregion(margin(small)) ///
>   coeflabels(educ="Education" exper="Experience" expersq="Experience Squared" ///
>   1.black="Black" 1.south="South" 1.married="Married" ///
>   1.union="Union") ///
>   msize(large) mcolor(%85) mlwidth(medium) msymbol(circle) /// marker options
>   levels(95 90) ciopts(lwidth(medthick thick) recast(rspike rcap)) ///ci options for all groups
>   legend(ring(1) col(1) pos(3) size(medsmall))
(note: scheme jet_white not found, using s2color)

.   graph export "/Users/Sam/Desktop/Econ 645/Stata/week4_union_wage_premium.png", replace
(file /Users/Sam/Desktop/Econ 645/Stata/week4_union_wage_premium.png written in PNG format)
Plot the Coefficients of Interest

Has returns to education changed over time

Lesson: We can interact time binaries with continuous time-invariant data to see if returns to education have changed over time

With fixed effects or first differencing, we cannot assess time-invariant variables. Variables that do not vary over time, such as sex, race, or education (assuming) education is static. But, if we interact education with time binaries, we can assess whether returns to education have increased over time.

We can test to see if returns to education are constant over time

. use "wagepan.dta", clear

Vella and Verbeek (1998) estimate to see if the returns to education have change over time. We have some variables that are not time-invariant, such as union status and marital status. Experience does growth but it grows at a constant rate. We have a few variable that do not (or we would expect not to change), such as race and education (for older workers)

We use the natural log of wages, which have nice properties, such as being are more normally distributed and providing elasticities. It also can take care of inflation when we add time period binaries.

Set up the Panel

. xtset nr year
       panel variable:  nr (strongly balanced)
        time variable:  year, 1980 to 1987
                delta:  1 unit
Pooled OLS
. eststo OLS: reg lwage c.edu##i.d8* exper expersq i.married i.union

      Source │       SS           df       MS      Number of obs   =     4,360
─────────────┼──────────────────────────────────   F(19, 4340)     =     50.92
       Model │  225.412805        19  11.8638318   Prob > F        =    0.0000
    Residual │  1011.11684     4,340   .23297623   R-squared       =    0.1823
─────────────┼──────────────────────────────────   Adj R-squared   =    0.1787
       Total │  1236.52964     4,359  .283672779   Root MSE        =    .48268

─────────────┬────────────────────────────────────────────────────────────────
       lwage │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
        educ │    .081673   .0125675     6.50   0.000     .0570343    .1063117
       1.d81 │  -.0356958    .199359    -0.18   0.858    -.4265413    .3551497
       1.d82 │  -.0315288   .1998095    -0.16   0.875    -.4232575       .3602
       1.d83 │  -.0342801   .2007839    -0.17   0.864    -.4279192    .3593589
       1.d84 │   .0242933   .2025167     0.12   0.905    -.3727429    .4213294
       1.d85 │   .0058838   .2052301     0.03   0.977    -.3964719    .4082395
       1.d86 │   .0251586   .2092184     0.12   0.904    -.3850164    .4353336
       1.d87 │   .0372565   .2148364     0.17   0.862    -.3839326    .4584456
             │
  d81#c.educ │
          1  │   .0084448   .0167792     0.50   0.615     -.024451    .0413407
             │
  d82#c.educ │
          1  │   .0088899   .0168742     0.53   0.598    -.0241921     .041972
             │
  d83#c.educ │
          1  │   .0093544   .0170326     0.55   0.583    -.0240381     .042747
             │
  d84#c.educ │
          1  │   .0070671   .0172551     0.41   0.682    -.0267617    .0408958
             │
  d85#c.educ │
          1  │   .0104027   .0175306     0.59   0.553    -.0239662    .0447716
             │
  d86#c.educ │
          1  │   .0116562   .0178614     0.65   0.514    -.0233613    .0466737
             │
  d87#c.educ │
          1  │   .0134166   .0182525     0.74   0.462    -.0223676    .0492008
             │
       exper │   .0568876   .0154436     3.68   0.000     .0266102     .087165
     expersq │   -.001919   .0009455    -2.03   0.042    -.0037726   -.0000654
   1.married │   .1229473   .0155752     7.89   0.000     .0924119    .1534827
     1.union │   .1720565   .0171378    10.04   0.000     .1384575    .2056554
       _cons │   .2175863   .1641736     1.33   0.185    -.1042777    .5394503
─────────────┴────────────────────────────────────────────────────────────────
FE (Within)

If we use FE or FD, we cannot assess race, education, or experience since they remain constant, but we can include dummy interactions

. eststo Within: xtreg lwage c.edu##i.d8* i.married i.union, fe
note: educ omitted because of collinearity

Fixed-effects (within) regression               Number of obs     =      4,360
Group variable: nr                              Number of groups  =        545

R-sq:                                           Obs per group:
     within  = 0.1708                                         min =          8
     between = 0.1900                                         avg =        8.0
     overall = 0.1325                                         max =          8

                                                F(16,3799)        =      48.91
corr(u_i, Xb)  = 0.0991                         Prob > F          =     0.0000

─────────────┬────────────────────────────────────────────────────────────────
       lwage │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
        educ │          0  (omitted)
       1.d81 │  -.0224158   .1458885    -0.15   0.878    -.3084431    .2636114
       1.d82 │  -.0057611   .1458558    -0.04   0.968    -.2917243    .2802021
       1.d83 │   .0104297   .1458579     0.07   0.943    -.2755377    .2963971
       1.d84 │   .0843743   .1458518     0.58   0.563    -.2015811    .3703297
       1.d85 │   .0497253   .1458602     0.34   0.733    -.2362465    .3356971
       1.d86 │   .0656064   .1458917     0.45   0.653    -.2204273    .3516401
       1.d87 │   .0904448   .1458505     0.62   0.535     -.195508    .3763977
             │
  d81#c.educ │
          1  │   .0115854   .0122625     0.94   0.345    -.0124562    .0356271
             │
  d82#c.educ │
          1  │   .0147905   .0122635     1.21   0.228    -.0092533    .0388342
             │
  d83#c.educ │
          1  │   .0171182   .0122633     1.40   0.163    -.0069251    .0411615
             │
  d84#c.educ │
          1  │   .0165839   .0122657     1.35   0.176     -.007464    .0406319
             │
  d85#c.educ │
          1  │   .0237085   .0122738     1.93   0.053    -.0003554    .0477725
             │
  d86#c.educ │
          1  │   .0274123    .012274     2.23   0.026     .0033481    .0514765
             │
  d87#c.educ │
          1  │   .0304332   .0122723     2.48   0.013     .0063722    .0544942
             │
   1.married │   .0548205   .0184126     2.98   0.003      .018721      .09092
     1.union │   .0829785   .0194461     4.27   0.000     .0448527    .1211042
       _cons │   1.362459   .0162385    83.90   0.000     1.330622    1.394296
─────────────┼────────────────────────────────────────────────────────────────
     sigma_u │  .37264193
     sigma_e │  .35335713
         rho │  .52654439   (fraction of variance due to u_i)
─────────────┴────────────────────────────────────────────────────────────────
F test that all u_i=0: F(544, 3799) = 8.09                   Prob > F = 0.0000
These are changes in the returns to education compared to the base year of 1980 And only 1987xeducation and 1986xeducation appear to be insignificant
Compare
. esttab OLS Within, mtitles se scalars(F r2) drop(0.d* 0.union 0.married)

────────────────────────────────────────────
                      (1)             (2)   
                      OLS          Within   
────────────────────────────────────────────
educ               0.0817***            0   
                 (0.0126)             (.)   

1.d81             -0.0357         -0.0224   
                  (0.199)         (0.146)   

1.d82             -0.0315        -0.00576   
                  (0.200)         (0.146)   

1.d83             -0.0343          0.0104   
                  (0.201)         (0.146)   

1.d84              0.0243          0.0844   
                  (0.203)         (0.146)   

1.d85             0.00588          0.0497   
                  (0.205)         (0.146)   

1.d86              0.0252          0.0656   
                  (0.209)         (0.146)   

1.d87              0.0373          0.0904   
                  (0.215)         (0.146)   

1.d81#c.educ      0.00844          0.0116   
                 (0.0168)        (0.0123)   

1.d82#c.educ      0.00889          0.0148   
                 (0.0169)        (0.0123)   

1.d83#c.educ      0.00935          0.0171   
                 (0.0170)        (0.0123)   

1.d84#c.educ      0.00707          0.0166   
                 (0.0173)        (0.0123)   

1.d85#c.educ       0.0104          0.0237   
                 (0.0175)        (0.0123)   

1.d86#c.educ       0.0117          0.0274*  
                 (0.0179)        (0.0123)   

1.d87#c.educ       0.0134          0.0304*  
                 (0.0183)        (0.0123)   

exper              0.0569***                
                 (0.0154)                   

expersq          -0.00192*                  
               (0.000945)                   

1.married           0.123***       0.0548** 
                 (0.0156)        (0.0184)   

1.union             0.172***       0.0830***
                 (0.0171)        (0.0194)   

_cons               0.218           1.362***
                  (0.164)        (0.0162)   
────────────────────────────────────────────
N                    4360            4360   
F                   50.92           48.91   
r2                  0.182           0.171   
────────────────────────────────────────────
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001

Returns to education have increased by about 3.1% between 1987 and 1980.

. display (exp(0.0304)-1)*100
3.0866798
Plot the Coefficients
. quietly reg lwage c.edu##i.d8* exper expersq i.married i.union

. estimates store pooled

. quietly xtreg lwage c.edu##i.d8* i.married i.union, fe

. estimates store fe

. coefplot ///
>   (pooled, label("{bf:Pooled OLS}") mcolor(midblue) mlcolor(cyan) ///
>     ciopts(lcolor(magenta midblue))) /// options for first group
>   (fe, label("{bf: Within}") mcolor(green) mlcolor(lime) ///
>     ciopts(lcolor(lime green))), /// options for second gropu
>   title("Change in Return to Education") ///
>   keep(educ 1.d81#c.educ 1.d82#c.educ 1.d83#c.educ 1.d84#c.educ ///
>     1.d85#c.educ 1.d86#c.educ 1.d87#c.educ) ///
>   xline(0, lcolor(red) lwidth(medium)) scheme(jet_white) ///
>   xtitle("{bf: Coefficients}") ///
>   graphregion(margin(small)) ///
>   coeflabels(educ="Education" 1.d81#c.educ="1981 Returns" ///
>     1.d82#c.educ="1982 Returns" 1.d83#c.educ="1983 Returns" ///
>     1.d84#c.educ="1984 Returns" 1.d85#c.educ="1985 Returns" ///
>     1.d86#c.educ="1986 Returns" 1.d87#c.educ="1987 Returns") ///
>   msize(large) mcolor(%85) mlwidth(medium) msymbol(circle) /// marker options
>   levels(95 90) ciopts(lwidth(medthick thick) recast(rspike rcap)) ///ci options for all groups
>   legend(ring(1) col(1) pos(3) size(medsmall))
(note: scheme jet_white not found, using s2color)

. graph export "/Users/Sam/Desktop/Econ 645/Stata/week4_edu_returns.png", replace
(file /Users/Sam/Desktop/Econ 645/Stata/week4_edu_returns.png written in PNG format)
Plot the Coefficients of Interest

Test for Serial Correlation

Test for Serial Correlation

. xtreg lwage c.edu##i.d8* i.married i.union, fe
note: educ omitted because of collinearity

Fixed-effects (within) regression               Number of obs     =      4,360
Group variable: nr                              Number of groups  =        545

R-sq:                                           Obs per group:
     within  = 0.1708                                         min =          8
     between = 0.1900                                         avg =        8.0
     overall = 0.1325                                         max =          8

                                                F(16,3799)        =      48.91
corr(u_i, Xb)  = 0.0991                         Prob > F          =     0.0000

─────────────┬────────────────────────────────────────────────────────────────
       lwage │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
        educ │          0  (omitted)
       1.d81 │  -.0224158   .1458885    -0.15   0.878    -.3084431    .2636114
       1.d82 │  -.0057611   .1458558    -0.04   0.968    -.2917243    .2802021
       1.d83 │   .0104297   .1458579     0.07   0.943    -.2755377    .2963971
       1.d84 │   .0843743   .1458518     0.58   0.563    -.2015811    .3703297
       1.d85 │   .0497253   .1458602     0.34   0.733    -.2362465    .3356971
       1.d86 │   .0656064   .1458917     0.45   0.653    -.2204273    .3516401
       1.d87 │   .0904448   .1458505     0.62   0.535     -.195508    .3763977
             │
  d81#c.educ │
          1  │   .0115854   .0122625     0.94   0.345    -.0124562    .0356271
             │
  d82#c.educ │
          1  │   .0147905   .0122635     1.21   0.228    -.0092533    .0388342
             │
  d83#c.educ │
          1  │   .0171182   .0122633     1.40   0.163    -.0069251    .0411615
             │
  d84#c.educ │
          1  │   .0165839   .0122657     1.35   0.176     -.007464    .0406319
             │
  d85#c.educ │
          1  │   .0237085   .0122738     1.93   0.053    -.0003554    .0477725
             │
  d86#c.educ │
          1  │   .0274123    .012274     2.23   0.026     .0033481    .0514765
             │
  d87#c.educ │
          1  │   .0304332   .0122723     2.48   0.013     .0063722    .0544942
             │
   1.married │   .0548205   .0184126     2.98   0.003      .018721      .09092
     1.union │   .0829785   .0194461     4.27   0.000     .0448527    .1211042
       _cons │   1.362459   .0162385    83.90   0.000     1.330622    1.394296
─────────────┼────────────────────────────────────────────────────────────────
     sigma_u │  .37264193
     sigma_e │  .35335713
         rho │  .52654439   (fraction of variance due to u_i)
─────────────┴────────────────────────────────────────────────────────────────
F test that all u_i=0: F(544, 3799) = 8.09                   Prob > F = 0.0000

predict u residuals

. predict u, resid

Our null hypothesis is that there is no serial correlation or the coefficient on our lagged residuals is zero. We’ll regress u on lag of u AR(1) model without a constant

. reg u l.u, noconst

      Source │       SS           df       MS      Number of obs   =     3,815
─────────────┼──────────────────────────────────   F(1, 3814)      =   2181.06
       Model │  332.737937         1  332.737937   Prob > F        =    0.0000
    Residual │  581.856111     3,814  .152557973   R-squared       =    0.3638
─────────────┼──────────────────────────────────   Adj R-squared   =    0.3636
       Total │  914.594048     3,815  .239736317   Root MSE        =    .39059

─────────────┬────────────────────────────────────────────────────────────────
           u │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
           u │
         L1. │   .5857228   .0125418    46.70   0.000     .5611336     .610312
─────────────┴────────────────────────────────────────────────────────────────

We can see that we have positive serial correlation since the coefficient on our lagged residual is positive and statistically significant. We will need to cluster our standard errors to account for the positive serial correlation.

Wage Gap

If we use FE or FD, we cannot assess race, education, or experience since they remain constant, but we can include dummy interactions

Let’s test this for a wage gap between Black workers and non-Black workers. With Fixed Effects (Within) estimator we cannot estimate the wage gap, so we’ll use Pooled OLS
Pooled OLS
. est clear

. eststo OLS: reg lwage c.edu##i.d8* i.black##i.d8* exper expersq i.married i.union

      Source │       SS           df       MS      Number of obs   =     4,360
─────────────┼──────────────────────────────────   F(27, 4332)     =     37.78
       Model │  235.665339        27  8.72834589   Prob > F        =    0.0000
    Residual │   1000.8643     4,332  .231039774   R-squared       =    0.1906
─────────────┼──────────────────────────────────   Adj R-squared   =    0.1855
       Total │  1236.52964     4,359  .283672779   Root MSE        =    .48067

─────────────┬────────────────────────────────────────────────────────────────
       lwage │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
        educ │   .0830098   .0125217     6.63   0.000     .0584609    .1075588
       1.d81 │  -.0412547   .1993287    -0.21   0.836    -.4320409    .3495315
       1.d82 │  -.0152276   .1997638    -0.08   0.939    -.4068669    .3764117
       1.d83 │  -.0268885   .2007155    -0.13   0.893    -.4203936    .3666166
       1.d84 │   .0257253    .202407     0.13   0.899     -.371096    .4225465
       1.d85 │   .0252159   .2050566     0.12   0.902       -.3768    .4272319
       1.d86 │   .0228682   .2089636     0.11   0.913    -.3868074    .4325438
       1.d87 │   .0552934   .2144675     0.26   0.797    -.3651726    .4757594
             │
  d81#c.educ │
          1  │   .0083812   .0167209     0.50   0.616    -.0244003    .0411627
             │
  d82#c.educ │
          1  │   .0080581   .0168149     0.48   0.632    -.0249077     .041024
             │
  d83#c.educ │
          1  │   .0087321    .016972     0.51   0.607    -.0245417    .0420059
             │
  d84#c.educ │
          1  │   .0065985   .0171926     0.38   0.701    -.0271079    .0403049
             │
  d85#c.educ │
          1  │   .0091917   .0174659     0.53   0.599    -.0250503    .0434338
             │
  d86#c.educ │
          1  │   .0109991   .0177943     0.62   0.537    -.0238868    .0458851
             │
  d87#c.educ │
          1  │   .0119645   .0181818     0.66   0.511    -.0236811    .0476102
             │
     1.black │  -.0821357   .0645627    -1.27   0.203    -.2087116    .0444401
             │
   black#d81 │
        1 1  │   .0350059   .0911407     0.38   0.701    -.1436765    .2136883
             │
   black#d82 │
        1 1  │  -.0989736   .0911617    -1.09   0.278    -.2776971    .0797499
             │
   black#d83 │
        1 1  │  -.0594933   .0911759    -0.65   0.514    -.2382447    .1192581
             │
   black#d84 │
        1 1  │  -.0446436   .0911938    -0.49   0.624    -.2234302     .134143
             │
   black#d85 │
        1 1  │  -.1412043   .0912107    -1.55   0.122     -.320024    .0376154
             │
   black#d86 │
        1 1  │  -.0280978    .091229    -0.31   0.758    -.2069534    .1507578
             │
   black#d87 │
        1 1  │  -.1448229   .0912847    -1.59   0.113    -.3237875    .0341418
             │
       exper │   .0616496   .0154296     4.00   0.000     .0313997    .0918995
     expersq │  -.0020374   .0009435    -2.16   0.031    -.0038872   -.0001876
   1.married │   .1069328   .0157148     6.80   0.000     .0761238    .1377419
     1.union │   .1843211   .0171798    10.73   0.000       .15064    .2180023
       _cons │   .1982792   .1638308     1.21   0.226    -.1229129    .5194713
─────────────┴────────────────────────────────────────────────────────────────
FE (Within)
. eststo Within: xtreg lwage c.edu##i.d8* i.black##i.d8* i.married i.union, fe
note: educ omitted because of collinearity
note: 1.black omitted because of collinearity

Fixed-effects (within) regression               Number of obs     =      4,360
Group variable: nr                              Number of groups  =        545

R-sq:                                           Obs per group:
     within  = 0.1738                                         min =          8
     between = 0.1934                                         avg =        8.0
     overall = 0.1389                                         max =          8

                                                F(23,3792)        =      34.69
corr(u_i, Xb)  = 0.1049                         Prob > F          =     0.0000

─────────────┬────────────────────────────────────────────────────────────────
       lwage │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
        educ │          0  (omitted)
       1.d81 │  -.0276557   .1463492    -0.19   0.850    -.3145865    .2592752
       1.d82 │   .0162187   .1463125     0.11   0.912    -.2706402    .3030775
       1.d83 │   .0242781   .1463218     0.17   0.868     -.262599    .3111552
       1.d84 │   .0961124   .1463152     0.66   0.511    -.1907517    .3829765
       1.d85 │   .0806915   .1463169     0.55   0.581    -.2061759    .3675589
       1.d86 │    .075028   .1463527     0.51   0.608    -.2119097    .3619656
       1.d87 │   .1199805   .1463163     0.82   0.412    -.1668858    .4068467
             │
  d81#c.educ │
          1  │   .0117652   .0122599     0.96   0.337    -.0122714    .0358018
             │
  d82#c.educ │
          1  │   .0140603   .0122605     1.15   0.252    -.0099776    .0380982
             │
  d83#c.educ │
          1  │   .0166878   .0122603     1.36   0.174    -.0073497    .0407252
             │
  d84#c.educ │
          1  │   .0162498   .0122624     1.33   0.185    -.0077918    .0402914
             │
  d85#c.educ │
          1  │   .0227215   .0122702     1.85   0.064    -.0013353    .0467783
             │
  d86#c.educ │
          1  │    .027172   .0122704     2.21   0.027     .0031148    .0512291
             │
  d87#c.educ │
          1  │   .0295045   .0122688     2.40   0.016     .0054504    .0535587
             │
     1.black │          0  (omitted)
             │
   black#d81 │
        1 1  │   .0298494   .0669496     0.45   0.656    -.1014113    .1611101
             │
   black#d82 │
        1 1  │  -.1111954   .0669827    -1.66   0.097     -.242521    .0201302
             │
   black#d83 │
        1 1  │  -.0687691   .0669927    -1.03   0.305    -.2001144    .0625761
             │
   black#d84 │
        1 1  │  -.0589417   .0670094    -0.88   0.379    -.1903196    .0724362
             │
   black#d85 │
        1 1  │  -.1573439   .0670161    -2.35   0.019     -.288735   -.0259527
             │
   black#d86 │
        1 1  │  -.0458512   .0670197    -0.68   0.494    -.1772493    .0855469
             │
   black#d87 │
        1 1  │  -.1494466    .067063    -2.23   0.026    -.2809297   -.0179636
             │
   1.married │   .0516758   .0184369     2.80   0.005     .0155286    .0878229
     1.union │   .0845706    .019456     4.35   0.000     .0464254    .1227157
       _cons │   1.362641   .0162264    83.98   0.000     1.330828    1.394455
─────────────┼────────────────────────────────────────────────────────────────
     sigma_u │  .37090778
     sigma_e │  .35303669
         rho │   .5246707   (fraction of variance due to u_i)
─────────────┴────────────────────────────────────────────────────────────────
F test that all u_i=0: F(544, 3792) = 8.01                   Prob > F = 0.0000
Compare
. esttab OLS Within, mtitles se scalars(F r2) drop(0.d* 0.b* 1.black#0.d* 0.union 0.married)

────────────────────────────────────────────
                      (1)             (2)   
                      OLS          Within   
────────────────────────────────────────────
educ               0.0830***            0   
                 (0.0125)             (.)   

1.d81             -0.0413         -0.0277   
                  (0.199)         (0.146)   

1.d82             -0.0152          0.0162   
                  (0.200)         (0.146)   

1.d83             -0.0269          0.0243   
                  (0.201)         (0.146)   

1.d84              0.0257          0.0961   
                  (0.202)         (0.146)   

1.d85              0.0252          0.0807   
                  (0.205)         (0.146)   

1.d86              0.0229          0.0750   
                  (0.209)         (0.146)   

1.d87              0.0553           0.120   
                  (0.214)         (0.146)   

1.d81#c.educ      0.00838          0.0118   
                 (0.0167)        (0.0123)   

1.d82#c.educ      0.00806          0.0141   
                 (0.0168)        (0.0123)   

1.d83#c.educ      0.00873          0.0167   
                 (0.0170)        (0.0123)   

1.d84#c.educ      0.00660          0.0162   
                 (0.0172)        (0.0123)   

1.d85#c.educ      0.00919          0.0227   
                 (0.0175)        (0.0123)   

1.d86#c.educ       0.0110          0.0272*  
                 (0.0178)        (0.0123)   

1.d87#c.educ       0.0120          0.0295*  
                 (0.0182)        (0.0123)   

1.black           -0.0821               0   
                 (0.0646)             (.)   

1.black#1~81       0.0350          0.0298   
                 (0.0911)        (0.0669)   

1.black#1~82      -0.0990          -0.111   
                 (0.0912)        (0.0670)   

1.black#1~83      -0.0595         -0.0688   
                 (0.0912)        (0.0670)   

1.black#1~84      -0.0446         -0.0589   
                 (0.0912)        (0.0670)   

1.black#1~85       -0.141          -0.157*  
                 (0.0912)        (0.0670)   

1.black#1~86      -0.0281         -0.0459   
                 (0.0912)        (0.0670)   

1.black#1~87       -0.145          -0.149*  
                 (0.0913)        (0.0671)   

exper              0.0616***                
                 (0.0154)                   

expersq          -0.00204*                  
               (0.000944)                   

1.married           0.107***       0.0517** 
                 (0.0157)        (0.0184)   

1.union             0.184***       0.0846***
                 (0.0172)        (0.0195)   

_cons               0.198           1.363***
                  (0.164)        (0.0162)   
────────────────────────────────────────────
N                    4360            4360   
F                   37.78           34.69   
r2                  0.191           0.174   
────────────────────────────────────────────
Standard errors in parentheses
* p<0.05, ** p<0.01, *** p<0.001

Returns to Marriage for Men

Lesson: We can test to see if random effects is an appropriate assumption.

We’ll use three methods to estimate the returns to marriage for men: Pooled OLS, Fixed Effects (Within), and Random Effects. We cannot estimate the coefficients for Black and Latino/Latina.

. use "wagepan.dta", clear

We can use the wagepan data again to estimate the returns to marriage for men. We will compare the Pooled OLS, FE (Within), and Random Effects estimates

Set Panel

. xtset nr year
       panel variable:  nr (strongly balanced)
        time variable:  year, 1980 to 1987
                delta:  1 unit
Pooled OLS
. reg lwage educ i.black i.hisp exper expersq married union i.d8*

      Source │       SS           df       MS      Number of obs   =     4,360
─────────────┼──────────────────────────────────   F(14, 4345)     =     72.46
       Model │  234.048277        14  16.7177341   Prob > F        =    0.0000
    Residual │  1002.48136     4,345  .230720682   R-squared       =    0.1893
─────────────┼──────────────────────────────────   Adj R-squared   =    0.1867
       Total │  1236.52964     4,359  .283672779   Root MSE        =    .48033

─────────────┬────────────────────────────────────────────────────────────────
       lwage │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
        educ │   .0913498   .0052374    17.44   0.000     .0810819    .1016177
     1.black │  -.1392342   .0235796    -5.90   0.000    -.1854622   -.0930062
      1.hisp │   .0160195   .0207971     0.77   0.441    -.0247535    .0567925
       exper │   .0672345   .0136948     4.91   0.000     .0403856    .0940834
     expersq │  -.0024117     .00082    -2.94   0.003    -.0040192   -.0008042
     married │   .1082529   .0156894     6.90   0.000     .0774937    .1390122
       union │   .1824613   .0171568    10.63   0.000     .1488253    .2160973
       1.d81 │     .05832   .0303536     1.92   0.055    -.0011886    .1178286
       1.d82 │   .0627744   .0332141     1.89   0.059    -.0023421    .1278909
       1.d83 │   .0620117   .0366601     1.69   0.091    -.0098608    .1338843
       1.d84 │   .0904672   .0400907     2.26   0.024      .011869    .1690654
       1.d85 │   .1092463   .0433525     2.52   0.012     .0242533    .1942393
       1.d86 │   .1419596    .046423     3.06   0.002     .0509469    .2329723
       1.d87 │   .1738334    .049433     3.52   0.000     .0769194    .2707474
       _cons │   .0920558   .0782701     1.18   0.240    -.0613935    .2455051
─────────────┴────────────────────────────────────────────────────────────────

. eststo m1: quietly reg lwage educ i.black i.hisp exper expersq married union i.d8*

The Pooled OLS data are likley upward biased - self-selection into marriage and we will have positive serial correlation so we really should cluster our standard errors by the group id.

FE (Within)
. xtreg lwage educ i.black i.hisp exper expersq married union i.d8*, fe
note: educ omitted because of collinearity
note: 1.black omitted because of collinearity
note: 1.hisp omitted because of collinearity
note: 1.d87 omitted because of collinearity

Fixed-effects (within) regression               Number of obs     =      4,360
Group variable: nr                              Number of groups  =        545

R-sq:                                           Obs per group:
     within  = 0.1806                                         min =          8
     between = 0.0005                                         avg =        8.0
     overall = 0.0635                                         max =          8

                                                F(10,3805)        =      83.85
corr(u_i, Xb)  = -0.1212                        Prob > F          =     0.0000

─────────────┬────────────────────────────────────────────────────────────────
       lwage │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
        educ │          0  (omitted)
     1.black │          0  (omitted)
      1.hisp │          0  (omitted)
       exper │   .1321464   .0098247    13.45   0.000     .1128842    .1514087
     expersq │  -.0051855   .0007044    -7.36   0.000    -.0065666   -.0038044
     married │   .0466804   .0183104     2.55   0.011     .0107811    .0825796
       union │   .0800019   .0193103     4.14   0.000     .0421423    .1178614
       1.d81 │   .0190448   .0203626     0.94   0.350    -.0208779    .0589674
       1.d82 │   -.011322   .0202275    -0.56   0.576    -.0509798    .0283359
       1.d83 │  -.0419955   .0203205    -2.07   0.039    -.0818357   -.0021553
       1.d84 │  -.0384709   .0203144    -1.89   0.058    -.0782991    .0013573
       1.d85 │  -.0432498   .0202458    -2.14   0.033    -.0829434   -.0035562
       1.d86 │  -.0273819   .0203863    -1.34   0.179    -.0673511    .0125872
       1.d87 │          0  (omitted)
       _cons │    1.02764   .0299499    34.31   0.000     .9689201    1.086359
─────────────┼────────────────────────────────────────────────────────────────
     sigma_u │   .4009279
     sigma_e │  .35099001
         rho │  .56612236   (fraction of variance due to u_i)
─────────────┴────────────────────────────────────────────────────────────────
F test that all u_i=0: F(544, 3805) = 9.64                   Prob > F = 0.0000

We use estimates store to store our FE (Within) estimates to compare

. estimates store femodel

. eststo m2: quietly xtreg lwage educ i.black i.hisp exper expersq married union i.d8*, fe
Random Effects

We can use the theta option to find the lambda-hat GLS transformaton https://www.stata.com/manuals/xtxtreg.pdf

. xtreg lwage educ i.black i.hisp exper expersq married union i.d8*, re theta

Random-effects GLS regression                   Number of obs     =      4,360
Group variable: nr                              Number of groups  =        545

R-sq:                                           Obs per group:
     within  = 0.1799                                         min =          8
     between = 0.1860                                         avg =        8.0
     overall = 0.1830                                         max =          8

                                                Wald chi2(14)     =     957.77
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000
theta          = .64291089

─────────────┬────────────────────────────────────────────────────────────────
       lwage │      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
        educ │   .0918763   .0106597     8.62   0.000     .0709836    .1127689
     1.black │  -.1393767   .0477228    -2.92   0.003    -.2329117   -.0458417
      1.hisp │   .0217317   .0426063     0.51   0.610    -.0617751    .1052385
       exper │   .1057545   .0153668     6.88   0.000     .0756361    .1358729
     expersq │  -.0047239   .0006895    -6.85   0.000    -.0060753   -.0033726
     married │    .063986   .0167742     3.81   0.000     .0311091    .0968629
       union │   .1061344   .0178539     5.94   0.000     .0711415    .1411273
       1.d81 │    .040462   .0246946     1.64   0.101    -.0079385    .0888626
       1.d82 │   .0309212   .0323416     0.96   0.339    -.0324672    .0943096
       1.d83 │   .0202806    .041582     0.49   0.626    -.0612186    .1017798
       1.d84 │   .0431187   .0513163     0.84   0.401    -.0574595    .1436969
       1.d85 │   .0578155   .0612323     0.94   0.345    -.0621977    .1778286
       1.d86 │   .0919476   .0712293     1.29   0.197    -.0476592    .2315544
       1.d87 │   .1349289   .0813135     1.66   0.097    -.0244427    .2943005
       _cons │   .0235864   .1506683     0.16   0.876     -.271718    .3188907
─────────────┼────────────────────────────────────────────────────────────────
     sigma_u │  .32460315
     sigma_e │  .35099001
         rho │  .46100216   (fraction of variance due to u_i)
─────────────┴────────────────────────────────────────────────────────────────

. eststo m3: quietly xtreg lwage educ i.black i.hisp exper expersq married union i.d8*, re

Our lambda-hat is 0.643, which means it is closer to the FE estimator than the Pooled OLS estimator.

Hausman Test
. hausman femodel ., sigmamore

Note: the rank of the differenced variance matrix (5) does not equal the number of coefficients being tested
        (10); be sure this is what you expect, or there may be problems computing the test.  Examine the output
        of your estimators for anything unexpected and possibly consider scaling your variables so that the
        coefficients are on a similar scale.

                 ──── Coefficients ────
             │      (b)          (B)            (b-B)     sqrt(diag(V_b-V_B))
             │    femodel         m3         Difference          S.E.
─────────────┼────────────────────────────────────────────────────────────────
       exper │    .1321464     .1057545        .0263919               .
     expersq │   -.0051855    -.0047239       -.0004616        .0001533
     married │    .0466804      .063986       -.0173057        .0074632
       union │    .0800019     .1061344       -.0261326        .0074922
       1.d81 │    .0190448      .040462       -.0214172               .
       1.d82 │    -.011322     .0309212       -.0422431               .
       1.d83 │   -.0419955     .0202806       -.0622762               .
       1.d84 │   -.0384709     .0431187       -.0815896               .
       1.d85 │   -.0432498     .0578155       -.1010653               .
       1.d86 │   -.0273819     .0919476       -.1193295               .
─────────────┴────────────────────────────────────────────────────────────────
                           b = consistent under Ho and Ha; obtained from xtreg
            B = inconsistent under Ha, efficient under Ho; obtained from xtreg

    Test:  Ho:  difference in coefficients not systematic

                  chi2(5) = (b-B)'[(V_b-V_B)^(-1)](b-B)
                          =       26.22
                Prob>chi2 =      0.0001
                (V_b-V_B is not positive definite)

We reject the null hypothesis and a_i is correlated with the explanatory variables - We reject the Random Effects Model

. esttab m1 m2 m3, drop(0.black 0.hisp 0.d8*) mtitles("Pooled OLS" "Within Model" "RE Model")

────────────────────────────────────────────────────────────
                      (1)             (2)             (3)   
               Pooled OLS    Within Model        RE Model   
────────────────────────────────────────────────────────────
educ               0.0913***            0          0.0919***
                  (17.44)             (.)          (8.62)   

1.black            -0.139***            0          -0.139** 
                  (-5.90)             (.)         (-2.92)   

1.hisp             0.0160               0          0.0217   
                   (0.77)             (.)          (0.51)   

exper              0.0672***        0.132***        0.106***
                   (4.91)         (13.45)          (6.88)   

expersq          -0.00241**      -0.00519***     -0.00472***
                  (-2.94)         (-7.36)         (-6.85)   

married             0.108***       0.0467*         0.0640***
                   (6.90)          (2.55)          (3.81)   

union               0.182***       0.0800***        0.106***
                  (10.63)          (4.14)          (5.94)   

1.d81              0.0583          0.0190          0.0405   
                   (1.92)          (0.94)          (1.64)   

1.d82              0.0628         -0.0113          0.0309   
                   (1.89)         (-0.56)          (0.96)   

1.d83              0.0620         -0.0420*         0.0203   
                   (1.69)         (-2.07)          (0.49)   

1.d84              0.0905*        -0.0385          0.0431   
                   (2.26)         (-1.89)          (0.84)   

1.d85               0.109*        -0.0432*         0.0578   
                   (2.52)         (-2.14)          (0.94)   

1.d86               0.142**       -0.0274          0.0919   
                   (3.06)         (-1.34)          (1.29)   

1.d87               0.174***            0           0.135   
                   (3.52)             (.)          (1.66)   

_cons              0.0921           1.028***       0.0236   
                   (1.18)         (34.31)          (0.16)   
────────────────────────────────────────────────────────────
N                    4360            4360            4360   
────────────────────────────────────────────────────────────
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001
. display (exp(.108)-1)*100
11.404775

. display (exp(.0467)-1)*100
4.780762

. display (exp(.064)-1)*100
6.6092399

We can see that the marriage premium falls from 11.4% in Pooled OLS to 4.8% in Fixed Effects. If we didn’t reject our RE model, it would have been 6.6%.

The difference between the 11.4% Pooled OLS and the 4.8% in the Within Model might comes from self-selection in marriage (they would have made more money even if they weren’t married), and employers paying married men more if marriage is a sign of stability. But, we cannot distinguish these two hypothesis with this research design.

Plot the Coefficients
. quietly reg lwage educ i.black i.hisp exper expersq married union i.d8*

. estimates store pooled

. quietly xtreg lwage educ i.black i.hisp exper expersq married union i.d8*, fe

. estimates store fe

. quietly xtreg lwage educ i.black i.hisp exper expersq married union i.d8*, re theta

. estimates store re

. coefplot ///
> (pooled, label("{bf:Pooled OLS}") mcolor(midblue) mlcolor(cyan) ///
>   ciopts(lcolor(magenta midblue))) /// options for first group
> (fe, label("{bf: Within}") mcolor(green) mlcolor(lime) ///
>   ciopts(lcolor(lime green))) /// options for second group
> (re, label("{bf: Random Effects}") mcolor(yellow) mlcolor(gold) ///
>   ciopts(lcolor(gold yellow))), /// options for third group
>   title("Returns to Marriage for Men") ///
>   keep(married) ///
>   xline(0, lcolor(red) lpattern(dash) lwidth(medium)) scheme(jet_white) ///
>   xtitle("{bf: Coefficients}") ///
>   graphregion(margin(small)) ///
>   coeflabels(married="Married") ///
>   msize(large) mcolor(%85) mlwidth(medium) msymbol(circle) /// marker options
>   levels(95 90) ciopts(lwidth(medthick thick) recast(rspike rcap)) ///ci options for all groups
>   legend(ring(1) col(1) pos(3) size(medsmall))
(note: scheme jet_white not found, using s2color)

. graph export "/Users/Sam/Desktop/Econ 645/Stata/week4_married_returns.png", replace
(file /Users/Sam/Desktop/Econ 645/Stata/week4_married_returns.png written in PNG format)
Plot the Coefficients of Interest

Exercises

Exercise 1:

. use "rental.dta", clear

The data on rental prices and other variables in college towns from 1980 to 1990. Do more students affect the prices? The general model with unobserved fixed effects is ln(rent)=b0+d0y90+b1ln(pop)+b2ln(avginc)+b3pctstu+a_i+u_i Where pop is city population, avginc is average income, pctstu is the student percent of the population, and rent is the nominal rental prices

Estimate a Pooled OLS. What does the estimate on y90 tell you?

Are there concerns with the standard errors in the Pooled OLS?

Use a First difference model. Does the coefficient on b3 change?

Use a FE Within model. Are the results the same as the FD model?

Set the Panel
. xtset city year, delta(10)
       panel variable:  city (strongly balanced)
        time variable:  year, 80 to 90
                delta:  10 units
Pooled OLS
. reg lrent i.y90 lpop lavginc pctstu, robust

Linear regression                               Number of obs     =        128
                                                F(4, 123)         =     223.26
                                                Prob > F          =     0.0000
                                                R-squared         =     0.8613
                                                Root MSE          =     .12592

─────────────┬────────────────────────────────────────────────────────────────
             │               Robust
       lrent │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
       1.y90 │   .2622267   .0579584     4.52   0.000     .1475017    .3769517
        lpop │   .0406863   .0223732     1.82   0.071       -.0036    .0849726
     lavginc │   .5714461   .0989016     5.78   0.000     .3756765    .7672157
      pctstu │   .0050436   .0011488     4.39   0.000     .0027696    .0073176
       _cons │  -.5688069   .8506229    -0.67   0.505    -2.252563    1.114949
─────────────┴────────────────────────────────────────────────────────────────
FD
. reg d.lrent i.y90 d.lpop d.lavginc d.pctstu
note: 1.y90 omitted because of collinearity

      Source │       SS           df       MS      Number of obs   =        64
─────────────┼──────────────────────────────────   F(3, 60)        =      9.51
       Model │  .231738668         3  .077246223   Prob > F        =    0.0000
    Residual │  .487362198        60  .008122703   R-squared       =    0.3223
─────────────┼──────────────────────────────────   Adj R-squared   =    0.2884
       Total │  .719100867        63  .011414299   Root MSE        =    .09013

─────────────┬────────────────────────────────────────────────────────────────
     D.lrent │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
       1.y90 │          0  (omitted)
             │
        lpop │
         D1. │   .0722456   .0883426     0.82   0.417     -.104466    .2489571
             │
     lavginc │
         D1. │   .3099605   .0664771     4.66   0.000     .1769865    .4429346
             │
      pctstu │
         D1. │   .0112033   .0041319     2.71   0.009     .0029382    .0194684
             │
       _cons │   .3855214   .0368245    10.47   0.000     .3118615    .4591813
─────────────┴────────────────────────────────────────────────────────────────
FE
. xtreg lrent i.y90 lpop lavginc pctstu, fe

Fixed-effects (within) regression               Number of obs     =        128
Group variable: city                            Number of groups  =         64

R-sq:                                           Obs per group:
     within  = 0.9765                                         min =          2
     between = 0.2173                                         avg =        2.0
     overall = 0.7597                                         max =          2

                                                F(4,60)           =     624.15
corr(u_i, Xb)  = -0.1297                        Prob > F          =     0.0000

─────────────┬────────────────────────────────────────────────────────────────
       lrent │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
       1.y90 │   .3855214   .0368245    10.47   0.000     .3118615    .4591813
        lpop │   .0722456   .0883426     0.82   0.417     -.104466    .2489571
     lavginc │   .3099605   .0664771     4.66   0.000     .1769865    .4429346
      pctstu │   .0112033   .0041319     2.71   0.009     .0029382    .0194684
       _cons │   1.409384   1.167238     1.21   0.232    -.9254394    3.744208
─────────────┼────────────────────────────────────────────────────────────────
     sigma_u │  .15905877
     sigma_e │  .06372873
         rho │   .8616755   (fraction of variance due to u_i)
─────────────┴────────────────────────────────────────────────────────────────
F test that all u_i=0: F(63, 60) = 6.67                      Prob > F = 0.0000

Exercise 2:

. use "airfare.dta", clear

We will assess concentration of airline on airfare. Our model: ln(fare) = b0 + b1concen + b2ln(dist) + b3*(ln(dist))^2 + a_i + u_i

Estimate the Pooled OLS with time binaries

. reg lfare concen ldist ldistsq i.y99 i.y00

      Source │       SS           df       MS      Number of obs   =     4,596
─────────────┼──────────────────────────────────   F(5, 4590)      =    627.18
       Model │  355.197587         5  71.0395174   Prob > F        =    0.0000
    Residual │  519.896787     4,590  .113267274   R-squared       =    0.4059
─────────────┼──────────────────────────────────   Adj R-squared   =    0.4052
       Total │  875.094374     4,595  .190444913   Root MSE        =    .33655

─────────────┬────────────────────────────────────────────────────────────────
       lfare │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
      concen │   .3609856   .0300677    12.01   0.000     .3020384    .4199327
       ldist │  -.9018838   .1282905    -7.03   0.000    -1.153395   -.6503727
     ldistsq │   .1030514   .0097268    10.59   0.000     .0839822    .1221207
       1.y99 │   .0272979   .0121656     2.24   0.025     .0034475    .0511483
       1.y00 │   .0893211    .012169     7.34   0.000     .0654641    .1131782
       _cons │   6.219743   .4206248    14.79   0.000     5.395116     7.04437
─────────────┴────────────────────────────────────────────────────────────────

What does a change in concen of 10 for airfare?

. sum concen

    Variable │        Obs        Mean    Std. Dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
      concen │      4,596    .6101149     .196435      .1605          1

What does the quadratic on distance mean Decreasing at a increasing rate - use quadratic formula for when distance on airfare is 0 Set Panel

. xtset id year
       panel variable:  id (strongly balanced)
        time variable:  year, 1997 to 2000
                delta:  1 unit

Estimate a RE model

. xtreg lfare concen ldist ldistsq i.y99 i.y00, re theta

Random-effects GLS regression                   Number of obs     =      4,596
Group variable: id                              Number of groups  =      1,149

R-sq:                                           Obs per group:
     within  = 0.1282                                         min =          4
     between = 0.4179                                         avg =        4.0
     overall = 0.4030                                         max =          4

                                                Wald chi2(5)      =    1331.49
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000
theta          = .83489895

─────────────┬────────────────────────────────────────────────────────────────
       lfare │      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
      concen │    .216343   .0265686     8.14   0.000     .1642694    .2684166
       ldist │  -.8544998   .2464622    -3.47   0.001    -1.337557   -.3714428
     ldistsq │   .0977308   .0186343     5.24   0.000     .0612083    .1342532
       1.y99 │   .0255419   .0038794     6.58   0.000     .0179385    .0331454
       1.y00 │   .0870883   .0038876    22.40   0.000     .0794686    .0947079
       _cons │   6.232589   .8098919     7.70   0.000      4.64523    7.819948
─────────────┼────────────────────────────────────────────────────────────────
     sigma_u │  .31930578
     sigma_e │   .1069025
         rho │  .89920879   (fraction of variance due to u_i)
─────────────┴────────────────────────────────────────────────────────────────

Estimate a FE model

. xtreg lfare concen ldist ldistsq i.y99 i.y00, fe
note: ldist omitted because of collinearity
note: ldistsq omitted because of collinearity

Fixed-effects (within) regression               Number of obs     =      4,596
Group variable: id                              Number of groups  =      1,149

R-sq:                                           Obs per group:
     within  = 0.1286                                         min =          4
     between = 0.0576                                         avg =        4.0
     overall = 0.0102                                         max =          4

                                                F(3,3444)         =     169.47
corr(u_i, Xb)  = -0.2143                        Prob > F          =     0.0000

─────────────┬────────────────────────────────────────────────────────────────
       lfare │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
      concen │    .177761   .0294665     6.03   0.000     .1199873    .2355346
       ldist │          0  (omitted)
     ldistsq │          0  (omitted)
       1.y99 │   .0250736   .0038791     6.46   0.000      .017468    .0326791
       1.y00 │   .0864927   .0038892    22.24   0.000     .0788673    .0941181
       _cons │   4.959254   .0183174   270.74   0.000      4.92334    4.995168
─────────────┼────────────────────────────────────────────────────────────────
     sigma_u │  .43441394
     sigma_e │   .1069025
         rho │  .94290028   (fraction of variance due to u_i)
─────────────┴────────────────────────────────────────────────────────────────
F test that all u_i=0: F(1148, 3444) = 60.09                 Prob > F = 0.0000

Mitchell

Chapter 6: Creating variables

Generate and replace will be the workhorses of creating and modifying variables We’ll deal with generating binary variables We’ll look into missing data and how to handle it Note: Please do not label your missing data, it just is a pain to deal with later Egen is also a very practical and useful command (I wish other software packages had this feature that is so easy to implement We’ll deal with strings and converting between numerics and strings We’ll also work with rename

Creating and changing variables

One of the most basic tasks is to create, replace, and modify variables. Stata provides some nice and easy commands to work with variable generation and modification relative to other statistical software packages.

The generate command and replace command will be important, but Stata has a very helpful command called egen. It can be used to create new variables from old variables, especially when working with groups of observations. Egen becomes even more helpful when it is combined with bysort. We can sort our group and find the max, min, sum, etc. of a group of observations, which can come in handy when we are working with panel data

Futhermore, using indexes can make our generation commands more useful for creating lags or rebasing our data.

6.2 Creating and changing variables

Set our working directory

. cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
/Users/Sam/Desktop/Econ 645/Data/Mitchell

Let’s get our data

. use wws, clear
(Working Women Survey)

Generate command

The generate command is one that you have plenty of experience with already, but there are some helpful tips when working with generate. What we’ll do first is to create a new variable called weekly wages. This will be based off of hourly wages and hours at work by women.

Let’s look at hourly wages for women

. summarize wage

    Variable │        Obs        Mean    Std. Dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
        wage │      2,246    288.2885    9595.692          0     380000

Look for outliers

. list id wage if wage > 1000

      ┌─────────────────┐
      │ idcode     wage │
      ├─────────────────┤
 893. │   3145   380000 │
1241. │   2341   250000 │
      └─────────────────┘

A histogram is helpful too

. histogram wage if wage < 1000
(bin=33, start=0, width=1.2347451)

. graph export "/Users/Sam/Desktop/Econ 645/Stata/week4_wages_hist.png", replace
(file /Users/Sam/Desktop/Econ 645/Stata/week4_wages_hist.png written in PNG format)
Histogram of Hours

Let’s look at hours

. summarize hours

    Variable │        Obs        Mean    Std. Dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
       hours │      2,242    37.21811    10.50914          1         80

Let’s look at histogram of hours

. histogram hours
(bin=33, start=1, width=2.3939394)

Let’s create our weekly wages

. gen weekwage = wage*hours
(4 missing values generated)

We’ll summarize our new variable

. summarize weekwage, detail

                          weekwage
─────────────────────────────────────────────────────────────
      Percentiles      Smallest
 1%     18.01074              0
 5%     61.99676        4.64573
10%     85.02411       6.038647       Obs               2,242
25%     146.3211        8.05153       Sum of Wgt.       2,242

50%      241.401                      Mean           11539.95
                        Largest       Std. Dev.      384170.1
75%     385.5072       1607.924
90%     526.4493           1920       Variance       1.48e+11
95%      681.159       1.00e+07       Skewness       35.42672
99%     1607.923       1.52e+07       Kurtosis       1294.728

We need to find our oultiers

. list id weekwage if weekwage > 80000

      ┌───────────────────┐
      │ idcode   weekwage │
      ├───────────────────┤
  19. │   5111          . │
 372. │   4324          . │
 804. │   3318          . │
 893. │   3145   15200000 │
1139. │   2583          . │
      ├───────────────────┤
1241. │   2341   10000000 │
      └───────────────────┘

Notice! Our missing values were captured in our outlier check! Why? Because missing values are considered very large values. This is important because when creating binary or categorical variables through qualifiers, make sure you top code your qualifiers For example, if we are creating a binary on part-time vs full-time We need to top code so we don’t count missing hours as full-time workers, such as hours > 35 & hours < 999999, or, hours > 35 & !missing(hours), оr else you may categorize missing hours as full-time.

Let’s plot our weekly wages

. histogram weekwage if weekwage < 80000 
(bin=33, start=0, width=58.181818)

Another helpful trick is the before and after options It might be helpful to have the newly created variable next to a similar variable, so let’s drop our variable and generate it again with the after option

. drop weekwage
. gen weekwage = wage*hours, after(wage)
(4 missing values generated)

Replace

We use our replace command to modify a variable that has already been created. Similar to generate, you probably already have experience with replace, but we’ll go over some useful tips.

Replace using categorical variable(s) in the qualifier

Let’s look at married and nevermarried variables

. tabulate married nevermarried

           │   Woman never been
           │        married
   married │         0          1 │     Total
───────────┼──────────────────────┼──────────
         0 │       570        234 │       804 
         1 │     1,440          2 │     1,442 
───────────┼──────────────────────┼──────────
     Total │     2,010        236 │     2,246 

We’ll create a categorical variable called maritalstatus.

. gen maritalstatus = .
(2,246 missing values generated)

I recommend generating a new categorical variable as missing instead of zero, because this prevents missing variables from being categorized as 0.

We generally use qualifers “=, >, <, >=, <=, !” when working with replace Let’s replace the marital status variable with married and nevermarried as qualifiers

. replace maritalstatus = 1 if married==0 & nevermarried==1
(234 real changes made)

. replace maritalstatus = 2 if married==1 & nevermarried==0
(1,440 real changes made)

. replace maritalstatus = 3 if married==0 & nevermarried==0
(570 real changes made)

It’s always a good idea to double check the varilabes against the variables they were created from to make sure your categories are what they are supposed to be

. tabulate maritalstatus

maritalstat │
         us │      Freq.     Percent        Cum.
────────────┼───────────────────────────────────
          1 │        234       10.43       10.43
          2 │      1,440       64.17       74.60
          3 │        570       25.40      100.00
────────────┼───────────────────────────────────
      Total │      2,244      100.00

. table maritalstatus married nevermarried

──────────┬─────────────────────────────
          │ Woman never been married and
          │           married           
maritalst │ ───── 0 ────    ───── 1 ────
atus      │     0      1        0      1
──────────┼─────────────────────────────
        1 │                   234       
        2 │        1,440                
        3 │   570                       
──────────┴─────────────────────────────

Let’s label our values similar to last week Labeling is a good practice for your future self or for others replicating

. label define maritalstatus1 1 "Single" 2 "Married" 3 "Divorced or Widowed"

. label values maritalstatus maritalstatus1

. label variable maritalstatus "Marital Status"

. tabulate maritalstatus

     Marital Status │      Freq.     Percent        Cum.
────────────────────┼───────────────────────────────────
             Single │        234       10.43       10.43
            Married │      1,440       64.17       74.60
Divorced or Widowed │        570       25.40      100.00
────────────────────┼───────────────────────────────────
              Total │      2,244      100.00

Replace using continuous variables in the qualifier

Let’s create a new varible called over40hours which will be categorized as 1 if it is over 40 hours, and 0 if it is 40 or under.

. generate over40hours = .
(2,246 missing values generated)

. replace over40hours = 0 if hours <= 40
(1,852 real changes made)

We need to make sure we don’t include missing values so we need an additional qualifier besides hours > 40. We need to add !missing(hours)

. replace over40hours = 1 if hours > 40 & hours < 99999
(390 real changes made)

6.3 Numeric expressions and functions

Our standard numeric expressions addition +, subtraction -, multiplication *, division /, exponentiation ^, and parantheses () for changing order of operation.

We also have some very useful numeric functions built into Stata int() function - removes any decimals round() function - rounds a number to the desired decimal place Please note that this is different from Excel!!! round(x,y) where x is your numeric and y is the nearest value you wish to round to. For example, if y = 1, round(-5.2,1)=-5 and round(4.5)=5 For example, if y=.1, round(10.16,.1)=10.2 and round(34.1345,.1)=34.1 For example, if y=10, round(313.34,10)=310 and round(4.52,10)=0 Note: if y is missing, then it will round to the nearest integer round(10.16)=10

ln() function - is our natural log function which we use a lot for transforming variables for elasticity estimates log10() function - is our logarithm base-10 function sqrt() function - is our square root function

. display int(10.65)
10

. display round(10.65,1)
11

Please note int(10.65) returns 10, while round(10.65,1) returns 11

. display ln(10.65)
2.3655599

. display log10(10.65)
1.0273496

. display sqrt(10.65)
3.2634338

Random Numbers and setting seeds

There are several random number generating functions in Stata runiform() is a random uniform distribution number generator has a distribution between 0 and 1 rnormal(m,sd) is a random normal distribution number generator without arguments it will assume mean = 0 and sd = 1 rchi2(df) is a random chi-squared distribution number generator where df is the degrees of freedom that need to be specified

Let’s look at some examples

Setting seeds is important if you want someone to replicate your results since a set seed will generate the same random number each time it is run.

. set seed 12345

Random uniform distribution

. gen runiformtest = runiform()

. summarize runiformtest

    Variable │        Obs        Mean    Std. Dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
runiformtest │      2,246    .4924636    .2875376   .0007583   .9985869

Random normal distribution

. gen rnormaltest = rnormal()

. gen rnormaltest2 = rnormal(100,15)

. summarize rnormaltest rnormaltest2

    Variable │        Obs        Mean    Std. Dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
 rnormaltest │      2,246     .018436    .9857847  -3.389103   3.372525
rnormaltest2 │      2,246    99.70718    14.77732   47.81653   154.2297

Random chi-squared distribution

. gen rchi2test = rchi2(5)

. summarize rchi2test

    Variable │        Obs        Mean    Std. Dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
   rchi2test │      2,246    5.063639    3.322998   .1102785   29.24291

6.4 String Expressions and functions

Working with string can be a pain, but there are some very helpful functions that will get your task completed.

. help string functions

Let’s get some new data to work with strings and their functions

. use "authors.dta", clear

We’ll format the names with left-justification 17 characters long

. format name %-17s
. list name 

     ┌──────────────────────┐
     │ name                 │
     ├──────────────────────┤
  1. │  Ruth Canaale        │
  2. │ Y. Don Uflossmore    │
  3. │ thích nhất hạnh      │
  4. │ J. Carreño Quiñones  │
  5. │ Ô Knausgård          │
     ├──────────────────────┤
  6. │ Don b Iteme          │
  7. │ isaac O'yerbreath    │
  8. │ Mike avity           │
  9. │      ÉMILE ZOLA      │
 10. │ i William Crown      │
     ├──────────────────────┤
 11. │ Ott W. Onthurt       │
 12. │ Olive Tu'Drill       │
 13. │ björk guðmundsdóttir │
     └──────────────────────┘

Notice white space(s) in front of some names and some names are not in the proper format (lowercase for initial letter instead of upper).

To fix these, we have a three string functions to work with ustrtitle() is the Unicode function to convert strings to title cases, which means that the first letter is always upper and all others are lower case for each word. ustrlower() is the Unicode function to convert strings to lower case ustrupper() is the Unicode function to convert strings to upper case

Please note that there are string function that do something similar but in ASCII. Our names are not in ASCII currently, so please note that not all string characters come in ASCII, but may come in Unicode: strproper(), strlower(), strupper()

. generate name2 = ustrtitle(name)

. generate lowname = ustrlower(name)

. generate upname = ustrupper(name)

Format our strings to 23 length and left-justified

. format name2 lowname upname %-23s

. list name2 lowname upname

     ┌────────────────────────────────────────────────────────────────────┐
     │ name2                  lowname                upname               │
     ├────────────────────────────────────────────────────────────────────┤
  1. │  Ruth Canaale           ruth canaale           RUTH CANAALE        │
  2. │ Y. Don Uflossmore      y. don uflossmore      Y. DON UFLOSSMORE    │
  3. │ Thích Nhất Hạnh        thích nhất hạnh        THÍCH NHẤT HẠNH      │
  4. │ J. Carreño Quiñones    j. carreño quiñones    J. CARREÑO QUIÑONES  │
  5. │ Ô Knausgård            ô knausgård            Ô KNAUSGÅRD          │
     ├────────────────────────────────────────────────────────────────────┤
  6. │ Don B Iteme            don b iteme            DON B ITEME          │
  7. │ Isaac O'yerbreath      isaac o'yerbreath      ISAAC O'YERBREATH    │
  8. │ Mike Avity             mike avity             MIKE AVITY           │
  9. │      Émile Zola             émile zola             ÉMILE ZOLA      │
 10. │ I William Crown        i william crown        I WILLIAM CROWN      │
     ├────────────────────────────────────────────────────────────────────┤
 11. │ Ott W. Onthurt         ott w. onthurt         OTT W. ONTHURT       │
 12. │ Olive Tu'drill         olive tu'drill         OLIVE TU'DRILL       │
 13. │ Björk Guðmundsdóttir   björk guðmundsdóttir   BJÖRK GUÐMUNDSDÓTTIR │
     └────────────────────────────────────────────────────────────────────┘

We still need to get rid of leading white spaces in front of the names ustrltrim() will get reid of leading blanks

. generate name3 = ustrltrim(name2)

. format name2 name3 %-17s

. list name name2 name3

     ┌────────────────────────────────────────────────────────────────────┐
     │ name                   name2                  name3                │
     ├────────────────────────────────────────────────────────────────────┤
  1. │  Ruth Canaale           Ruth Canaale          Ruth Canaale         │
  2. │ Y. Don Uflossmore      Y. Don Uflossmore      Y. Don Uflossmore    │
  3. │ thích nhất hạnh        Thích Nhất Hạnh        Thích Nhất Hạnh      │
  4. │ J. Carreño Quiñones    J. Carreño Quiñones    J. Carreño Quiñones  │
  5. │ Ô Knausgård            Ô Knausgård            Ô Knausgård          │
     ├────────────────────────────────────────────────────────────────────┤
  6. │ Don b Iteme            Don B Iteme            Don B Iteme          │
  7. │ isaac O'yerbreath      Isaac O'yerbreath      Isaac O'yerbreath    │
  8. │ Mike avity             Mike Avity             Mike Avity           │
  9. │      ÉMILE ZOLA             Émile Zola        Émile Zola           │
 10. │ i William Crown        I William Crown        I William Crown      │
     ├────────────────────────────────────────────────────────────────────┤
 11. │ Ott W. Onthurt         Ott W. Onthurt         Ott W. Onthurt       │
 12. │ Olive Tu'Drill         Olive Tu'drill         Olive Tu'drill       │
 13. │ björk guðmundsdóttir   Björk Guðmundsdóttir   Björk Guðmundsdóttir │
     └────────────────────────────────────────────────────────────────────┘

Let’s work with the initials to identify the initial. In small datasets this is easy enough, but for large datasets we’ll need to use some functions.

substr functions

One of the more practical string functions is the substr functions usubstr(s,n1,n2) is our Unicode substring substr(s,n1,n2) is our ASCII substring Where s is the string, n1 is the starting position, n2 is the length

. display substr("abcdef",2,3)
bcd

Let’s find names that start with the initial with substr We’ll start at the 2nd position and move 1 character Let’s use ASCII and compare it to Unicode

. gen secondchar = substr(name3,2,1)

. gen firstinit = (secondchar == " " | secondchar == ".") if !missing(secondchar)

. list name3 secondchar firstinit

     ┌────────────────────────────────────────────┐
     │ name3                  second~r   firsti~t │
     ├────────────────────────────────────────────┤
  1. │ Ruth Canaale                  u          0 │
  2. │ Y. Don Uflossmore             .          1 │
  3. │ Thích Nhất Hạnh               h          0 │
  4. │ J. Carreño Quiñones           .          1 │
  5. │ Ô Knausgård                   �          0 │
     ├────────────────────────────────────────────┤
  6. │ Don B Iteme                   o          0 │
  7. │ Isaac O'yerbreath             s          0 │
  8. │ Mike Avity                    i          0 │
  9. │ Émile Zola                    �          0 │
 10. │ I William Crown                          1 │
     ├────────────────────────────────────────────┤
 11. │ Ott W. Onthurt                t          0 │
 12. │ Olive Tu'drill                l          0 │
 13. │ Björk Guðmundsdóttir          j          0 │
     └────────────────────────────────────────────┘

We might want to break up a string as well. This is helpful for names, addresses, etc. We can use the strwordcount function ustrwordcount(s) - counts the number of words using word-boundary rules of Unicode strings Where s is the string input

. generate namecount = ustrwordcount(name3)

. list name3 namecount

     ┌─────────────────────────────────┐
     │ name3                  nameco~t │
     ├─────────────────────────────────┤
  1. │ Ruth Canaale                  2 │
  2. │ Y. Don Uflossmore             4 │
  3. │ Thích Nhất Hạnh               3 │
  4. │ J. Carreño Quiñones           4 │
  5. │ Ô Knausgård                   2 │
     ├─────────────────────────────────┤
  6. │ Don B Iteme                   3 │
  7. │ Isaac O'yerbreath             2 │
  8. │ Mike Avity                    2 │
  9. │ Émile Zola                    2 │
 10. │ I William Crown               3 │
     ├─────────────────────────────────┤
 11. │ Ott W. Onthurt                4 │
 12. │ Olive Tu'drill                2 │
 13. │ Björk Guðmundsdóttir          2 │
     └─────────────────────────────────┘

Notice that there are three authors have a word count of 4 instead of 3 when it should be three.

To extract the words after word count, we can use ustrword() ustrword(s,n) - returns the word in the string depending upon n Note: a positive n returns the nth word from the left, while a -n returns the nth word from the right

. generate uname1=ustrword(name3,1)

. generate uname2=ustrword(name3,2)

. generate uname3=ustrword(name3,3)
(7 missing values generated)

. generate uname4=ustrword(name3,4)
(10 missing values generated)
. list name3 uname1 uname2 uname3 uname4 namecount

     ┌──────────────────────────────────────────────────────────────────────────────────┐
     │ name3                  uname1           uname2    uname3       uname4   nameco~t │
     ├──────────────────────────────────────────────────────────────────────────────────┤
  1. │ Ruth Canaale             Ruth          Canaale                                 2 │
  2. │ Y. Don Uflossmore           Y                .       Don   Uflossmore          4 │
  3. │ Thích Nhất Hạnh         Thích             Nhất      Hạnh                       3 │
  4. │ J. Carreño Quiñones         J                .   Carreño     Quiñones          4 │
  5. │ Ô Knausgård                 Ô        Knausgård                                 2 │
     ├──────────────────────────────────────────────────────────────────────────────────┤
  6. │ Don B Iteme               Don                B     Iteme                       3 │
  7. │ Isaac O'yerbreath       Isaac      O'yerbreath                                 2 │
  8. │ Mike Avity               Mike            Avity                                 2 │
  9. │ Émile Zola              Émile             Zola                                 2 │
 10. │ I William Crown             I          William     Crown                       3 │
     ├──────────────────────────────────────────────────────────────────────────────────┤
 11. │ Ott W. Onthurt            Ott                W         .      Onthurt          4 │
 12. │ Olive Tu'drill          Olive         Tu'drill                                 2 │
 13. │ Björk Guðmundsdóttir    Björk   Guðmundsdóttir                                 2 │
     └──────────────────────────────────────────────────────────────────────────────────┘

It seems that ustrwordcount counts “.” as separate words

so let’s use another very helpful string function called substr which comes in Unicode usubstr() and ASCII substr() usubinstr(s1,s2,s3,n) - replaces the nth occurance of s2 in s1 with s3 subinstr(s1,s2,s3,n) - same thing but for ASCII Where s1 is our string, s2 is the string we want to replace, s3 is the string we want instead, and n is the nth occurance of s2.

If n is missing or implied, then all occurances of s2 will be replaced

. generate name4 = usubinstr(name3,".","",.)

. replace namecount = ustrwordcount(name4)
(3 real changes made)

. list name4 namecount

     ┌─────────────────────────────────┐
     │                name4   nameco~t │
     ├─────────────────────────────────┤
  1. │         Ruth Canaale          2 │
  2. │     Y Don Uflossmore          3 │
  3. │      Thích Nhất Hạnh          3 │
  4. │   J Carreño Quiñones          3 │
  5. │          Ô Knausgård          2 │
     ├─────────────────────────────────┤
  6. │          Don B Iteme          3 │
  7. │    Isaac O'yerbreath          2 │
  8. │           Mike Avity          2 │
  9. │           Émile Zola          2 │
 10. │      I William Crown          3 │
     ├─────────────────────────────────┤
 11. │        Ott W Onthurt          3 │
 12. │       Olive Tu'drill          2 │
 13. │ Björk Guðmundsdóttir          2 │
     └─────────────────────────────────┘

Let’s split the names

. gen fname = ustrword(name4,1)

. gen mname = ustrword(name4,2) if namecount==3
(7 missing values generated)

. gen lname = ustrword(name4,namecount)

. format fname mname lname %-15s

. list name4 fname mname lname

     ┌─────────────────────────────────────────────────────────┐
     │                name4   fname   mname     lname          │
     ├─────────────────────────────────────────────────────────┤
  1. │         Ruth Canaale   Ruth              Canaale        │
  2. │     Y Don Uflossmore   Y       Don       Uflossmore     │
  3. │      Thích Nhất Hạnh   Thích   Nhất      Hạnh           │
  4. │   J Carreño Quiñones   J       Carreño   Quiñones       │
  5. │          Ô Knausgård   Ô                 Knausgård      │
     ├─────────────────────────────────────────────────────────┤
  6. │          Don B Iteme   Don     B         Iteme          │
  7. │    Isaac O'yerbreath   Isaac             O'yerbreath    │
  8. │           Mike Avity   Mike              Avity          │
  9. │           Émile Zola   Émile             Zola           │
 10. │      I William Crown   I       William   Crown          │
     ├─────────────────────────────────────────────────────────┤
 11. │        Ott W Onthurt   Ott     W         Onthurt        │
 12. │       Olive Tu'drill   Olive             Tu'drill       │
 13. │ Björk Guðmundsdóttir   Björk             Guðmundsdóttir │
     └─────────────────────────────────────────────────────────┘

Some names have the middle initial first, so let’s rearrange The middle initial will only have a string length of one, so we need our string length functions ustrlen(s) - returns the length of the string strlen(s) - same as above but for ASCII

Let’s add the period back to the middle initial if the string length is 1

. replace fname=fname + "." if ustrlen(fname)==1
(4 real changes made)

. replace mname=mname + "." if ustrlen(mname)==1
(2 real changes made)

. list fname mname

     ┌─────────────────┐
     │ fname   mname   │
     ├─────────────────┤
  1. │ Ruth            │
  2. │ Y.      Don     │
  3. │ Thích   Nhất    │
  4. │ J.      Carreño │
  5. │ Ô.              │
     ├─────────────────┤
  6. │ Don     B.      │
  7. │ Isaac           │
  8. │ Mike            │
  9. │ Émile           │
 10. │ I.      William │
     ├─────────────────┤
 11. │ Ott     W.      │
 12. │ Olive           │
 13. │ Björk           │
     └─────────────────┘

Let’s make a new variable that correctly arranges the parts of the name

. gen mlen = ustrlen(mname)

. gen flen = ustrlen(fname)

. gen fullname = fname + " " + lname if namecount == 2
(6 missing values generated)

. replace fullname = fname + " " + mname + " " + lname if namecount==3 & mlen > 2
(4 real changes made)

. replace fullname = fname + " " + mname + " " + lname if namecount==3 & mlen==2
(2 real changes made)

. replace fullname = mname + " " + fname + " " + lname if namecount==3 & flen==2
(3 real changes made)

. list fullname fname mname lname

     ┌─────────────────────────────────────────────────────────┐
     │             fullname   fname   mname     lname          │
     ├─────────────────────────────────────────────────────────┤
  1. │         Ruth Canaale   Ruth              Canaale        │
  2. │    Don Y. Uflossmore   Y.      Don       Uflossmore     │
  3. │      Thích Nhất Hạnh   Thích   Nhất      Hạnh           │
  4. │  Carreño J. Quiñones   J.      Carreño   Quiñones       │
  5. │         Ô. Knausgård   Ô.                Knausgård      │
     ├─────────────────────────────────────────────────────────┤
  6. │         Don B. Iteme   Don     B.        Iteme          │
  7. │    Isaac O'yerbreath   Isaac             O'yerbreath    │
  8. │           Mike Avity   Mike              Avity          │
  9. │           Émile Zola   Émile             Zola           │
 10. │     William I. Crown   I.      William   Crown          │
     ├─────────────────────────────────────────────────────────┤
 11. │       Ott W. Onthurt   Ott     W.        Onthurt        │
 12. │       Olive Tu'drill   Olive             Tu'drill       │
 13. │ Björk Guðmundsdóttir   Björk             Guðmundsdóttir │
     └─────────────────────────────────────────────────────────┘

If you are brave enough, learning regular express can pay off in the long run. Stata has string functions with regular express, such as finding particular sets of strings with regexm(). Regular expressions are a pain to work with, but they are very powerful.

6.5 Recoding

We can use Stata commands to create new variables from old variables

Let’s get some data on working women

. use "wws2lab.dta", clear
(Working Women Survey w/fixes)

. codebook occupation, tabulate(25)

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
occupation                                                                                             occupation
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────

                  type:  numeric (byte)
                 label:  occlbl

                 range:  [1,13]                       units:  1
         unique values:  13                       missing .:  9/2,246

            tabulation:  Freq.   Numeric  Label
                           319         1  Professional/technical
                           264         2  Managers/admin
                           725         3  Sales
                           101         4  Clerical/unskilled
                            53         5  Craftsmen
                           246         6  Operatives
                            28         7  Transport
                           286         8  Laborers
                             1         9  Farmers
                             9        10  Farm laborers
                            16        11  Service
                             2        12  Household workers
                           187        13  Other
                             9         .  

Let’s say we want aggregated groups of occupation We could use a categorical variable strategy to create a with gen and replace with qualifers. We can also use the recode command for faster groupings

. recode occupation (1/3=1) (5/8=2) (4 9/13=3), generate(occ3)
(1918 differences between occupation and occ3)

. tab occupation occ3

                      │       RECODE of occupation
                      │           (occupation)
           occupation │         1          2          3 │     Total
──────────────────────┼─────────────────────────────────┼──────────
Professional/technica │       319          0          0 │       319 
       Managers/admin │       264          0          0 │       264 
                Sales │       725          0          0 │       725 
   Clerical/unskilled │         0          0        101 │       101 
            Craftsmen │         0         53          0 │        53 
           Operatives │         0        246          0 │       246 
            Transport │         0         28          0 │        28 
             Laborers │         0        286          0 │       286 
              Farmers │         0          0          1 │         1 
        Farm laborers │         0          0          9 │         9 
              Service │         0          0         16 │        16 
    Household workers │         0          0          2 │         2 
                Other │         0          0        187 │       187 
──────────────────────┼─────────────────────────────────┼──────────
                Total │     1,308        613        316 │     2,237 


. table occupation occ3

───────────────────────┬─────────────────
                       │    RECODE of    
                       │    occupation   
                       │   (occupation)  
            occupation │    1     2     3
───────────────────────┼─────────────────
Professional/technical │  319            
        Managers/admin │  264            
                 Sales │  725            
    Clerical/unskilled │              101
             Craftsmen │         53      
            Operatives │        246      
             Transport │         28      
              Laborers │        286      
               Farmers │                1
         Farm laborers │                9
               Service │               16
     Household workers │                2
                 Other │              187
───────────────────────┴─────────────────

We condense 3 or 4 lines of code down to 1

We can also label are variables in the recode command. This can add to your consolidation of code

. drop occ3

. recode occupation (1/3= 1 "White Collar") (5/8=2 "Blue Collar") (4 9/12=3 "Other"), generate(occ3)
(1731 differences between occupation and occ3)

. tab occupation occ3, missing

                      │           RECODE of occupation (occupation)
           occupation │ White Col  Blue Coll      Other         13          . │     Total
──────────────────────┼───────────────────────────────────────────────────────┼──────────
Professional/technica │       319          0          0          0          0 │       319 
       Managers/admin │       264          0          0          0          0 │       264 
                Sales │       725          0          0          0          0 │       725 
   Clerical/unskilled │         0          0        101          0          0 │       101 
            Craftsmen │         0         53          0          0          0 │        53 
           Operatives │         0        246          0          0          0 │       246 
            Transport │         0         28          0          0          0 │        28 
             Laborers │         0        286          0          0          0 │       286 
              Farmers │         0          0          1          0          0 │         1 
        Farm laborers │         0          0          9          0          0 │         9 
              Service │         0          0         16          0          0 │        16 
    Household workers │         0          0          2          0          0 │         2 
                Other │         0          0          0        187          0 │       187 
                    . │         0          0          0          0          9 │         9 
──────────────────────┼───────────────────────────────────────────────────────┼──────────
                Total │     1,308        613        129        187          9 │     2,246 


. table occupation occ3, missing

───────────────────────┬───────────────────────────────────────────────────────
                       │           RECODE of occupation (occupation)           
            occupation │ White Collar   Blue Collar         Other            13
───────────────────────┼───────────────────────────────────────────────────────
Professional/technical │          319             .             .             .
        Managers/admin │          264             .             .             .
                 Sales │          725             .             .             .
    Clerical/unskilled │            .             .           101             .
             Craftsmen │            .            53             .             .
            Operatives │            .           246             .             .
             Transport │            .            28             .             .
              Laborers │            .           286             .             .
               Farmers │            .             .             1             .
         Farm laborers │            .             .             9             .
               Service │            .             .            16             .
     Household workers │            .             .             2             .
                 Other │            .             .             .           187
───────────────────────┴───────────────────────────────────────────────────────

In Stata #n1/#n2 means all values between #n1 and #n2 #n1 #n2 means only values #n1 #n2 For example 1/10 means 1 2 3 4 5 6 7 8 9 10 1 10 means 1 10

Occupation is a categorical variable, but we can use recode with continuous variables such as wages, hours, etc. There is recode(), irecode(), and autocode(), but I recommand using gen and replace with qualifiers instead. You don’t want overlapping groups and with qualifers you can ensure they don’t overlap

Egen is powerful that has many useful command. One is egen cut

. summarize wage, detail

                         hourly wage
─────────────────────────────────────────────────────────────
      Percentiles      Smallest
 1%     1.892108              0
 5%     2.801002       1.004952
10%     3.220612       1.032247       Obs               2,244
25%     4.259257       1.151368       Sum of Wgt.       2,244

50%      6.27227                      Mean           7.796781
                        Largest       Std. Dev.       5.82459
75%     9.657809       40.19808
90%     12.77777       40.19808       Variance       33.92584
95%     16.72241       40.19808       Skewness       3.078195
99%     38.70926       40.74659       Kurtosis       15.58899

You can use breakpoints with at or equal lengths with group. Make sure your first breakpoint is at the minimum so you don’t cut off data.

. egen wage3 = cut(wage), at(0,4,6,9,12)
(278 missing values generated)

. egen wage4 = cut(wage), group(3)
(2 missing values generated)

. tabstat wage, by(wage3)

Summary for variables: wage
     by categories of: wage3 

   wage3 │      mean
─────────┼──────────
       0 │  3.085733
       4 │  4.918866
       6 │  7.327027
       9 │  10.40792
─────────┼──────────
   Total │  6.188629
─────────┴──────────

. tabstat wage, by(wage4)

Summary for variables: wage
     by categories of: wage4 

   wage4 │      mean
─────────┼──────────
       0 │  3.633853
       1 │  6.374262
       2 │  13.38223
─────────┼──────────
   Total │  7.796781
─────────┴──────────

Still recommend using gen, replace, and qualifers for generating categorical variables from continuous variables

6.6 Coding missing values

If you are interested, you can review different ways to code missing values. I generally just use “.”, but when survey data comes back, there may be different reasons for missing. N/A vs Did not respond may have different meanings and you may want to account for that.

We will take a brief look at mvdecode() to recode all missing values coded as -1 and -2 to . It can be a bit faster than using replace for every variable of interest

. infile id age pl1-pl5 bp1-bp5 using "cardio2miss.txt", clear
(5 observations read)

. list

     ┌──────────────────────────────────────────────────────────────────────┐
     │ id   age   pl1   pl2   pl3   pl4   pl5   bp1   bp2   bp3   bp4   bp5 │
     ├──────────────────────────────────────────────────────────────────────┤
  1. │  1    40    54   115    87    86    93   129    81   105    -2    -2 │
  2. │  2    30    92   123    88   136   125   107    87   111    58   120 │
  3. │  3    16   105    -1    97   122   128   101    57   109    68   112 │
  4. │  4    23    52   105    79   115    71   121   106   129    39   137 │
  5. │  5    18    70   116    -1   128    52   112    68   125    59   111 │
     └──────────────────────────────────────────────────────────────────────┘

Recode -1 and -2 as .

. mvdecode bp* pl*, mv(-1 -2)
         bp4: 1 missing value generated
         bp5: 1 missing value generated
         pl2: 1 missing value generated
         pl3: 1 missing value generated

. list

     ┌──────────────────────────────────────────────────────────────────────┐
     │ id   age   pl1   pl2   pl3   pl4   pl5   bp1   bp2   bp3   bp4   bp5 │
     ├──────────────────────────────────────────────────────────────────────┤
  1. │  1    40    54   115    87    86    93   129    81   105     .     . │
  2. │  2    30    92   123    88   136   125   107    87   111    58   120 │
  3. │  3    16   105     .    97   122   128   101    57   109    68   112 │
  4. │  4    23    52   105    79   115    71   121   106   129    39   137 │
  5. │  5    18    70   116     .   128    52   112    68   125    59   111 │
     └──────────────────────────────────────────────────────────────────────┘

6.7 Dummy variables

Creation of dummy variables is a common and vital process in econometric analysis. Making mutually exclusive groups is essential for preventing multicollinearity and perfect multicollinearity.

Stata has a easy way to incorporate binary/categorical/dummy variables Factors variables for dummy variables start with “i.” in our regressions. We have seen d. and l. last week for differences and lags If we don’t use i. for categorical variables, then we need to generate multiple binary (0/1) variables from our categorical group. If we don’t use factor variables or mutually exclusive binary variables from our categorical, then Stata will think our categorical variable is a continuous variable, which does not have any interpretation.

. use "wws2lab.dta", clear
(Working Women Survey w/fixes)

Let’s look at our highest grade completed categorical

. codebook grade4

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
grade4                                                                            4 level Current Grade Completed
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────

                  type:  numeric (byte)
                 label:  grade4

                 range:  [1,4]                        units:  1
         unique values:  4                        missing .:  4/2,246

            tabulation:  Freq.   Numeric  Label
                           332         1  Not HS
                           941         2  HS Grad
                           456         3  Some Coll
                           513         4  Coll Grad
                             4         .  

Let’s use it in a regression with grade4 as a factor variable

. regress wage i.grade4

      Source │       SS           df       MS      Number of obs   =     2,240
─────────────┼──────────────────────────────────   F(3, 2236)      =     85.35
       Model │  7811.98756         3  2603.99585   Prob > F        =    0.0000
    Residual │  68221.1897     2,236  30.5103711   R-squared       =    0.1027
─────────────┼──────────────────────────────────   Adj R-squared   =    0.1015
       Total │  76033.1772     2,239  33.9585428   Root MSE        =    5.5236

─────────────┬────────────────────────────────────────────────────────────────
        wage │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
      grade4 │
    HS Grad  │   1.490229   .3526422     4.23   0.000      .798689     2.18177
  Some Coll  │   3.769248   .3985065     9.46   0.000     2.987767    4.550729
  Coll Grad  │   5.319548   .3892162    13.67   0.000     4.556285     6.08281
             │
       _cons │   5.194571    .303148    17.14   0.000      4.60009    5.789052
─────────────┴────────────────────────────────────────────────────────────────

Stata knows to create 4 groups “NHS” “HS” “SC” and “CD”, and by default, Stata excludes the first category to prevent perfect multicollinearity. So, there will always be k-1 groups in the regress and the kth group is in the intercept.

A nice feature with factor variables is that we can change the base group, which is 1 by default. ib2.var means to use the variable as a factor variable and use the 2nd category as the base group.

. regress wage ib2.grade4

      Source │       SS           df       MS      Number of obs   =     2,240
─────────────┼──────────────────────────────────   F(3, 2236)      =     85.35
       Model │  7811.98756         3  2603.99585   Prob > F        =    0.0000
    Residual │  68221.1897     2,236  30.5103711   R-squared       =    0.1027
─────────────┼──────────────────────────────────   Adj R-squared   =    0.1015
       Total │  76033.1772     2,239  33.9585428   Root MSE        =    5.5236

─────────────┬────────────────────────────────────────────────────────────────
        wage │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
      grade4 │
     Not HS  │  -1.490229   .3526422    -4.23   0.000     -2.18177    -.798689
  Some Coll  │   2.279019   .3152246     7.23   0.000     1.660855    2.897182
  Coll Grad  │   3.829318   .3033948    12.62   0.000     3.234353    4.424283
             │
       _cons │     6.6848   .1801606    37.10   0.000     6.331501      7.0381
─────────────┴────────────────────────────────────────────────────────────────

We can use list to see how Stata treats i.grade4 vs grade4

. list wage grade4 i.grade4 in 1/10, nolabel

     ┌───────────────────────────────────────────────────────┐
     │                         1b.       2.       3.       4.│
     │     wage   grade4   grade4   grade4   grade4   grade4 │
     ├───────────────────────────────────────────────────────┤
  1. │  7.15781        1        0        0        0        0 │
  2. │ 2.447664        2        0        1        0        0 │
  3. │ 3.824476        3        0        0        1        0 │
  4. │ 14.32367        4        0        0        0        1 │
  5. │ 5.517124        2        0        1        0        0 │
     ├───────────────────────────────────────────────────────┤
  6. │ 5.032206        3        0        0        1        0 │
  7. │ 4.251207        2        0        1        0        0 │
  8. │ 2.801002        2        0        1        0        0 │
  9. │ 15.04025        4        0        0        0        1 │
 10. │ 13.28503        4        0        0        0        1 │
     └───────────────────────────────────────────────────────┘

Interactions are an important part of Stata Interactions # can interact categoricals-categorical, categorical-continuous, or continuous-continuous

Categorical-Categorical Interaction - Changes Intercepts

Let’s interact married and high grade completed

. reg wage i.grade4##i.married

      Source │       SS           df       MS      Number of obs   =     2,240
─────────────┼──────────────────────────────────   F(7, 2232)      =     38.08
       Model │  8110.85014         7  1158.69288   Prob > F        =    0.0000
    Residual │  67922.3271     2,232  30.4311501   R-squared       =    0.1067
─────────────┼──────────────────────────────────   Adj R-squared   =    0.1039
       Total │  76033.1772     2,239  33.9585428   Root MSE        =    5.5164

───────────────────┬────────────────────────────────────────────────────────────────
              wage │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
───────────────────┼────────────────────────────────────────────────────────────────
            grade4 │
          HS Grad  │   1.321819   .5759051     2.30   0.022     .1924532    2.451184
        Some Coll  │   4.439404   .6472103     6.86   0.000     3.170207    5.708601
        Coll Grad  │   5.905856   .6363548     9.28   0.000     4.657947    7.153766
                   │
           married │
          married  │  -.1704624   .6220255    -0.27   0.784    -1.390271    1.049347
                   │
    grade4#married │
  HS Grad#married  │    .267619   .7281759     0.37   0.713    -1.160354    1.695592
Some Coll#married  │  -1.056459   .8208028    -1.29   0.198    -2.666076    .5531576
Coll Grad#married  │  -.9017134   .8039499    -1.12   0.262    -2.478281    .6748543
                   │
             _cons │   5.299313   .4875893    10.87   0.000     4.343137    6.255489
───────────────────┴────────────────────────────────────────────────────────────────

This is the same as

. reg wage i.grade4 i.married i.grade4#i.married

      Source │       SS           df       MS      Number of obs   =     2,240
─────────────┼──────────────────────────────────   F(7, 2232)      =     38.08
       Model │  8110.85014         7  1158.69288   Prob > F        =    0.0000
    Residual │  67922.3271     2,232  30.4311501   R-squared       =    0.1067
─────────────┼──────────────────────────────────   Adj R-squared   =    0.1039
       Total │  76033.1772     2,239  33.9585428   Root MSE        =    5.5164

───────────────────┬────────────────────────────────────────────────────────────────
              wage │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
───────────────────┼────────────────────────────────────────────────────────────────
            grade4 │
          HS Grad  │   1.321819   .5759051     2.30   0.022     .1924532    2.451184
        Some Coll  │   4.439404   .6472103     6.86   0.000     3.170207    5.708601
        Coll Grad  │   5.905856   .6363548     9.28   0.000     4.657947    7.153766
                   │
           married │
          married  │  -.1704624   .6220255    -0.27   0.784    -1.390271    1.049347
                   │
    grade4#married │
  HS Grad#married  │    .267619   .7281759     0.37   0.713    -1.160354    1.695592
Some Coll#married  │  -1.056459   .8208028    -1.29   0.198    -2.666076    .5531576
Coll Grad#married  │  -.9017134   .8039499    -1.12   0.262    -2.478281    .6748543
                   │
             _cons │   5.299313   .4875893    10.87   0.000     4.343137    6.255489
───────────────────┴────────────────────────────────────────────────────────────────

A single # interacts only the two categoricals, while ## between two categorical will create all groups (except the base group) No High School and unmarried is the base group (intercept) All grade4 coefficients are returns to education for unmarried All grade4#married coefficients are returns to education for married Remember: This changes the intercepts for all groups and the base group is the intercept.

Categorical-continuous - changes intercepts and slopes

Interacting categorical and continuous variables allows for different slopes and different intercepts. We define a continuous variable as “c.var” and a categorical as “i.var”

. reg wage i.grade4##c.age

      Source │       SS           df       MS      Number of obs   =     2,240
─────────────┼──────────────────────────────────   F(7, 2232)      =     36.60
       Model │  7829.76084         7  1118.53726   Prob > F        =    0.0000
    Residual │  68203.4164     2,232  30.5570862   R-squared       =    0.1030
─────────────┼──────────────────────────────────   Adj R-squared   =    0.1002
       Total │  76033.1772     2,239  33.9585428   Root MSE        =    5.5278

─────────────┬────────────────────────────────────────────────────────────────
        wage │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
      grade4 │
    HS Grad  │   1.296013   2.460262     0.53   0.598    -3.528628    6.120654
  Some Coll  │   4.463486   2.764418     1.61   0.107    -.9576133    9.884585
  Coll Grad  │   6.507141   2.727232     2.39   0.017     1.158965    11.85532
             │
         age │    .014414   .0574379     0.25   0.802    -.0982233    .1270514
             │
grade4#c.age │
    HS Grad  │    .005623   .0665171     0.08   0.933    -.1248188    .1360649
  Some Coll  │  -.0189718   .0750005    -0.25   0.800    -.1660498    .1281063
  Coll Grad  │   -.032585   .0738738    -0.44   0.659    -.1774535    .1122835
             │
       _cons │   4.664681   2.133219     2.19   0.029     .4813798    8.847983
─────────────┴────────────────────────────────────────────────────────────────

We can see different intercepts for education: Don’t forget the intercept is the intercept for the base group (No High School) We can see the basegroup slope The coefficient for age is the slope of the base group (No High School) We can see the different slopes for the different groups There are three different slopes for our three comparison groups

Continuous-Continuous Interaction - add n polynomials

p = 2

. reg wage c.age##c.age

      Source │       SS           df       MS      Number of obs   =     2,244
─────────────┼──────────────────────────────────   F(2, 2241)      =      0.53
       Model │  35.8383924         2  17.9191962   Prob > F        =    0.5899
    Residual │  76059.8271     2,241  33.9401281   R-squared       =    0.0005
─────────────┼──────────────────────────────────   Adj R-squared   =   -0.0004
       Total │  76095.6655     2,243  33.9258429   Root MSE        =    5.8258

─────────────┬────────────────────────────────────────────────────────────────
        wage │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
         age │   .2546764   .2532736     1.01   0.315    -.2419991    .7513518
             │
 c.age#c.age │  -.0037169   .0036417    -1.02   0.308    -.0108584    .0034247
             │
       _cons │    3.55454   4.337015     0.82   0.413    -4.950447    12.05953
─────────────┴────────────────────────────────────────────────────────────────

p = 3

. reg wage c.age##c.age##c.age

      Source │       SS           df       MS      Number of obs   =     2,244
─────────────┼──────────────────────────────────   F(3, 2240)      =      0.50
       Model │  50.4733111         3   16.824437   Prob > F        =    0.6854
    Residual │  76045.1922     2,240  33.9487465   R-squared       =    0.0007
─────────────┼──────────────────────────────────   Adj R-squared   =   -0.0007
       Total │  76095.6655     2,243  33.9258429   Root MSE        =    5.8266

──────────────────┬────────────────────────────────────────────────────────────────
             wage │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
──────────────────┼────────────────────────────────────────────────────────────────
              age │  -1.199385    2.22906    -0.54   0.591    -5.570624    3.171855
                  │
      c.age#c.age │   .0392406    .065528     0.60   0.549    -.0892614    .1677426
                  │
c.age#c.age#c.age │  -.0004146   .0006315    -0.66   0.512     -.001653    .0008238
                  │
            _cons │   19.58876   24.80329     0.79   0.430    -29.05107    68.22859
──────────────────┴────────────────────────────────────────────────────────────────

p = 4

. reg wage c.age##c.age##c.age##c.age

      Source │       SS           df       MS      Number of obs   =     2,244
─────────────┼──────────────────────────────────   F(4, 2239)      =      0.50
       Model │  67.8804251         4  16.9701063   Prob > F        =    0.7359
    Residual │  76027.7851     2,239  33.9561345   R-squared       =    0.0009
─────────────┼──────────────────────────────────   Adj R-squared   =   -0.0009
       Total │  76095.6655     2,243  33.9258429   Root MSE        =    5.8272

────────────────────────┬────────────────────────────────────────────────────────────────
                   wage │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
────────────────────────┼────────────────────────────────────────────────────────────────
                    age │   9.148757   14.62392     0.63   0.532    -19.52911    37.82662
                        │
            c.age#c.age │   -.428763   .6569265    -0.65   0.514    -1.717012    .8594857
                        │
      c.age#c.age#c.age │   .0088427   .0129448     0.68   0.495    -.0165425    .0342278
                        │
c.age#c.age#c.age#c.age │  -.0000676   .0000945    -0.72   0.474    -.0002529    .0001176
                        │
                  _cons │  -64.76751   120.4015    -0.54   0.591    -300.8777    171.3427
────────────────────────┴────────────────────────────────────────────────────────────────

6.8 Date Variables

Dates can be a pain to work with, but Mitchell and UCLA’s OARC STATA resources: https://stats.oarc.ucla.edu/stata/modules/using-dates-in-stata/ can be very helpful.

Dates and times can be very helpful when working with time series data or panel data.

Sometimes dates are separated in columns and we need to append them and set the date format

. type momkid1.csv
momid,momm,momd,momy,kidbday
1,11,28,1972,1/5/1998
2,4,3,1973,4/11/2002
3,6,13,1968,5/15/1996
4,1,5,1960,1/4/2004

. import delimited using "momkid1.csv", clear
(5 vars, 4 obs)

. list

     ┌────────────────────────────────────────┐
     │ momid   momm   momd   momy     kidbday │
     ├────────────────────────────────────────┤
  1. │     1     11     28   1972    1/5/1998 │
  2. │     2      4      3   1973   4/11/2002 │
  3. │     3      6     13   1968   5/15/1996 │
  4. │     4      1      5   1960    1/4/2004 │
     └────────────────────────────────────────┘

Create Dates from numerical variables: mdy() We can create the mother’s birthday with the mdy() function

. generate mombdate=mdy(momm, momd, momy), after(momy)

Create Dates from string variables: date()

. generate kidbdate=date(kidbday,"MDY")
. list 

     ┌──────────────────────────────────────────────────────────────┐
     │ momid   momm   momd   momy   mombdate     kidbday   kidbdate │
     ├──────────────────────────────────────────────────────────────┤
  1. │     1     11     28   1972       4715    1/5/1998      13884 │
  2. │     2      4      3   1973       4841   4/11/2002      15441 │
  3. │     3      6     13   1968       3086   5/15/1996      13284 │
  4. │     4      1      5   1960          4    1/4/2004      16074 │
     └──────────────────────────────────────────────────────────────┘

Notice that the dates are not in a format that we can work with easily. They are the days from Jan 1, 1960. You can see the Mom’s birthday on Jan 5, 1960 is 4. We can difference the dates so Jan 5, 1960 is 4 and Jan 1, 1960 is 0.

We saw this a couple week ago, but our format %td is what we need and remember that are a lot of variations on %td, but ultimately Stata still sees Jan 5, 1960 is 4 and Jan 1, 1960 is 0, etc.

. format mombdate kidbdate %td

. list

     ┌────────────────────────────────────────────────────────────────┐
     │ momid   momm   momd   momy    mombdate     kidbday    kidbdate │
     ├────────────────────────────────────────────────────────────────┤
  1. │     1     11     28   1972   28nov1972    1/5/1998   05jan1998 │
  2. │     2      4      3   1973   03apr1973   4/11/2002   11apr2002 │
  3. │     3      6     13   1968   13jun1968   5/15/1996   15may1996 │
  4. │     4      1      5   1960   05jan1960    1/4/2004   04jan2004 │
     └────────────────────────────────────────────────────────────────┘

For a MM/DD/YYYY format add NN/DD/ccYY at after %td

. format mombdate kidbdate %tdNN/DD/ccYY

. list

     ┌──────────────────────────────────────────────────────────────────┐
     │ momid   momm   momd   momy     mombdate     kidbday     kidbdate │
     ├──────────────────────────────────────────────────────────────────┤
  1. │     1     11     28   1972   11/28/1972    1/5/1998   01/05/1998 │
  2. │     2      4      3   1973   04/03/1973   4/11/2002   04/11/2002 │
  3. │     3      6     13   1968   06/13/1968   5/15/1996   05/15/1996 │
  4. │     4      1      5   1960   01/05/1960    1/4/2004   01/04/2004 │
     └──────────────────────────────────────────────────────────────────┘

nn is for 1-12 months; NN is for 01-12 months dd is for 1-31 days; DD is for 01-31 days YY is for 2-digit year (regardless of first two digits) cc is for first 2-digits of year ccYY is for 4-digit year Dayname will return Sunday-Saturday Mon will return month name We can use “,”, “-”, “/” and “” where ”” is for a blank space

. format mombdate kidbdate %tdDayname_Mon_DD,_ccYY

. list kidbday kidbdate

     ┌────────────────────────────────────┐
     │   kidbday                 kidbdate │
     ├────────────────────────────────────┤
  1. │  1/5/1998      Monday Jan 05, 1998 │
  2. │ 4/11/2002    Thursday Apr 11, 2002 │
  3. │ 5/15/1996   Wednesday May 15, 1996 │
  4. │  1/4/2004      Sunday Jan 04, 2004 │
     └────────────────────────────────────┘

Remember that we can difference between dates as mentioned above, since Stata keeps dates as numerical days from Jan 1, 1960.

. generate momagediff=kidbdate-mombdate

. list mombdate kidbdate momagediff

     ┌───────────────────────────────────────────────────────────┐
     │              mombdate                 kidbdate   momage~f │
     ├───────────────────────────────────────────────────────────┤
  1. │  Tuesday Nov 28, 1972      Monday Jan 05, 1998       9169 │
  2. │  Tuesday Apr 03, 1973    Thursday Apr 11, 2002      10600 │
  3. │ Thursday Jun 13, 1968   Wednesday May 15, 1996      10198 │
  4. │  Tuesday Jan 05, 1960      Sunday Jan 04, 2004      16070 │
     └───────────────────────────────────────────────────────────┘

We can find years by dividing by 365.25

. generate momagediffyr = (kidbdate-mombdate)/365.25

. list mombdate kidbdate momagediffyr

     ┌────────────────────────────────────────────────────────────┐
     │              mombdate                 kidbdate   momaged~r │
     ├────────────────────────────────────────────────────────────┤
  1. │  Tuesday Nov 28, 1972      Monday Jan 05, 1998   25.103354 │
  2. │  Tuesday Apr 03, 1973    Thursday Apr 11, 2002   29.021218 │
  3. │ Thursday Jun 13, 1968   Wednesday May 15, 1996   27.920602 │
  4. │  Tuesday Jan 05, 1960      Sunday Jan 04, 2004   43.997262 │
     └────────────────────────────────────────────────────────────┘
. display (mdy(4,5,2005)-mdy(5,6,2000))/365.25
4.9144422

We can always use a qualifier to find people born before, after, not on,
or on a certain date

. list momid mombdate if (mombdate >= mdy(1,20,1970)) & !missing(mombdate)

     ┌──────────────────────────────┐
     │ momid               mombdate │
     ├──────────────────────────────┤
  1. │     1   Tuesday Nov 28, 1972 │
  2. │     2   Tuesday Apr 03, 1973 │
     └──────────────────────────────┘

Let’s say we want just the day, month, or year from a date variable Then we have a few functions: day(date) returns a numeric of the day month(date) returns a numeric of the month year(date) returns a numeric of the year week(date) returns a numeric of the week out of (1-52) quarter(date) returns a numeric of the quarter (1-4) dow(date) returns day of the week as a numeric (0=Sunday,…,6=Saturday) doy(date) returns a numeric of the day of the year (1-365) or (1-366)

This can be helpful when trying to compare time for a panel data set and we don’t need day - just month and year - I use this quite a bit with CPS data. If we have the month year we can use ym(Y,M) for month and year and use that in our xtset - it can compare the number of months since Jan 1960

. gen monyear = ym(momy,momm)

We use the %tm format for months and year

. format monyear %tmMon_ccYY

. list monyear

     ┌──────────┐
     │  monyear │
     ├──────────┤
  1. │ Nov 1972 │
  2. │ Apr 1973 │
  3. │ Jun 1968 │
  4. │ Jan 1960 │
     └──────────┘

You can use cut off when there are only 2-digit YY on page 177-179, but you can review this if you like.

Don’t forget to look for help on Stata

. help dates and times

Or, on UCLA’s OARC Stata resources

6.9 Date and Time Variables

Section 6.8 provides helpful information using dates AND times. You may run across time formats, and I’ll refer you to pages 179-186 for future reference. For some reason you had birth date and time, you can use the mdyhms() function and the format %tc

. gen testdt = mdyhms(momm,momd,momy,runiform(1,12),runiform(0,59), runiform(0,59))

. format testdt %tc

. list mombdate testdt

     ┌────────────────────────────────────────────┐
     │              mombdate               testdt │
     ├────────────────────────────────────────────┤
  1. │  Tuesday Nov 28, 1972   28nov1972 08:33:12 │
  2. │  Tuesday Apr 03, 1973   03apr1973 05:26:19 │
  3. │ Thursday Jun 13, 1968   13jun1968 07:08:28 │
  4. │  Tuesday Jan 05, 1960   05jan1960 01:35:20 │
     └────────────────────────────────────────────┘

6.10 Computations across variables

The egen command is something that I miss in other statistical packages since it is so useful. We can do computation across columns or we can do computation across observations with egen

For computations across columns/variables, we can look at row means or row min, row max, row missing, row nonmissing

Let’s get our data

. use "cardio2miss.dta", clear

Let’s find the mean across pl1-pl5 with row mean instead of gen avgpl=(pl1+…+pl5)/5

. egen avgpl=rowmean(pl1-pl5)

. list id pl1-pl5 avgpl

     ┌──────────────────────────────────────────┐
     │ id   pl1   pl2   pl3   pl4   pl5   avgpl │
     ├──────────────────────────────────────────┤
  1. │  1    54   115    87    86    93      87 │
  2. │  2    92   123    88   136   125   112.8 │
  3. │  3   105    .a    97   122   128     113 │
  4. │  4    52   105    79   115    71    84.4 │
  5. │  5    70   116    .a   128    52    91.5 │
     └──────────────────────────────────────────┘

rowmeans() will ignore the missing values

. egen avgbp=rowmean(bp1-bp5)

. list id bp1-bp5 avgbp

     ┌──────────────────────────────────────────┐
     │ id   bp1   bp2   bp3   bp4   bp5   avgbp │
     ├──────────────────────────────────────────┤
  1. │  1   129    81   105    .b    .b     105 │
  2. │  2   107    87   111    58   120    96.6 │
  3. │  3   101    57   109    68   112    89.4 │
  4. │  4   121   106   129    39   137   106.4 │
  5. │  5   112    68   125    59   111      95 │
     └──────────────────────────────────────────┘

rowmin() returns the row minimum rowmax() returns the row maximum

. egen maxbp=rowmax(bp1-bp5)

. egen minbp=rowmin(bp1-bp5)

. list id bp1-bp5 avgbp maxbp minbp

     ┌──────────────────────────────────────────────────────────┐
     │ id   bp1   bp2   bp3   bp4   bp5   avgbp   maxbp   minbp │
     ├──────────────────────────────────────────────────────────┤
  1. │  1   129    81   105    .b    .b     105     129      81 │
  2. │  2   107    87   111    58   120    96.6     120      58 │
  3. │  3   101    57   109    68   112    89.4     112      57 │
  4. │  4   121   106   129    39   137   106.4     137      39 │
  5. │  5   112    68   125    59   111      95     125      59 │
     └──────────────────────────────────────────────────────────┘

We can find missing or not missing with the rowmiss() and rownonmiss()

. egen missbp = rowmiss(bp?)

Note: the ? operator differs from the wildcard operator The ? operator is a wildcard for 1 character in between so bp? will pick up bp1, bp2, bp3, bp4, and bp5, and bp? will exlude bpavg, bpmin, bpmax Examples: myvar variables starting with my & ending with var with any number of other characters between my~var one variable starting with my & ending with var with any number of other characters between my?var variables starting with my & ending with var with one other character between myvar just one variable myvar thisvar thatvar three variables myvar* variables starting with myvar *var variables ending with var myvar1-myvar6 myvar1, myvar2, …, myvar6 (probably) this-that variables this through that, inclusive

6.11 Computation across observations

EGEN and bysort are a powerful combination for working across groups of observations. This can be very helpful when working with panel data or time series by groups. We will sort by id and then time and perform our egen commands We have our egen sum, egen total, egen max, egen min, egen mean which will be the main egen commands.

. use "gasctrysmall.dta", clear

Find the average for groups. Egen without a bysort will return the mean for the entire column, which may or may not be helpful With groups, we want to sort by groups first and then perform the egen mean

. egen avggas = mean(gas)

. bysort ctry: egen avggas_ctry = mean(gas)

. list ctry year gas avggas avggas_ctry, sepby(ctry)

     ┌────────────────────────────────────────┐
     │ ctry   year   gas   avggas   avggas_~y │
     ├────────────────────────────────────────┤
  1. │    1   1974   .78    .7675        .805 │
  2. │    1   1975   .83    .7675        .805 │
     ├────────────────────────────────────────┤
  3. │    2   1971   .69    .7675   .78333333 │
  4. │    2   1971   .77    .7675   .78333333 │
  5. │    2   1973   .89    .7675   .78333333 │
     ├────────────────────────────────────────┤
  6. │    3   1974   .42    .7675         .42 │
     ├────────────────────────────────────────┤
  7. │    4   1974   .82    .7675         .88 │
  8. │    4   1975   .94    .7675         .88 │
     └────────────────────────────────────────┘

Let’s get the max and min for each country

. bysort ctry: egen mingas_ctry = min(gas)

. bysort ctry: egen maxgas_ctry = max(gas)

We can count our observations

. bysort ctry: egen count_ctry=count(gas)

. list, sepby(ctry)

     ┌────────────────────────────────────────────────────────────────────────────────┐
     │ ctry   year   gas   infl   avggas   avggas_~y   mingas~y   maxgas~y   count_~y │
     ├────────────────────────────────────────────────────────────────────────────────┤
  1. │    1   1974   .78   1.32    .7675        .805        .78        .83          2 │
  2. │    1   1975   .83    1.4    .7675        .805        .78        .83          2 │
     ├────────────────────────────────────────────────────────────────────────────────┤
  3. │    2   1971   .69   1.15    .7675   .78333333        .69        .89          3 │
  4. │    2   1971   .77   1.15    .7675   .78333333        .69        .89          3 │
  5. │    2   1973   .89   1.29    .7675   .78333333        .69        .89          3 │
     ├────────────────────────────────────────────────────────────────────────────────┤
  6. │    3   1974   .42   1.14    .7675         .42        .42        .42          1 │
     ├────────────────────────────────────────────────────────────────────────────────┤
  7. │    4   1974   .82   1.12    .7675         .88        .82        .94          2 │
  8. │    4   1975   .94   1.18    .7675         .88        .82        .94          2 │
     └────────────────────────────────────────────────────────────────────────────────┘

We also have egen count(), egen iqr(), egen median(), and egen pctile(#,p(#)) See more egen commands:

. help egen

Subscripting (Head start to Mitchell 8.4)

Bysort can be combined with indexes to find the first, last, or # observaration within a group

. sort ctry year

Find first year

. bysort ctry: gen firstyr = year[1]

Find last year

. bysort ctry: gen lastyr = year[_N]

Take a difference between periods (same as l.var)

. bysort ctry: gen diffgas = gas-gas[_n-1]
(4 missing values generated)

Create Indexes for Rate of change to base year

. bysort ctry: gen gasindex = gas/gas[1]*100

6.12 More Egen

This section has more interesting egen commands that you can review if you would like. The workhorse egen commands are above.

6.13 Converting Strings to numerics

These next two section have a lot of practical implications that you find when working with survey and especially administrative data. First we will cover converting numerical characters in strings to numerical data to analyze.

. use "cardio1str.dta", clear

Let’s summarize our data, but it comes back blank

. sum wt bp1 bp2 bp3

    Variable │        Obs        Mean    Std. Dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
          wt │          0
         bp1 │          0
         bp2 │          0
         bp3 │          0

All of our numerical data are in string formats

. describe

Contains data from cardio1str.dta
  obs:             5                          
 vars:            11                          22 Dec 2009 19:51
 size:           205                          
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id              str1    %3s                   Identification variable
wt              str5    %5s                   Weight of person
age             str2    %2s                   Age of person
bp1             str3    %3s                   Systolic BP: Trial 1
bp2             str3    %3s                   Systolic BP: Trial 2
bp3             str3    %3s                   Systolic BP: Trial 3
pl1             str3    %3s                   Pulse: Trial 1
pl2             str2    %3s                   Pulse: Trial 2
pl3             str3    %3s                   Pulse: Trial 3
income          str10   %10s                  Income
gender          str6    %6s                   Gender of person
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: 

Destring

Destring is our main command to convert numerical characters to numerics. With destring, we have to choose an option of generating a new variable or replace the current variable

. destring age, gen(agen)
age: all characters numeric; agen generated as byte

. sum age agen

    Variable │        Obs        Mean    Std. Dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
         age │          0
        agen │          5        37.4    10.83051         23         48

. drop agen

We can destring all of our numerics

. destring id-income, replace
id: all characters numeric; replaced as byte
wt: all characters numeric; replaced as double
age: all characters numeric; replaced as byte
bp1: all characters numeric; replaced as int
bp2: all characters numeric; replaced as int
bp3: all characters numeric; replaced as int
pl1: all characters numeric; replaced as int
(1 missing value generated)
pl2: all characters numeric; replaced as byte
(1 missing value generated)
pl3: contains nonnumeric characters; no replace
income: contains nonnumeric characters; no replace

Notice that the replace failed for income and pl3. pl3 has a letter, so all of the data are not all digits. income has $ and , so it can not convert the data.

We can use the force option with pl3 to convert the X to missing

. list pl3

     ┌─────┐
     │ pl3 │
     ├─────┤
  1. │  93 │
  2. │ 125 │
  3. │   X │
  4. │  71 │
  5. │  52 │
     └─────┘

. destring pl3, replace force
pl3: contains nonnumeric characters; replaced as int
(1 missing value generated)

. list pl3

     ┌─────┐
     │ pl3 │
     ├─────┤
  1. │  93 │
  2. │ 125 │
  3. │   . │
  4. │  71 │
  5. │  52 │
     └─────┘

We cannot use the force option with income, since it contains important information. We need to get rid of the two non-numerical characters of $ and ,

We have two options: eliminate the “$” and “,” or use the ignore option in destring. The latter option is easier.

. destring income, replace ignore("$,")
income: characters $ , removed; replaced as double

. list

     ┌──────────────────────────────────────────────────────────────────────────┐
     │ id      wt   age   bp1   bp2   bp3   pl1   pl2   pl3     income   gender │
     ├──────────────────────────────────────────────────────────────────────────┤
  1. │  1   150.7    45   115    86   129    54    87    93   25308.92     male │
  2. │  2   186.3    23   123   136   107    92    88   125   46213.31     male │
  3. │  3   109.9    48   132   122   101   105    97     .   65234.11     male │
  4. │  4   183.4    29   105   115   121     .    79    71   89234.23     male │
  5. │  5   159.1    42   116   128   112    70     .    52   54989.87   female │
     └──────────────────────────────────────────────────────────────────────────┘

Encode

Sometimes we have categorical variables that come in as strings. We need to code them as categorical, but strings can be tricky with leading or lagging zeros or misspellings.

One option is to use the encode command, which will generate a new variable that codifies the string variable. This can be very helpful when our id variable in panel data are in string format.

. list id gender

     ┌─────────────┐
     │ id   gender │
     ├─────────────┤
  1. │  1     male │
  2. │  2     male │
  3. │  3     male │
  4. │  4     male │
  5. │  5   female │
     └─────────────┘

Just in case, trim any white space around the string strtrim() - remove leading and lagging white space strrtrim() - removes lagging white spaces strltrim() - removes leading white spaces stritrim() - removes leading, lagging, and consecutive white spaces

. replace gender = stritrim(gender)
(0 real changes made)

. encode gender, generate(gendercode)

. list gender gendercode

     ┌───────────────────┐
     │ gender   gender~e │
     ├───────────────────┤
  1. │   male       male │
  2. │   male       male │
  3. │   male       male │
  4. │   male       male │
  5. │ female     female │
     └───────────────────┘

They look the same, but let’s use tabulation

. tab gender gendercode, nolabel

 Gender of │   Gender of person
    person │         1          2 │     Total
───────────┼──────────────────────┼──────────
    female │         1          0 │         1 
      male │         0          4 │         4 
───────────┼──────────────────────┼──────────
     Total │         1          4 │         5 

The encode creates label values around the new variable

. codebook gendercode

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
gendercode                                                                                       Gender of person
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────

                  type:  numeric (long)
                 label:  gendercode

                 range:  [1,2]                        units:  1
         unique values:  2                        missing .:  0/5

            tabulation:  Freq.   Numeric  Label
                             1         1  female
                             4         2  male

6.14 Converting numerics to strings

In my experience, it is more common to destring, but there are situtations where you need to convert a numeric to a string. The most common examples from me are zip codes and FIPS codes. As a side note, whenever working with or merging with state-level data, always, always, always use FIPS codes and not states or state abbreviations.

. use "cardio3.dta", clear

. list zipcode

     ┌─────────┐
     │ zipcode │
     ├─────────┤
  1. │    1003 │
  2. │   90095 │
  3. │   43409 │
  4. │   23219 │
  5. │   66214 │
     └─────────┘

This zip code list is a problem and likely will not merge properly especially for zipcodes that have a leading 0.

. tostring zipcode, generate(zips)
zips generated as str5

. replace zips = "0" + zips if strlen(zips) == 4
(1 real change made)

. replace zips = "00" + zips if strlen(zips) == 3
(0 real changes made)

. list zipcode zips

     ┌─────────────────┐
     │ zipcode    zips │
     ├─────────────────┤
  1. │    1003   01003 │
  2. │   90095   90095 │
  3. │   43409   43409 │
  4. │   23219   23219 │
  5. │   66214   66214 │
     └─────────────────┘

I don’t recommend just formatting the data to 5 digits. The data should be exact for matching and merging datasets.

I recommend using tostring, but there is decode which is the opposite of encode. If we want to use the variable labels instead of the numeric for some reason, the decode command can be used

. decode famhist, gen(famhists)

. list famhist famhists, nolabel

     ┌────────────────────┐
     │ famhist   famhists │
     ├────────────────────┤
  1. │       0      No HD │
  2. │       1     Yes HD │
  3. │       0      No HD │
  4. │       1     Yes HD │
  5. │       1     Yes HD │
     └────────────────────┘

. codebook famhist famhists

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
famhist                                                                           Family history of heart disease
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────

                  type:  numeric (long)
                 label:  famhistl

                 range:  [0,1]                        units:  1
         unique values:  2                        missing .:  0/5

            tabulation:  Freq.   Numeric  Label
                             2         0  No HD
                             3         1  Yes HD

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
famhists                                                                          Family history of heart disease
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────

                  type:  string (str6)

         unique values:  2                        missing "":  0/5

            tabulation:  Freq.  Value
                             2  "No HD"
                             3  "Yes HD"

               warning:  variable has embedded blanks

6.15 Renaming and reordering

Reordering and renaming seems straightforward enough, but there are some useful tips that will help consolidate your scripts. For example, you don’t need to a new rename command for every single variable to be renamed. We can use group renaming.

. use "cardio2.dta", clear

. describe

Contains data from cardio2.dta
  obs:             5                          
 vars:            12                          22 Dec 2009 19:51
 size:           100                          
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id              byte    %3.0f                 Identification variable
age             byte    %3.0f                 Age of person
pl1             int     %3.0f                 Pulse: Trial 1
bp1             int     %3.0f                 Systolic BP: Trial 1
pl2             byte    %3.0f                 Pulse: Trial 2
bp2             int     %3.0f                 Systolic BP: Trial 2
pl3             int     %3.0f                 Pulse: Trial 3
bp3             int     %3.0f                 Systolic BP: Trial 3
pl4             int     %3.0f                 Pulse: Trial 4
bp4             int     %3.0f                 Systolic BP: Trial 4
pl5             byte    %3.0f                 Pulse: Trial 5
bp5             int     %3.0f                 Systolic BP: Trial 5
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: 

Rename

The rename command is straightfoward, and allows us to rename our variable(s) of interest.

. rename age age_yrs

We also can do bulk renames if the group of variables have common characters in the variable name. We can rename our pl group to pulse

. rename pl? pulse? 

Where ? is an operator for just one character to be different

. describe

Contains data from cardio2.dta
  obs:             5                          
 vars:            12                          22 Dec 2009 19:51
 size:           100                          
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id              byte    %3.0f                 Identification variable
age_yrs         byte    %3.0f                 Age of person
pulse1          int     %3.0f                 Pulse: Trial 1
bp1             int     %3.0f                 Systolic BP: Trial 1
pulse2          byte    %3.0f                 Pulse: Trial 2
bp2             int     %3.0f                 Systolic BP: Trial 2
pulse3          int     %3.0f                 Pulse: Trial 3
bp3             int     %3.0f                 Systolic BP: Trial 3
pulse4          int     %3.0f                 Pulse: Trial 4
bp4             int     %3.0f                 Systolic BP: Trial 4
pulse5          byte    %3.0f                 Pulse: Trial 5
bp5             int     %3.0f                 Systolic BP: Trial 5
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: 
     Note: Dataset has changed since last saved.

An alternate way is to rename by grouping and it prevents any unintended renaming with the ? or * operators

. rename (bp1 bp2 bp3 bp4 bp5) (bpress1 bpress2 bpress3 bpress4 bpress5)

. describe

Contains data from cardio2.dta
  obs:             5                          
 vars:            12                          22 Dec 2009 19:51
 size:           100                          
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id              byte    %3.0f                 Identification variable
age_yrs         byte    %3.0f                 Age of person
pulse1          int     %3.0f                 Pulse: Trial 1
bpress1         int     %3.0f                 Systolic BP: Trial 1
pulse2          byte    %3.0f                 Pulse: Trial 2
bpress2         int     %3.0f                 Systolic BP: Trial 2
pulse3          int     %3.0f                 Pulse: Trial 3
bpress3         int     %3.0f                 Systolic BP: Trial 3
pulse4          int     %3.0f                 Pulse: Trial 4
bpress4         int     %3.0f                 Systolic BP: Trial 4
pulse5          byte    %3.0f                 Pulse: Trial 5
bpress5         int     %3.0f                 Systolic BP: Trial 5
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: 
     Note: Dataset has changed since last saved.

Order

We can move our variables around, which can be especially helpful with panel data, since we want the unit id and the time period to be next to one another.

Our variables are out of order and pl and bp alternate, but we may want to keep our pl variables and bp variables next to one another.

Reorder blood pressure and pulse

. order id age_yrs bp* pul*

Or,

. order pul* bp*, after(age_yrs) 

Or,

. order bp*, before(pulse1)

If you generate a new variable, it will default to the end, but we may want it somewhere else. Let’s say we want age-squared, but we want age-squared to be next to age. We can use the after (or before) option in the generate command.

. gen age2=age_yrs*age_yrs, after(age_yrs)

If you wanted to reorder the whole dataset alphabetically, then

. order _all, alphabetic

. describe

Contains data from cardio2.dta
  obs:             5                          
 vars:            13                          22 Dec 2009 19:51
 size:           140                          
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
age2            double  %10.0g                
age_yrs         byte    %3.0f                 Age of person
bpress1         int     %3.0f                 Systolic BP: Trial 1
bpress2         int     %3.0f                 Systolic BP: Trial 2
bpress3         int     %3.0f                 Systolic BP: Trial 3
bpress4         int     %3.0f                 Systolic BP: Trial 4
bpress5         int     %3.0f                 Systolic BP: Trial 5
id              byte    %3.0f                 Identification variable
pulse1          int     %3.0f                 Pulse: Trial 1
pulse2          byte    %3.0f                 Pulse: Trial 2
pulse3          int     %3.0f                 Pulse: Trial 3
pulse4          int     %3.0f                 Pulse: Trial 4
pulse5          byte    %3.0f                 Pulse: Trial 5
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: 
     Note: Dataset has changed since last saved.

There is a sequential option as well if your var1, …, vark are out of order

. order bpress5 bpress1-bpress4, after(age2)

. order bp*, sequential