Chapter 4 Stata Programming Techniques

A brief overview of techniques that will be helpful for completing assignments and empirical project topics include - Local Macros, Loops, Tempfiles, By Sort and Egen, and Weights<

  1. Macros
  2. Local Macros
  3. Global Macros
  4. Loops
  5. Dataframes and Tempfiles
  6. Appending
  7. Bysort and EGEN
  8. Survey Weights
  9. Accessing results in Stata’s stored memory

4.1 Stata Programming: Macros

One of the essential parts of Stata programming are macros. Macros variables have many uses and can contain strings or numerics. We can use them in loops, in regressions, in calling temporary files, etc. There are two kinds of macros: local and global.

4.2 Local Macros

Local macros work within a single executed do file and removed once the session is run. If you are running local macros you must run the line of code that initializes the macro and the command that utilizes it.

You need to initialize the local macro E.g.: local i = 1 Key: You need to call the macro with key (or left quote key just above tab key) and ' key. E.g.: display "i’” *E.g.: replace x = 0 if y = `i’

local i = 1
display "`i'"
1
local k = `i' + 1
display "`k'"
2

4.3 Global Macros

Global macros are variables that stays within memory even after a new session is run. These macro varibles can work across multiple do files, as well. You need to initialize a global macro E.g.:

global j = 2

Next, you call the macro with $

display "$j"
2

You can use local and global macro variables for qualifiers, as well

replace z = 1 if y = $i

Another great use of macros is to find a list of for local levels of a categorical variable, and store it in macro variable

levelsof varname, local(levels) 
foreach l of local levels {
  command if varname == `l'
}

We’ll demo this later, but it is quite useful

You can also use local macros to test regression models We’ll demo this later, as well

4.4 Looping

Looping has many uses when you want to apply a function or command over a repeated set of values There is forvalues and foreach - I typically use foreach given it’s versatility but Stata says that forvalues can be more efficient *E.g.: Loop over years and iterate i

4.4.1 Loop over a number

local i = 0
foreach num of numlist 2015/2019 {
  display "`num'"
  local i = `i'+1
  display "This is the `i' loop"
}
2015
This is the 1 loop
2016
This is the 2 loop
2017
This is the 3 loop
2018
This is the 4 loop
2019
This is the 5 loop

4.4.2 Loop over a local macro

*E.g.: Loop over months and years to read in new files

local month jan feb mar apr may jun jul aug sep oct nov dec
foreach y of numlist 2018/2019 {
  foreach m of local month {
    local filename = "`m'`y'.dta"
    display "`filename'"
  }
}
jan2018.dta
feb2018.dta
mar2018.dta
apr2018.dta
may2018.dta
jun2018.dta
jul2018.dta
aug2018.dta
sep2018.dta
oct2018.dta
nov2018.dta
dec2018.dta
jan2019.dta
feb2019.dta
mar2019.dta
apr2019.dta
may2019.dta
jun2019.dta
jul2019.dta
aug2019.dta
sep2019.dta
oct2019.dta
nov2019.dta
dec2019.dta

4.5 Data Frames and Tempfiles

Data frames are a relatively new part of Stata (Stata 16+) that let you switch between datasets. Before you could only work on one dataset at a time. You can read an introduction to Stata data frames from Asjad Naqvi.

For older versions of Stata, we have the tempfiles workaround. Tempfiles are useful since they are a bypass around an older Stata limitation of one dataframe at a time. Later iterations of Stata introduced the dataframe functionality, but the tempfile method still works well. We will append three years of CPS MORG data from NBER and generate short 1-year panels. Note: Tempfiles are really just macros.

4.6 Appending multiple CPS files

Get CPS Data

Small CPS Files have only a few variables in them compared to the usual. The full Census CPS micro data files found at: https://www.census.gov/data/datasets/time-series/demo/cps/cps-basic.html.

Set up initial tempfile so we can add each year of CPS data. Save the macro and set it to emptyok.

tempfile cps
save `cps', emptyok

Create a local variable to loop over each year and month to append monthly CPS files into 1 cps file.

cd "/Users/Sam/Desktop/Data/CPS"
*Set month macro
local month jan feb mar apr may jun jul aug sep oct nov dec
local filecount = 0

tempfile cps
save `cps', emptyok

foreach y of numlist 23/24 {

  foreach m of local month {
        
    *Show the year and month
    display "`m'`y'"
    local filename "small`m'`y'pub.dta"
    display "`filename'"
        
    *Open the monthly data file
    use "`filename'", clear

      *Append monthly data file to cps tempfile
      quietly append using `cps'

      save `cps', replace
      clear
      
      *Count the number of monthly files appended
      local filecount = `filecount' + 1
    }
}
*Get the appended CPS File
use `cps'
tab hrmonth hryear4
save "cps23_24.dta", replace
/Users/Sam/Desktop/Data/CPS




(note: dataset contains 0 observations)
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved

jan23
smalljan23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
feb23
smallfeb23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
mar23
smallmar23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
apr23
smallapr23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
may23
smallmay23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
jun23
smalljun23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
jul23
smalljul23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
aug23
smallaug23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
sep23
smallsep23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
oct23
smalloct23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
nov23
smallnov23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
dec23
smalldec23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
jan24
smalljan24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
feb24
smallfeb24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
mar24
smallmar24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
apr24
smallapr24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
may24
smallmay24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
jun24
smalljun24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
jul24
smalljul24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
aug24
smallaug24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
sep24
smallsep24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
oct24
smalloct24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
nov24
smallnov24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
dec24
smalldec24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved

           |        HRYEAR4
   HRMONTH |      2023       2024 |     Total
-----------+----------------------+----------
         1 |   125,982    126,802 |   252,784 
         2 |   124,754    126,784 |   251,538 
         3 |   124,477    124,581 |   249,058 
         4 |   126,798    126,850 |   253,648 
         5 |   127,559    126,945 |   254,504 
         6 |   127,491    126,112 |   253,603 
         7 |   127,424    126,158 |   253,582 
         8 |   127,930    127,031 |   254,961 
         9 |   126,665    126,590 |   253,255 
        10 |   127,941    126,387 |   254,328 
        11 |   126,917    126,686 |   253,603 
        12 |   126,832    126,743 |   253,575 
-----------+----------------------+----------
     Total | 1,520,770  1,517,669 | 3,038,439 


file cps23_24.dta saved
CPS Basic Data Dictionary can be found at: CPS Basic Data Dictionary.

Check all years are there

4.7 By Sort and EGEN

Please note that we will get into bysort egen later in the semester. However, bysort: egen is a powerful combination that it is worth bring up now. You can summarize, replace, or create new variables by multiple groups with the by sort and egen commands Let’s generate laborforce

  1. 1 is employed at work;
  2. 2 is employed absent;
  3. 3 is unemployed layoff;
  4. 4 is unemployed looking;
  5. 5 is NILF retired;
  6. 6 is NILF disabiled; and
  7. 7 is NILF other
gen laborforce = .
replace laborforce = 0 if pemlr >= 5 & pemlr <= 7
replace laborforce = 1 if pemlr >= 1 & pemlr <= 4
label define laborforce1 0 "NILF" 1 "Labor Force"
label values laborforce laborforce1
tab laborforce

gen employed = .
replace employed = 0 if pemlr >= 3 & pemlr <= 7
replace employed = 1 if pemlr >= 1 & pemlr <= 2
label define employed1 0 "Not Employed" 1 "Employed"
label values employed employed1
tab employed

Generate a race/ethnicity category from existing

gen race_ethnicity = .
replace race_ethnicity = 1 if ptdtrace == 1 & pehspnon == 2
replace race_ethnicity = 2 if ptdtrace == 2 & pehspnon == 2 
replace race_ethnicity = 3 if pehspnon == 1
replace race_ethnicity = 4 if ptdtrace == 3 & pehspnon == 2
replace race_ethnicity = 5 if (ptdtrace == 4 | ptdtrace == 5) & pehspnon == 2
replace race_ethnicity = 6 if (ptdtrace >= 6 & ptdtrace <= 26) & pehspnon == 2
label define race_ethnicity1 1 "White NH" 2 "Black NH" 3 "Hispanic or Latino/a" ///
4 "Native American NH" 5 "Asian or Pacific Islander NH" 6 "Multiracial NH"
label values race_ethnicity race_ethnicity1
tab race_ethnicity 

Sort by Sex

sort pesex
*Summarize laborforce by sex
by pesex: sum laborforce
bysort: sum laborforce

Sort by Sex and Race.

sort pesex race_ethnicity
*Summarize laborforce by sex and race
by pesex race_ethnicity: sum laborforce
bysort pesex race_ethnicity: sum laborforce

Generate Age Bin to find individuals over 16.

gen over_16 = .
replace over_16 = 0 if prtage < 16
replace over_16 = 1 if prtage >= 16
label define over_16a 0 "Under 16" 1 "16 and older"
label values over_16 over_16a
tab over_16

Generate unweighted laborforce participation rate for each group with by sort and egen. Bysort works, as well, but I usually use sort on one line and by on the other

sort pesex race_ethnicity
by pesex race_ethnicity: egen mean_lfpr = mean(laborforce) if over_16 == 1
by pesex race_ethnicity: sum mean_lfpr
*Counting and indexing/subscripting within groups
gen idcount = .
by pesex race_ethnicity: replace idcount = _n
gen idcount2 = .
by pesex race_ethnicity: replace idcount2 = idcount[_N]

4.8 Retrieving estimates stored in Stata’s memory

Instead manually entering summary or regression results, we can use scalars in Stata’s memory to retrieve the information. This reduces human error of copying and pasting.

For summary statistics we can use the return list command.

cd "/Users/Sam/Desktop/Data/CPS"
use smalljan24pub.dta, clear
sum pternwa if prerelg==1

return list

gen demean_earnings=pternwa-`r(mean)' if prerelg==1
summarize demean_earnings
/Users/Sam/Desktop/Data/CPS

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
     pternwa |     10,666    137274.8    146839.5          0    1125627


scalars:
                  r(N) =  10666
              r(sum_w) =  10666
               r(mean) =  137274.7503281455
                r(Var) =  21561833661.67427
                 r(sd) =  146839.4826389492
                r(min) =  0
                r(max) =  1125627
                r(sum) =  1464172487

(116,136 missing values generated)

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
demean_ear~s |     10,666   -3.79e-12    146839.5  -137274.8   988352.2

We can also grab information about a regression

cd "/Users/Sam/Desktop/Data/CPS"
use smalljan24pub.dta, clear
gen lnearn=ln(pternwa) if prerelg==1
gen exp=prtage-16
replace exp=0 if exp==-1

reg lnearn c.exp##c.exp i.peeduca i.peernlab if prerelg==1

ereturn list

matrix list e(b)

*Coefficients and standard errors
display _b[1.peernlab] _se[1.peernlab]
/Users/Sam/Desktop/Data/CPS


(116,150 missing values generated)

(27,667 missing values generated)

(1,209 real changes made)

      Source |       SS           df       MS      Number of obs   =    10,652
-------------+----------------------------------   F(18, 10633)    =    238.75
       Model |  2310.98949        18  128.388305   Prob > F        =    0.0000
    Residual |  5717.80316    10,633  .537741292   R-squared       =    0.2878
-------------+----------------------------------   Adj R-squared   =    0.2866
       Total |  8028.79265    10,651  .753806465   Root MSE        =    .73331

------------------------------------------------------------------------------
      lnearn |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         exp |   .0654986   .0019087    34.31   0.000     .0617571    .0692401
             |
 c.exp#c.exp |  -.0010166   .0000324   -31.38   0.000    -.0010801   -.0009531
             |
     peeduca |
         32  |   .2151811   .2348611     0.92   0.360    -.2451906    .6755528
         33  |   .1297205   .2243499     0.58   0.563    -.3100474    .5694884
         34  |   .1947927   .2240222     0.87   0.385    -.2443328    .6339181
         35  |  -.0479483   .2153955    -0.22   0.824    -.4701637    .3742672
         36  |  -.1032521   .2123809    -0.49   0.627    -.5195583    .3130541
         37  |   .0602974   .2103972     0.29   0.774    -.3521206    .4727154
         38  |   .1407128   .2128599     0.66   0.509    -.2765325    .5579581
         39  |   .4391957    .203923     2.15   0.031     .0394684     .838923
         40  |   .4534936   .2043117     2.22   0.026     .0530045    .8539827
         41  |   .5707234   .2059659     2.77   0.006     .1669917     .974455
         42  |   .5189681   .2053085     2.53   0.011     .1165251    .9214112
         43  |    .914999   .2038914     4.49   0.000     .5153337    1.314664
         44  |   1.025085   .2044413     5.01   0.000     .6243415    1.425828
         45  |   1.463769   .2108507     6.94   0.000     1.050462    1.877076
         46  |     1.3309   .2076626     6.41   0.000     .9238427    1.737958
             |
  2.peernlab |  -.0496265    .024125    -2.06   0.040    -.0969159   -.0023371
       _cons |   10.07725   .2065816    48.78   0.000     9.672309    10.48219
------------------------------------------------------------------------------


scalars:
                  e(N) =  10652
               e(df_m) =  18
               e(df_r) =  10633
                  e(F) =  238.7547824846974
                 e(r2) =  .2878377352595289
               e(rmse) =  .7333084563678371
                e(mss) =  2310.989494457061
                e(rss) =  5717.803159756108
               e(r2_a) =  .2866321563292809
                 e(ll) =  -11800.89312136709
               e(ll_0) =  -13608.80112438194
               e(rank) =  19

macros:
            e(cmdline) : "regress lnearn c.exp##c.exp i.peeduca i.peernlab if prerelg==1"
              e(title) : "Linear regression"
          e(marginsok) : "XB default"
                e(vce) : "ols"
             e(depvar) : "lnearn"
                e(cmd) : "regress"
         e(properties) : "b V"
            e(predict) : "regres_p"
          e(estat_cmd) : "regress_estat"

matrices:
                  e(b) :  1 x 21
                  e(V) :  21 x 21

functions:
             e(sample)   


e(b)[1,21]
                     c.exp#        31b.         32.         33.         34.         35.         36.         37.
           exp       c.exp     peeduca     peeduca     peeduca     peeduca     peeduca     peeduca     peeduca
y1   .06549863  -.00101659           0   .21518108   .12972052   .19479266  -.04794826   -.1032521   .06029741

            38.         39.         40.         41.         42.         43.         44.         45.         46.
       peeduca     peeduca     peeduca     peeduca     peeduca     peeduca     peeduca     peeduca     peeduca
y1    .1407128   .43919568    .4534936   .57072336   .51896812   .91499896   1.0250848    1.463769   1.3309002

            1b.          2.            
      peernlab    peernlab       _cons
y1           0  -.04962649   10.077247

00

4.9 Survey Weights

We can use the Current Population Survey public use microdata and replicate published BLS numbers. We need to adjust our numbers with the appropriate weights in order to replicate the published numbers.

First, we need to adjust weights the weights. In the CSV files the weights need to be divided by 1000 since the documentation says implies 4 decimals.

Composite weights cmpwgt2* are used for final BLS tabulations of labor force


gen cmpwgt2 = pwcmpwgt/10000

Outgoing Rotation Weights pworwgt are used for earnings only person in interviews 4 and 8

gen orwgt2 = pworwgt/10000

PWSSWGT weights are used for general tabulations of sex, race, states, etc.

gen sswgt2 = pwsswgt/10000

PWVETWGT are used to study the veteran population

gen vetwgt2 = pwvetwgt/10000

PWLWGT are weights used to study someone over multiple CPS interviews

gen lgwgt2 = pwlgwgt/10000

If we are appending all months, we need to divide by 12 for the Basic CPS to get annual weights.

    replace cmpwgt2 = cmpwgt2/12

If we are appending all months across multiple years, we need to get a composite weight for all of the CPS files

cd "/Users/Sam/Desktop/Data/CPS"
use cps23_24.dta
gen cmpwgt3 = pwcmpwgt/24

Use the SvySet command to set up the survey design.

    svyset [pw=cmpwgt3]

Use the svy: command vars to utilize the survey design.

    svy: tab employed hryear4, count cellwidth(20) format(%20.2gc)
cd "/Users/Sam/Desktop/Data/CPS"
use cps23_24.dta
gen cmpwgt3 = pwcmpwgt/10000
replace cmpwgt3 = cmpwgt3/12
svyset [pw=cmpwgt3]
gen employed = .
replace employed = 0 if pemlr >= 3 & pemlr <= 7
replace employed = 1 if pemlr >= 1 & pemlr <= 2
label define employed1 0 "Not Employed" 1 "Employed"
label values employed employed1
tab employed
svy: tab employed hryear4, count cellwidth(20) format(%20.2gc)
/Users/Sam/Desktop/Data/CPS


(655,995 missing values generated)

(1,946,105 real changes made)

      pweight: cmpwgt3
          VCE: linearized
  Single unit: missing
     Strata 1: <one>
         SU 1: <observations>
        FPC 1: <zero>

(3,038,439 missing values generated)

(838,080 real changes made)

(1,138,576 real changes made)

    employed |      Freq.     Percent        Cum.
-------------+-----------------------------------
Not Employed |    838,080       42.40       42.40
    Employed |  1,138,576       57.60      100.00
-------------+-----------------------------------
       Total |  1,976,656      100.00

(running tabulate on estimation sample)

Number of strata   =         1                Number of obs     =    1,976,656
Number of PSUs     = 1,976,656                Population size   =  535,513,623
                                              Design df         =    1,976,655

----------------------------------------------------------------------------
          |                             HRYEAR4                             
 employed |                 2023                  2024                 Total
----------+-----------------------------------------------------------------
 Not Empl |          105,905,668           107,225,904           213,131,572
 Employed |          161,036,522           161,345,530           322,382,051
          | 
    Total |          266,942,190           268,571,433           535,513,623
----------------------------------------------------------------------------
  Key:  weighted count

  Pearson:
    Uncorrected   chi2(1)         =   12.9838
    Design-based  F(1, 1976655)   =    9.8919     P = 0.0017