Chapter 4 Stata Programming Techniques
A brief overview of techniques that will be helpful for completing assignments and empirical project topics include - Local Macros, Loops, Tempfiles, By Sort and Egen, and Weights<
- Macros
- Local Macros
- Global Macros
- Loops
- Dataframes and Tempfiles
- Appending
- Bysort and EGEN
- Survey Weights
- Accessing results in Stata’s stored memory
4.1 Stata Programming: Macros
One of the essential parts of Stata programming are macros. Macros variables have many uses and can contain strings or numerics. We can use them in loops, in regressions, in calling temporary files, etc. There are two kinds of macros: local and global.
4.2 Local Macros
Local macros work within a single executed do file and removed once the session is run. If you are running local macros you must run the line of code that initializes the macro and the command that utilizes it.
You need to initialize the local macro
E.g.: local i = 1
Key: You need to call the macro with key (or left quote key just above tab key) and ' key. E.g.: display "i’”
*E.g.: replace x = 0 if y = `i’
1
2
4.3 Global Macros
Global macros are variables that stays within memory even after a new session is run. These macro varibles can work across multiple do files, as well. You need to initialize a global macro E.g.:
Next, you call the macro with $
2
You can use local and global macro variables for qualifiers, as well
Another great use of macros is to find a list of for local levels of a categorical variable, and store it in macro variable
We’ll demo this later, but it is quite useful
You can also use local macros to test regression models We’ll demo this later, as well
4.4 Looping
Looping has many uses when you want to apply a function or command over a repeated set of values There is forvalues and foreach - I typically use foreach given it’s versatility but Stata says that forvalues can be more efficient *E.g.: Loop over years and iterate i
4.4.1 Loop over a number
local i = 0
foreach num of numlist 2015/2019 {
display "`num'"
local i = `i'+1
display "This is the `i' loop"
}2015
This is the 1 loop
2016
This is the 2 loop
2017
This is the 3 loop
2018
This is the 4 loop
2019
This is the 5 loop
4.4.2 Loop over a local macro
*E.g.: Loop over months and years to read in new files
local month jan feb mar apr may jun jul aug sep oct nov dec
foreach y of numlist 2018/2019 {
foreach m of local month {
local filename = "`m'`y'.dta"
display "`filename'"
}
}jan2018.dta
feb2018.dta
mar2018.dta
apr2018.dta
may2018.dta
jun2018.dta
jul2018.dta
aug2018.dta
sep2018.dta
oct2018.dta
nov2018.dta
dec2018.dta
jan2019.dta
feb2019.dta
mar2019.dta
apr2019.dta
may2019.dta
jun2019.dta
jul2019.dta
aug2019.dta
sep2019.dta
oct2019.dta
nov2019.dta
dec2019.dta
4.5 Data Frames and Tempfiles
Data frames are a relatively new part of Stata (Stata 16+) that let you switch between datasets. Before you could only work on one dataset at a time. You can read an introduction to Stata data frames from Asjad Naqvi.
For older versions of Stata, we have the tempfiles workaround. Tempfiles are useful since they are a bypass around an older Stata limitation of one dataframe at a time. Later iterations of Stata introduced the dataframe functionality, but the tempfile method still works well. We will append three years of CPS MORG data from NBER and generate short 1-year panels. Note: Tempfiles are really just macros.
4.6 Appending multiple CPS files
Get CPS Data
Small CPS Files have only a few variables in them compared to the usual. The full Census CPS micro data files found at: https://www.census.gov/data/datasets/time-series/demo/cps/cps-basic.html.
Set up initial tempfile so we can add each year of CPS data. Save the macro and set it to emptyok.
Create a local variable to loop over each year and month to append monthly CPS files into 1 cps file.
cd "/Users/Sam/Desktop/Data/CPS"
*Set month macro
local month jan feb mar apr may jun jul aug sep oct nov dec
local filecount = 0
tempfile cps
save `cps', emptyok
foreach y of numlist 23/24 {
foreach m of local month {
*Show the year and month
display "`m'`y'"
local filename "small`m'`y'pub.dta"
display "`filename'"
*Open the monthly data file
use "`filename'", clear
*Append monthly data file to cps tempfile
quietly append using `cps'
save `cps', replace
clear
*Count the number of monthly files appended
local filecount = `filecount' + 1
}
}
*Get the appended CPS File
use `cps'
tab hrmonth hryear4
save "cps23_24.dta", replace/Users/Sam/Desktop/Data/CPS
(note: dataset contains 0 observations)
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
jan23
smalljan23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
feb23
smallfeb23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
mar23
smallmar23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
apr23
smallapr23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
may23
smallmay23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
jun23
smalljun23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
jul23
smalljul23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
aug23
smallaug23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
sep23
smallsep23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
oct23
smalloct23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
nov23
smallnov23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
dec23
smalldec23pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
jan24
smalljan24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
feb24
smallfeb24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
mar24
smallmar24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
apr24
smallapr24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
may24
smallmay24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
jun24
smalljun24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
jul24
smalljul24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
aug24
smallaug24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
sep24
smallsep24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
oct24
smalloct24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
nov24
smallnov24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
dec24
smalldec24pub.dta
file /var/folders/tz/fwgb6p8923dg6yvb_kt_pw3c0000gn/T//S_71868.000001 saved
| HRYEAR4
HRMONTH | 2023 2024 | Total
-----------+----------------------+----------
1 | 125,982 126,802 | 252,784
2 | 124,754 126,784 | 251,538
3 | 124,477 124,581 | 249,058
4 | 126,798 126,850 | 253,648
5 | 127,559 126,945 | 254,504
6 | 127,491 126,112 | 253,603
7 | 127,424 126,158 | 253,582
8 | 127,930 127,031 | 254,961
9 | 126,665 126,590 | 253,255
10 | 127,941 126,387 | 254,328
11 | 126,917 126,686 | 253,603
12 | 126,832 126,743 | 253,575
-----------+----------------------+----------
Total | 1,520,770 1,517,669 | 3,038,439
file cps23_24.dta saved
CPS Basic Data Dictionary can be found at:
CPS Basic Data Dictionary.
Check all years are there
4.7 By Sort and EGEN
Please note that we will get into bysort egen later in the semester. However, bysort: egen is a powerful combination that it is worth bring up now. You can summarize, replace, or create new variables by multiple groups with the by sort and egen commands Let’s generate laborforce
- 1 is employed at work;
- 2 is employed absent;
- 3 is unemployed layoff;
- 4 is unemployed looking;
- 5 is NILF retired;
- 6 is NILF disabiled; and
- 7 is NILF other
gen laborforce = .
replace laborforce = 0 if pemlr >= 5 & pemlr <= 7
replace laborforce = 1 if pemlr >= 1 & pemlr <= 4
label define laborforce1 0 "NILF" 1 "Labor Force"
label values laborforce laborforce1
tab laborforce
gen employed = .
replace employed = 0 if pemlr >= 3 & pemlr <= 7
replace employed = 1 if pemlr >= 1 & pemlr <= 2
label define employed1 0 "Not Employed" 1 "Employed"
label values employed employed1
tab employedGenerate a race/ethnicity category from existing
gen race_ethnicity = .
replace race_ethnicity = 1 if ptdtrace == 1 & pehspnon == 2
replace race_ethnicity = 2 if ptdtrace == 2 & pehspnon == 2
replace race_ethnicity = 3 if pehspnon == 1
replace race_ethnicity = 4 if ptdtrace == 3 & pehspnon == 2
replace race_ethnicity = 5 if (ptdtrace == 4 | ptdtrace == 5) & pehspnon == 2
replace race_ethnicity = 6 if (ptdtrace >= 6 & ptdtrace <= 26) & pehspnon == 2
label define race_ethnicity1 1 "White NH" 2 "Black NH" 3 "Hispanic or Latino/a" ///
4 "Native American NH" 5 "Asian or Pacific Islander NH" 6 "Multiracial NH"
label values race_ethnicity race_ethnicity1
tab race_ethnicity Sort by Sex
Sort by Sex and Race.
sort pesex race_ethnicity
*Summarize laborforce by sex and race
by pesex race_ethnicity: sum laborforce
bysort pesex race_ethnicity: sum laborforceGenerate Age Bin to find individuals over 16.
gen over_16 = .
replace over_16 = 0 if prtage < 16
replace over_16 = 1 if prtage >= 16
label define over_16a 0 "Under 16" 1 "16 and older"
label values over_16 over_16a
tab over_16Generate unweighted laborforce participation rate for each group with by sort and egen. Bysort works, as well, but I usually use sort on one line and by on the other
4.8 Retrieving estimates stored in Stata’s memory
Instead manually entering summary or regression results, we can use scalars in Stata’s memory to retrieve the information. This reduces human error of copying and pasting.
For summary statistics we can use the return list command.
cd "/Users/Sam/Desktop/Data/CPS"
use smalljan24pub.dta, clear
sum pternwa if prerelg==1
return list
gen demean_earnings=pternwa-`r(mean)' if prerelg==1
summarize demean_earnings/Users/Sam/Desktop/Data/CPS
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
pternwa | 10,666 137274.8 146839.5 0 1125627
scalars:
r(N) = 10666
r(sum_w) = 10666
r(mean) = 137274.7503281455
r(Var) = 21561833661.67427
r(sd) = 146839.4826389492
r(min) = 0
r(max) = 1125627
r(sum) = 1464172487
(116,136 missing values generated)
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
demean_ear~s | 10,666 -3.79e-12 146839.5 -137274.8 988352.2
We can also grab information about a regression
cd "/Users/Sam/Desktop/Data/CPS"
use smalljan24pub.dta, clear
gen lnearn=ln(pternwa) if prerelg==1
gen exp=prtage-16
replace exp=0 if exp==-1
reg lnearn c.exp##c.exp i.peeduca i.peernlab if prerelg==1
ereturn list
matrix list e(b)
*Coefficients and standard errors
display _b[1.peernlab] _se[1.peernlab]/Users/Sam/Desktop/Data/CPS
(116,150 missing values generated)
(27,667 missing values generated)
(1,209 real changes made)
Source | SS df MS Number of obs = 10,652
-------------+---------------------------------- F(18, 10633) = 238.75
Model | 2310.98949 18 128.388305 Prob > F = 0.0000
Residual | 5717.80316 10,633 .537741292 R-squared = 0.2878
-------------+---------------------------------- Adj R-squared = 0.2866
Total | 8028.79265 10,651 .753806465 Root MSE = .73331
------------------------------------------------------------------------------
lnearn | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
exp | .0654986 .0019087 34.31 0.000 .0617571 .0692401
|
c.exp#c.exp | -.0010166 .0000324 -31.38 0.000 -.0010801 -.0009531
|
peeduca |
32 | .2151811 .2348611 0.92 0.360 -.2451906 .6755528
33 | .1297205 .2243499 0.58 0.563 -.3100474 .5694884
34 | .1947927 .2240222 0.87 0.385 -.2443328 .6339181
35 | -.0479483 .2153955 -0.22 0.824 -.4701637 .3742672
36 | -.1032521 .2123809 -0.49 0.627 -.5195583 .3130541
37 | .0602974 .2103972 0.29 0.774 -.3521206 .4727154
38 | .1407128 .2128599 0.66 0.509 -.2765325 .5579581
39 | .4391957 .203923 2.15 0.031 .0394684 .838923
40 | .4534936 .2043117 2.22 0.026 .0530045 .8539827
41 | .5707234 .2059659 2.77 0.006 .1669917 .974455
42 | .5189681 .2053085 2.53 0.011 .1165251 .9214112
43 | .914999 .2038914 4.49 0.000 .5153337 1.314664
44 | 1.025085 .2044413 5.01 0.000 .6243415 1.425828
45 | 1.463769 .2108507 6.94 0.000 1.050462 1.877076
46 | 1.3309 .2076626 6.41 0.000 .9238427 1.737958
|
2.peernlab | -.0496265 .024125 -2.06 0.040 -.0969159 -.0023371
_cons | 10.07725 .2065816 48.78 0.000 9.672309 10.48219
------------------------------------------------------------------------------
scalars:
e(N) = 10652
e(df_m) = 18
e(df_r) = 10633
e(F) = 238.7547824846974
e(r2) = .2878377352595289
e(rmse) = .7333084563678371
e(mss) = 2310.989494457061
e(rss) = 5717.803159756108
e(r2_a) = .2866321563292809
e(ll) = -11800.89312136709
e(ll_0) = -13608.80112438194
e(rank) = 19
macros:
e(cmdline) : "regress lnearn c.exp##c.exp i.peeduca i.peernlab if prerelg==1"
e(title) : "Linear regression"
e(marginsok) : "XB default"
e(vce) : "ols"
e(depvar) : "lnearn"
e(cmd) : "regress"
e(properties) : "b V"
e(predict) : "regres_p"
e(estat_cmd) : "regress_estat"
matrices:
e(b) : 1 x 21
e(V) : 21 x 21
functions:
e(sample)
e(b)[1,21]
c.exp# 31b. 32. 33. 34. 35. 36. 37.
exp c.exp peeduca peeduca peeduca peeduca peeduca peeduca peeduca
y1 .06549863 -.00101659 0 .21518108 .12972052 .19479266 -.04794826 -.1032521 .06029741
38. 39. 40. 41. 42. 43. 44. 45. 46.
peeduca peeduca peeduca peeduca peeduca peeduca peeduca peeduca peeduca
y1 .1407128 .43919568 .4534936 .57072336 .51896812 .91499896 1.0250848 1.463769 1.3309002
1b. 2.
peernlab peernlab _cons
y1 0 -.04962649 10.077247
00
4.9 Survey Weights
We can use the Current Population Survey public use microdata and replicate published BLS numbers. We need to adjust our numbers with the appropriate weights in order to replicate the published numbers.
First, we need to adjust weights the weights. In the CSV files the weights need to be divided by 1000 since the documentation says implies 4 decimals.
Composite weights cmpwgt2* are used for final BLS tabulations of labor force
Outgoing Rotation Weights pworwgt are used for earnings only person in interviews 4 and 8
PWSSWGT weights are used for general tabulations of sex, race, states, etc.
PWVETWGT are used to study the veteran population
PWLWGT are weights used to study someone over multiple CPS interviews
If we are appending all months, we need to divide by 12 for the Basic CPS to get annual weights.
If we are appending all months across multiple years, we need to get a composite weight for all of the CPS files
Use the SvySet command to set up the survey design.
Use the svy: command vars to utilize the survey design.
cd "/Users/Sam/Desktop/Data/CPS"
use cps23_24.dta
gen cmpwgt3 = pwcmpwgt/10000
replace cmpwgt3 = cmpwgt3/12
svyset [pw=cmpwgt3]
gen employed = .
replace employed = 0 if pemlr >= 3 & pemlr <= 7
replace employed = 1 if pemlr >= 1 & pemlr <= 2
label define employed1 0 "Not Employed" 1 "Employed"
label values employed employed1
tab employed
svy: tab employed hryear4, count cellwidth(20) format(%20.2gc)/Users/Sam/Desktop/Data/CPS
(655,995 missing values generated)
(1,946,105 real changes made)
pweight: cmpwgt3
VCE: linearized
Single unit: missing
Strata 1: <one>
SU 1: <observations>
FPC 1: <zero>
(3,038,439 missing values generated)
(838,080 real changes made)
(1,138,576 real changes made)
employed | Freq. Percent Cum.
-------------+-----------------------------------
Not Employed | 838,080 42.40 42.40
Employed | 1,138,576 57.60 100.00
-------------+-----------------------------------
Total | 1,976,656 100.00
(running tabulate on estimation sample)
Number of strata = 1 Number of obs = 1,976,656
Number of PSUs = 1,976,656 Population size = 535,513,623
Design df = 1,976,655
----------------------------------------------------------------------------
| HRYEAR4
employed | 2023 2024 Total
----------+-----------------------------------------------------------------
Not Empl | 105,905,668 107,225,904 213,131,572
Employed | 161,036,522 161,345,530 322,382,051
|
Total | 266,942,190 268,571,433 535,513,623
----------------------------------------------------------------------------
Key: weighted count
Pearson:
Uncorrected chi2(1) = 12.9838
Design-based F(1, 1976655) = 9.8919 P = 0.0017