When you work with Stata, you should always use a do file. I find log files helpful, but do files are essential. It provides transparency to show your work, and it helps provide replication.
Comments
One key feature of do files are comments.
*Comments are a key feature and important for documenting your work.
*Get the data.
sysuse auto, clear
*We now summarize the data.
summarize
(1978 Automobile Data)
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
make | 0
price | 74 6165.257 2949.496 3291 15906
mpg | 74 21.2973 5.785503 12 41
rep78 | 69 3.405797 .9899323 1 5
headroom | 74 2.993243 .8459948 1.5 5
-------------+---------------------------------------------------------
trunk | 74 13.75676 4.277404 5 23
weight | 74 3019.459 777.1936 1760 4840
length | 74 187.9324 22.26634 142 233
turn | 74 39.64865 4.399354 31 51
displacement | 74 197.2973 91.83722 79 425
-------------+---------------------------------------------------------
gear_ratio | 74 3.014865 .4562871 2.19 3.89
foreign | 74 .2972973 .4601885 0 1
Comments are an essential part for two reasons.
A Note for “Point and Click”
Never use point and click. It is available in Stata, but don’t lower yourself to a SPSS standard. In case you are new to do files. We’ll start off with a simple example. We’ll pull some data, summarize some data, and tabulate some data.
use "/Users/Sam/Desktop/Econ 645/Data/Mitchell/wws.dta", clear
summarize age wage hours
tabulate married
(Working Women Survey)
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
age | 2,246 36.25111 5.437983 21 83
wage | 2,246 288.2885 9595.692 0 380000
hours | 2,242 37.21811 10.50914 1 80
married | Freq. Percent Cum.
------------+-----------------------------------
0 | 804 35.80 35.80
1 | 1,442 64.20 100.00
------------+-----------------------------------
Total | 2,246 100.00
We can look at a similar file called example1.do
type "/Users/Sam/Desktop/Econ 645/Data/Mitchell/example1.do"
use wws2, clear
summarize age wage hours
tabulate married
We can actually run a do file within a do file
do "/Users/Sam/Desktop/Econ 645/Data/Mitchell/example1.do"
You can use the doedit command if you want to open the Do-file Editor, but you can just click to open the Do-File (this is an exception from what I said above).
Since we already have the Do-File Editor open, running the doedit will just open a new blank do file.doedit
Note: you can write do files in Notepad or TextEdit, but you
don’t get any of the benefits, such as highlighted text.
Let’s look at a do file that uses the log command
type "/Users/Sam/Desktop/Econ 645/Data/Mitchell/example2.do"
log using example2
use wws2, clear
summarize age wage hours
tabulate married
log close
Let’s run it.
*We will hold off on the log commands for now
*log using example2
use "/Users/Sam/Desktop/Econ 645/Data/Mitchell/wws2.dta", clear
summarize age wage hours
tabulate married
*log close example2
(Working Women Survey w/fixes)
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
age | 2,246 36.22707 5.337859 21 48
wage | 2,244 7.796781 5.82459 0 40.74659
hours | 2,242 37.21811 10.50914 1 80
married | Freq. Percent Cum.
------------+-----------------------------------
0 | 804 35.80 35.80
1 | 1,442 64.20 100.00
------------+-----------------------------------
Total | 2,246 100.00
A Note on Log Commands:
It is important to note that when opening a log it must end with a “log close” command. If your do files bombs out (fails to finish) before reaching the log close command, your log will remain open!
Even if we close the log file, we’ll get an error if we try to run the do file again. So we need to add the replace option.
*log using example2, replace
use "/Users/Sam/Desktop/Econ 645/Data/Mitchell/wws2.dta", clear
summarize age wage hours
tabulate married
*log close
(Working Women Survey w/fixes)
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
age | 2,246 36.22707 5.337859 21 48
wage | 2,244 7.796781 5.82459 0 40.74659
hours | 2,242 37.21811 10.50914 1 80
married | Freq. Percent Cum.
------------+-----------------------------------
0 | 804 35.80 35.80
1 | 1,442 64.20 100.00
------------+-----------------------------------
Total | 2,246 100.00
A brief overview of techniques that will be helpful for completing assignments and empirical project. Topics include - Local Macros, Loops, Tempfiles, By Sort and Egen, and Weights.
First Tip
It is always a good idea to clear your memory and turn more off initially. I have seen some arguments against this when people use R and R Projects. However, I recommend this method for single batch runs in Stata. clear set more off
Julian’s website provides very good tips about repliciability and organization. She recommends
Folder Structure
You be able to pick up a folder and move computers and have it run.
Stata makes saving graphs easy. If there is a minor fix to your scripts, then all you need to do is rerun the scripts and your graph will be fixed. No point and clicking like Excel!
Stata also has estout and outreg2 to output tables and regression from Stata. I recommend that you avoid copying and pasting tables into documents after command runs. I will be faster in the long run, even though it is easy to copy and paste tables. Use Stata user-written command, such asestout or outreg2. This will help you in the long-run.
Estout
estout: estout documentation
ssc install estout
Outreg2
outreg2: outreg2 documentation
ssc install outreg2
Macros can contain strings or numerics, and can be used in dynamic programming.
Local Macros - local macros work within a single executed do file and removed once the session is run. If you are running local macros you must run the line of code that initializes the macro and the command that utilizes it.
You need to initialize the local macro. E.g.: local i = 1 You need to
call the macro with '. E.g.: display "i’” E.g.: replace x =
0 if y = `i’,
local i = 1
display "`i'"
local k = `i' + 1
display "`k'"
1
2
Global Macros - stays within memory; can work across multiple do files You need to initialize a global macro. E.g.: global j = 2 You need to call the macro with \(. E.g.: display "\)j” E.g.: replace z = 1 if y = $i
global j = 2
display "$j"
2
Now rerun without initilizing the global macro
display "$j"
You can set lists for local levels of a categorical variable E.g.:
levelsof varname, local(levels)
foreach l of local levels {
command if varname == `l'
}
We’ll demo this later, but it is quite useful. You can also use local macros to test regression models, but we’ll demo this later, as well.
Looping has many uses when you want to apply a function or command over a repeated set of values.
There is forvalues and foreach - I typically use foreach given it’s versatility but Stata says that forvalues can be more efficient.
*E.g.: Loop over years and iterate i.
local i = 0
foreach num of numlist 2000/2019 {
display "`num'"
local i = `i'+1
display "`i'"
}
*E.g.: Loop over months and years to read in new files
local month jan feb mar apr may jun jul aug sep oct nov dec
foreach y of numlist 2018/2019 {
foreach m of local month {
local filename = "`m'`y'.dta"
display "`filename'"
}
}
2000
1
2001
2
2002
3
2003
4
2004
5
2005
6
2006
7
2007
8
2008
9
2009
10
2010
11
2011
12
2012
13
2013
14
2014
15
2015
16
2016
17
2017
18
2018
19
2019
20
jan2018.dta
feb2018.dta
mar2018.dta
apr2018.dta
may2018.dta
jun2018.dta
jul2018.dta
aug2018.dta
sep2018.dta
oct2018.dta
nov2018.dta
dec2018.dta
jan2019.dta
feb2019.dta
mar2019.dta
apr2019.dta
may2019.dta
jun2019.dta
jul2019.dta
aug2019.dta
sep2019.dta
oct2019.dta
nov2019.dta
dec2019.dta
For older versions of Stata, we have the tempfiles workaround. Tempfiles are useful since they are a bypass around an older Stata limitation of one dataframe at a time. Later iterations of Stata introduced the dataframe functionality, but the tempfile method still works well. We will append three years of CPS MORG data from NBER and generate short 1-year panels. Note: Tempfiles are really just macros.
Let’s get Current Population Survey (CPS) Data from the NBER MORG
*Set link into macro
local url "/Users/Sam/Desktop/Econ 672/Course Material/Data/"
*These data are online too at:
*local url "https://github.com/rowesamuel/ECON672/blob/main/Data/Introduction/"
*https://github.com/rowesamuel/ECON672/blob/main/Data/Introduction/small_morg2017.dta?raw=true
*Set up initial tempfile so we can add each month with the tempfile command.
tempfile cps
*Save the macro and set it to emptyok
save `cps', emptyok
*Use Census CPS instead of NBER MORG
*Loop over each year and month to append monthly CPS files into 1 cps file
local month jan feb mar apr may jun jul aug sep oct nov dec
local filecount = 0
foreach y of numlist 20/21 {
foreach m of local month {
*Show the year and month
display "`m'`y'"
local filename "`url'small_`m'`y'pub.dta"
display "`filename'"
*Open the monthly data file
use "`filename'", clear
*Append monthly data file to cps tempfile
append using `cps'
*Save the tempfile with appended data
save `cps', replace
clear
*Count the number of monthly files appended
local filecount = `filecount' + 1
}
}
*Retrieve the tempfile
use `cps'
save "/Users/Sam/Desktop/Econ 645/Data/CPS/week1cps.dta", replace
Double check all of our data were appended correctly
use "/Users/Sam/Desktop/Econ 645/Data/CPS/week1cps.dta", clear
*Check all years are there
tab hrmonth hryear4
| HRYEAR4
HRMONTH | 2020 2021 | Total
-----------+----------------------+----------
1 | 138,697 133,553 | 272,250
2 | 139,248 132,681 | 271,929
3 | 131,578 131,357 | 262,935
4 | 129,382 133,449 | 262,831
5 | 126,557 132,077 | 258,634
6 | 123,364 129,272 | 252,636
7 | 124,102 129,092 | 253,194
8 | 126,448 129,408 | 255,856
9 | 133,448 127,872 | 261,320
10 | 135,242 129,103 | 264,345
11 | 134,122 127,375 | 261,497
12 | 132,036 127,489 | 259,525
-----------+----------------------+----------
Total | 1,574,224 1,562,728 | 3,136,952
Small CPS Files have only a few variables in them compared to the usual. The full Census CPS micro data files found at: https://www.census.gov/data/datasets/time-series/demo/cps/cps-basic.html.
CPS Basic Data Dictionary can be found at: CPS Basic Data Dictionary.Please note that we will get into bysort egen later in the semester
You can summarize, replace, or create new variables by multiple groups with the bysort and egen commands. Let’s generate laborforce 1 is employed at work; 2 is employed absent; 3 is unemployed layoff; 4 is unemployed looking; 5 is NILF retired; 6 is NILF disabiled; and 7 is NILF other
use "/Users/Sam/Desktop/Econ 645/Data/CPS/week1cps.dta", clear
*Generate laborforce binary
gen laborforce = .
replace laborforce = 0 if pemlr >= 5 & pemlr <= 7
replace laborforce = 1 if pemlr >= 1 & pemlr <= 4
label define laborforce1 0 "NILF" 1 "Labor Force"
label values laborforce laborforce1
tab laborforce
*Generate employment binary
gen employed = .
replace employed = 0 if pemlr >= 3 & pemlr <= 7
replace employed = 1 if pemlr >= 1 & pemlr <= 2
label define employed1 0 "Not Employed" 1 "Employed"
label values employed employed1
tab employed
*Generate a race/ethnicity category from existing
gen race_ethnicity = .
replace race_ethnicity = 1 if ptdtrace == 1 & pehspnon == 2
replace race_ethnicity = 2 if ptdtrace == 2 & pehspnon == 2
replace race_ethnicity = 3 if pehspnon == 1
replace race_ethnicity = 4 if ptdtrace == 3 & pehspnon == 2
replace race_ethnicity = 5 if (ptdtrace == 4 | ptdtrace == 5) & pehspnon == 2
replace race_ethnicity = 6 if (ptdtrace >= 6 & ptdtrace <= 26) & pehspnon == 2
label define race_ethnicity1 1 "White NH" 2 "Black NH" 3 "Hispanic or Latino/a" ///
4 "Native American NH" 5 "Asian or Pacific Islander NH" 6 "Multiracial NH"
label values race_ethnicity race_ethnicity1
tab race_ethnicity
*Sort by Sex
sort pesex
*Summarize laborforce by sex
by pesex: sum laborforce
*Sort by Sex and Race
sort pesex race_ethnicity
*Summarize laborforce by sex and race
by pesex race_ethnicity: sum laborforce
*Generate Age Bin
gen over_16 = .
replace over_16 = 0 if prtage < 16
replace over_16 = 1 if prtage >= 16
label define over_16a 0 "Under 16" 1 "16 and older"
label values over_16 over_16a
tab over_16
*Generate unweighted laborforce participation rate for each group
*with by sort and egen
*bysort works, as well, but I will start with sort on one line and by on the other
sort pesex race_ethnicity
by pesex race_ethnicity: egen mean_lfpr = mean(laborforce) if over_16 == 1
by pesex race_ethnicity: sum mean_lfpr
*Counting and indexing/subscripting within groups
gen idcount = .
by pesex race_ethnicity: replace idcount = _n
gen idcount2 = .
by pesex race_ethnicity: replace idcount2 = idcount[_N]
save "/Users/Sam/Desktop/Econ 645/Data/CPS/week1cps.dta", replace
variable laborforce already defined
r(110);
r(110);
In order to replicate the BLS numbers, we need to adjust our numbers with the appropriate weights. First, we need to adjust weights the weights. In the CSV files the weights need to be divided by 1000. Since the documentation says implies 4 decimals
use "/Users/Sam/Desktop/Econ 645/Data/CPS/week1cps.dta", clear
*Composite weights <b>cmpwgt2</b> are used for final BLS tabulations of labor force
gen cmpwgt2 = pwcmpwgt/10000
*Outgoing Rotation Weights <b>pworwgt</b> are used for earnings only person in interviews 4 and 8
gen orwgt2 = pworwgt/10000
*PWSSWGT</b> weights are used for general tabulations of sex, race, states, etc.
gen sswgt2 = pwsswgt/10000
*PWVETWGT</b> are used to study the veteran population
gen vetwgt2 = pwvetwgt/10000
*PWLWGT</b> are weighted used to study someone over multiple CPS interviews
gen lgwgt2 = pwlgwgt/10000
*Usually we would need to divide by 12 for the Basic CPS to get annual weights
replace cmpwgt2 = cmpwgt2/12
*Get a composite weight for all of the CPS files
tab hryear4
gen cmpwgt3 = pwcmpwgt/24
*Use the SvySet command to set up the survey design
svyset [pw=cmpwgt2]
*Use the svy: command vars to utilize the survey design
svy: tab employed hryear4, count cellwidth(20) format(%20.2gc)
(583,399 missing values generated)
(583,399 missing values generated)
(583,399 missing values generated)
(583,399 missing values generated)
(583,399 missing values generated)
(2,064,239 real changes made)
HRYEAR4 | Freq. Percent Cum.
------------+-----------------------------------
2020 | 1,574,224 50.18 50.18
2021 | 1,562,728 49.82 100.00
------------+-----------------------------------
Total | 3,136,952 100.00
(583,399 missing values generated)
pweight: cmpwgt2
VCE: linearized
Single unit: missing
Strata 1: <one>
SU 1: <observations>
FPC 1: <zero>
(running tabulate on estimation sample)
Number of strata = 1 Number of obs = 2,096,484
Number of PSUs = 2,096,484 Population size = 521,774,163
Design df = 2,096,483
----------------------------------------------------------------------------
| HRYEAR4
employed | 2020 2021 Total
----------+-----------------------------------------------------------------
Not Empl | 112,534,150 108,864,483 221,398,633
Employed | 147,794,858 152,580,672 300,375,530
|
Total | 260,329,008 261,445,155 521,774,163
----------------------------------------------------------------------------
Key: weighted count
Pearson:
Uncorrected chi2(1) = 541.1783
Design-based F(1, 2096483) = 406.1776 P = 0.0000
Important Note Missing values are set to -1 in the CPS PUMS (public-use micro dataset). If you do not account for missing, you will have an incorrect analysis.
use "/Users/Sam/Desktop/Econ 645/Data/CPS/week1cps.dta", clear
*First, let us generate some mutually exclusive categories of interest
*Recategorize Female
gen female = .
replace female = 0 if pesex == 1
replace female = 1 if pesex == 2
label define female1 0 "Male" 1 "Female"
label values female female1
*Generate Union
gen union = .
replace union = 0 if peernlab == 2
replace union = 1 if peernlab == 1
label define union1 0 "Nonunion" 1 "Union"
label values union union1
*Our outcome of interest: Earnings
*Note: Documentation says that they imply 2 decimals so we need to divide by 100.
gen earnings = .
replace earnings = pternwa if pternwa >=0
*Divide by 100 for decimals
replace earnings = earnings/100
*Take the natural log
gen lnearnings = ln(earnings)
*Generate Educational Bins
tab peeduca
gen educ = .
*High School Drop Out: from Less than 1st Grade to 12th Grade No Diploma
replace educ = 1 if peeduca >= 31 & peeduca <38
*Graduated High School or GED
replace educ = 2 if peeduca == 39
*Some College
replace educ = 3 if peeduca == 40
*AA Degree: Vocational or Academic
replace educ = 4 if peeduca == 41 | peeduca == 42
*Bachelor Degree
replace educ = 5 if peeduca == 43
*Advanced Degree: Masters, Professional, or Doctorate
replace educ = 6 if peeduca >= 44 & peeduca <= 46
label define educ1 1 "High School Dropout" 2 "High School Graduate" ///
3 "Some College" 4 "Associates (VorA) Degree" ///
5 "Bachelor's Degree" 6 "Advanced Degree"
label values educ educ1
*Caveat with Generating Categorical Variables in Stata
*Do not do peeduca >= 44 without peeduca <= 46 since missing values are very large
*So you if do peeduca >= 44 and missing peeduca is "." then missing will get
*Categorized in Advanced Degree which will be a measurement error.
*Generate Potential Experience
gen exp = prtage - 16
gen exp2 = exp*exp
*You can use local macros for testing models
*Only 3 models
local rhs1 i.educ exp exp2 i.female i.union i.hryear4
local rhs2 i.educ exp exp2 i.female i.union i.hryear4 i.peio1icd
*Add interaction between female and union
local rhs3 i.educ exp exp2 i.female##i.union i.hryear4 i.peio1icd
*We use eststo to save a model to compare it to another.
est clear
quietly eststo reg1: reg lnearnings `rhs1'
quietly eststo reg2: reg lnearnings `rhs2'
quietly eststo reg3: reg lnearnings `rhs3'
*Output the results
esttab, title (Mincer Equation) r2 se noconstant star(* .10 ** .05 *** .01) ///
b(%10.3f) drop (*peio1icd 0.*) wide label
(3,136,952 missing values generated)
(1,244,479 real changes made)
(1,309,074 real changes made)
(3,136,952 missing values generated)
(240,875 real changes made)
(27,432 real changes made)
(3,136,952 missing values generated)
(268,307 real changes made)
(267,905 real changes made)
(2,869,047 missing values generated)
peeduca | Freq. Percent Cum.
------------+-----------------------------------
-1 | 448,365 17.56 17.56
31 | 5,354 0.21 17.77
32 | 9,435 0.37 18.14
33 | 18,658 0.73 18.87
34 | 35,457 1.39 20.26
35 | 47,949 1.88 22.13
36 | 57,474 2.25 24.39
37 | 62,605 2.45 26.84
38 | 30,453 1.19 28.03
39 | 588,436 23.04 51.07
40 | 345,152 13.52 64.59
41 | 89,862 3.52 68.11
42 | 116,914 4.58 72.69
43 | 437,520 17.13 89.82
44 | 191,168 7.49 97.31
45 | 29,103 1.14 98.45
46 | 39,648 1.55 100.00
------------+-----------------------------------
Total | 2,553,553 100.00
(3,136,952 missing values generated)
(236,932 real changes made)
(588,436 real changes made)
(345,152 real changes made)
(206,776 real changes made)
(437,520 real changes made)
(259,919 real changes made)
(583,399 missing values generated)
(583,399 missing values generated)
Mincer Equation
-----------------------------------------------------------------------------------------------------------
(1) (2) (3)
lnearnings lnearnings lnearnings
-----------------------------------------------------------------------------------------------------------
High School Dropout 0.000 (.) 0.000 (.) 0.000 (.)
High School Graduate 0.382*** (0.006) 0.316*** (0.006) 0.316*** (0.006)
Some College 0.425*** (0.006) 0.347*** (0.006) 0.347*** (0.006)
Associates (VorA) ~e 0.545*** (0.007) 0.434*** (0.007) 0.434*** (0.007)
Bachelor's Degree 0.860*** (0.006) 0.725*** (0.006) 0.725*** (0.006)
Advanced Degree 1.050*** (0.007) 0.947*** (0.007) 0.946*** (0.007)
exp 0.059*** (0.000) 0.050*** (0.000) 0.050*** (0.000)
exp2 -0.001*** (0.000) -0.001*** (0.000) -0.001*** (0.000)
Female -0.325*** (0.003) -0.232*** (0.003) -0.235*** (0.003)
Union 0.116*** (0.004) 0.146*** (0.005) 0.131*** (0.006)
HRYEAR4=2020 0.000 (.) 0.000 (.) 0.000 (.)
HRYEAR4=2021 0.039*** (0.003) 0.041*** (0.003) 0.041*** (0.003)
Female # Nonunion 0.000 (.)
Female # Union 0.032*** (0.009)
Constant 5.514*** (0.007) 5.568*** (0.018) 5.569*** (0.018)
-----------------------------------------------------------------------------------------------------------
Observations 265160 265160 265160
R-squared 0.295 0.358 0.358
-----------------------------------------------------------------------------------------------------------
Standard errors in parentheses
* p<.10, ** p<.05, *** p<.01
Use esttab for formatted results:
esttab
documentation.