10.3 Executing do-files and making log files

Using Do files

When you work with Stata, you should always use a do file. I find log files helpful, but do files are essential. It provides transparency to show your work, and it helps provide replication.

Comments

One key feature of do files are comments.

*Comments are a key feature and important for documenting your work.
*Get the data.
sysuse auto, clear
*We now summarize the data.
summarize
(1978 Automobile Data)

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        make |          0
       price |         74    6165.257    2949.496       3291      15906
         mpg |         74     21.2973    5.785503         12         41
       rep78 |         69    3.405797    .9899323          1          5
    headroom |         74    2.993243    .8459948        1.5          5
-------------+---------------------------------------------------------
       trunk |         74    13.75676    4.277404          5         23
      weight |         74    3019.459    777.1936       1760       4840
      length |         74    187.9324    22.26634        142        233
        turn |         74    39.64865    4.399354         31         51
displacement |         74    197.2973    91.83722         79        425
-------------+---------------------------------------------------------
  gear_ratio |         74    3.014865    .4562871       2.19       3.89
     foreign |         74    .2972973    .4601885          0          1

Comments are an essential part for two reasons.

  1. It will help yourself when you go back to your own code to describe what is going one.
  2. Second, it helps with replication, since someone should be able to run your code top to bottom and get the same results that you have in your paper, report, or document.

A Note for “Point and Click”

Never use point and click. It is available in Stata, but don’t lower yourself to a SPSS standard. In case you are new to do files. We’ll start off with a simple example. We’ll pull some data, summarize some data, and tabulate some data.

use "/Users/Sam/Desktop/Econ 645/Data/Mitchell/wws.dta", clear
summarize age wage hours
tabulate married
(Working Women Survey)

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |      2,246    36.25111    5.437983         21         83
        wage |      2,246    288.2885    9595.692          0     380000
       hours |      2,242    37.21811    10.50914          1         80

    married |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        804       35.80       35.80
          1 |      1,442       64.20      100.00
------------+-----------------------------------
      Total |      2,246      100.00

We can look at a similar file called example1.do

type "/Users/Sam/Desktop/Econ 645/Data/Mitchell/example1.do"
use wws2, clear
summarize age wage hours
tabulate married

We can actually run a do file within a do file

    do "/Users/Sam/Desktop/Econ 645/Data/Mitchell/example1.do"

You can use the doedit command if you want to open the Do-file Editor, but you can just click to open the Do-File (this is an exception from what I said above).

Since we already have the Do-File Editor open, running the doedit will just open a new blank do file.

doedit
Note: you can write do files in Notepad or TextEdit, but you don’t get any of the benefits, such as highlighted text.

Let’s look at a do file that uses the log command

type "/Users/Sam/Desktop/Econ 645/Data/Mitchell/example2.do"
log using example2
use wws2, clear
summarize age wage hours
tabulate married
log close

Let’s run it.

*We will hold off on the log commands for now
*log using example2
use "/Users/Sam/Desktop/Econ 645/Data/Mitchell/wws2.dta", clear
summarize age wage hours
tabulate married
*log close example2
(Working Women Survey w/fixes)

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |      2,246    36.22707    5.337859         21         48
        wage |      2,244    7.796781     5.82459          0   40.74659
       hours |      2,242    37.21811    10.50914          1         80

    married |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        804       35.80       35.80
          1 |      1,442       64.20      100.00
------------+-----------------------------------
      Total |      2,246      100.00

A Note on Log Commands:

It is important to note that when opening a log it must end with a “log close” command. If your do files bombs out (fails to finish) before reaching the log close command, your log will remain open!

Even if we close the log file, we’ll get an error if we try to run the do file again. So we need to add the replace option.

*log using example2, replace
use "/Users/Sam/Desktop/Econ 645/Data/Mitchell/wws2.dta", clear
summarize age wage hours
tabulate married
*log close
(Working Women Survey w/fixes)

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |      2,246    36.22707    5.337859         21         48
        wage |      2,244    7.796781     5.82459          0   40.74659
       hours |      2,242    37.21811    10.50914          1         80

    married |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        804       35.80       35.80
          1 |      1,442       64.20      100.00
------------+-----------------------------------
      Total |      2,246      100.00

Useful Stata Commands: An Overview

A brief overview of techniques that will be helpful for completing assignments and empirical project. Topics include - Local Macros, Loops, Tempfiles, By Sort and Egen, and Weights.

First Tip

It is always a good idea to clear your memory and turn more off initially. I have seen some arguments against this when people use R and R Projects. However, I recommend this method for single batch runs in Stata. clear set more off

Coding guide for organizing your folders

https://julianreif.com/guide

Julian’s website provides very good tips about repliciability and organization. She recommends

Folder Structure

  • analysis
    • data
    • processed
    • results
      • figures
      • tables
    • scripts
      • process_raw_data.do
      • append_data.do
    • run.do
  • paper
    • manuscript.tex
    • figures
    • tables

You be able to pick up a folder and move computers and have it run.

Automate Graphs, Figures, and Tables

Stata makes saving graphs easy. If there is a minor fix to your scripts, then all you need to do is rerun the scripts and your graph will be fixed. No point and clicking like Excel!

Stata also has estout and outreg2 to output tables and regression from Stata. I recommend that you avoid copying and pasting tables into documents after command runs. I will be faster in the long run, even though it is easy to copy and paste tables. Use Stata user-written command, such asestout or outreg2. This will help you in the long-run.

Estout

estout: estout documentation

ssc install estout

Outreg2

outreg2: outreg2 documentation

ssc install outreg2

Macros

Macros can contain strings or numerics, and can be used in dynamic programming.

Local Macros - local macros work within a single executed do file and removed once the session is run. If you are running local macros you must run the line of code that initializes the macro and the command that utilizes it.

You need to initialize the local macro. E.g.: local i = 1 You need to call the macro with '. E.g.: display "i’” E.g.: replace x = 0 if y = `i’,

    local i = 1
    display "`i'"
    local k = `i' + 1
    display "`k'"
1


2

Global Macros - stays within memory; can work across multiple do files You need to initialize a global macro. E.g.: global j = 2 You need to call the macro with \(. E.g.: display "\)j” E.g.: replace z = 1 if y = $i

    global j = 2
    display "$j"
2

Now rerun without initilizing the global macro

display "$j"

You can set lists for local levels of a categorical variable E.g.:

levelsof varname, local(levels) 
foreach l of local levels {
  command if varname == `l'
 }

We’ll demo this later, but it is quite useful. You can also use local macros to test regression models, but we’ll demo this later, as well.

Looping

Looping has many uses when you want to apply a function or command over a repeated set of values.

There is forvalues and foreach - I typically use foreach given it’s versatility but Stata says that forvalues can be more efficient.

*E.g.: Loop over years and iterate i.
local i = 0
foreach num of numlist 2000/2019 {
  display "`num'"
  local i = `i'+1
  display "`i'"
}

*E.g.: Loop over months and years to read in new files
local month jan feb mar apr may jun jul aug sep oct nov dec
foreach y of numlist 2018/2019 {
      foreach m of local month {
        local filename = "`m'`y'.dta"
    display "`filename'"
      }
}
2000
1
2001
2
2002
3
2003
4
2004
5
2005
6
2006
7
2007
8
2008
9
2009
10
2010
11
2011
12
2012
13
2013
14
2014
15
2015
16
2016
17
2017
18
2018
19
2019
20


jan2018.dta
feb2018.dta
mar2018.dta
apr2018.dta
may2018.dta
jun2018.dta
jul2018.dta
aug2018.dta
sep2018.dta
oct2018.dta
nov2018.dta
dec2018.dta
jan2019.dta
feb2019.dta
mar2019.dta
apr2019.dta
may2019.dta
jun2019.dta
jul2019.dta
aug2019.dta
sep2019.dta
oct2019.dta
nov2019.dta
dec2019.dta

Data Frames and Tempfiles

Data frames are a relatively new part of Stata (Stata 16+) that let you switch between datasets. Before you could only work on one dataset at a time. You can read an introduction to Stata data frames from Asjad Naqvi.

For older versions of Stata, we have the tempfiles workaround. Tempfiles are useful since they are a bypass around an older Stata limitation of one dataframe at a time. Later iterations of Stata introduced the dataframe functionality, but the tempfile method still works well. We will append three years of CPS MORG data from NBER and generate short 1-year panels. Note: Tempfiles are really just macros.

Appending multiple CPS files

Let’s get Current Population Survey (CPS) Data from the NBER MORG

*Set link into macro
local url "/Users/Sam/Desktop/Econ 672/Course Material/Data/"
*These data are online too at:
*local url "https://github.com/rowesamuel/ECON672/blob/main/Data/Introduction/"
*https://github.com/rowesamuel/ECON672/blob/main/Data/Introduction/small_morg2017.dta?raw=true

*Set up initial tempfile so we can add each month with the tempfile command.
tempfile cps
*Save the macro and set it to emptyok
save `cps', emptyok

*Use Census CPS instead of NBER MORG

*Loop over each year and month to append monthly CPS files into 1 cps file
local month jan feb mar apr may jun jul aug sep oct nov dec
local filecount = 0

foreach y of numlist 20/21 {
  foreach m of local month {
        
     *Show the year and month
       display "`m'`y'"
       local filename "`url'small_`m'`y'pub.dta"
       display "`filename'"
        
       *Open the monthly data file
       use "`filename'", clear

       *Append monthly data file to cps tempfile
       append using `cps'

       *Save the tempfile with appended data

       save `cps', replace
       clear
       *Count the number of monthly files appended
       local filecount = `filecount' + 1
  }
}

*Retrieve the tempfile
use `cps'

save "/Users/Sam/Desktop/Econ 645/Data/CPS/week1cps.dta", replace

Double check all of our data were appended correctly

use "/Users/Sam/Desktop/Econ 645/Data/CPS/week1cps.dta", clear
*Check all years are there
tab hrmonth hryear4
           |        HRYEAR4
   HRMONTH |      2020       2021 |     Total
-----------+----------------------+----------
         1 |   138,697    133,553 |   272,250 
         2 |   139,248    132,681 |   271,929 
         3 |   131,578    131,357 |   262,935 
         4 |   129,382    133,449 |   262,831 
         5 |   126,557    132,077 |   258,634 
         6 |   123,364    129,272 |   252,636 
         7 |   124,102    129,092 |   253,194 
         8 |   126,448    129,408 |   255,856 
         9 |   133,448    127,872 |   261,320 
        10 |   135,242    129,103 |   264,345 
        11 |   134,122    127,375 |   261,497 
        12 |   132,036    127,489 |   259,525 
-----------+----------------------+----------
     Total | 1,574,224  1,562,728 | 3,136,952 

Small CPS Files have only a few variables in them compared to the usual. The full Census CPS micro data files found at: https://www.census.gov/data/datasets/time-series/demo/cps/cps-basic.html.

CPS Basic Data Dictionary can be found at: CPS Basic Data Dictionary.

By Sort and EGEN

Please note that we will get into bysort egen later in the semester

You can summarize, replace, or create new variables by multiple groups with the bysort and egen commands. Let’s generate laborforce 1 is employed at work; 2 is employed absent; 3 is unemployed layoff; 4 is unemployed looking; 5 is NILF retired; 6 is NILF disabiled; and 7 is NILF other

use "/Users/Sam/Desktop/Econ 645/Data/CPS/week1cps.dta", clear

*Generate laborforce binary
gen laborforce = .
replace laborforce = 0 if pemlr >= 5 & pemlr <= 7
replace laborforce = 1 if pemlr >= 1 & pemlr <= 4
label define laborforce1 0 "NILF" 1 "Labor Force"
label values laborforce laborforce1
tab laborforce

*Generate employment binary
gen employed = .
replace employed = 0 if pemlr >= 3 & pemlr <= 7
replace employed = 1 if pemlr >= 1 & pemlr <= 2
label define employed1 0 "Not Employed" 1 "Employed"
label values employed employed1
tab employed

*Generate a race/ethnicity category from existing 
gen race_ethnicity = .
replace race_ethnicity = 1 if ptdtrace == 1 & pehspnon == 2
replace race_ethnicity = 2 if ptdtrace == 2 & pehspnon == 2 
replace race_ethnicity = 3 if pehspnon == 1
replace race_ethnicity = 4 if ptdtrace == 3 & pehspnon == 2
replace race_ethnicity = 5 if (ptdtrace == 4 | ptdtrace == 5) & pehspnon == 2
replace race_ethnicity = 6 if (ptdtrace >= 6 & ptdtrace <= 26) & pehspnon == 2
label define race_ethnicity1 1 "White NH" 2 "Black NH" 3 "Hispanic or Latino/a" ///
4 "Native American NH" 5 "Asian or Pacific Islander NH" 6 "Multiracial NH"
label values race_ethnicity race_ethnicity1
tab race_ethnicity 

*Sort by Sex
sort pesex
*Summarize laborforce by sex
by pesex: sum laborforce
*Sort by Sex and Race
sort pesex race_ethnicity
*Summarize laborforce by sex and race
by pesex race_ethnicity: sum laborforce
*Generate Age Bin
gen over_16 = .
replace over_16 = 0 if prtage < 16
replace over_16 = 1 if prtage >= 16
label define over_16a 0 "Under 16" 1 "16 and older"
label values over_16 over_16a
tab over_16

*Generate unweighted laborforce participation rate for each group
*with by sort and egen
*bysort works, as well, but I will start with sort on one line and by on the other
sort pesex race_ethnicity
by pesex race_ethnicity: egen mean_lfpr = mean(laborforce) if over_16 == 1
by pesex race_ethnicity: sum mean_lfpr

*Counting and indexing/subscripting within groups
gen idcount = .
by pesex race_ethnicity: replace idcount = _n
gen idcount2 = .
by pesex race_ethnicity: replace idcount2 = idcount[_N]

save "/Users/Sam/Desktop/Econ 645/Data/CPS/week1cps.dta", replace
variable laborforce already defined
r(110);

r(110);

Weights

In order to replicate the BLS numbers, we need to adjust our numbers with the appropriate weights. First, we need to adjust weights the weights. In the CSV files the weights need to be divided by 1000. Since the documentation says implies 4 decimals

use "/Users/Sam/Desktop/Econ 645/Data/CPS/week1cps.dta", clear
*Composite weights <b>cmpwgt2</b> are used for final BLS tabulations of labor force
gen cmpwgt2 = pwcmpwgt/10000
*Outgoing Rotation Weights <b>pworwgt</b> are used for earnings only person in interviews 4 and 8
gen orwgt2 = pworwgt/10000
*PWSSWGT</b> weights are used for general tabulations of sex, race, states, etc.
gen sswgt2 = pwsswgt/10000
*PWVETWGT</b> are used to study the veteran population 
gen vetwgt2 = pwvetwgt/10000
*PWLWGT</b> are weighted used to study someone over multiple CPS interviews
gen lgwgt2 = pwlgwgt/10000

*Usually we would need to divide by 12 for the Basic CPS to get annual weights
replace cmpwgt2 = cmpwgt2/12

*Get a composite weight for all of the CPS files
tab hryear4
gen cmpwgt3 = pwcmpwgt/24

*Use the SvySet command to set up the survey design
svyset [pw=cmpwgt2]

*Use the svy: command vars to utilize the survey design
svy: tab employed hryear4, count cellwidth(20) format(%20.2gc)
(583,399 missing values generated)

(583,399 missing values generated)

(583,399 missing values generated)

(583,399 missing values generated)

(583,399 missing values generated)

(2,064,239 real changes made)

    HRYEAR4 |      Freq.     Percent        Cum.
------------+-----------------------------------
       2020 |  1,574,224       50.18       50.18
       2021 |  1,562,728       49.82      100.00
------------+-----------------------------------
      Total |  3,136,952      100.00

(583,399 missing values generated)

      pweight: cmpwgt2
          VCE: linearized
  Single unit: missing
     Strata 1: <one>
         SU 1: <observations>
        FPC 1: <zero>

(running tabulate on estimation sample)

Number of strata   =         1                Number of obs     =    2,096,484
Number of PSUs     = 2,096,484                Population size   =  521,774,163
                                              Design df         =    2,096,483

----------------------------------------------------------------------------
          |                             HRYEAR4                             
 employed |                 2020                  2021                 Total
----------+-----------------------------------------------------------------
 Not Empl |          112,534,150           108,864,483           221,398,633
 Employed |          147,794,858           152,580,672           300,375,530
          | 
    Total |          260,329,008           261,445,155           521,774,163
----------------------------------------------------------------------------
  Key:  weighted count

  Pearson:
    Uncorrected   chi2(1)         =  541.1783
    Design-based  F(1, 2096483)   =  406.1776     P = 0.0000

Mincer Equation Example - Replicate Tables

Important Note Missing values are set to -1 in the CPS PUMS (public-use micro dataset). If you do not account for missing, you will have an incorrect analysis.

use "/Users/Sam/Desktop/Econ 645/Data/CPS/week1cps.dta", clear

*First, let us generate some mutually exclusive categories of interest
*Recategorize Female
gen female = .
replace female = 0 if pesex == 1
replace female = 1 if pesex == 2
label define female1 0 "Male" 1 "Female"
label values female female1

*Generate Union
gen union = .
replace union = 0 if peernlab == 2
replace union = 1 if peernlab == 1
label define union1 0 "Nonunion" 1 "Union"
label values union union1

*Our outcome of interest: Earnings
*Note: Documentation says that they imply 2 decimals so we need to divide by 100.
gen earnings = .
replace earnings = pternwa if pternwa >=0
*Divide by 100 for decimals
replace earnings = earnings/100
*Take the natural log
gen lnearnings = ln(earnings)

*Generate Educational Bins
tab peeduca
gen educ = .
*High School Drop Out: from Less than 1st Grade to 12th Grade No Diploma
replace educ = 1 if peeduca >= 31 & peeduca <38
*Graduated High School or GED
replace educ = 2 if peeduca == 39
*Some College
replace educ = 3 if peeduca == 40
*AA Degree: Vocational or Academic
replace educ = 4 if peeduca == 41 | peeduca == 42
*Bachelor Degree
replace educ = 5 if peeduca == 43
*Advanced Degree: Masters, Professional, or Doctorate
replace educ = 6 if peeduca >= 44 & peeduca <= 46
label define educ1 1 "High School Dropout" 2 "High School Graduate" ///
                   3 "Some College" 4 "Associates (VorA) Degree" ///
                           5 "Bachelor's Degree" 6 "Advanced Degree"
label values educ educ1

*Caveat with Generating Categorical Variables in Stata
*Do not do peeduca >= 44 without peeduca <= 46 since missing values are very large
*So you if do peeduca >= 44 and missing peeduca is "." then missing will get 
*Categorized in Advanced Degree which will be a measurement error.

*Generate Potential Experience
gen exp = prtage - 16
gen exp2 = exp*exp

*You can use local macros for testing models
*Only 3 models
local rhs1 i.educ exp exp2 i.female i.union i.hryear4 
local rhs2 i.educ exp exp2 i.female i.union i.hryear4 i.peio1icd
*Add interaction between female and union
local rhs3 i.educ exp exp2 i.female##i.union i.hryear4 i.peio1icd

*We use eststo to save a model to compare it to another.
est clear
quietly eststo reg1: reg lnearnings `rhs1'
quietly eststo reg2: reg lnearnings `rhs2'
quietly eststo reg3: reg lnearnings `rhs3'

*Output the results
esttab, title (Mincer Equation) r2 se noconstant star(* .10 ** .05 *** .01) ///
b(%10.3f) drop (*peio1icd 0.*) wide label
(3,136,952 missing values generated)

(1,244,479 real changes made)

(1,309,074 real changes made)



(3,136,952 missing values generated)

(240,875 real changes made)

(27,432 real changes made)



(3,136,952 missing values generated)

(268,307 real changes made)

(267,905 real changes made)

(2,869,047 missing values generated)

    peeduca |      Freq.     Percent        Cum.
------------+-----------------------------------
         -1 |    448,365       17.56       17.56
         31 |      5,354        0.21       17.77
         32 |      9,435        0.37       18.14
         33 |     18,658        0.73       18.87
         34 |     35,457        1.39       20.26
         35 |     47,949        1.88       22.13
         36 |     57,474        2.25       24.39
         37 |     62,605        2.45       26.84
         38 |     30,453        1.19       28.03
         39 |    588,436       23.04       51.07
         40 |    345,152       13.52       64.59
         41 |     89,862        3.52       68.11
         42 |    116,914        4.58       72.69
         43 |    437,520       17.13       89.82
         44 |    191,168        7.49       97.31
         45 |     29,103        1.14       98.45
         46 |     39,648        1.55      100.00
------------+-----------------------------------
      Total |  2,553,553      100.00

(3,136,952 missing values generated)

(236,932 real changes made)

(588,436 real changes made)

(345,152 real changes made)

(206,776 real changes made)

(437,520 real changes made)

(259,919 real changes made)



(583,399 missing values generated)

(583,399 missing values generated)









Mincer Equation
-----------------------------------------------------------------------------------------------------------
                              (1)                          (2)                          (3)                
                       lnearnings                   lnearnings                   lnearnings                
-----------------------------------------------------------------------------------------------------------
High School Dropout         0.000             (.)        0.000             (.)        0.000             (.)
High School Graduate        0.382***      (0.006)        0.316***      (0.006)        0.316***      (0.006)
Some College                0.425***      (0.006)        0.347***      (0.006)        0.347***      (0.006)
Associates (VorA) ~e        0.545***      (0.007)        0.434***      (0.007)        0.434***      (0.007)
Bachelor's Degree           0.860***      (0.006)        0.725***      (0.006)        0.725***      (0.006)
Advanced Degree             1.050***      (0.007)        0.947***      (0.007)        0.946***      (0.007)
exp                         0.059***      (0.000)        0.050***      (0.000)        0.050***      (0.000)
exp2                       -0.001***      (0.000)       -0.001***      (0.000)       -0.001***      (0.000)
Female                     -0.325***      (0.003)       -0.232***      (0.003)       -0.235***      (0.003)
Union                       0.116***      (0.004)        0.146***      (0.005)        0.131***      (0.006)
HRYEAR4=2020                0.000             (.)        0.000             (.)        0.000             (.)
HRYEAR4=2021                0.039***      (0.003)        0.041***      (0.003)        0.041***      (0.003)
Female # Nonunion                                                                     0.000             (.)
Female # Union                                                                        0.032***      (0.009)
Constant                    5.514***      (0.007)        5.568***      (0.018)        5.569***      (0.018)
-----------------------------------------------------------------------------------------------------------
Observations               265160                       265160                       265160                
R-squared                   0.295                        0.358                        0.358                
-----------------------------------------------------------------------------------------------------------
Standard errors in parentheses
* p<.10, ** p<.05, *** p<.01
Use esttab for formatted results: esttab documentation.