Chapter 2 Partial Linear Model

For the partial linear model, $D$ can be binary or continuous with the partial linear model.

\[ Y=\delta D+g_0(X)+\varepsilon \]

\[ \delta^{ATE} = \frac{E[(Y-l_0(X))(D-m_0(X))] }{E[(D-m_0(X))^2 ]} \]

2.1 Assumptions

Our main assumption for the partial linear model is conditional orthogonality

\[ E[Cov(U,D|X)]=0 \]

2.2 Explore the data

Before we establish our partial linear model, we should inspect the data. Our outcome $Y$ is net total financial assests, and our treatment of interest $D$ is $401(k)$ eligibility.

Our data come from the 1991 Survey of Income and Program Participation (SIPP).

macro drop _all

use https://github.com/aahrens1/ddml/raw/master/data/sipp1991.dta, clear
global Y net_tfa
global D e401
global X tw age inc fsize educ db marr twoearn pira hown

sum $Y, detail

                 Net total financial assets
-------------------------------------------------------------
      Percentiles      Smallest
 1%       -23500        -502302
 5%        -9000        -409000
10%        -4757        -336789       Obs               9,915
25%         -500        -315701       Sum of wgt.       9,915

50%         1499                      Mean           18051.53
                        Largest       Std. dev.       63522.5
75%        16549        1317947
90%        54860        1324445       Variance       4.04e+09
95%        91999        1462115       Skewness       10.63845
99%       219948        1536798       Kurtosis       186.7729

Our mean net financial worth is $\$18,051.53$, while our median net worth is $\$1,499$ with a minimum of $-\$502,302$ and a maximum

tab $D

     401(k) |
eligibility |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      6,233       62.86       62.86
          1 |      3,682       37.14      100.00
------------+-----------------------------------
      Total |      9,915      100.00

sum $X

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
          tw |      9,915    63816.85    111529.7    -502302    2029910
         age |      9,915    41.06021     10.3445         25         64
         inc |      9,915    37200.62    24774.29      -2652     242124
       fsize |      9,915     2.86586    1.538937          1         13
        educ |      9,915    13.20625    2.810382          1         18
-------------+---------------------------------------------------------
          db |      9,915    .2710035    .4445003          0          1
        marr |      9,915    .6048411    .4889094          0          1
     twoearn |      9,915    .3808371    .4856171          0          1
        pira |      9,915    .2421583    .4284112          0          1
        hown |      9,915    .6351992    .4813985          0          1

2.3 Initiate the model

First, we need to set our seed so we can recreate our results. If not, we have a different randomization everytime we run the syntax.

set seed 42

We use the command ddml init partial with the partial linear model. One of the key options here is the kfolds option. This will tell Stata/Python how many cross-validation/cross-fitting folds you want to run. The default option is 5 if you do not specify kfolds($k)$.

Please note that we have the option of rep($n$) to specify how many cross-fitting resampling repetitions we want to do for our methods. Such that cross-fitting repetitions is how often the cross-fitting procedure is repeated on randomly generated fold.

ddml init partial, kfolds(5)

2.4 Set and train your model

We we use command ddml : to add our supervised machine learning method where is the model for $E[Y|X]$ and $E[D|X]$

For $E[Y|X]$, we will run an OLS and a random forest to estimate the parameters in this example. We will use ddml $E[Y|X]$: reg for OLS and we will use ddml $E[Y|X]$: pystacked , type(reg) method(rf) for Random Forest.

For our OLS regression, the syntax is straightforward. We just need to add ddml $E[Y|X]$: before reg $Y $X.

For our machine learning methods, we have some additional syntax.

First, we need to use the command pystacked to call Python for our machine learnings methods before our model.

Next, we need to use the option type(). This option clarifies if the machine learning method is used as a regression with type(reg) or as a classifier with type(class).

Finally, we can specify the machine learning method with the method option. For this example, we will be using random forest. We can use random forest as a regression and a classifier.

ddml E[Y|X]: reg $Y $X
ddml E[Y|X]: pystacked $Y $X, type(reg) method(rf)

For $E[D|X]$, we will run a logit and a random forest to estimate the paramters in this example.

ddml E[D|X]: logit $D $X
ddml E[D|X]: pystacked $D $X, type(class) method(rf)

If you need to check what learners you have used to train your model, use the ddml desc command.

ddml desc

Model:                  partial, crossfit folds k=5, resamples r=1
Mata global (mname):    m0
Dependent variable (Y): net_tfa
 net_tfa learners:      Y1_reg Y2_pystacked
D equations (1):        e401
 e401 learners:         D1_logit D2_pystacked

We have two training models for $E[Y|X]$ and two training models for $E[D|X]$.

2.5 Crossfitting/Cross-Validation

Our next step is to use the crossfitting/crossvalidation method to test our model. Our main command will be ddml crossfit

ddml crossfit

Cross-fitting E[y|X] equation: net_tfa
Cross-fitting fold 1 2 3 4 5 ...completed cross-fitting
Cross-fitting E[D|X] equation: e401
Cross-fitting fold 1 2 3 4 5 ...completed cross-fitting

2.6 Estimation

We use the command ddml estiamte to calculate the $ATE$.

ddml estimate, robust allcombos

Model:                  partial, crossfit folds k=5, resamples r=1
Mata global (mname):    m0
Dependent variable (Y): net_tfa
 net_tfa learners:      Y1_reg Y2_pystacked
D equations (1):        e401
 e401 learners:         D1_logit D2_pystacked

DDML estimation results:
spec  r     Y learner     D learner         b        SE 
   1  1        Y1_reg      D1_logit  5309.230 (1075.670)
   2  1        Y1_reg  D2_pystacked  6654.235  (919.688)
*  3  1  Y2_pystacked      D1_logit  7095.605 (1010.501)
   4  1  Y2_pystacked  D2_pystacked  7031.370  (806.388)
* = minimum MSE specification for that resample.

Min MSE DDML model
y-E[y|X]  = y-Y2_pystacked_1                       Number of obs   =      9915
D-E[D|X]  = D-D1_logit_1 
------------------------------------------------------------------------------
             |               Robust
     net_tfa | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        e401 |   7095.605   1010.501     7.02   0.000      5115.06    9076.151
       _cons |  -316.1911   350.5837    -0.90   0.367    -1003.322    370.9403
------------------------------------------------------------------------------

We find that $401(k)$ eligibility increases net financial worth by about $\$7,096$ dollars. We can also notice that the 3rd specification was chosen, which was logit for estimating $e401$. If we want to the results of the other three specification, you can rerun ddml estimate with the option spec(n) where $n$ is the specification you want. We will try this with the first specification.

ddml estimate, robust allcombos spec(1) replay

Model:                  partial, crossfit folds k=5, resamples r=1
Mata global (mname):    m0
Dependent variable (Y): net_tfa
 net_tfa learners:      Y1_reg Y2_pystacked
D equations (1):        e401
 e401 learners:         D1_logit D2_pystacked

DDML estimation results:
spec  r     Y learner     D learner         b        SE 
   1  1        Y1_reg      D1_logit  5309.230 (1075.670)
   2  1        Y1_reg  D2_pystacked  6654.235  (919.688)
*  3  1  Y2_pystacked      D1_logit  7095.605 (1010.501)
   4  1  Y2_pystacked  D2_pystacked  7031.370  (806.388)
* = minimum MSE specification for that resample.

DDML model, specification 1
y-E[y|X]  = y-Y1_reg_1                             Number of obs   =      9915
D-E[D|X]  = D-D1_logit_1 
------------------------------------------------------------------------------
             |               Robust
     net_tfa | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        e401 |    5309.23    1075.67     4.94   0.000     3200.955    7417.505
       _cons |   -12.7658   394.0253    -0.03   0.974    -785.0412    759.5096
------------------------------------------------------------------------------

Using the first specification, which was OLS for $E[Y|X]$ and logit for $E[D|X]$, our $\widehat{ATE}$ is equal to $\$5,309$