Chapter 2 Partial Linear Model
For the partial linear model, \(D\) can be binary or continuous with the partial linear model.
\[ Y=\delta D+g_0(X)+\varepsilon \]
\[ \delta^{ATE} = \frac{E[(Y-l_0(X))(D-m_0(X))] }{E[(D-m_0(X))^2 ]} \]
2.1 Assumptions
Our main assumption for the partial linear model is conditional orthogonality
\[ E[Cov(U,D|X)]=0 \]
2.2 Explore the data
Before we establish our partial linear model, we should inspect the data. Our outcome \(Y\) is net total financial assests, and our treatment of interest \(D\) is \(401(k)\) eligibility.
Our data come from the 1991 Survey of Income and Program Participation (SIPP).
macro drop _all
use https://github.com/aahrens1/ddml/raw/master/data/sipp1991.dta, clear
global Y net_tfa
global D e401
global X tw age inc fsize educ db marr twoearn pira hown
sum $Y, detail Net total financial assets
-------------------------------------------------------------
Percentiles Smallest
1% -23500 -502302
5% -9000 -409000
10% -4757 -336789 Obs 9,915
25% -500 -315701 Sum of wgt. 9,915
50% 1499 Mean 18051.53
Largest Std. dev. 63522.5
75% 16549 1317947
90% 54860 1324445 Variance 4.04e+09
95% 91999 1462115 Skewness 10.63845
99% 219948 1536798 Kurtosis 186.7729
Our mean net financial worth is \(\$18,051.53\), while our median net worth is \(\$1,499\) with a minimum of \(-\$502,302\) and a maximum
401(k) |
eligibility | Freq. Percent Cum.
------------+-----------------------------------
0 | 6,233 62.86 62.86
1 | 3,682 37.14 100.00
------------+-----------------------------------
Total | 9,915 100.00
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
tw | 9,915 63816.85 111529.7 -502302 2029910
age | 9,915 41.06021 10.3445 25 64
inc | 9,915 37200.62 24774.29 -2652 242124
fsize | 9,915 2.86586 1.538937 1 13
educ | 9,915 13.20625 2.810382 1 18
-------------+---------------------------------------------------------
db | 9,915 .2710035 .4445003 0 1
marr | 9,915 .6048411 .4889094 0 1
twoearn | 9,915 .3808371 .4856171 0 1
pira | 9,915 .2421583 .4284112 0 1
hown | 9,915 .6351992 .4813985 0 1
2.3 Initiate the model
First, we need to set our seed so we can recreate our results. If not, we have a different randomization everytime we run the syntax.
We use the command ddml init partial with the partial linear model. One of the key options here is the kfolds option. This will tell Stata/Python how many cross-validation/cross-fitting folds you want to run. The default option is 5 if you do not specify kfolds(\(k)\).
Please note that we have the option of rep(\(n\)) to specify how many cross-fitting resampling repetitions we want to do for our methods. Such that cross-fitting repetitions is how often the cross-fitting procedure is repeated on randomly generated fold.
2.4 Set and train your model
We we use command ddml
For \(E[Y|X]\), we will run an OLS and a random forest to estimate the parameters in this example.
We will use ddml \(E[Y|X]\): reg for OLS and we will use ddml \(E[Y|X]\): pystacked
For our OLS regression, the syntax is straightforward. We just need to add ddml \(E[Y|X]\): before reg $Y $X.
For our machine learning methods, we have some additional syntax.
First, we need to use the command pystacked to call Python for our machine learnings methods before our model.
Next, we need to use the option type(). This option clarifies if the machine learning method is used as a regression with type(reg) or as a classifier with type(class).
Finally, we can specify the machine learning method with the method option. For this example, we will be using random forest. We can use random forest as a regression and a classifier.
For \(E[D|X]\), we will run a logit and a random forest to estimate the paramters in this example.
If you need to check what learners you have used to train your model, use the ddml desc command.
Model: partial, crossfit folds k=5, resamples r=1
Mata global (mname): m0
Dependent variable (Y): net_tfa
net_tfa learners: Y1_reg Y2_pystacked
D equations (1): e401
e401 learners: D1_logit D2_pystacked
We have two training models for \(E[Y|X]\) and two training models for \(E[D|X]\).
2.5 Crossfitting/Cross-Validation
Our next step is to use the crossfitting/crossvalidation method to test our model. Our main command will be ddml crossfit
Cross-fitting E[y|X] equation: net_tfa
Cross-fitting fold 1 2 3 4 5 ...completed cross-fitting
Cross-fitting E[D|X] equation: e401
Cross-fitting fold 1 2 3 4 5 ...completed cross-fitting
2.6 Estimation
We use the command ddml estiamte to calculate the \(ATE\).
Model: partial, crossfit folds k=5, resamples r=1
Mata global (mname): m0
Dependent variable (Y): net_tfa
net_tfa learners: Y1_reg Y2_pystacked
D equations (1): e401
e401 learners: D1_logit D2_pystacked
DDML estimation results:
spec r Y learner D learner b SE
1 1 Y1_reg D1_logit 5309.230 (1075.670)
2 1 Y1_reg D2_pystacked 6654.235 (919.688)
* 3 1 Y2_pystacked D1_logit 7095.605 (1010.501)
4 1 Y2_pystacked D2_pystacked 7031.370 (806.388)
* = minimum MSE specification for that resample.
Min MSE DDML model
y-E[y|X] = y-Y2_pystacked_1 Number of obs = 9915
D-E[D|X] = D-D1_logit_1
------------------------------------------------------------------------------
| Robust
net_tfa | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
e401 | 7095.605 1010.501 7.02 0.000 5115.06 9076.151
_cons | -316.1911 350.5837 -0.90 0.367 -1003.322 370.9403
------------------------------------------------------------------------------
We find that \(401(k)\) eligibility increases net financial worth by about \(\$7,096\) dollars. We can also notice that the 3rd specification was chosen, which was logit for estimating \(e401\). If we want to the results of the other three specification, you can rerun ddml estimate with the option spec(n) where \(n\) is the specification you want. We will try this with the first specification.
Model: partial, crossfit folds k=5, resamples r=1
Mata global (mname): m0
Dependent variable (Y): net_tfa
net_tfa learners: Y1_reg Y2_pystacked
D equations (1): e401
e401 learners: D1_logit D2_pystacked
DDML estimation results:
spec r Y learner D learner b SE
1 1 Y1_reg D1_logit 5309.230 (1075.670)
2 1 Y1_reg D2_pystacked 6654.235 (919.688)
* 3 1 Y2_pystacked D1_logit 7095.605 (1010.501)
4 1 Y2_pystacked D2_pystacked 7031.370 (806.388)
* = minimum MSE specification for that resample.
DDML model, specification 1
y-E[y|X] = y-Y1_reg_1 Number of obs = 9915
D-E[D|X] = D-D1_logit_1
------------------------------------------------------------------------------
| Robust
net_tfa | Coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
e401 | 5309.23 1075.67 4.94 0.000 3200.955 7417.505
_cons | -12.7658 394.0253 -0.03 0.974 -785.0412 759.5096
------------------------------------------------------------------------------
Using the first specification, which was OLS for \(E[Y|X]\) and logit for \(E[D|X]\), our \(\widehat{ATE}\) is equal to \(\$5,309\)