Source: Wooldridge (2014), Long and Freese (2005)
Lesson: We should use a count model when we have a Poisson distribution for the outcome of interest.
Our coefficients are expected log counts so we need to transform our coefficients for interpretation.We want to look at arrest data to see the number of times a man was arrested in 1986. There are 1,970 zeros out of 2,725 men in the sample and only eight observations are greater than 5. An OLS will not account for a count model but a Poisson distribution will.
\[ narr86_{i} = \beta_{0} + \beta_{1} pcnv_{i} + \beta_{2} avgsen_{i} + \beta_{3} tottime_{i} + \beta_{4} ptime86_{i} + \beta_{5} qemp86_{i} + \beta_{6} inc86_{i} + \beta_{7} black_{i} + \beta_{8} hispan_{i} + \beta_{9} born60_{i} + u_{i} \]
use "/Users/Sam/Desktop/Econ 645/Data/Wooldridge/crime1.dta", clear
reg narr86 pcnv avgsen tottime ptime86 qemp86 inc86 black hispan born60
Source | SS df MS Number of obs = 2,725
-------------+---------------------------------- F(9, 2715) = 23.57
Model | 145.702778 9 16.1891976 Prob > F = 0.0000
Residual | 1864.64438 2,715 .686793509 R-squared = 0.0725
-------------+---------------------------------- Adj R-squared = 0.0694
Total | 2010.34716 2,724 .738012906 Root MSE = .82873
------------------------------------------------------------------------------
narr86 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pcnv | -.131886 .0404037 -3.26 0.001 -.2111112 -.0526609
avgsen | -.0113316 .0122413 -0.93 0.355 -.0353348 .0126717
tottime | .0120693 .0094364 1.28 0.201 -.006434 .0305725
ptime86 | -.0408735 .008813 -4.64 0.000 -.0581544 -.0235925
qemp86 | -.0513099 .0144862 -3.54 0.000 -.079715 -.0229047
inc86 | -.0014617 .000343 -4.26 0.000 -.0021343 -.0007891
black | .3270097 .0454264 7.20 0.000 .2379359 .4160835
hispan | .1938094 .0397156 4.88 0.000 .1159335 .2716853
born60 | -.022465 .0332945 -0.67 0.500 -.0877502 .0428202
_cons | .576566 .0378945 15.22 0.000 .502261 .6508711
------------------------------------------------------------------------------
Our coefficients are the change in expected log counts
use "/Users/Sam/Desktop/Econ 645/Data/Wooldridge/crime1.dta", clear
poisson narr86 pcnv avgsen tottime ptime86 qemp86 inc86 black hispan born60
Iteration 0: log likelihood = -2249.0104
Iteration 1: log likelihood = -2248.7614
Iteration 2: log likelihood = -2248.7611
Iteration 3: log likelihood = -2248.7611
Poisson regression Number of obs = 2,725
LR chi2(9) = 386.32
Prob > chi2 = 0.0000
Log likelihood = -2248.7611 Pseudo R2 = 0.0791
------------------------------------------------------------------------------
narr86 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pcnv | -.4015713 .0849712 -4.73 0.000 -.5681117 -.2350308
avgsen | -.0237723 .019946 -1.19 0.233 -.0628658 .0153212
tottime | .0244904 .0147504 1.66 0.097 -.0044199 .0534006
ptime86 | -.0985584 .0206946 -4.76 0.000 -.1391192 -.0579977
qemp86 | -.0380187 .0290242 -1.31 0.190 -.0949051 .0188677
inc86 | -.0080807 .001041 -7.76 0.000 -.010121 -.0060404
black | .6608376 .0738342 8.95 0.000 .5161252 .80555
hispan | .4998133 .0739267 6.76 0.000 .3549196 .644707
born60 | -.0510286 .0640518 -0.80 0.426 -.1765678 .0745106
_cons | -.5995888 .0672501 -8.92 0.000 -.7313966 -.467781
------------------------------------------------------------------------------
Test the Key assumption
We need to test \[ E(y|x)=Var(y|x) \]
estat gof
Deviance goodness-of-fit = 2822.185
Prob > chi2(2715) = 0.0742
Pearson goodness-of-fit = 4118.08
Prob > chi2(2715) = 0.0000
Pcnv is the percentage point change. If \(\Delta x=0.1\), then the percentage point change is 10%. If \(\Delta x=.01\), then the percentage points change is 1%. Delta pcnv = .1 so
display (exp(-.402*.01)-1)*100
-.40119306
A 1 percentage point increase in prior arrest decreases expected number of arrests by 0.4% A discrete change in a binary - the coefficient on Black
display (exp(0.6608)-1)*100
93.634079
This means that the number of expected number arrests for a Black person is 93.7% higher than the expected number of arrests for a White person.
We can use the command listcoef as well, where \(e^\beta\) or \((e^\beta-1)*100\) depending upon the option called percent.
scc install listcoef
listcoef, help
listcoef, percent help
poisson (N=2725): Factor Change in Expected Count
Observed SD: .85907678
------------------------------------------------------------------
narr86 | b z P>|z| e^b e^bStdX SDofX
---------+--------------------------------------------------------
pcnv | -0.40157 -4.726 0.000 0.6693 0.8533 0.3952
avgsen | -0.02377 -1.192 0.233 0.9765 0.9200 3.5080
tottime | 0.02449 1.660 0.097 1.0248 1.1194 4.6070
ptime86 | -0.09856 -4.763 0.000 0.9061 0.8251 1.9501
qemp86 | -0.03802 -1.310 0.190 0.9627 0.9406 1.6104
inc86 | -0.00808 -7.762 0.000 0.9920 0.5837 66.6272
black | 0.66084 8.950 0.000 1.9364 1.2750 0.3677
hispan | 0.49981 6.761 0.000 1.6484 1.2291 0.4127
born60 | -0.05103 -0.797 0.426 0.9503 0.9758 0.4808
------------------------------------------------------------------
b = raw coefficient
z = z-score for test of b=0
P>|z| = p-value for z-test
e^b = exp(b) = factor change in expected count for unit increase in X
e^bStdX = exp(b*SD of X) = change in expected count for SD increase in X
SDofX = standard deviation of X
poisson (N=2725): Percentage Change in Expected Count
Observed SD: .85907678
------------------------------------------------------------------
narr86 | b z P>|z| % %StdX SDofX
---------+--------------------------------------------------------
pcnv | -0.40157 -4.726 0.000 -33.1 -14.7 0.3952
avgsen | -0.02377 -1.192 0.233 -2.3 -8.0 3.5080
tottime | 0.02449 1.660 0.097 2.5 11.9 4.6070
ptime86 | -0.09856 -4.763 0.000 -9.4 -17.5 1.9501
qemp86 | -0.03802 -1.310 0.190 -3.7 -5.9 1.6104
inc86 | -0.00808 -7.762 0.000 -0.8 -41.6 66.6272
black | 0.66084 8.950 0.000 93.6 27.5 0.3677
hispan | 0.49981 6.761 0.000 64.8 22.9 0.4127
born60 | -0.05103 -0.797 0.426 -5.0 -2.4 0.4808
------------------------------------------------------------------
b = raw coefficient
z = z-score for test of b=0
P>|z| = p-value for z-test
% = percent change in expected count for unit increase in X
%StdX = percent change in expected count for SD increase in X
SDofX = standard deviation of X
Average Marginal Effects
use "/Users/Sam/Desktop/Econ 645/Data/Wooldridge/crime1.dta", clear
quietly poisson narr86 pcnv avgsen tottime ptime86 qemp86 inc86 black hispan born60
margins, dydx(*)
Average marginal effects Number of obs = 2,725
Model VCE : OIM
Expression : Predicted number of events, predict()
dy/dx w.r.t. : pcnv avgsen tottime ptime86 qemp86 inc86 black hispan born60
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pcnv | -.1623969 .0347091 -4.68 0.000 -.2304256 -.0943682
avgsen | -.0096136 .0080714 -1.19 0.234 -.0254333 .0062061
tottime | .009904 .0059726 1.66 0.097 -.001802 .02161
ptime86 | -.0398574 .0084547 -4.71 0.000 -.0564283 -.0232865
qemp86 | -.0153749 .0117466 -1.31 0.191 -.0383979 .0076481
inc86 | -.0032679 .0004323 -7.56 0.000 -.0041152 -.0024205
black | .2672451 .0309251 8.64 0.000 .2066331 .3278571
hispan | .2021263 .03051 6.62 0.000 .1423279 .2619248
born60 | -.0206361 .0259102 -0.80 0.426 -.0714193 .030147
------------------------------------------------------------------------------
Interpretation: The marginal change in a continuous variable depends on the expected value of y given x, so we have to specify x with atmeans or average marginal effects
Note: be careful with the change in percentage points for pcnv, we want a 1 percentage point or 10 percentage point change not a change of 1.
Lesson: When overdispersion occurs, we could estimate a Negative Binomial Regression model. We’ll see that our standard errors might be too large if our main Poisson assumption mean=variance fails.
use "/Users/Sam/Desktop/Econ 645/Data/Wooldridge/crime1.dta", clear
nbreg narr86 pcnv avgsen tottime ptime86 qemp86 inc86 black hispan born60
Fitting Poisson model:
Iteration 0: log likelihood = -2249.0104
Iteration 1: log likelihood = -2248.7614
Iteration 2: log likelihood = -2248.7611
Iteration 3: log likelihood = -2248.7611
Fitting constant-only model:
Iteration 0: log likelihood = -2297.3847
Iteration 1: log likelihood = -2291.4767
Iteration 2: log likelihood = -2290.6926
Iteration 3: log likelihood = -2290.6903
Iteration 4: log likelihood = -2290.6903
Fitting full model:
Iteration 0: log likelihood = -2181.2355
Iteration 1: log likelihood = -2158.2716
Iteration 2: log likelihood = -2157.6284
Iteration 3: log likelihood = -2157.628
Iteration 4: log likelihood = -2157.628
Negative binomial regression Number of obs = 2,725
LR chi2(9) = 266.12
Dispersion = mean Prob > chi2 = 0.0000
Log likelihood = -2157.628 Pseudo R2 = 0.0581
------------------------------------------------------------------------------
narr86 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pcnv | -.4770963 .1033295 -4.62 0.000 -.6796183 -.2745743
avgsen | -.0173385 .0261171 -0.66 0.507 -.0685272 .0338501
tottime | .0197394 .0192325 1.03 0.305 -.0179557 .0574344
ptime86 | -.1073997 .025074 -4.28 0.000 -.1565439 -.0582555
qemp86 | -.0504884 .0351857 -1.43 0.151 -.1194511 .0184743
inc86 | -.0077126 .0011465 -6.73 0.000 -.0099596 -.0054656
black | .6560406 .0923594 7.10 0.000 .4750195 .8370617
hispan | .5048465 .0895663 5.64 0.000 .3292998 .6803932
born60 | -.046412 .0776384 -0.60 0.550 -.1985804 .1057564
_cons | -.5637368 .0827121 -6.82 0.000 -.7258495 -.4016242
-------------+----------------------------------------------------------------
/lnalpha | -.0738912 .1177617 -.3046999 .1569175
-------------+----------------------------------------------------------------
alpha | .9287728 .1093739 .7373446 1.169899
------------------------------------------------------------------------------
LR test of alpha=0: chibar2(01) = 182.27 Prob >= chibar2 = 0.000
First, our coefficients are the change in expected log counts, so we need some transformations to provide interpretations. Second, our test for over dispersion can be seen at the bottom of our NBRM output with the LR test of alpha=0
We can use listcoef in ssc to for a different interpretation with listcoef command:
listcoef, percent help
listcoef, help
nbreg (N=2725): Percentage Change in Expected Count
Observed SD: .85907678
------------------------------------------------------------------
narr86 | b z P>|z| % %StdX SDofX
---------+--------------------------------------------------------
pcnv | -0.47710 -4.617 0.000 -37.9 -17.2 0.3952
avgsen | -0.01734 -0.664 0.507 -1.7 -5.9 3.5080
tottime | 0.01974 1.026 0.305 2.0 9.5 4.6070
ptime86 | -0.10740 -4.283 0.000 -10.2 -18.9 1.9501
qemp86 | -0.05049 -1.435 0.151 -4.9 -7.8 1.6104
inc86 | -0.00771 -6.727 0.000 -0.8 -40.2 66.6272
black | 0.65604 7.103 0.000 92.7 27.3 0.3677
hispan | 0.50485 5.637 0.000 65.7 23.2 0.4127
born60 | -0.04641 -0.598 0.550 -4.5 -2.2 0.4808
---------+--------------------------------------------------------
ln alpha | -0.07389 -0.627
------------------------------------------------------------------
b = raw coefficient
z = z-score for test of b=0
P>|z| = p-value for z-test
% = percent change in expected count for unit increase in X
%StdX = percent change in expected count for SD increase in X
SDofX = standard deviation of X
nbreg (N=2725): Factor Change in Expected Count
Observed SD: .85907678
------------------------------------------------------------------
narr86 | b z P>|z| e^b e^bStdX SDofX
---------+--------------------------------------------------------
pcnv | -0.47710 -4.617 0.000 0.6206 0.8282 0.3952
avgsen | -0.01734 -0.664 0.507 0.9828 0.9410 3.5080
tottime | 0.01974 1.026 0.305 1.0199 1.0952 4.6070
ptime86 | -0.10740 -4.283 0.000 0.8982 0.8110 1.9501
qemp86 | -0.05049 -1.435 0.151 0.9508 0.9219 1.6104
inc86 | -0.00771 -6.727 0.000 0.9923 0.5982 66.6272
black | 0.65604 7.103 0.000 1.9271 1.2728 0.3677
hispan | 0.50485 5.637 0.000 1.6567 1.2316 0.4127
born60 | -0.04641 -0.598 0.550 0.9546 0.9779 0.4808
---------+--------------------------------------------------------
ln alpha | -0.07389 -0.627
------------------------------------------------------------------
b = raw coefficient
z = z-score for test of b=0
P>|z| = p-value for z-test
e^b = exp(b) = factor change in expected count for unit increase in X
e^bStdX = exp(b*SD of X) = change in expected count for SD increase in X
SDofX = standard deviation of X
Proportion of prior arrest that lead to a conviction
display exp(-.4771*.1)-1
-.04658976
Interpretation A 10 percentage point increase in proportion of prior arrests that lead to conviction decrease the expected number of arrests by 4.7% holding all others constant.
The average sentence served from prior convictions (in months)
display (exp(0.01974)-1)
.01993612
Interpretation A 1 month increase in total time served in prision from prior convictions before 1986 increases the expected number of arrests holding all others constant.
Compare Poisson and Negative Binominal Regression Models
est clear
eststo PRM: quietly poisson narr86 pcnv avgsen tottime ptime86 qemp86 inc86 black hispan born60
eststo NBRM: quietly nbreg narr86 pcnv avgsen tottime ptime86 qemp86 inc86 black hispan born60
esttab, mtitle
(1) (2)
PRM NBRM
--------------------------------------------
narr86
pcnv -0.402*** -0.477***
(-4.73) (-4.62)
avgsen -0.0238 -0.0173
(-1.19) (-0.66)
tottime 0.0245 0.0197
(1.66) (1.03)
ptime86 -0.0986*** -0.107***
(-4.76) (-4.28)
qemp86 -0.0380 -0.0505
(-1.31) (-1.43)
inc86 -0.00808*** -0.00771***
(-7.76) (-6.73)
black 0.661*** 0.656***
(8.95) (7.10)
hispan 0.500*** 0.505***
(6.76) (5.64)
born60 -0.0510 -0.0464
(-0.80) (-0.60)
_cons -0.600*** -0.564***
(-8.92) (-6.82)
--------------------------------------------
lnalpha
_cons -0.0739
(-0.63)
--------------------------------------------
N 2725 2725
--------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001
The interpretation is similar to Poisson. The marginal change in a continuous variable depends on the expected value of y given x, so we have to specify x with atmeans or average marginal effects
use "/Users/Sam/Desktop/Econ 645/Data/Wooldridge/crime1.dta", clear
quietly nbreg narr86 pcnv avgsen tottime ptime86 qemp86 inc86 black hispan born60
margins, dydx(*)
Average marginal effects Number of obs = 2,725
Model VCE : OIM
Expression : Predicted number of events, predict()
dy/dx w.r.t. : pcnv avgsen tottime ptime86 qemp86 inc86 black hispan born60
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pcnv | -.1933482 .0429858 -4.50 0.000 -.2775988 -.1090976
avgsen | -.0070266 .010587 -0.66 0.507 -.0277767 .0137235
tottime | .0079996 .0078006 1.03 0.305 -.0072894 .0232886
ptime86 | -.0435248 .0103794 -4.19 0.000 -.0638682 -.0231815
qemp86 | -.0204609 .0143136 -1.43 0.153 -.0485151 .0075932
inc86 | -.0031256 .0004821 -6.48 0.000 -.0040705 -.0021807
black | .2658672 .0395274 6.73 0.000 .1883949 .3433395
hispan | .2045942 .0375363 5.45 0.000 .1310244 .2781641
born60 | -.0188089 .0314753 -0.60 0.550 -.0804994 .0428815
------------------------------------------------------------------------------
Lesson: We can use a Tobit estimator for right-censored model
We need to summarize the dependent variable to see the right-censored value. We know that weekly earnings in the Current Population Survey are top-coded or right-censored at $2884.61. This may bias our estimate, so we’ll compare a OLS model and a Tobit model for right-censored data. Remember a Tobit estimator for a censored model is different from a corner solution.
use "/Users/Sam/Desktop/Econ 645/Data/CPS/jan2024.dta", clear
sum earnings if prerelg==1, detail
histogram earnings if prerelg==1, normal title(Histogram of Weekly Earnings) caption("Source: Current Population Survey")
graph export "/Users/Sam/Desktop/Econ 645/R Markdown/week9_histogram_earnings.png", replace
Weekly Earnings: pternwa
-------------------------------------------------------------
Percentiles Smallest
1% 70 0
5% 225 0
10% 360 0 Obs 10,666
25% 656 0 Sum of Wgt. 10,666
50% 1000 Mean 1230.474
Largest Std. Dev. 788.1457
75% 1680 2884.61
90% 2692.3 2884.61 Variance 621173.7
95% 2884.61 2884.61 Skewness .7779644
99% 2884.61 2884.61 Kurtosis 2.648218
(bin=40, start=0, width=72.11525)
(file /Users/Sam/Desktop/Econ 645/R Markdown/week9_histogram_earnings.png written in PNG format)
Let us take a look at the natural log of earnings
use "/Users/Sam/Desktop/Econ 645/Data/CPS/jan2024.dta", clear
sum lnearnings, detail
histogram lnearnings if prerelg==1, normal title(Histogram of LN Weekly Earnings) caption("Source: Current Population Survey")
graph export "/Users/Sam/Desktop/Econ 645/R Markdown/week9_histogram_lnearnings.png", replace
Natural Log of Weekly Earnings
-------------------------------------------------------------
Percentiles Smallest
1% 4.356709 -3.506558
5% 5.420535 -3.506558
10% 5.886104 -3.506558 Obs 10,652
25% 6.49224 .48858 Sum of Wgt. 10,652
50% 6.907755 Mean 6.86502
Largest Std. Dev. .8181897
75% 7.426549 7.967145
90% 7.898151 7.967145 Variance .6694343
95% 7.967145 7.967145 Skewness -1.791008
99% 7.967145 7.967145 Kurtosis 13.92768
(bin=40, start=-3.5065579, width=.28684257)
(file /Users/Sam/Desktop/Econ 645/R Markdown/week9_histogram_lnearnings.png written in PNG format)
We’ll estimate the following
Mincer Equation. \[ ln(wwages_{i})=\beta_{0}
+ \beta_{1} edu_{i} + \beta_{2} exp + \beta_{3} exp^2 \beta_{4}
marital_{i} + \beta_{5} veteran_{i} + \beta_{6} union_{i} + \beta_{7}
female_{i} + \beta_{8} race_{i} + u_{i} \] We’ll need to use the
option, ul(right-censored-value) with our Tobit estimator.
est clear
eststo OLS: quietly reg lnearnings i.edu exp expsq i.marital i.veteran i.union i.female i.race
eststo Tobit: quietly tobit lnearnings i.edu exp expsq i.marital i.veteran i.union i.female i.race, ul(`maxval')
esttab OLS Tobit, drop(0.* 1.race* 1.mar* 1.edu) mtitle("OLS" "Tobit")
Natural Log of Weekly Earnings
-------------------------------------------------------------
Percentiles Smallest
1% 4.356709 -3.506558
5% 5.420535 -3.506558
10% 5.886104 -3.506558 Obs 10,652
25% 6.49224 .48858 Sum of Wgt. 10,652
50% 6.907755 Mean 6.86502
Largest Std. Dev. .8181897
75% 7.426549 7.967145
90% 7.898151 7.967145 Variance .6694343
95% 7.967145 7.967145 Skewness -1.791008
99% 7.967145 7.967145 Kurtosis 13.92768
scalars:
r(N) = 10652
r(sum_w) = 10652
r(mean) = 6.865020483352663
r(Var) = .6694343494891623
r(sd) = .8181896781854207
r(skewness) = -1.791007993978133
r(kurtosis) = 13.92768307319675
r(sum) = 73126.19818867257
r(min) = -3.506557897319982
r(max) = 7.967144987828557
r(p1) = 4.356708826689592
r(p5) = 5.420534999272286
r(p10) = 5.886104031450156
r(p25) = 6.492239835020471
r(p50) = 6.907755278982137
r(p75) = 7.426549072397305
r(p90) = 7.898151125863075
r(p95) = 7.967144987828557
r(p99) = 7.967144987828557
--------------------------------------------
(1) (2)
OLS Tobit
--------------------------------------------
main
2.edu 0.330*** 0.330***
(11.77) (11.04)
3.edu 0.442*** 0.445***
(13.48) (12.67)
4.edu 0.788*** 0.832***
(26.50) (26.11)
5.edu 0.951*** 1.046***
(30.01) (30.64)
exp 0.0558*** 0.0575***
(28.73) (27.60)
expsq -0.000887*** -0.000911***
(-28.27) (-27.03)
2.marital -0.0389 -0.0462*
(-1.92) (-2.12)
3.marital -0.118*** -0.130***
(-6.66) (-6.84)
1.veteran 0.0412 0.0428
(1.34) (1.28)
1.union 0.0670** 0.0448
(3.05) (1.90)
1.female -0.304*** -0.335***
(-22.81) (-23.25)
2.race_eth~y 0.0325 0.0551
(0.41) (0.65)
3.race_eth~y -0.0466 -0.0576
(-0.60) (-0.69)
4.race_eth~y 0.135 0.131
(1.05) (0.96)
5.race_eth~y 0.0232 0.0167
(0.30) (0.20)
6.race_eth~y 0.0730 0.0903
(0.82) (0.95)
7.race_eth~y 0.0549 0.0553
(0.73) (0.69)
_cons 5.813*** 5.817***
(69.47) (64.82)
--------------------------------------------
sigma
_cons 0.714***
(137.00)
--------------------------------------------
N 10568 10568
--------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001
Interpretation A very similar interpretation to OLS, but we need normality and homoskedasticity for unbiased estimator. We can use a log-linear interpretation without a scaling factor, so the returns to education would only require \((e^\beta -1)*100\) interpretation.
Lesson: We can look at a censored data for duration analysis similar to Wooldridge’s example.
This differs from a top-coded data, which we can use a Tobit analysis that we just saw.
We can look at the duration of time in months between an arrests for inmates in a North Carolina prison after being released from prison. We want to evaluate a work program to see if it is effective in increasing duration before recidivism occurs.
Note: 893 inmates have not been arrested during the period they were followed These observations are censored. The censoring times differed among inmates ranging from 70 to 81 months.
Our dependent variable duration (time in months) is transformed by natural logarithm. We have a bunch of observations that recidivate between 70 and 81 months.
use "/Users/Sam/Desktop/Econ 645/Data/Wooldridge/recid.dta", clear
tab durat
We have a bunch of observations that are censored after 69 months (but not all)
use "/Users/Sam/Desktop/Econ 645/Data/Wooldridge/recid.dta", clear
tab durat cens
| cens
durat | 0 1 | Total
-----------+----------------------+----------
1 | 8 0 | 8
2 | 15 0 | 15
3 | 14 0 | 14
4 | 13 0 | 13
5 | 16 0 | 16
6 | 18 0 | 18
7 | 18 0 | 18
8 | 16 0 | 16
9 | 18 0 | 18
10 | 22 0 | 22
11 | 11 0 | 11
12 | 14 0 | 14
13 | 15 0 | 15
14 | 16 0 | 16
15 | 23 0 | 23
16 | 11 0 | 11
17 | 9 0 | 9
18 | 16 0 | 16
19 | 9 0 | 9
20 | 8 0 | 8
21 | 13 0 | 13
22 | 7 0 | 7
23 | 16 0 | 16
24 | 12 0 | 12
25 | 13 0 | 13
26 | 8 0 | 8
27 | 11 0 | 11
28 | 9 0 | 9
29 | 8 0 | 8
30 | 7 0 | 7
31 | 6 0 | 6
32 | 6 0 | 6
33 | 6 0 | 6
34 | 4 0 | 4
35 | 5 0 | 5
36 | 6 0 | 6
37 | 6 0 | 6
38 | 4 0 | 4
39 | 4 0 | 4
40 | 2 0 | 2
41 | 7 0 | 7
42 | 5 0 | 5
43 | 5 0 | 5
44 | 4 0 | 4
45 | 4 0 | 4
46 | 7 0 | 7
47 | 4 0 | 4
48 | 1 0 | 1
49 | 4 0 | 4
50 | 5 0 | 5
51 | 2 0 | 2
52 | 2 0 | 2
53 | 8 0 | 8
54 | 3 0 | 3
55 | 5 0 | 5
56 | 2 0 | 2
57 | 4 0 | 4
58 | 1 0 | 1
59 | 5 0 | 5
60 | 3 0 | 3
62 | 3 0 | 3
63 | 2 0 | 2
64 | 1 0 | 1
65 | 2 0 | 2
66 | 3 0 | 3
67 | 3 0 | 3
68 | 4 0 | 4
69 | 2 0 | 2
70 | 2 103 | 105
71 | 2 88 | 90
72 | 1 84 | 85
73 | 1 107 | 108
74 | 1 71 | 72
75 | 0 44 | 44
76 | 0 105 | 105
77 | 1 60 | 61
78 | 0 54 | 54
79 | 0 60 | 60
80 | 0 69 | 69
81 | 0 48 | 48
-----------+----------------------+----------
Total | 552 893 | 1,445
We’ll use the stset command and set failure at cens==0. We’ll use the stset command to time set survival.
use "/Users/Sam/Desktop/Econ 645/Data/Wooldridge/recid.dta", clear
stset ldurat, failure(cens==0)
failure event: cens == 0
obs. time interval: (0, ldurat]
exit on or before: failure
------------------------------------------------------------------------------
1445 total observations
8 observations end on or before enter()
------------------------------------------------------------------------------
1437 observations remaining, representing
544 failures in single-record/single-failure data
5411.742 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 4.394449
We’ll use the streg command and set the distribution
streg workprg priors tserved felon alcohol drugs black married educ age, dist(weibull) nohr
failure event: cens == 0
obs. time interval: (0, ldurat]
exit on or before: failure
------------------------------------------------------------------------------
1445 total observations
8 observations end on or before enter()
------------------------------------------------------------------------------
1437 observations remaining, representing
544 failures in single-record/single-failure data
5411.742 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 4.394449
failure _d: cens == 0
analysis time _t: ldurat
Fitting constant-only model:
Iteration 0: log likelihood = -1254.5107
Iteration 1: log likelihood = -1100.1536
Iteration 2: log likelihood = -1079.1128
Iteration 3: log likelihood = -1078.7957
Iteration 4: log likelihood = -1078.7957
Fitting full model:
Iteration 0: log likelihood = -1078.7957
Iteration 1: log likelihood = -1034.1821
Iteration 2: log likelihood = -1001.9186
Iteration 3: log likelihood = -1000.5996
Iteration 4: log likelihood = -1000.5919
Iteration 5: log likelihood = -1000.5919
Weibull regression -- log relative-hazard form
No. of subjects = 1,437 Number of obs = 1,437
No. of failures = 544
Time at risk = 5411.742317
LR chi2(10) = 156.41
Log likelihood = -1000.5919 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
_t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
workprg | .0780722 .091463 0.85 0.393 -.1011921 .2573364
priors | .0834203 .0139155 5.99 0.000 .0561465 .1106941
tserved | .0134545 .0016851 7.98 0.000 .0101519 .0167572
felon | -.287139 .106697 -2.69 0.007 -.4962614 -.0780167
alcohol | .4399456 .10658 4.13 0.000 .2310527 .6488385
drugs | .2920932 .0983695 2.97 0.003 .0992926 .4848938
black | .4515388 .0889883 5.07 0.000 .2771249 .6259527
married | -.146192 .1098131 -1.33 0.183 -.3614216 .0690377
educ | -.023948 .019578 -1.22 0.221 -.0623202 .0144241
age | -.0036431 .0005284 -6.89 0.000 -.0046788 -.0026073
_cons | -3.639277 .3077568 -11.83 0.000 -4.24247 -3.036085
-------------+----------------------------------------------------------------
/ln_p | .9214587 .0396737 23.23 0.000 .8436997 .9992178
-------------+----------------------------------------------------------------
p | 2.512953 .0996982 2.324953 2.716156
1/p | .3979381 .0157877 .3681673 .4301163
------------------------------------------------------------------------------
Interpretation: Given the log-linear function form, we can easily determine the estimated percent change in duration before criminal recidivism.
display (exp(.0780722)-1)*100
8.1200719
Being a part of the work program increase the duration of time before recidivism, but it is not statistically significant.
display (exp(-.287139)-1)*100
-24.959259
Being a felon reduces the duration of time before recidivism, where a felon has as 24% decrease in duration of time before recidivism.
Lesson: Similar to tobit, when we don’t account for the truncation of the data, or why certain parts of the population are not sampled, we will commit a type of omitted variable bias without our lambda(zy)-hat. We can use a Heckit method for sample selection correction.
We want to see if there is sample selection bias due to unobservable wage offers for non-working women.
We need to estimate a logit or probit to test and correct for sample selection bias due to unobserved wage offer for nonworking women. Spousal income, education, experience, age, and number of kids less than 6. \[ ln(wages_{i})=\beta_{0}+ \mathbf{x'\beta} + u_{i} \] Where \(\mathbf{x}\) is a vector that includes education, experience, and experience squared
We use the Heckit command to implement a Heckman Method for sample selection correction.
\[ ln(wages_{i})=\beta_{0}+ \mathbf{x'\beta} +\mathbf{z'\delta} + u_{i} \] Where \(\mathbf{z}\) is a vector that includes spousal income, education, experience, age, and number of kids less than 6.
use "/Users/Sam/Desktop/Econ 645/Data/Wooldridge/mroz.dta", clear
reg lwage educ exper expersq
Source | SS df MS Number of obs = 428
-------------+---------------------------------- F(3, 424) = 26.29
Model | 35.0222967 3 11.6740989 Prob > F = 0.0000
Residual | 188.305144 424 .444115906 R-squared = 0.1568
-------------+---------------------------------- Adj R-squared = 0.1509
Total | 223.327441 427 .523015084 Root MSE = .66642
------------------------------------------------------------------------------
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
educ | .1074896 .0141465 7.60 0.000 .0796837 .1352956
exper | .0415665 .0131752 3.15 0.002 .0156697 .0674633
expersq | -.0008112 .0003932 -2.06 0.040 -.0015841 -.0000382
_cons | -.5220406 .1986321 -2.63 0.009 -.9124667 -.1316144
------------------------------------------------------------------------------
Notice: an assumption is used to exclude spousal income, age, kids less than 6, and kids greater than 6 from our main regression.
We will use a subset of all exogenous variable 1. What are the factors correlated with being in the labor force? 2. What is the impact of education and experience on wages
use "/Users/Sam/Desktop/Econ 645/Data/Wooldridge/mroz.dta", clear
heckman lwage educ exper expersq, select(inlf=nwifeinc educ exper expersq age kidslt6 kidsge6) twostep
Heckman selection model -- two-step estimates Number of obs = 753
(regression model with sample selection) Censored obs = 325
Uncensored obs = 428
Wald chi2(3) = 51.53
Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lwage |
educ | .1090655 .015523 7.03 0.000 .0786411 .13949
exper | .0438873 .0162611 2.70 0.007 .0120163 .0757584
expersq | -.0008591 .0004389 -1.96 0.050 -.0017194 1.15e-06
_cons | -.5781032 .3050062 -1.90 0.058 -1.175904 .019698
-------------+----------------------------------------------------------------
inlf |
nwifeinc | -.0120237 .0048398 -2.48 0.013 -.0215096 -.0025378
educ | .1309047 .0252542 5.18 0.000 .0814074 .180402
exper | .1233476 .0187164 6.59 0.000 .0866641 .1600311
expersq | -.0018871 .0006 -3.15 0.002 -.003063 -.0007111
age | -.0528527 .0084772 -6.23 0.000 -.0694678 -.0362376
kidslt6 | -.8683285 .1185223 -7.33 0.000 -1.100628 -.636029
kidsge6 | .036005 .0434768 0.83 0.408 -.049208 .1212179
_cons | .2700768 .508593 0.53 0.595 -.7267473 1.266901
-------------+----------------------------------------------------------------
mills |
lambda | .0322619 .1336246 0.24 0.809 -.2296376 .2941613
-------------+----------------------------------------------------------------
rho | 0.04861
sigma | .66362875
------------------------------------------------------------------------------
est clear
eststo OLS: reg lwage educ exper expersq
eststo Heckman: heckman lwage educ exper expersq, select(inlf=nwifeinc educ exper expersq age kidslt6 kidsge6) twostep
esttab, mtitle
(1) (2)
OLS Heckman
--------------------------------------------
main
educ 0.107*** 0.109***
(7.60) (7.03)
exper 0.0416** 0.0439**
(3.15) (2.70)
expersq -0.000811* -0.000859
(-2.06) (-1.96)
_cons -0.522** -0.578
(-2.63) (-1.90)
--------------------------------------------
inlf
nwifeinc -0.0120*
(-2.48)
educ 0.131***
(5.18)
exper 0.123***
(6.59)
expersq -0.00189**
(-3.15)
age -0.0529***
(-6.23)
kidslt6 -0.868***
(-7.33)
kidsge6 0.0360
(0.83)
_cons 0.270
(0.53)
--------------------------------------------
mills
lambda 0.0323
(0.24)
--------------------------------------------
N 428 753
--------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001
There is no real evidence of sample selection bias in the wage offer equation. Our lambda-hat is not statistically significant and we fail to reject \(H_0: \rho=0\). Also, we notice very little difference between our OLS and Heckman Method.
Interpretation: we will have a similar interpretation to our OLS. We have a log-linear model so our interpretation of the return would be \(e^\beta\) or \((e^\beta-1)*100\). Our \(\hat{\lambda}\) would be a potential omitted variable, but our example shows that we fail to reject the null hypothesis and selection bias is does not appear to be problematic.