Samuel Rowe Adapted from Wooldridge and Mitchell
Set Working Directory
. clear . set more off . cd "/Users/Sam/Desktop/Econ 645/Data/Wooldridge" /Users/Sam/Desktop/Econ 645/Data/Wooldridge
Let’s look at the data for married working women
. use mroz.dta, clear
Our labor supply equation hoursi = β10 + α1ln(wagei) + β11educi + β12agei + β13kidslt6i + β14nwifeinc + u1
Our labor demand equation in term of wages as a function of productivity ln(wagei) = β20 + α2hoursi + β21 * educi + β22experi + β23experi2 + u2
To estimate our labor supply we use educ, age, kidslt6, nwifeinc, exper, and exper-squared
. ivregress 2sls hours (lwage=exper expersq) educ age kidslt6 nwifeinc, robust
Instrumental variables (2SLS) regression Number of obs = 428
Wald chi2(5) = 12.60
Prob > chi2 = 0.0274
R-squared = .
Root MSE = 1344.7
─────────────┬────────────────────────────────────────────────────────────────
│ Robust
hours │ Coef. Std. Err. z P>|z| [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
lwage │ 1639.556 593.3108 2.76 0.006 476.6879 2802.423
educ │ -183.7513 67.78742 -2.71 0.007 -316.6122 -50.89039
age │ -7.806092 10.48746 -0.74 0.457 -28.36114 12.74896
kidslt6 │ -198.1543 208.4247 -0.95 0.342 -606.6592 210.3506
nwifeinc │ -10.16959 5.287486 -1.92 0.054 -20.53287 .1936911
_cons │ 2225.662 603.0964 3.69 0.000 1043.615 3407.709
─────────────┴────────────────────────────────────────────────────────────────
Instrumented: lwage
Instruments: educ age kidslt6 nwifeinc exper expersq
Test exper and expersq as instruments
. reg lwage exper expersq educ age kidslt6 nwifeinc, robust
Linear regression Number of obs = 428
F(6, 421) = 14.76
Prob > F = 0.0000
R-squared = 0.1633
Root MSE = .66622
─────────────┬────────────────────────────────────────────────────────────────
│ Robust
lwage │ Coef. Std. Err. t P>|t| [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
exper │ .0418643 .0151135 2.77 0.006 .0121569 .0715718
expersq │ -.0007625 .0004065 -1.88 0.061 -.0015614 .0000365
educ │ .1011113 .0141358 7.15 0.000 .0733257 .128897
age │ -.0025561 .0059149 -0.43 0.666 -.0141826 .0090704
kidslt6 │ -.0532185 .1047899 -0.51 0.612 -.2591951 .1527581
nwifeinc │ .00556 .0027435 2.03 0.043 .0001673 .0109527
_cons │ -.4471607 .2889008 -1.55 0.122 -1.015028 .1207069
─────────────┴────────────────────────────────────────────────────────────────
. test exper expersq
( 1) exper = 0
( 2) expersq = 0
F( 2, 421) = 6.17
Prob > F = 0.0023
Note: Our instrument is rather weak. What is a possible problem with this first-stage?
Our results show that the labor supply curve slopes upward. The estimated coefficient is Δhours = 1640/100 * (%Δwages)
Our results can be interpreted through linear-log elasticity.
100 * (Δhours/hours) ≈ (1,640/hours)(%Δwages) Or %Δhours ≈ (1,640/hours)(%Δwages)
This implies that the labor supply elasticity with respect to wages is η = 1, 640/hours
The elasticity is not constant since it is a linear-log model instead of log-log model. Using average hours worked of 1,303 Our estimated elasticity = (1,640/hours)(%delta-wages) or 1,640/1,303=1.26
. display 1640/1303 1.2586339
Our estimated elasticity around mean hours is 1.26%, which means a 1% increase in wages around mean hours increases hours by 1.26%. This mean wage elasticity of supply is elastic (>1) around mean hours.
It’s not constant elasticity When hours are higher, full time 40 hours every week, a 1% increase in wages increase the supply of hours around 0.79%
. display 1640/(40*52) .78846154
Our wage elasticity of supply is inelastic since (<1)
At lower hours, our estimated wage elasticity of supply is more elastic When hours are equal to 800, then our wage elasticity of supply is almost 2
. display 1640/800 2.05
A 1% increase in wages increases hours by 2.05%
. use openness.dta, clear
Romer (1993) proposes that more open countries should have lower inflation rates. Romer (1993) tries to explain inflation rates in terms of a country’s average share of imports in gross domestic (or national) product since 1973. Average share of imports in GDP is his measure of opennes.
infit = β10 + α1openit + β11ln(pcincit) + u1 openit = β20 + α2infit + β21ln(pcinc) + β22ln(land) + u2
Where open is the average share of imports in terms of GDP, pcinc is the 1980 per capita income, and land is land area of a country in square miles
We can look at the reduced form equation to see the instruments impact on*openness
. reg open lland lpcinc, robust
Linear regression Number of obs = 114
F(2, 111) = 22.22
Prob > F = 0.0000
R-squared = 0.4487
Root MSE = 17.796
─────────────┬────────────────────────────────────────────────────────────────
│ Robust
open │ Coef. Std. Err. t P>|t| [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
lland │ -7.567103 1.141798 -6.63 0.000 -9.829652 -5.304554
lpcinc │ .5464812 1.436115 0.38 0.704 -2.299276 3.392238
_cons │ 117.0845 18.24808 6.42 0.000 80.92473 153.2443
─────────────┴────────────────────────────────────────────────────────────────
Let’s test the instrument and use an F-test
. test lland
( 1) lland = 0
F( 1, 111) = 43.92
Prob > F = 0.0000
We want to see if a country’s openness impacts a country’s inflation rate using natural log of square miles as an instrument
. ivregress 2sls inf (open=lland) lpcinc, robust
Instrumental variables (2SLS) regression Number of obs = 114
Wald chi2(2) = 5.19
Prob > chi2 = 0.0745
R-squared = 0.0309
Root MSE = 23.52
─────────────┬────────────────────────────────────────────────────────────────
│ Robust
inf │ Coef. Std. Err. z P>|z| [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
open │ -.3374871 .1504296 -2.24 0.025 -.6323238 -.0426504
lpcinc │ .3758247 1.360282 0.28 0.782 -2.29028 3.041929
_cons │ 26.89934 10.77526 2.50 0.013 5.780212 48.01846
─────────────┴────────────────────────────────────────────────────────────────
Instrumented: open
Instruments: lpcinc lland
For every percentage point increase in average share of imports of GDP, inflation decreases by 0.337 percentage points
Pull CPS data and recreate MROZ.dta with CPS data, expect use all individuals not just working women
Estimate linear-log and test elasticity at different hours. Estimate constant elasticity with a log-log model.
Set Working Directory
. cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell" /Users/Sam/Desktop/Econ 645/Data/Mitchell
From my experience, I feel that understanding and mastering the reshape command is invaluable. This is especially true when working with panel data. Panel data should not be used in a wide format. Panel data should be in a long format where you can use the xtset unit time command. After the xtset is established you can easily add lags or differences with the l.var or d.var
It is also helpful in terms of mergering datasets. A lot of time if I’m mergering BLS data, such as state unemployment, or inflation rates, I need to reshape the BLS data from wide to long to I can merge m:1
I typically work in long formats, but reshaping to wide formats does have its benefits. I find wide formats are helpful for people who work in excel, especially when years are across columns
We have two types of datasets: wide and long
Our wide datasets typically include years or variables types spread out across the columns.
. use cardio_wide, clear
. describe
Contains data from cardio_wide.dta
obs: 6
vars: 12 22 Dec 2009 20:43
size: 120
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
storage display value
variable name type format label variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id byte %3.0f ID of person
age byte %3.0f Age of person
bp1 int %3.0f Blood pressure systolic Trial 1
bp2 int %3.0f Blood pressure systolic Trial 2
bp3 int %3.0f Blood pressure systolic Trial 3
bp4 int %3.0f Blood pressure systolic Trial 4
bp5 int %3.0f Blood pressure systolic Trial 5
pl1 int %3.0f Pulse: Trial 1
pl2 byte %3.0f Pulse: Trial 2
pl3 int %3.0f Pulse: Trial 3
pl4 int %3.0f Pulse: Trial 4
pl5 byte %3.0f Pulse: Trial 5
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by:
We have 12 variables idcode, age, and our 5 different blood pressure and pulse trials. We see that are 5 different incidents/trials of blood pressure and pulse are spread out across 5 different variables bp1-bp5 and pl1-pl5
. list
┌──────────────────────────────────────────────────────────────────────┐
│ id age bp1 bp2 bp3 bp4 bp5 pl1 pl2 pl3 pl4 pl5 │
├──────────────────────────────────────────────────────────────────────┤
1. │ 1 40 115 86 129 105 127 54 87 93 81 92 │
2. │ 2 30 123 136 107 111 120 92 88 125 87 58 │
3. │ 3 16 124 122 101 109 112 105 97 128 57 68 │
4. │ 4 23 105 115 121 129 137 52 79 71 106 39 │
5. │ 5 18 116 128 112 125 111 70 64 52 68 59 │
├──────────────────────────────────────────────────────────────────────┤
6. │ 6 27 108 126 124 131 107 74 78 92 99 80 │
└──────────────────────────────────────────────────────────────────────┘
With a wide format each row is an observation and all of our different trials are located on a single row
We’ll look at an example of the same dataset but in long format
. use cardio_long, clear
. describe
Contains data from cardio_long.dta
obs: 30
vars: 5 10 Feb 2020 23:02
size: 210
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
storage display value
variable name type format label variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id byte %3.0f ID of person
trial byte %9.0g Trial number
age byte %3.0f Age of person
bp int %3.0f Blood pressure (systolic)
pl int %3.0f Pulse
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: id trial
We only have 5 variables idcode, age, trial, blood pressure, and pulse.
. list
┌──────────────────────────────┐
│ id trial age bp pl │
├──────────────────────────────┤
1. │ 1 1 40 115 54 │
2. │ 1 2 40 86 87 │
3. │ 1 3 40 129 93 │
4. │ 1 4 40 105 81 │
5. │ 1 5 40 127 92 │
├──────────────────────────────┤
6. │ 2 1 30 123 92 │
7. │ 2 2 30 136 88 │
8. │ 2 3 30 107 125 │
9. │ 2 4 30 111 87 │
10. │ 2 5 30 120 58 │
├──────────────────────────────┤
11. │ 3 1 16 124 105 │
12. │ 3 2 16 122 97 │
13. │ 3 3 16 101 128 │
14. │ 3 4 16 109 57 │
15. │ 3 5 16 112 68 │
├──────────────────────────────┤
16. │ 4 1 23 105 52 │
17. │ 4 2 23 115 79 │
18. │ 4 3 23 121 71 │
19. │ 4 4 23 129 106 │
20. │ 4 5 23 137 39 │
├──────────────────────────────┤
21. │ 5 1 18 116 70 │
22. │ 5 2 18 128 64 │
23. │ 5 3 18 112 52 │
24. │ 5 4 18 125 68 │
25. │ 5 5 18 111 59 │
├──────────────────────────────┤
26. │ 6 1 27 108 74 │
27. │ 6 2 27 126 78 │
28. │ 6 3 27 124 92 │
29. │ 6 4 27 131 99 │
30. │ 6 5 27 107 80 │
└──────────────────────────────┘
We also see that each individual has 5 observations. One for each trial that is sorted from 1 to 5. Notice: We have a cross-sectional unit variable with idcode and We also have a time component with trial. Our dataset is basically a panel data set. We observe the same person over 5 trials.
*hat kind of format should we use? It depends.Mitchell makes a good point of looking at correlations among the blood pressure trials.
. use cardio_wide, clear
. correlate bp1 bp2 bp3 bp4 bp5
(obs=6)
│ bp1 bp2 bp3 bp4 bp5
─────────────┼─────────────────────────────────────────────
bp1 │ 1.0000
bp2 │ 0.2427 1.0000
bp3 │ -0.7662 -0.6657 1.0000
bp4 │ -0.7644 0.3980 0.2644 1.0000
bp5 │ -0.3643 -0.4984 0.3694 -0.0966 1.0000
Mitchell also mentions that multivariate analysis with multiple trials is a bit easier with factor variables. If we wanted to see all of the different regressions for each trial, then
. mvreg bp* = age
Equation Obs Parms RMSE "R-sq" F P
──────────────────────────────────────────────────────────────────────────
bp1 6 2 8.516293 0.0160 .0651388 0.8111
bp2 6 2 15.34222 0.3858 2.512064 0.1882
bp3 6 2 8.846404 0.4597 3.402782 0.1388
bp4 6 2 11.60081 0.1554 .7357661 0.4394
bp5 6 2 11.81248 0.1360 .629676 0.4719
─────────────┬────────────────────────────────────────────────────────────────
│ Coef. Std. Err. t P>|t| [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
bp1 │
age │ -.1107266 .4338427 -0.26 0.811 -1.315267 1.093814
_cons │ 118.0087 11.66545 10.12 0.001 85.62017 150.3971
─────────────┼────────────────────────────────────────────────────────────────
bp2 │
age │ -1.238754 .7815736 -1.58 0.188 -3.40875 .9312418
_cons │ 150.628 21.01547 7.17 0.002 92.27974 208.9763
─────────────┼────────────────────────────────────────────────────────────────
bp3 │
age │ .8313149 .4506595 1.84 0.139 -.4199164 2.082546
_cons │ 94.32958 12.11763 7.78 0.001 60.68565 127.9735
─────────────┼────────────────────────────────────────────────────────────────
bp4 │
age │ -.5069204 .5909761 -0.86 0.439 -2.147733 1.133892
_cons │ 131.3443 15.89055 8.27 0.001 87.22504 175.4635
─────────────┼────────────────────────────────────────────────────────────────
bp5 │
age │ .4775087 .6017591 0.79 0.472 -1.193242 2.14826
_cons │ 106.7439 16.1805 6.60 0.003 61.81969 151.6682
─────────────┴────────────────────────────────────────────────────────────────
If we want correlations between blood pressure and pulse, a long format is preferable
. use cardio_long, clear
. correlate bp pl
(obs=30)
│ bp pl
─────────────┼──────────────────
bp │ 1.0000
pl │ -0.0444 1.0000
Or
. bysort trial: correlate bp pl
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> trial = 1
(obs=6)
│ bp pl
─────────────┼──────────────────
bp │ 1.0000
pl │ 0.7958 1.0000
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> trial = 2
(obs=6)
│ bp pl
─────────────┼──────────────────
bp │ 1.0000
pl │ -0.1985 1.0000
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> trial = 3
(obs=6)
│ bp pl
─────────────┼──────────────────
bp │ 1.0000
pl │ -0.4790 1.0000
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> trial = 4
(obs=6)
│ bp pl
─────────────┼──────────────────
bp │ 1.0000
pl │ 0.5639 1.0000
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> trial = 5
(obs=6)
│ bp pl
─────────────┼──────────────────
bp │ 1.0000
pl │ -0.3911 1.0000
It is also more appropriate for panel data and the xt commands, such as xtset, xtreg, etc. I also prefer this for regular regressions where we can use factor variables i.var for different time periods.
. xtset id trial
panel variable: id (strongly balanced)
time variable: trial, 1 to 5
delta: 1 unit
. xtreg bp age
Random-effects GLS regression Number of obs = 30
Group variable: id Number of groups = 6
R-sq: Obs per group:
within = 0.0000 min = 5
between = 0.0726 avg = 5.0
overall = 0.0061 max = 5
Wald chi2(1) = 0.17
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.6778
─────────────┬────────────────────────────────────────────────────────────────
bp │ Coef. Std. Err. z P>|z| [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
age │ -.1095156 .2635863 -0.42 0.678 -.6261352 .4071041
_cons │ 120.2109 7.087482 16.96 0.000 106.3197 134.1021
─────────────┼────────────────────────────────────────────────────────────────
sigma_u │ 0
sigma_e │ 11.994443
rho │ 0 (fraction of variance due to u_i)
─────────────┴────────────────────────────────────────────────────────────────
. xtline bp, overlay
It is also easier to recode in long format
. recode pl (min/89=0) (90/max=1), generate (plhi) (30 differences between pl and plhi)
Doing this in wide format is more of a pain You can see this on page 293 in Mitchell.
If you wanted to mean of each 5 trials, it just as easy with egen in long format.
. bysort id: egen pl_avg=mean(pl) . bysort id: egen bp_avg=mean(bp)
For wide-format you can use rowmean with the wildcard
. use cardio_wide, clear
We don’t need the bysort group, but we need to be careful which columns to include
. egen pl_avg = rowmean(pl*) . egen bp_avg = rowmean(bp*)
It easier to explicitly difference trails with wide-format, but requires more typing
. generate pldiff2 = pl2-pl1 . generate pldiff3 = pl3-pl2 . generate pldiff4 = pl4-pl3 . generate pldiff5 = pl5-pl4
But I prefer using subscripting/indexing within groups or using xtset with panel data since we have a lot of flexibilities.
. use cardio_long, clear . sort id trial
We can use indexing
. by id: generate pldiff = pl-pl[_n-1] (6 missing values generated)
We can difference from the first trial
. by id: generate pldiff_1trial = pl-pl[1]
We can use xtset as well with l.var
. xtset id trial
panel variable: id (strongly balanced)
time variable: trial, 1 to 5
delta: 1 unit
One lag
. gen pldiff_other = pl-l.pl (6 missing values generated)
We can use two lags
. gen pldiff2_other = l2.pl-l1.pl (12 missing values generated)
We can use three lags
. gen pldiff3_other = l3.pl-l2.pl (18 missing values generated)
Wide format can become unwieldy quickly and Stata does have a variable limit size, while your observation limit is typically your memory. In short, it is better in the long-run to learn to work with Stata in long-format.
However, Mitchell does bring up a good point that in our panel data, we may need multiple weights wt1-wtk. You can see this in the ACS PUMS with 250 or so replicate weights. We do not want to reshape our data by weights, we need to keep the data in a cross-sectional unit i and time period t format.
In short, I agree with Mitchell that you should learn to work with your data in a long-format data structure.
There may be scenarios where we need to reshape our long-format data into a wide format. For example, it might be easier to match and merge our data If our using data set is in wide format and our master data set is in wide format
Let’s look at an example
. use cardio_long, clear
. describe
Contains data from cardio_long.dta
obs: 30
vars: 5 10 Feb 2020 23:02
size: 210
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
storage display value
variable name type format label variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id byte %3.0f ID of person
trial byte %9.0g Trial number
age byte %3.0f Age of person
bp int %3.0f Blood pressure (systolic)
pl int %3.0f Pulse
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: id trial
. list
┌──────────────────────────────┐
│ id trial age bp pl │
├──────────────────────────────┤
1. │ 1 1 40 115 54 │
2. │ 1 2 40 86 87 │
3. │ 1 3 40 129 93 │
4. │ 1 4 40 105 81 │
5. │ 1 5 40 127 92 │
├──────────────────────────────┤
6. │ 2 1 30 123 92 │
7. │ 2 2 30 136 88 │
8. │ 2 3 30 107 125 │
9. │ 2 4 30 111 87 │
10. │ 2 5 30 120 58 │
├──────────────────────────────┤
11. │ 3 1 16 124 105 │
12. │ 3 2 16 122 97 │
13. │ 3 3 16 101 128 │
14. │ 3 4 16 109 57 │
15. │ 3 5 16 112 68 │
├──────────────────────────────┤
16. │ 4 1 23 105 52 │
17. │ 4 2 23 115 79 │
18. │ 4 3 23 121 71 │
19. │ 4 4 23 129 106 │
20. │ 4 5 23 137 39 │
├──────────────────────────────┤
21. │ 5 1 18 116 70 │
22. │ 5 2 18 128 64 │
23. │ 5 3 18 112 52 │
24. │ 5 4 18 125 68 │
25. │ 5 5 18 111 59 │
├──────────────────────────────┤
26. │ 6 1 27 108 74 │
27. │ 6 2 27 126 78 │
28. │ 6 3 27 124 92 │
29. │ 6 4 27 131 99 │
30. │ 6 5 27 107 80 │
└──────────────────────────────┘
Our reshape command requires reshape What direction: wide or long Variables to reshape (bp and pl) Variables define an observation i(id) Variables that defines the repeated observations for each person or time variable j(trial) for reshaping long to wide Trial values will become the suffix values for our reshape values
. reshape wide bp pl, i(id) j(trial)
(note: j = 1 2 3 4 5)
Data long -> wide
─────────────────────────────────────────────────────────────────────────────
Number of obs. 30 -> 6
Number of variables 5 -> 12
j variable (5 values) trial -> (dropped)
xij variables:
bp -> bp1 bp2 ... bp5
pl -> pl1 pl2 ... pl5
─────────────────────────────────────────────────────────────────────────────
. describe
Contains data
obs: 6
vars: 12
size: 132
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
storage display value
variable name type format label variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id byte %3.0f ID of person
bp1 int %3.0f 1 bp
pl1 int %3.0f 1 pl
bp2 int %3.0f 2 bp
pl2 int %3.0f 2 pl
bp3 int %3.0f 3 bp
pl3 int %3.0f 3 pl
bp4 int %3.0f 4 bp
pl4 int %3.0f 4 pl
bp5 int %3.0f 5 bp
pl5 int %3.0f 5 pl
age byte %3.0f Age of person
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: id
Note: Dataset has changed since last saved.
. list
┌──────────────────────────────────────────────────────────────────────┐
│ id bp1 pl1 bp2 pl2 bp3 pl3 bp4 pl4 bp5 pl5 age │
├──────────────────────────────────────────────────────────────────────┤
1. │ 1 115 54 86 87 129 93 105 81 127 92 40 │
2. │ 2 123 92 136 88 107 125 111 87 120 58 30 │
3. │ 3 124 105 122 97 101 128 109 57 112 68 16 │
4. │ 4 105 52 115 79 121 71 129 106 137 39 23 │
5. │ 5 116 70 128 64 112 52 125 68 111 59 18 │
├──────────────────────────────────────────────────────────────────────┤
6. │ 6 108 74 126 78 124 92 131 99 107 80 27 │
└──────────────────────────────────────────────────────────────────────┘
Notice we excluded age. It is constant within the cross-sectional unit. It does not define the cross-sectional unit or the*time dimension unit. We do not want to separate it into repeated observations either.
Without doing any additional work, we can reshape it back to long with simply reshape long.
. reshape long
(note: j = 1 2 3 4 5)
Data wide -> long
─────────────────────────────────────────────────────────────────────────────
Number of obs. 6 -> 30
Number of variables 12 -> 5
j variable (5 values) -> trial
xij variables:
bp1 bp2 ... bp5 -> bp
pl1 pl2 ... pl5 -> pl
─────────────────────────────────────────────────────────────────────────────
. describe
Contains data
obs: 30
vars: 5
size: 210
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
storage display value
variable name type format label variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id byte %3.0f ID of person
trial byte %9.0g Trial number
bp int %3.0f Blood pressure (systolic)
pl int %3.0f Pulse
age byte %3.0f Age of person
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: id trial
Note: Dataset has changed since last saved.
Problem 1: a constant value within a cross-sectional unit is not constant
. use cardio_long2, clear
. list in 1/10
┌──────────────────────────────┐
│ id trial age bp pl │
├──────────────────────────────┤
1. │ 1 1 40 115 54 │
2. │ 1 2 40 86 87 │
3. │ 1 3 44 129 93 │
4. │ 1 4 40 105 81 │
5. │ 1 5 40 127 92 │
├──────────────────────────────┤
6. │ 2 1 30 123 92 │
7. │ 2 2 30 136 88 │
8. │ 2 3 30 107 125 │
9. │ 2 4 30 111 87 │
10. │ 2 5 30 120 58 │
└──────────────────────────────┘
We see that cross-sectional unit 1 has a data entry problem with age in the 3rd trial. If we reshape, we will get an error
. capture reshape wide bp pl, i(id) j(trial)
Use the reshape error command to help with the problem
. reshape error
(note: j = 1 2 3 4 5)
i (id) indicates the top-level grouping such as subject id.
j (trial) indicates the subgrouping such as time.
xij variables are bp pl.
Thus, the following variable(s) should be constant within i:
age
age not constant within i (id) for 1 value of i:
┌──────────────────┐
│ id trial age │
├──────────────────┤
1. │ 1 1 40 │
2. │ 1 2 40 │
3. │ 1 3 44 │
4. │ 1 4 40 │
5. │ 1 5 40 │
└──────────────────┘
(data now sorted by id trial)
It tells us that our constant variable is not constant within cross-sectional units
Problem 2: excluding a non-constant that we need to reshape. If we excluded pl from our reshape, we will get a similar error
. capture reshape wide bp, i(id) j(trial)
. reshape error
(note: j = 1 2 3 4 5)
i (id) indicates the top-level grouping such as subject id.
j (trial) indicates the subgrouping such as time.
xij variable is bp.
Thus, the following variable(s) should be constant within i:
age pl
age not constant within i (id) for 1 value of i:
┌──────────────────┐
│ id trial age │
├──────────────────┤
1. │ 1 1 40 │
2. │ 1 2 40 │
3. │ 1 3 44 │
4. │ 1 4 40 │
5. │ 1 5 40 │
└──────────────────┘
pl not constant within i (id) for 6 values of i:
┌──────────────────┐
│ id trial pl │
├──────────────────┤
1. │ 1 1 54 │
2. │ 1 2 87 │
3. │ 1 3 93 │
4. │ 1 4 81 │
5. │ 1 5 92 │
├──────────────────┤
6. │ 2 1 92 │
7. │ 2 2 88 │
8. │ 2 3 125 │
9. │ 2 4 87 │
10. │ 2 5 58 │
├──────────────────┤
11. │ 3 1 105 │
12. │ 3 2 97 │
13. │ 3 3 128 │
14. │ 3 4 57 │
15. │ 3 5 68 │
├──────────────────┤
16. │ 4 1 52 │
17. │ 4 2 79 │
18. │ 4 3 71 │
19. │ 4 4 106 │
20. │ 4 5 39 │
├──────────────────┤
21. │ 5 1 70 │
22. │ 5 2 64 │
23. │ 5 3 52 │
24. │ 5 4 68 │
25. │ 5 5 59 │
├──────────────────┤
26. │ 6 1 74 │
27. │ 6 2 78 │
28. │ 6 3 92 │
29. │ 6 4 99 │
30. │ 6 5 80 │
└──────────────────┘
(data now sorted by id trial)
In my personal opinion, this reshaping wide to long is a more common occurance. It is important to get data into a long-format so we can conduct panel analysis. I had to do this for a few of the Wooldridge datasets that were panel data, but in wide format.
. use cardio_wide, clear
. describe
Contains data from cardio_wide.dta
obs: 6
vars: 12 22 Dec 2009 20:43
size: 120
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
storage display value
variable name type format label variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id byte %3.0f ID of person
age byte %3.0f Age of person
bp1 int %3.0f Blood pressure systolic Trial 1
bp2 int %3.0f Blood pressure systolic Trial 2
bp3 int %3.0f Blood pressure systolic Trial 3
bp4 int %3.0f Blood pressure systolic Trial 4
bp5 int %3.0f Blood pressure systolic Trial 5
pl1 int %3.0f Pulse: Trial 1
pl2 byte %3.0f Pulse: Trial 2
pl3 int %3.0f Pulse: Trial 3
pl4 int %3.0f Pulse: Trial 4
pl5 byte %3.0f Pulse: Trial 5
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by:
. list
┌──────────────────────────────────────────────────────────────────────┐
│ id age bp1 bp2 bp3 bp4 bp5 pl1 pl2 pl3 pl4 pl5 │
├──────────────────────────────────────────────────────────────────────┤
1. │ 1 40 115 86 129 105 127 54 87 93 81 92 │
2. │ 2 30 123 136 107 111 120 92 88 125 87 58 │
3. │ 3 16 124 122 101 109 112 105 97 128 57 68 │
4. │ 4 23 105 115 121 129 137 52 79 71 106 39 │
5. │ 5 18 116 128 112 125 111 70 64 52 68 59 │
├──────────────────────────────────────────────────────────────────────┤
6. │ 6 27 108 126 124 131 107 74 78 92 99 80 │
└──────────────────────────────────────────────────────────────────────┘
Let’s reshape wide to long with Our reshape values that are not constant: bp and pl Our cross-sectional unit to reshape upon or our list of variables that uniquely identify a cross-sectional unit. For example, we may need a State FIPS code and a County FIPS code to unique identify a county cross-sectional unit (of course we can merge the 2-digit State FIPS code and the 3-digit County FIPS code to create a 5-digit FIPS code in one variable). Our time unit, where with reshape long, we need to create a new variable that takes on the values of the suffixes of bp and pl (For example trialnum). Our constant values within cross-sectional units do not need to be identified.
. reshape long bp pl, i(id) j(trialnum)
(note: j = 1 2 3 4 5)
Data wide -> long
─────────────────────────────────────────────────────────────────────────────
Number of obs. 6 -> 30
Number of variables 12 -> 5
j variable (5 values) -> trialnum
xij variables:
bp1 bp2 ... bp5 -> bp
pl1 pl2 ... pl5 -> pl
─────────────────────────────────────────────────────────────────────────────
. describe
Contains data
obs: 30
vars: 5
size: 210
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
storage display value
variable name type format label variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id byte %3.0f ID of person
trialnum byte %9.0g
age byte %3.0f Age of person
bp int %3.0f
pl int %3.0f
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: id trialnum
Note: Dataset has changed since last saved.
. list in 1/10
┌─────────────────────────────────┐
│ id trialnum age bp pl │
├─────────────────────────────────┤
1. │ 1 1 40 115 54 │
2. │ 1 2 40 86 87 │
3. │ 1 3 40 129 93 │
4. │ 1 4 40 105 81 │
5. │ 1 5 40 127 92 │
├─────────────────────────────────┤
6. │ 2 1 30 123 92 │
7. │ 2 2 30 136 88 │
8. │ 2 3 30 107 125 │
9. │ 2 4 30 111 87 │
10. │ 2 5 30 120 58 │
└─────────────────────────────────┘
To reshape it back to wide, we just simply state
. reshape wide
(note: j = 1 2 3 4 5)
Data long -> wide
─────────────────────────────────────────────────────────────────────────────
Number of obs. 30 -> 6
Number of variables 5 -> 12
j variable (5 values) trialnum -> (dropped)
xij variables:
bp -> bp1 bp2 ... bp5
pl -> pl1 pl2 ... pl5
─────────────────────────────────────────────────────────────────────────────
. describe
Contains data
obs: 6
vars: 12
size: 132
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
storage display value
variable name type format label variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id byte %3.0f ID of person
bp1 int %3.0f 1 bp
pl1 int %3.0f 1 pl
bp2 int %3.0f 2 bp
pl2 int %3.0f 2 pl
bp3 int %3.0f 3 bp
pl3 int %3.0f 3 pl
bp4 int %3.0f 4 bp
pl4 int %3.0f 4 pl
bp5 int %3.0f 5 bp
pl5 int %3.0f 5 pl
age byte %3.0f Age of person
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: id
Note: Dataset has changed since last saved.
. list
┌──────────────────────────────────────────────────────────────────────┐
│ id bp1 pl1 bp2 pl2 bp3 pl3 bp4 pl4 bp5 pl5 age │
├──────────────────────────────────────────────────────────────────────┤
1. │ 1 115 54 86 87 129 93 105 81 127 92 40 │
2. │ 2 123 92 136 88 107 125 111 87 120 58 30 │
3. │ 3 124 105 122 97 101 128 109 57 112 68 16 │
4. │ 4 105 52 115 79 121 71 129 106 137 39 23 │
5. │ 5 116 70 128 64 112 52 125 68 111 59 18 │
├──────────────────────────────────────────────────────────────────────┤
6. │ 6 108 74 126 78 124 92 131 99 107 80 27 │
└──────────────────────────────────────────────────────────────────────┘
Problem 1: Failing to list all of the varying variables to be reshaped.
This is a dangerous kind of problem, since it doesn’t throw an error. This requires the user to inspect the data after reshaping to make sure everything worked properly.
. use cardio_wide, clear
. reshape long bp, i(id) j(trialnum)
(note: j = 1 2 3 4 5)
Data wide -> long
─────────────────────────────────────────────────────────────────────────────
Number of obs. 6 -> 30
Number of variables 12 -> 9
j variable (5 values) -> trialnum
xij variables:
bp1 bp2 ... bp5 -> bp
─────────────────────────────────────────────────────────────────────────────
The remedy is easy: don’t forget all of your variables to be reshaped
. use cardio_wide, clear
. reshape long bp pl, i(id) j(trialnum)
(note: j = 1 2 3 4 5)
Data wide -> long
─────────────────────────────────────────────────────────────────────────────
Number of obs. 6 -> 30
Number of variables 12 -> 5
j variable (5 values) -> trialnum
xij variables:
bp1 bp2 ... bp5 -> bp
pl1 pl2 ... pl5 -> pl
─────────────────────────────────────────────────────────────────────────────
Problem 2: Be careful when the time-varying value is embedded within the variable name and not at the end.
Typically you will find that the varying variable will have var1, var2, …, vark and you simply specify reshape long var, i(id) j(time). But, if the varying value is embedded like t1bp,…,t5bp and t1pl,…,t5pl, then we cannot simply use reshape long tbp tpl, i(id) j(trialnum), since Stata looks for the varying value at the end.
. use cardio_wide2, clear
. describe
Contains data from cardio_wide2.dta
obs: 6
vars: 12 31 Dec 2009 15:46
size: 120
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
storage display value
variable name type format label variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id byte %3.0f ID of person
age byte %3.0f Age of person
t1bp int %3.0f Blood pressure systolic Trial 1
t2bp int %3.0f Blood pressure systolic Trial 2
t3bp int %3.0f Blood pressure systolic Trial 3
t4bp int %3.0f Blood pressure systolic Trial 4
t5bp int %3.0f Blood pressure systolic Trial 5
t1pl int %3.0f Pulse: Trial 1
t2pl byte %3.0f Pulse: Trial 2
t3pl int %3.0f Pulse: Trial 3
t4pl int %3.0f Pulse: Trial 4
t5pl byte %3.0f Pulse: Trial 5
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by:
. capture reshape long tpl tpb, i(id) j(trialnum)
. reshape error
no xij variables found
You typed something like reshape wide a b, i(i) j(j).
reshape looked for existing variables named a# and b# but could not find any. Remember this picture:
long wide
┌───────────────┐ ┌──────────────────┐
│ i j a b │ │ i a1 a2 b1 b2 │
│───────────────│ <─── reshape ───> │──────────────────│
│ 1 1 1 2 │ │ 1 1 3 2 4 │
│ 1 2 3 4 │ │ 2 5 7 6 8 │
│ 2 1 5 6 │ └──────────────────┘
│ 2 2 7 8 │
└───────────────┘
long to wide: reshape wide a b, i(i) j(j) (j existing variable)
wide to long: reshape long a b, i(i) j(j) (j new variable)
We need the operator @ to indicate to Stata that our varying-values are in that part of the variable names.
. reshape long t@pl t@bp, i(id) j(trialnum)
(note: j = 1 2 3 4 5)
Data wide -> long
─────────────────────────────────────────────────────────────────────────────
Number of obs. 6 -> 30
Number of variables 12 -> 5
j variable (5 values) -> trialnum
xij variables:
t1pl t2pl ... t5pl -> tpl
t1bp t2bp ... t5bp -> tbp
─────────────────────────────────────────────────────────────────────────────
. describe
Contains data
obs: 30
vars: 5
size: 210
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
storage display value
variable name type format label variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id byte %3.0f ID of person
trialnum byte %9.0g
age byte %3.0f Age of person
tbp int %3.0f
tpl int %3.0f
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: id trialnum
Note: Dataset has changed since last saved.
. list in 1/10
┌─────────────────────────────────┐
│ id trialnum age tbp tpl │
├─────────────────────────────────┤
1. │ 1 1 40 115 54 │
2. │ 1 2 40 86 87 │
3. │ 1 3 40 129 93 │
4. │ 1 4 40 105 81 │
5. │ 1 5 40 127 92 │
├─────────────────────────────────┤
6. │ 2 1 30 123 92 │
7. │ 2 2 30 136 88 │
8. │ 2 3 30 107 125 │
9. │ 2 4 30 111 87 │
10. │ 2 5 30 120 58 │
└─────────────────────────────────┘
Problem 3: Be specific with which variables to reshape and don’t mistakenly constant variables that have the same prefix as the varying variables.
. use cardio_wide3, clear . clist bp* pl*, noobs bp1 bp2 bp3 bp4 bp5 bp2005 pl1 pl2 pl3 pl4 pl5 pl2005 115 86 129 105 127 112 54 87 93 81 92 81 123 136 107 111 120 119 92 88 125 87 58 90 124 122 101 109 112 113 105 97 128 57 68 91 105 115 121 129 137 121 52 79 71 106 39 69 116 128 112 125 111 118 70 64 52 68 59 62 108 126 124 131 107 119 74 78 92 99 80 84
Our bp2005 and pl2005 are constant variables within our cross-section, but have the same prefix as our value-varying variables
. reshape long bp pl, i(id) j(trialnum)
(note: j = 1 2 3 4 5 2005)
Data wide -> long
─────────────────────────────────────────────────────────────────────────────
Number of obs. 6 -> 36
Number of variables 14 -> 5
j variable (6 values) -> trialnum
xij variables:
bp1 bp2 ... bp2005 -> bp
pl1 pl2 ... pl2005 -> pl
─────────────────────────────────────────────────────────────────────────────
. describe
Contains data
obs: 36
vars: 5
size: 432
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
storage display value
variable name type format label variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id byte %3.0f ID of person
trialnum int %9.0g
age byte %3.0f Age of person
bp float %9.0g
pl float %9.0g
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: id trialnum
Note: Dataset has changed since last saved.
. list in 1/12, sepby(id)
┌─────────────────────────────────┐
│ id trialnum age bp pl │
├─────────────────────────────────┤
1. │ 1 1 40 115 54 │
2. │ 1 2 40 86 87 │
3. │ 1 3 40 129 93 │
4. │ 1 4 40 105 81 │
5. │ 1 5 40 127 92 │
6. │ 1 2005 40 112 81 │
├─────────────────────────────────┤
7. │ 2 1 30 123 92 │
8. │ 2 2 30 136 88 │
9. │ 2 3 30 107 125 │
10. │ 2 4 30 111 87 │
11. │ 2 5 30 120 58 │
12. │ 2 2005 30 119 90 │
└─────────────────────────────────┘
We did not intend to reshape bp2005 and pl2005.
To rememdy, we need to specify the values within j() with j(trialnum 1-5) to indicate that we only want to reshape suffixes with 1-5
. use cardio_wide3, clear
. reshape long bp pl, i(id) j(trialnum 1-5)
Data wide -> long
─────────────────────────────────────────────────────────────────────────────
Number of obs. 6 -> 30
Number of variables 14 -> 7
j variable (5 values) -> trialnum
xij variables:
bp1 bp2 ... bp5 -> bp
pl1 pl2 ... pl5 -> pl
─────────────────────────────────────────────────────────────────────────────
. describe
Contains data
obs: 30
vars: 7
size: 450
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
storage display value
variable name type format label variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id byte %3.0f ID of person
trialnum byte %9.0g
age byte %3.0f Age of person
bp int %3.0f
bp2005 float %9.0g
pl int %3.0f
pl2005 float %9.0g
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: id trialnum
Note: Dataset has changed since last saved.
. list in 1/10, sepby(id)
┌───────────────────────────────────────────────────┐
│ id trialnum age bp bp2005 pl pl2005 │
├───────────────────────────────────────────────────┤
1. │ 1 1 40 115 112 54 81 │
2. │ 1 2 40 86 112 87 81 │
3. │ 1 3 40 129 112 93 81 │
4. │ 1 4 40 105 112 81 81 │
5. │ 1 5 40 127 112 92 81 │
├───────────────────────────────────────────────────┤
6. │ 2 1 30 123 119 92 90 │
7. │ 2 2 30 136 119 88 90 │
8. │ 2 3 30 107 119 125 90 │
9. │ 2 4 30 111 119 87 90 │
10. │ 2 5 30 120 119 58 90 │
└───────────────────────────────────────────────────┘
Problem 4: No id varible to identify the cross-sectional unit If each wide-format observation is a unique cross-sectional unit, then we can use _n to generate a unique id variable
. use cardio_wide3, clear
. describe
Contains data from cardio_wide3.dta
obs: 6
vars: 14 31 Dec 2009 15:46
size: 168
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
storage display value
variable name type format label variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id byte %3.0f ID of person
age byte %3.0f Age of person
bp1 int %3.0f Blood pressure systolic Trial 1
bp2 int %3.0f Blood pressure systolic Trial 2
bp3 int %3.0f Blood pressure systolic Trial 3
bp4 int %3.0f Blood pressure systolic Trial 4
bp5 int %3.0f Blood pressure systolic Trial 5
bp2005 float %9.0g
pl1 int %3.0f Pulse: Trial 1
pl2 byte %3.0f Pulse: Trial 2
pl3 int %3.0f Pulse: Trial 3
pl4 int %3.0f Pulse: Trial 4
pl5 byte %3.0f Pulse: Trial 5
pl2005 float %9.0g
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by:
. gen idcode = _n
. describe
Contains data from cardio_wide3.dta
obs: 6
vars: 15 31 Dec 2009 15:46
size: 216
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
storage display value
variable name type format label variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id byte %3.0f ID of person
age byte %3.0f Age of person
bp1 int %3.0f Blood pressure systolic Trial 1
bp2 int %3.0f Blood pressure systolic Trial 2
bp3 int %3.0f Blood pressure systolic Trial 3
bp4 int %3.0f Blood pressure systolic Trial 4
bp5 int %3.0f Blood pressure systolic Trial 5
bp2005 float %9.0g
pl1 int %3.0f Pulse: Trial 1
pl2 byte %3.0f Pulse: Trial 2
pl3 int %3.0f Pulse: Trial 3
pl4 int %3.0f Pulse: Trial 4
pl5 byte %3.0f Pulse: Trial 5
pl2005 float %9.0g
idcode double %10.0g
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by:
Note: Dataset has changed since last saved.
. reshape long bp pl, i(id idcode) j(trialnum 1-5)
Data wide -> long
─────────────────────────────────────────────────────────────────────────────
Number of obs. 6 -> 30
Number of variables 15 -> 8
j variable (5 values) -> trialnum
xij variables:
bp1 bp2 ... bp5 -> bp
pl1 pl2 ... pl5 -> pl
─────────────────────────────────────────────────────────────────────────────
. describe
Contains data
obs: 30
vars: 8
size: 690
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
storage display value
variable name type format label variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id byte %3.0f ID of person
idcode double %10.0g
trialnum byte %9.0g
age byte %3.0f Age of person
bp int %3.0f
bp2005 float %9.0g
pl int %3.0f
pl2005 float %9.0g
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: id idcode trialnum
Note: Dataset has changed since last saved.
. list in 1/10, sepby(id)
┌────────────────────────────────────────────────────────────┐
│ id idcode trialnum age bp bp2005 pl pl2005 │
├────────────────────────────────────────────────────────────┤
1. │ 1 1 1 40 115 112 54 81 │
2. │ 1 1 2 40 86 112 87 81 │
3. │ 1 1 3 40 129 112 93 81 │
4. │ 1 1 4 40 105 112 81 81 │
5. │ 1 1 5 40 127 112 92 81 │
├────────────────────────────────────────────────────────────┤
6. │ 2 2 1 30 123 119 92 90 │
7. │ 2 2 2 30 136 119 88 90 │
8. │ 2 2 3 30 107 119 125 90 │
9. │ 2 2 4 30 111 119 87 90 │
10. │ 2 2 5 30 120 119 58 90 │
└────────────────────────────────────────────────────────────┘
Mitchell discusses when we have multiple levels of data, such as cross-sectional unit i and time dimension t, where data is level 1 data are constant across time within unit, but vary across cross-sectional units and level 2 data are time-varying data that vary within a cross-sectional units
Sometimes we have multiple levels, such as school district, school, classroom, and student.
If you are interested, please review pages 311-314
The collapse command can be useful finding statistics at aggregate levels for different groups. You don’t have to use egen but you can if you would like. After using the egen command, such as egen sumvar = sum(var), or egen avgvar = mean(var) the collapse command can be useful. Let’s say we want to aggregate mean wages by state in the CPS, we can use the egen mean command to find average wages in the state for month m in year y. We can then keep state month year and avgvar for all 50 states for each month and year of interest.
We need to be careful with the collapse command, since your original data are lost from memory unless you use a tempfile or a data frame.
. use cardio_long, clear
. list, sepby(id)
┌──────────────────────────────┐
│ id trial age bp pl │
├──────────────────────────────┤
1. │ 1 1 40 115 54 │
2. │ 1 2 40 86 87 │
3. │ 1 3 40 129 93 │
4. │ 1 4 40 105 81 │
5. │ 1 5 40 127 92 │
├──────────────────────────────┤
6. │ 2 1 30 123 92 │
7. │ 2 2 30 136 88 │
8. │ 2 3 30 107 125 │
9. │ 2 4 30 111 87 │
10. │ 2 5 30 120 58 │
├──────────────────────────────┤
11. │ 3 1 16 124 105 │
12. │ 3 2 16 122 97 │
13. │ 3 3 16 101 128 │
14. │ 3 4 16 109 57 │
15. │ 3 5 16 112 68 │
├──────────────────────────────┤
16. │ 4 1 23 105 52 │
17. │ 4 2 23 115 79 │
18. │ 4 3 23 121 71 │
19. │ 4 4 23 129 106 │
20. │ 4 5 23 137 39 │
├──────────────────────────────┤
21. │ 5 1 18 116 70 │
22. │ 5 2 18 128 64 │
23. │ 5 3 18 112 52 │
24. │ 5 4 18 125 68 │
25. │ 5 5 18 111 59 │
├──────────────────────────────┤
26. │ 6 1 27 108 74 │
27. │ 6 2 27 126 78 │
28. │ 6 3 27 124 92 │
29. │ 6 4 27 131 99 │
30. │ 6 5 27 107 80 │
└──────────────────────────────┘
Let’s say we want to find mean blood pressure and pluse for each cross-sectional unit, then we can use the egen command or the collapse commands
. sort id . by id: egen meanbp = mean(bp) . by id: egen meanpl = mean(pl) . by id: egen maxbp = max(bp) . by id: egen maxpl = max(bp) . by id: egen minbp = min(bp) . by id: egen minpl = min(bp) . by id: egen medbp = pctile(bp), p(50) . by id: egen medpl = pctile(pl), p(50)
. collapse meanbp meanpl maxbp maxpl minbp minpl medbp medpl, by(id)
Or you can just use the collapse command
. use cardio_long, clear
. collapse (mean) meanbp=bp meanpl=pl (max) maxbp=bp maxpl=pl /// > (min) minbp=bp minpl=pl (p50) medbp=bp medpl=pl, by(id)
Notice that our data are rounded to the nearest 1, unlike our egen commands We can format our data to reflect that decimal point to two places instead of rounding up by default
. use cardio_long, clear . format bp* pl* %5.2f . collapse (mean) meanbp=bp meanpl=pl (max) maxbp=bp maxpl=pl /// > (min) minbp=bp minpl=pl (p50) medbp=bp medpl=pl, by(id)
Exercises Pull BLS monthly inflation data, reshape it into long, and merge with CPS