ECON 645: Week 7:

Chapter 16: Simultaneous Equation Models and Mitchell Chapter 9

Samuel Rowe Adapted from Wooldridge and Mitchell

Wooldridge

Set Working Directory

. clear

. set more off

. cd "/Users/Sam/Desktop/Econ 645/Data/Wooldridge"
/Users/Sam/Desktop/Econ 645/Data/Wooldridge

Labor Supply of Married, Working Women

Let’s look at the data for married working women

. use mroz.dta, clear

Our labor supply equation hoursi = β10 + α1ln(wagei) + β11educi + β12agei + β13kidslt6i + β14nwifeinc + u1

Our labor demand equation in term of wages as a function of productivity ln(wagei) = β20 + α2hoursi + β21 * educi + β22experi + β23experi2 + u2

To estimate our labor supply we use educ, age, kidslt6, nwifeinc, exper, and exper-squared

. ivregress 2sls hours (lwage=exper expersq) educ age kidslt6 nwifeinc, robust

Instrumental variables (2SLS) regression          Number of obs   =        428
                                                  Wald chi2(5)    =      12.60
                                                  Prob > chi2     =     0.0274
                                                  R-squared       =          .
                                                  Root MSE        =     1344.7

─────────────┬────────────────────────────────────────────────────────────────
             │               Robust
       hours │      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
       lwage │   1639.556   593.3108     2.76   0.006     476.6879    2802.423
        educ │  -183.7513   67.78742    -2.71   0.007    -316.6122   -50.89039
         age │  -7.806092   10.48746    -0.74   0.457    -28.36114    12.74896
     kidslt6 │  -198.1543   208.4247    -0.95   0.342    -606.6592    210.3506
    nwifeinc │  -10.16959   5.287486    -1.92   0.054    -20.53287    .1936911
       _cons │   2225.662   603.0964     3.69   0.000     1043.615    3407.709
─────────────┴────────────────────────────────────────────────────────────────
Instrumented:  lwage
Instruments:   educ age kidslt6 nwifeinc exper expersq

Test exper and expersq as instruments

. reg lwage exper expersq educ age kidslt6 nwifeinc, robust

Linear regression                               Number of obs     =        428
                                                F(6, 421)         =      14.76
                                                Prob > F          =     0.0000
                                                R-squared         =     0.1633
                                                Root MSE          =     .66622

─────────────┬────────────────────────────────────────────────────────────────
             │               Robust
       lwage │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
       exper │   .0418643   .0151135     2.77   0.006     .0121569    .0715718
     expersq │  -.0007625   .0004065    -1.88   0.061    -.0015614    .0000365
        educ │   .1011113   .0141358     7.15   0.000     .0733257     .128897
         age │  -.0025561   .0059149    -0.43   0.666    -.0141826    .0090704
     kidslt6 │  -.0532185   .1047899    -0.51   0.612    -.2591951    .1527581
    nwifeinc │     .00556   .0027435     2.03   0.043     .0001673    .0109527
       _cons │  -.4471607   .2889008    -1.55   0.122    -1.015028    .1207069
─────────────┴────────────────────────────────────────────────────────────────

. test exper expersq

 ( 1)  exper = 0
 ( 2)  expersq = 0

       F(  2,   421) =    6.17
            Prob > F =    0.0023

Note: Our instrument is rather weak. What is a possible problem with this first-stage?

Interpretation:

Our results show that the labor supply curve slopes upward. The estimated coefficient is Δhours = 1640/100 * (%Δwages)

Our results can be interpreted through linear-log elasticity.

100 * (Δhours/hours) ≈ (1,640/hours)(%Δwages) Or %Δhours ≈ (1,640/hours)(%Δwages)

This implies that the labor supply elasticity with respect to wages is η = 1, 640/hours

The elasticity is not constant since it is a linear-log model instead of log-log model. Using average hours worked of 1,303 Our estimated elasticity = (1,640/hours)(%delta-wages) or 1,640/1,303=1.26

. display 1640/1303
1.2586339

Our estimated elasticity around mean hours is 1.26%, which means a 1% increase in wages around mean hours increases hours by 1.26%. This mean wage elasticity of supply is elastic (>1) around mean hours.

It’s not constant elasticity When hours are higher, full time 40 hours every week, a 1% increase in wages increase the supply of hours around 0.79%

. display 1640/(40*52)
.78846154

Our wage elasticity of supply is inelastic since (<1)

At lower hours, our estimated wage elasticity of supply is more elastic When hours are equal to 800, then our wage elasticity of supply is almost 2

. display 1640/800
2.05

A 1% increase in wages increases hours by 2.05%

Inflation and Openness

. use openness.dta, clear

Romer (1993) proposes that more open countries should have lower inflation rates. Romer (1993) tries to explain inflation rates in terms of a country’s average share of imports in gross domestic (or national) product since 1973. Average share of imports in GDP is his measure of opennes.

infit = β10 + α1openit + β11ln(pcincit) + u1 openit = β20 + α2infit + β21ln(pcinc) + β22ln(land) + u2

Where open is the average share of imports in terms of GDP, pcinc is the 1980 per capita income, and land is land area of a country in square miles

We can look at the reduced form equation to see the instruments impact on*openness

. reg open lland lpcinc, robust

Linear regression                               Number of obs     =        114
                                                F(2, 111)         =      22.22
                                                Prob > F          =     0.0000
                                                R-squared         =     0.4487
                                                Root MSE          =     17.796

─────────────┬────────────────────────────────────────────────────────────────
             │               Robust
        open │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
       lland │  -7.567103   1.141798    -6.63   0.000    -9.829652   -5.304554
      lpcinc │   .5464812   1.436115     0.38   0.704    -2.299276    3.392238
       _cons │   117.0845   18.24808     6.42   0.000     80.92473    153.2443
─────────────┴────────────────────────────────────────────────────────────────

Let’s test the instrument and use an F-test

. test lland

 ( 1)  lland = 0

       F(  1,   111) =   43.92
            Prob > F =    0.0000

We want to see if a country’s openness impacts a country’s inflation rate using natural log of square miles as an instrument

. ivregress 2sls inf (open=lland) lpcinc, robust

Instrumental variables (2SLS) regression          Number of obs   =        114
                                                  Wald chi2(2)    =       5.19
                                                  Prob > chi2     =     0.0745
                                                  R-squared       =     0.0309
                                                  Root MSE        =      23.52

─────────────┬────────────────────────────────────────────────────────────────
             │               Robust
         inf │      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
        open │  -.3374871   .1504296    -2.24   0.025    -.6323238   -.0426504
      lpcinc │   .3758247   1.360282     0.28   0.782     -2.29028    3.041929
       _cons │   26.89934   10.77526     2.50   0.013     5.780212    48.01846
─────────────┴────────────────────────────────────────────────────────────────
Instrumented:  open
Instruments:   lpcinc lland

Interpretation

For every percentage point increase in average share of imports of GDP, inflation decreases by 0.337 percentage points

Exercises

Pull CPS data and recreate MROZ.dta with CPS data, expect use all individuals not just working women

Estimate linear-log and test elasticity at different hours. Estimate constant elasticity with a log-log model.

Mitchell

Chapter 9: Changing the shape of your data

Set Working Directory

. cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
/Users/Sam/Desktop/Econ 645/Data/Mitchell

Intro

From my experience, I feel that understanding and mastering the reshape command is invaluable. This is especially true when working with panel data. Panel data should not be used in a wide format. Panel data should be in a long format where you can use the xtset unit time command. After the xtset is established you can easily add lags or differences with the l.var or d.var

It is also helpful in terms of mergering datasets. A lot of time if I’m mergering BLS data, such as state unemployment, or inflation rates, I need to reshape the BLS data from wide to long to I can merge m:1

I typically work in long formats, but reshaping to wide formats does have its benefits. I find wide formats are helpful for people who work in excel, especially when years are across columns

9.2: Wide and long datasets

We have two types of datasets: wide and long

Wide Format

Our wide datasets typically include years or variables types spread out across the columns.

. use cardio_wide, clear

. describe

Contains data from cardio_wide.dta
  obs:             6                          
 vars:            12                          22 Dec 2009 20:43
 size:           120                          
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id              byte    %3.0f                 ID of person
age             byte    %3.0f                 Age of person
bp1             int     %3.0f                 Blood pressure systolic Trial 1
bp2             int     %3.0f                 Blood pressure systolic Trial 2
bp3             int     %3.0f                 Blood pressure systolic Trial 3
bp4             int     %3.0f                 Blood pressure systolic Trial 4
bp5             int     %3.0f                 Blood pressure systolic Trial 5
pl1             int     %3.0f                 Pulse: Trial 1
pl2             byte    %3.0f                 Pulse: Trial 2
pl3             int     %3.0f                 Pulse: Trial 3
pl4             int     %3.0f                 Pulse: Trial 4
pl5             byte    %3.0f                 Pulse: Trial 5
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: 

We have 12 variables idcode, age, and our 5 different blood pressure and pulse trials. We see that are 5 different incidents/trials of blood pressure and pulse are spread out across 5 different variables bp1-bp5 and pl1-pl5

. list

     ┌──────────────────────────────────────────────────────────────────────┐
     │ id   age   bp1   bp2   bp3   bp4   bp5   pl1   pl2   pl3   pl4   pl5 │
     ├──────────────────────────────────────────────────────────────────────┤
  1. │  1    40   115    86   129   105   127    54    87    93    81    92 │
  2. │  2    30   123   136   107   111   120    92    88   125    87    58 │
  3. │  3    16   124   122   101   109   112   105    97   128    57    68 │
  4. │  4    23   105   115   121   129   137    52    79    71   106    39 │
  5. │  5    18   116   128   112   125   111    70    64    52    68    59 │
     ├──────────────────────────────────────────────────────────────────────┤
  6. │  6    27   108   126   124   131   107    74    78    92    99    80 │
     └──────────────────────────────────────────────────────────────────────┘

With a wide format each row is an observation and all of our different trials are located on a single row

Long Format

We’ll look at an example of the same dataset but in long format

. use cardio_long, clear

. describe

Contains data from cardio_long.dta
  obs:            30                          
 vars:             5                          10 Feb 2020 23:02
 size:           210                          
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id              byte    %3.0f                 ID of person
trial           byte    %9.0g                 Trial number
age             byte    %3.0f                 Age of person
bp              int     %3.0f                 Blood pressure (systolic)
pl              int     %3.0f                 Pulse
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: id  trial

We only have 5 variables idcode, age, trial, blood pressure, and pulse.

. list

     ┌──────────────────────────────┐
     │ id   trial   age    bp    pl │
     ├──────────────────────────────┤
  1. │  1       1    40   115    54 │
  2. │  1       2    40    86    87 │
  3. │  1       3    40   129    93 │
  4. │  1       4    40   105    81 │
  5. │  1       5    40   127    92 │
     ├──────────────────────────────┤
  6. │  2       1    30   123    92 │
  7. │  2       2    30   136    88 │
  8. │  2       3    30   107   125 │
  9. │  2       4    30   111    87 │
 10. │  2       5    30   120    58 │
     ├──────────────────────────────┤
 11. │  3       1    16   124   105 │
 12. │  3       2    16   122    97 │
 13. │  3       3    16   101   128 │
 14. │  3       4    16   109    57 │
 15. │  3       5    16   112    68 │
     ├──────────────────────────────┤
 16. │  4       1    23   105    52 │
 17. │  4       2    23   115    79 │
 18. │  4       3    23   121    71 │
 19. │  4       4    23   129   106 │
 20. │  4       5    23   137    39 │
     ├──────────────────────────────┤
 21. │  5       1    18   116    70 │
 22. │  5       2    18   128    64 │
 23. │  5       3    18   112    52 │
 24. │  5       4    18   125    68 │
 25. │  5       5    18   111    59 │
     ├──────────────────────────────┤
 26. │  6       1    27   108    74 │
 27. │  6       2    27   126    78 │
 28. │  6       3    27   124    92 │
 29. │  6       4    27   131    99 │
 30. │  6       5    27   107    80 │
     └──────────────────────────────┘

We also see that each individual has 5 observations. One for each trial that is sorted from 1 to 5. Notice: We have a cross-sectional unit variable with idcode and We also have a time component with trial. Our dataset is basically a panel data set. We observe the same person over 5 trials.

*hat kind of format should we use? It depends.

Wide Format

Mitchell makes a good point of looking at correlations among the blood pressure trials.

. use cardio_wide, clear

. correlate bp1 bp2 bp3 bp4 bp5
(obs=6)

             │      bp1      bp2      bp3      bp4      bp5
─────────────┼─────────────────────────────────────────────
         bp1 │   1.0000
         bp2 │   0.2427   1.0000
         bp3 │  -0.7662  -0.6657   1.0000
         bp4 │  -0.7644   0.3980   0.2644   1.0000
         bp5 │  -0.3643  -0.4984   0.3694  -0.0966   1.0000

Mitchell also mentions that multivariate analysis with multiple trials is a bit easier with factor variables. If we wanted to see all of the different regressions for each trial, then

. mvreg bp* = age

Equation             Obs   Parms        RMSE    "R-sq"          F        P
──────────────────────────────────────────────────────────────────────────
bp1                    6       2    8.516293    0.0160   .0651388   0.8111
bp2                    6       2    15.34222    0.3858   2.512064   0.1882
bp3                    6       2    8.846404    0.4597   3.402782   0.1388
bp4                    6       2    11.60081    0.1554   .7357661   0.4394
bp5                    6       2    11.81248    0.1360    .629676   0.4719

─────────────┬────────────────────────────────────────────────────────────────
             │      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
bp1          │
         age │  -.1107266   .4338427    -0.26   0.811    -1.315267    1.093814
       _cons │   118.0087   11.66545    10.12   0.001     85.62017    150.3971
─────────────┼────────────────────────────────────────────────────────────────
bp2          │
         age │  -1.238754   .7815736    -1.58   0.188     -3.40875    .9312418
       _cons │    150.628   21.01547     7.17   0.002     92.27974    208.9763
─────────────┼────────────────────────────────────────────────────────────────
bp3          │
         age │   .8313149   .4506595     1.84   0.139    -.4199164    2.082546
       _cons │   94.32958   12.11763     7.78   0.001     60.68565    127.9735
─────────────┼────────────────────────────────────────────────────────────────
bp4          │
         age │  -.5069204   .5909761    -0.86   0.439    -2.147733    1.133892
       _cons │   131.3443   15.89055     8.27   0.001     87.22504    175.4635
─────────────┼────────────────────────────────────────────────────────────────
bp5          │
         age │   .4775087   .6017591     0.79   0.472    -1.193242     2.14826
       _cons │   106.7439    16.1805     6.60   0.003     61.81969    151.6682
─────────────┴────────────────────────────────────────────────────────────────

Long Format:

If we want correlations between blood pressure and pulse, a long format is preferable

. use cardio_long, clear

. correlate bp pl
(obs=30)

             │       bp       pl
─────────────┼──────────────────
          bp │   1.0000
          pl │  -0.0444   1.0000

Or

. bysort trial: correlate bp pl

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> trial = 1
(obs=6)

             │       bp       pl
─────────────┼──────────────────
          bp │   1.0000
          pl │   0.7958   1.0000


───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> trial = 2
(obs=6)

             │       bp       pl
─────────────┼──────────────────
          bp │   1.0000
          pl │  -0.1985   1.0000


───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> trial = 3
(obs=6)

             │       bp       pl
─────────────┼──────────────────
          bp │   1.0000
          pl │  -0.4790   1.0000


───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> trial = 4
(obs=6)

             │       bp       pl
─────────────┼──────────────────
          bp │   1.0000
          pl │   0.5639   1.0000


───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
-> trial = 5
(obs=6)

             │       bp       pl
─────────────┼──────────────────
          bp │   1.0000
          pl │  -0.3911   1.0000

It is also more appropriate for panel data and the xt commands, such as xtset, xtreg, etc. I also prefer this for regular regressions where we can use factor variables i.var for different time periods.

. xtset id trial
       panel variable:  id (strongly balanced)
        time variable:  trial, 1 to 5
                delta:  1 unit

. xtreg bp age

Random-effects GLS regression                   Number of obs     =         30
Group variable: id                              Number of groups  =          6

R-sq:                                           Obs per group:
     within  = 0.0000                                         min =          5
     between = 0.0726                                         avg =        5.0
     overall = 0.0061                                         max =          5

                                                Wald chi2(1)      =       0.17
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.6778

─────────────┬────────────────────────────────────────────────────────────────
          bp │      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
─────────────┼────────────────────────────────────────────────────────────────
         age │  -.1095156   .2635863    -0.42   0.678    -.6261352    .4071041
       _cons │   120.2109   7.087482    16.96   0.000     106.3197    134.1021
─────────────┼────────────────────────────────────────────────────────────────
     sigma_u │          0
     sigma_e │  11.994443
         rho │          0   (fraction of variance due to u_i)
─────────────┴────────────────────────────────────────────────────────────────

. xtline bp, overlay

It is also easier to recode in long format

. recode pl (min/89=0) (90/max=1), generate (plhi)
(30 differences between pl and plhi)

Doing this in wide format is more of a pain You can see this on page 293 in Mitchell.

If you wanted to mean of each 5 trials, it just as easy with egen in long format.

. bysort id: egen pl_avg=mean(pl)

. bysort id: egen bp_avg=mean(bp)

For wide-format you can use rowmean with the wildcard

. use cardio_wide, clear

We don’t need the bysort group, but we need to be careful which columns to include

. egen pl_avg = rowmean(pl*)

. egen bp_avg = rowmean(bp*)

It easier to explicitly difference trails with wide-format, but requires more typing

. generate pldiff2 = pl2-pl1

. generate pldiff3 = pl3-pl2

. generate pldiff4 = pl4-pl3

. generate pldiff5 = pl5-pl4

But I prefer using subscripting/indexing within groups or using xtset with panel data since we have a lot of flexibilities.

. use cardio_long, clear

. sort id trial

We can use indexing

. by id: generate pldiff = pl-pl[_n-1]
(6 missing values generated)

We can difference from the first trial

. by id: generate pldiff_1trial = pl-pl[1]

We can use xtset as well with l.var

. xtset id trial
       panel variable:  id (strongly balanced)
        time variable:  trial, 1 to 5
                delta:  1 unit

One lag

. gen pldiff_other = pl-l.pl
(6 missing values generated)

We can use two lags

. gen pldiff2_other = l2.pl-l1.pl
(12 missing values generated)

We can use three lags

. gen pldiff3_other = l3.pl-l2.pl
(18 missing values generated)

Wide format can become unwieldy quickly and Stata does have a variable limit size, while your observation limit is typically your memory. In short, it is better in the long-run to learn to work with Stata in long-format.

However, Mitchell does bring up a good point that in our panel data, we may need multiple weights wt1-wtk. You can see this in the ACS PUMS with 250 or so replicate weights. We do not want to reshape our data by weights, we need to keep the data in a cross-sectional unit i and time period t format.

In short, I agree with Mitchell that you should learn to work with your data in a long-format data structure.

9.3 Reshaping Long to Wide

There may be scenarios where we need to reshape our long-format data into a wide format. For example, it might be easier to match and merge our data If our using data set is in wide format and our master data set is in wide format

Let’s look at an example

. use cardio_long, clear

. describe 

Contains data from cardio_long.dta
  obs:            30                          
 vars:             5                          10 Feb 2020 23:02
 size:           210                          
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id              byte    %3.0f                 ID of person
trial           byte    %9.0g                 Trial number
age             byte    %3.0f                 Age of person
bp              int     %3.0f                 Blood pressure (systolic)
pl              int     %3.0f                 Pulse
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: id  trial

. list

     ┌──────────────────────────────┐
     │ id   trial   age    bp    pl │
     ├──────────────────────────────┤
  1. │  1       1    40   115    54 │
  2. │  1       2    40    86    87 │
  3. │  1       3    40   129    93 │
  4. │  1       4    40   105    81 │
  5. │  1       5    40   127    92 │
     ├──────────────────────────────┤
  6. │  2       1    30   123    92 │
  7. │  2       2    30   136    88 │
  8. │  2       3    30   107   125 │
  9. │  2       4    30   111    87 │
 10. │  2       5    30   120    58 │
     ├──────────────────────────────┤
 11. │  3       1    16   124   105 │
 12. │  3       2    16   122    97 │
 13. │  3       3    16   101   128 │
 14. │  3       4    16   109    57 │
 15. │  3       5    16   112    68 │
     ├──────────────────────────────┤
 16. │  4       1    23   105    52 │
 17. │  4       2    23   115    79 │
 18. │  4       3    23   121    71 │
 19. │  4       4    23   129   106 │
 20. │  4       5    23   137    39 │
     ├──────────────────────────────┤
 21. │  5       1    18   116    70 │
 22. │  5       2    18   128    64 │
 23. │  5       3    18   112    52 │
 24. │  5       4    18   125    68 │
 25. │  5       5    18   111    59 │
     ├──────────────────────────────┤
 26. │  6       1    27   108    74 │
 27. │  6       2    27   126    78 │
 28. │  6       3    27   124    92 │
 29. │  6       4    27   131    99 │
 30. │  6       5    27   107    80 │
     └──────────────────────────────┘

Our reshape command requires reshape What direction: wide or long Variables to reshape (bp and pl) Variables define an observation i(id) Variables that defines the repeated observations for each person or time variable j(trial) for reshaping long to wide Trial values will become the suffix values for our reshape values

. reshape wide bp pl, i(id) j(trial)
(note: j = 1 2 3 4 5)

Data                               long   ->   wide
─────────────────────────────────────────────────────────────────────────────
Number of obs.                       30   ->       6
Number of variables                   5   ->      12
j variable (5 values)             trial   ->   (dropped)
xij variables:
                                     bp   ->   bp1 bp2 ... bp5
                                     pl   ->   pl1 pl2 ... pl5
─────────────────────────────────────────────────────────────────────────────

. describe

Contains data
  obs:             6                          
 vars:            12                          
 size:           132                          
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id              byte    %3.0f                 ID of person
bp1             int     %3.0f                 1 bp
pl1             int     %3.0f                 1 pl
bp2             int     %3.0f                 2 bp
pl2             int     %3.0f                 2 pl
bp3             int     %3.0f                 3 bp
pl3             int     %3.0f                 3 pl
bp4             int     %3.0f                 4 bp
pl4             int     %3.0f                 4 pl
bp5             int     %3.0f                 5 bp
pl5             int     %3.0f                 5 pl
age             byte    %3.0f                 Age of person
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: id
     Note: Dataset has changed since last saved.

. list

     ┌──────────────────────────────────────────────────────────────────────┐
     │ id   bp1   pl1   bp2   pl2   bp3   pl3   bp4   pl4   bp5   pl5   age │
     ├──────────────────────────────────────────────────────────────────────┤
  1. │  1   115    54    86    87   129    93   105    81   127    92    40 │
  2. │  2   123    92   136    88   107   125   111    87   120    58    30 │
  3. │  3   124   105   122    97   101   128   109    57   112    68    16 │
  4. │  4   105    52   115    79   121    71   129   106   137    39    23 │
  5. │  5   116    70   128    64   112    52   125    68   111    59    18 │
     ├──────────────────────────────────────────────────────────────────────┤
  6. │  6   108    74   126    78   124    92   131    99   107    80    27 │
     └──────────────────────────────────────────────────────────────────────┘

Notice we excluded age. It is constant within the cross-sectional unit. It does not define the cross-sectional unit or the*time dimension unit. We do not want to separate it into repeated observations either.

Without doing any additional work, we can reshape it back to long with simply reshape long.

. reshape long
(note: j = 1 2 3 4 5)

Data                               wide   ->   long
─────────────────────────────────────────────────────────────────────────────
Number of obs.                        6   ->      30
Number of variables                  12   ->       5
j variable (5 values)                     ->   trial
xij variables:
                        bp1 bp2 ... bp5   ->   bp
                        pl1 pl2 ... pl5   ->   pl
─────────────────────────────────────────────────────────────────────────────

. describe

Contains data
  obs:            30                          
 vars:             5                          
 size:           210                          
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id              byte    %3.0f                 ID of person
trial           byte    %9.0g                 Trial number
bp              int     %3.0f                 Blood pressure (systolic)
pl              int     %3.0f                 Pulse
age             byte    %3.0f                 Age of person
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: id  trial
     Note: Dataset has changed since last saved.

9.4 Reshaping Long to Wide: Problems

Problem 1: a constant value within a cross-sectional unit is not constant

. use cardio_long2, clear

. list in 1/10

     ┌──────────────────────────────┐
     │ id   trial   age    bp    pl │
     ├──────────────────────────────┤
  1. │  1       1    40   115    54 │
  2. │  1       2    40    86    87 │
  3. │  1       3    44   129    93 │
  4. │  1       4    40   105    81 │
  5. │  1       5    40   127    92 │
     ├──────────────────────────────┤
  6. │  2       1    30   123    92 │
  7. │  2       2    30   136    88 │
  8. │  2       3    30   107   125 │
  9. │  2       4    30   111    87 │
 10. │  2       5    30   120    58 │
     └──────────────────────────────┘

We see that cross-sectional unit 1 has a data entry problem with age in the 3rd trial. If we reshape, we will get an error

. capture reshape wide bp pl, i(id) j(trial)

Use the reshape error command to help with the problem

. reshape error
(note: j = 1 2 3 4 5)

i (id) indicates the top-level grouping such as subject id.
j (trial) indicates the subgrouping such as time.
xij variables are bp pl.
Thus, the following variable(s) should be constant within i:
      age

age not constant within i (id) for 1 value of i:

     ┌──────────────────┐
     │ id   trial   age │
     ├──────────────────┤
  1. │  1       1    40 │
  2. │  1       2    40 │
  3. │  1       3    44 │
  4. │  1       4    40 │
  5. │  1       5    40 │
     └──────────────────┘

(data now sorted by id trial)

It tells us that our constant variable is not constant within cross-sectional units

Problem 2: excluding a non-constant that we need to reshape. If we excluded pl from our reshape, we will get a similar error

. capture reshape wide bp, i(id) j(trial)

. reshape error
(note: j = 1 2 3 4 5)

i (id) indicates the top-level grouping such as subject id.
j (trial) indicates the subgrouping such as time.
xij variable is bp.
Thus, the following variable(s) should be constant within i:
      age pl

age not constant within i (id) for 1 value of i:

     ┌──────────────────┐
     │ id   trial   age │
     ├──────────────────┤
  1. │  1       1    40 │
  2. │  1       2    40 │
  3. │  1       3    44 │
  4. │  1       4    40 │
  5. │  1       5    40 │
     └──────────────────┘

pl not constant within i (id) for 6 values of i:

     ┌──────────────────┐
     │ id   trial    pl │
     ├──────────────────┤
  1. │  1       1    54 │
  2. │  1       2    87 │
  3. │  1       3    93 │
  4. │  1       4    81 │
  5. │  1       5    92 │
     ├──────────────────┤
  6. │  2       1    92 │
  7. │  2       2    88 │
  8. │  2       3   125 │
  9. │  2       4    87 │
 10. │  2       5    58 │
     ├──────────────────┤
 11. │  3       1   105 │
 12. │  3       2    97 │
 13. │  3       3   128 │
 14. │  3       4    57 │
 15. │  3       5    68 │
     ├──────────────────┤
 16. │  4       1    52 │
 17. │  4       2    79 │
 18. │  4       3    71 │
 19. │  4       4   106 │
 20. │  4       5    39 │
     ├──────────────────┤
 21. │  5       1    70 │
 22. │  5       2    64 │
 23. │  5       3    52 │
 24. │  5       4    68 │
 25. │  5       5    59 │
     ├──────────────────┤
 26. │  6       1    74 │
 27. │  6       2    78 │
 28. │  6       3    92 │
 29. │  6       4    99 │
 30. │  6       5    80 │
     └──────────────────┘

(data now sorted by id trial)

9.5 Reshaping Wide to Long

In my personal opinion, this reshaping wide to long is a more common occurance. It is important to get data into a long-format so we can conduct panel analysis. I had to do this for a few of the Wooldridge datasets that were panel data, but in wide format.

. use cardio_wide, clear

. describe

Contains data from cardio_wide.dta
  obs:             6                          
 vars:            12                          22 Dec 2009 20:43
 size:           120                          
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id              byte    %3.0f                 ID of person
age             byte    %3.0f                 Age of person
bp1             int     %3.0f                 Blood pressure systolic Trial 1
bp2             int     %3.0f                 Blood pressure systolic Trial 2
bp3             int     %3.0f                 Blood pressure systolic Trial 3
bp4             int     %3.0f                 Blood pressure systolic Trial 4
bp5             int     %3.0f                 Blood pressure systolic Trial 5
pl1             int     %3.0f                 Pulse: Trial 1
pl2             byte    %3.0f                 Pulse: Trial 2
pl3             int     %3.0f                 Pulse: Trial 3
pl4             int     %3.0f                 Pulse: Trial 4
pl5             byte    %3.0f                 Pulse: Trial 5
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: 

. list

     ┌──────────────────────────────────────────────────────────────────────┐
     │ id   age   bp1   bp2   bp3   bp4   bp5   pl1   pl2   pl3   pl4   pl5 │
     ├──────────────────────────────────────────────────────────────────────┤
  1. │  1    40   115    86   129   105   127    54    87    93    81    92 │
  2. │  2    30   123   136   107   111   120    92    88   125    87    58 │
  3. │  3    16   124   122   101   109   112   105    97   128    57    68 │
  4. │  4    23   105   115   121   129   137    52    79    71   106    39 │
  5. │  5    18   116   128   112   125   111    70    64    52    68    59 │
     ├──────────────────────────────────────────────────────────────────────┤
  6. │  6    27   108   126   124   131   107    74    78    92    99    80 │
     └──────────────────────────────────────────────────────────────────────┘

Let’s reshape wide to long with Our reshape values that are not constant: bp and pl Our cross-sectional unit to reshape upon or our list of variables that uniquely identify a cross-sectional unit. For example, we may need a State FIPS code and a County FIPS code to unique identify a county cross-sectional unit (of course we can merge the 2-digit State FIPS code and the 3-digit County FIPS code to create a 5-digit FIPS code in one variable). Our time unit, where with reshape long, we need to create a new variable that takes on the values of the suffixes of bp and pl (For example trialnum). Our constant values within cross-sectional units do not need to be identified.

. reshape long bp pl, i(id) j(trialnum)
(note: j = 1 2 3 4 5)

Data                               wide   ->   long
─────────────────────────────────────────────────────────────────────────────
Number of obs.                        6   ->      30
Number of variables                  12   ->       5
j variable (5 values)                     ->   trialnum
xij variables:
                        bp1 bp2 ... bp5   ->   bp
                        pl1 pl2 ... pl5   ->   pl
─────────────────────────────────────────────────────────────────────────────

. describe

Contains data
  obs:            30                          
 vars:             5                          
 size:           210                          
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id              byte    %3.0f                 ID of person
trialnum        byte    %9.0g                 
age             byte    %3.0f                 Age of person
bp              int     %3.0f                 
pl              int     %3.0f                 
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: id  trialnum
     Note: Dataset has changed since last saved.

. list in 1/10

     ┌─────────────────────────────────┐
     │ id   trialnum   age    bp    pl │
     ├─────────────────────────────────┤
  1. │  1          1    40   115    54 │
  2. │  1          2    40    86    87 │
  3. │  1          3    40   129    93 │
  4. │  1          4    40   105    81 │
  5. │  1          5    40   127    92 │
     ├─────────────────────────────────┤
  6. │  2          1    30   123    92 │
  7. │  2          2    30   136    88 │
  8. │  2          3    30   107   125 │
  9. │  2          4    30   111    87 │
 10. │  2          5    30   120    58 │
     └─────────────────────────────────┘

To reshape it back to wide, we just simply state

. reshape wide
(note: j = 1 2 3 4 5)

Data                               long   ->   wide
─────────────────────────────────────────────────────────────────────────────
Number of obs.                       30   ->       6
Number of variables                   5   ->      12
j variable (5 values)          trialnum   ->   (dropped)
xij variables:
                                     bp   ->   bp1 bp2 ... bp5
                                     pl   ->   pl1 pl2 ... pl5
─────────────────────────────────────────────────────────────────────────────

. describe

Contains data
  obs:             6                          
 vars:            12                          
 size:           132                          
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id              byte    %3.0f                 ID of person
bp1             int     %3.0f                 1 bp
pl1             int     %3.0f                 1 pl
bp2             int     %3.0f                 2 bp
pl2             int     %3.0f                 2 pl
bp3             int     %3.0f                 3 bp
pl3             int     %3.0f                 3 pl
bp4             int     %3.0f                 4 bp
pl4             int     %3.0f                 4 pl
bp5             int     %3.0f                 5 bp
pl5             int     %3.0f                 5 pl
age             byte    %3.0f                 Age of person
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: id
     Note: Dataset has changed since last saved.

. list

     ┌──────────────────────────────────────────────────────────────────────┐
     │ id   bp1   pl1   bp2   pl2   bp3   pl3   bp4   pl4   bp5   pl5   age │
     ├──────────────────────────────────────────────────────────────────────┤
  1. │  1   115    54    86    87   129    93   105    81   127    92    40 │
  2. │  2   123    92   136    88   107   125   111    87   120    58    30 │
  3. │  3   124   105   122    97   101   128   109    57   112    68    16 │
  4. │  4   105    52   115    79   121    71   129   106   137    39    23 │
  5. │  5   116    70   128    64   112    52   125    68   111    59    18 │
     ├──────────────────────────────────────────────────────────────────────┤
  6. │  6   108    74   126    78   124    92   131    99   107    80    27 │
     └──────────────────────────────────────────────────────────────────────┘

9.6 Reshaping wide to long problems

Problem 1: Failing to list all of the varying variables to be reshaped.

This is a dangerous kind of problem, since it doesn’t throw an error. This requires the user to inspect the data after reshaping to make sure everything worked properly.

. use cardio_wide, clear

. reshape long bp, i(id) j(trialnum)
(note: j = 1 2 3 4 5)

Data                               wide   ->   long
─────────────────────────────────────────────────────────────────────────────
Number of obs.                        6   ->      30
Number of variables                  12   ->       9
j variable (5 values)                     ->   trialnum
xij variables:
                        bp1 bp2 ... bp5   ->   bp
─────────────────────────────────────────────────────────────────────────────

The remedy is easy: don’t forget all of your variables to be reshaped

. use cardio_wide, clear

. reshape long bp pl, i(id) j(trialnum)
(note: j = 1 2 3 4 5)

Data                               wide   ->   long
─────────────────────────────────────────────────────────────────────────────
Number of obs.                        6   ->      30
Number of variables                  12   ->       5
j variable (5 values)                     ->   trialnum
xij variables:
                        bp1 bp2 ... bp5   ->   bp
                        pl1 pl2 ... pl5   ->   pl
─────────────────────────────────────────────────────────────────────────────

Problem 2: Be careful when the time-varying value is embedded within the variable name and not at the end.

Typically you will find that the varying variable will have var1, var2, …, vark and you simply specify reshape long var, i(id) j(time). But, if the varying value is embedded like t1bp,…,t5bp and t1pl,…,t5pl, then we cannot simply use reshape long tbp tpl, i(id) j(trialnum), since Stata looks for the varying value at the end.

. use cardio_wide2, clear

. describe

Contains data from cardio_wide2.dta
  obs:             6                          
 vars:            12                          31 Dec 2009 15:46
 size:           120                          
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id              byte    %3.0f                 ID of person
age             byte    %3.0f                 Age of person
t1bp            int     %3.0f                 Blood pressure systolic Trial 1
t2bp            int     %3.0f                 Blood pressure systolic Trial 2
t3bp            int     %3.0f                 Blood pressure systolic Trial 3
t4bp            int     %3.0f                 Blood pressure systolic Trial 4
t5bp            int     %3.0f                 Blood pressure systolic Trial 5
t1pl            int     %3.0f                 Pulse: Trial 1
t2pl            byte    %3.0f                 Pulse: Trial 2
t3pl            int     %3.0f                 Pulse: Trial 3
t4pl            int     %3.0f                 Pulse: Trial 4
t5pl            byte    %3.0f                 Pulse: Trial 5
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: 

. capture reshape long tpl tpb, i(id) j(trialnum)

. reshape error
no xij variables found
    You typed something like reshape wide a b, i(i) j(j).
    reshape looked for existing variables named a# and b# but could not find any.  Remember this picture:

         long                                wide
        ┌───────────────┐                   ┌──────────────────┐
        │ i   j   a   b │                   │ i   a1 a2  b1 b2 │
        │───────────────│ <─── reshape ───> │──────────────────│
        │ 1   1   1   2 │                   │ 1   1   3   2  4 │
        │ 1   2   3   4 │                   │ 2   5   7   6  8 │
        │ 2   1   5   6 │                   └──────────────────┘
        │ 2   2   7   8 │
        └───────────────┘

        long to wide: reshape wide a b, i(i) j(j)    (j existing variable)
        wide to long: reshape long a b, i(i) j(j)    (j    new   variable)

We need the operator @ to indicate to Stata that our varying-values are in that part of the variable names.

. reshape long t@pl t@bp, i(id) j(trialnum)
(note: j = 1 2 3 4 5)

Data                               wide   ->   long
─────────────────────────────────────────────────────────────────────────────
Number of obs.                        6   ->      30
Number of variables                  12   ->       5
j variable (5 values)                     ->   trialnum
xij variables:
                     t1pl t2pl ... t5pl   ->   tpl
                     t1bp t2bp ... t5bp   ->   tbp
─────────────────────────────────────────────────────────────────────────────

. describe

Contains data
  obs:            30                          
 vars:             5                          
 size:           210                          
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id              byte    %3.0f                 ID of person
trialnum        byte    %9.0g                 
age             byte    %3.0f                 Age of person
tbp             int     %3.0f                 
tpl             int     %3.0f                 
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: id  trialnum
     Note: Dataset has changed since last saved.

. list in 1/10

     ┌─────────────────────────────────┐
     │ id   trialnum   age   tbp   tpl │
     ├─────────────────────────────────┤
  1. │  1          1    40   115    54 │
  2. │  1          2    40    86    87 │
  3. │  1          3    40   129    93 │
  4. │  1          4    40   105    81 │
  5. │  1          5    40   127    92 │
     ├─────────────────────────────────┤
  6. │  2          1    30   123    92 │
  7. │  2          2    30   136    88 │
  8. │  2          3    30   107   125 │
  9. │  2          4    30   111    87 │
 10. │  2          5    30   120    58 │
     └─────────────────────────────────┘

Problem 3: Be specific with which variables to reshape and don’t mistakenly constant variables that have the same prefix as the varying variables.

. use cardio_wide3, clear

. clist bp* pl*, noobs

bp1  bp2  bp3  bp4  bp5     bp2005  pl1  pl2  pl3  pl4  pl5     pl2005
115   86  129  105  127        112   54   87   93   81   92         81
123  136  107  111  120        119   92   88  125   87   58         90
124  122  101  109  112        113  105   97  128   57   68         91
105  115  121  129  137        121   52   79   71  106   39         69
116  128  112  125  111        118   70   64   52   68   59         62
108  126  124  131  107        119   74   78   92   99   80         84

Our bp2005 and pl2005 are constant variables within our cross-section, but have the same prefix as our value-varying variables

. reshape long bp pl, i(id) j(trialnum)
(note: j = 1 2 3 4 5 2005)

Data                               wide   ->   long
─────────────────────────────────────────────────────────────────────────────
Number of obs.                        6   ->      36
Number of variables                  14   ->       5
j variable (6 values)                     ->   trialnum
xij variables:
                     bp1 bp2 ... bp2005   ->   bp
                     pl1 pl2 ... pl2005   ->   pl
─────────────────────────────────────────────────────────────────────────────

. describe

Contains data
  obs:            36                          
 vars:             5                          
 size:           432                          
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id              byte    %3.0f                 ID of person
trialnum        int     %9.0g                 
age             byte    %3.0f                 Age of person
bp              float   %9.0g                 
pl              float   %9.0g                 
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: id  trialnum
     Note: Dataset has changed since last saved.

. list in 1/12, sepby(id)

     ┌─────────────────────────────────┐
     │ id   trialnum   age    bp    pl │
     ├─────────────────────────────────┤
  1. │  1          1    40   115    54 │
  2. │  1          2    40    86    87 │
  3. │  1          3    40   129    93 │
  4. │  1          4    40   105    81 │
  5. │  1          5    40   127    92 │
  6. │  1       2005    40   112    81 │
     ├─────────────────────────────────┤
  7. │  2          1    30   123    92 │
  8. │  2          2    30   136    88 │
  9. │  2          3    30   107   125 │
 10. │  2          4    30   111    87 │
 11. │  2          5    30   120    58 │
 12. │  2       2005    30   119    90 │
     └─────────────────────────────────┘

We did not intend to reshape bp2005 and pl2005.

To rememdy, we need to specify the values within j() with j(trialnum 1-5) to indicate that we only want to reshape suffixes with 1-5

. use cardio_wide3, clear

. reshape long bp pl, i(id) j(trialnum 1-5)

Data                               wide   ->   long
─────────────────────────────────────────────────────────────────────────────
Number of obs.                        6   ->      30
Number of variables                  14   ->       7
j variable (5 values)                     ->   trialnum
xij variables:
                        bp1 bp2 ... bp5   ->   bp
                        pl1 pl2 ... pl5   ->   pl
─────────────────────────────────────────────────────────────────────────────

. describe

Contains data
  obs:            30                          
 vars:             7                          
 size:           450                          
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id              byte    %3.0f                 ID of person
trialnum        byte    %9.0g                 
age             byte    %3.0f                 Age of person
bp              int     %3.0f                 
bp2005          float   %9.0g                 
pl              int     %3.0f                 
pl2005          float   %9.0g                 
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: id  trialnum
     Note: Dataset has changed since last saved.

. list in 1/10, sepby(id)

     ┌───────────────────────────────────────────────────┐
     │ id   trialnum   age    bp   bp2005    pl   pl2005 │
     ├───────────────────────────────────────────────────┤
  1. │  1          1    40   115      112    54       81 │
  2. │  1          2    40    86      112    87       81 │
  3. │  1          3    40   129      112    93       81 │
  4. │  1          4    40   105      112    81       81 │
  5. │  1          5    40   127      112    92       81 │
     ├───────────────────────────────────────────────────┤
  6. │  2          1    30   123      119    92       90 │
  7. │  2          2    30   136      119    88       90 │
  8. │  2          3    30   107      119   125       90 │
  9. │  2          4    30   111      119    87       90 │
 10. │  2          5    30   120      119    58       90 │
     └───────────────────────────────────────────────────┘

Problem 4: No id varible to identify the cross-sectional unit If each wide-format observation is a unique cross-sectional unit, then we can use _n to generate a unique id variable

. use cardio_wide3, clear

. describe

Contains data from cardio_wide3.dta
  obs:             6                          
 vars:            14                          31 Dec 2009 15:46
 size:           168                          
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id              byte    %3.0f                 ID of person
age             byte    %3.0f                 Age of person
bp1             int     %3.0f                 Blood pressure systolic Trial 1
bp2             int     %3.0f                 Blood pressure systolic Trial 2
bp3             int     %3.0f                 Blood pressure systolic Trial 3
bp4             int     %3.0f                 Blood pressure systolic Trial 4
bp5             int     %3.0f                 Blood pressure systolic Trial 5
bp2005          float   %9.0g                 
pl1             int     %3.0f                 Pulse: Trial 1
pl2             byte    %3.0f                 Pulse: Trial 2
pl3             int     %3.0f                 Pulse: Trial 3
pl4             int     %3.0f                 Pulse: Trial 4
pl5             byte    %3.0f                 Pulse: Trial 5
pl2005          float   %9.0g                 
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: 

. gen idcode = _n

. describe

Contains data from cardio_wide3.dta
  obs:             6                          
 vars:            15                          31 Dec 2009 15:46
 size:           216                          
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id              byte    %3.0f                 ID of person
age             byte    %3.0f                 Age of person
bp1             int     %3.0f                 Blood pressure systolic Trial 1
bp2             int     %3.0f                 Blood pressure systolic Trial 2
bp3             int     %3.0f                 Blood pressure systolic Trial 3
bp4             int     %3.0f                 Blood pressure systolic Trial 4
bp5             int     %3.0f                 Blood pressure systolic Trial 5
bp2005          float   %9.0g                 
pl1             int     %3.0f                 Pulse: Trial 1
pl2             byte    %3.0f                 Pulse: Trial 2
pl3             int     %3.0f                 Pulse: Trial 3
pl4             int     %3.0f                 Pulse: Trial 4
pl5             byte    %3.0f                 Pulse: Trial 5
pl2005          float   %9.0g                 
idcode          double  %10.0g                
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: 
     Note: Dataset has changed since last saved.

. reshape long bp pl, i(id idcode) j(trialnum 1-5)

Data                               wide   ->   long
─────────────────────────────────────────────────────────────────────────────
Number of obs.                        6   ->      30
Number of variables                  15   ->       8
j variable (5 values)                     ->   trialnum
xij variables:
                        bp1 bp2 ... bp5   ->   bp
                        pl1 pl2 ... pl5   ->   pl
─────────────────────────────────────────────────────────────────────────────

. describe

Contains data
  obs:            30                          
 vars:             8                          
 size:           690                          
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
id              byte    %3.0f                 ID of person
idcode          double  %10.0g                
trialnum        byte    %9.0g                 
age             byte    %3.0f                 Age of person
bp              int     %3.0f                 
bp2005          float   %9.0g                 
pl              int     %3.0f                 
pl2005          float   %9.0g                 
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: id  idcode  trialnum
     Note: Dataset has changed since last saved.

. list in 1/10, sepby(id)

     ┌────────────────────────────────────────────────────────────┐
     │ id   idcode   trialnum   age    bp   bp2005    pl   pl2005 │
     ├────────────────────────────────────────────────────────────┤
  1. │  1        1          1    40   115      112    54       81 │
  2. │  1        1          2    40    86      112    87       81 │
  3. │  1        1          3    40   129      112    93       81 │
  4. │  1        1          4    40   105      112    81       81 │
  5. │  1        1          5    40   127      112    92       81 │
     ├────────────────────────────────────────────────────────────┤
  6. │  2        2          1    30   123      119    92       90 │
  7. │  2        2          2    30   136      119    88       90 │
  8. │  2        2          3    30   107      119   125       90 │
  9. │  2        2          4    30   111      119    87       90 │
 10. │  2        2          5    30   120      119    58       90 │
     └────────────────────────────────────────────────────────────┘

9.7 Multilevel datasets

Mitchell discusses when we have multiple levels of data, such as cross-sectional unit i and time dimension t, where data is level 1 data are constant across time within unit, but vary across cross-sectional units and level 2 data are time-varying data that vary within a cross-sectional units

Sometimes we have multiple levels, such as school district, school, classroom, and student.

If you are interested, please review pages 311-314

9.8 Collapsing datasets

The collapse command can be useful finding statistics at aggregate levels for different groups. You don’t have to use egen but you can if you would like. After using the egen command, such as egen sumvar = sum(var), or egen avgvar = mean(var) the collapse command can be useful. Let’s say we want to aggregate mean wages by state in the CPS, we can use the egen mean command to find average wages in the state for month m in year y. We can then keep state month year and avgvar for all 50 states for each month and year of interest.

We need to be careful with the collapse command, since your original data are lost from memory unless you use a tempfile or a data frame.

. use cardio_long, clear

. list, sepby(id)

     ┌──────────────────────────────┐
     │ id   trial   age    bp    pl │
     ├──────────────────────────────┤
  1. │  1       1    40   115    54 │
  2. │  1       2    40    86    87 │
  3. │  1       3    40   129    93 │
  4. │  1       4    40   105    81 │
  5. │  1       5    40   127    92 │
     ├──────────────────────────────┤
  6. │  2       1    30   123    92 │
  7. │  2       2    30   136    88 │
  8. │  2       3    30   107   125 │
  9. │  2       4    30   111    87 │
 10. │  2       5    30   120    58 │
     ├──────────────────────────────┤
 11. │  3       1    16   124   105 │
 12. │  3       2    16   122    97 │
 13. │  3       3    16   101   128 │
 14. │  3       4    16   109    57 │
 15. │  3       5    16   112    68 │
     ├──────────────────────────────┤
 16. │  4       1    23   105    52 │
 17. │  4       2    23   115    79 │
 18. │  4       3    23   121    71 │
 19. │  4       4    23   129   106 │
 20. │  4       5    23   137    39 │
     ├──────────────────────────────┤
 21. │  5       1    18   116    70 │
 22. │  5       2    18   128    64 │
 23. │  5       3    18   112    52 │
 24. │  5       4    18   125    68 │
 25. │  5       5    18   111    59 │
     ├──────────────────────────────┤
 26. │  6       1    27   108    74 │
 27. │  6       2    27   126    78 │
 28. │  6       3    27   124    92 │
 29. │  6       4    27   131    99 │
 30. │  6       5    27   107    80 │
     └──────────────────────────────┘

Let’s say we want to find mean blood pressure and pluse for each cross-sectional unit, then we can use the egen command or the collapse commands

. sort id

. by id: egen meanbp = mean(bp)

. by id: egen meanpl = mean(pl)

. by id: egen maxbp = max(bp)

. by id: egen maxpl = max(bp)

. by id: egen minbp = min(bp)

. by id: egen minpl = min(bp)

. by id: egen medbp = pctile(bp), p(50)

. by id: egen medpl = pctile(pl), p(50)
. collapse meanbp meanpl maxbp maxpl minbp minpl medbp medpl, by(id)

Or you can just use the collapse command

. use cardio_long, clear
. collapse (mean) meanbp=bp meanpl=pl (max) maxbp=bp maxpl=pl ///
>     (min) minbp=bp minpl=pl (p50) medbp=bp medpl=pl, by(id)

Notice that our data are rounded to the nearest 1, unlike our egen commands We can format our data to reflect that decimal point to two places instead of rounding up by default

. use cardio_long, clear

. format bp* pl* %5.2f

. collapse (mean) meanbp=bp meanpl=pl (max) maxbp=bp maxpl=pl ///
>      (min) minbp=bp minpl=pl (p50) medbp=bp medpl=pl, by(id)

Exercises Pull BLS monthly inflation data, reshape it into long, and merge with CPS