Chapter 4 Reshape
From my experience, I feel that understanding and mastering the reshape command is invaluable. This is especially true when working with panel data. Panel data should not be used in a wide format. Panel data should be in a long format where you can use the xtset unit time command. After the xtset is established you can easily add lags or differences with the l.var or d.var
It is also helpful in terms of merging datasets. A lot of time if I’m merging BLS data, such as state unemployment, or inflation rates, I need to reshape the BLS data from wide to long to I can merge m:1
I typically work in long formats, but reshaping to wide formats does have its benefits. I find wide formats are helpful for people who work in excel, especially when years are across columns.
4.1 Wide and Long Datasets
We have two types of datasets: wide and long
Wide Format
Our wide datasets typically include years or variables types spread out across the columns.
/Users/Sam/Desktop/Econ 645/Data/Mitchell
Contains data from cardio_wide.dta
obs: 6
vars: 12 22 Dec 2009 20:43
size: 120
----------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------------
id byte %3.0f ID of person
age byte %3.0f Age of person
bp1 int %3.0f Blood pressure systolic Trial 1
bp2 int %3.0f Blood pressure systolic Trial 2
bp3 int %3.0f Blood pressure systolic Trial 3
bp4 int %3.0f Blood pressure systolic Trial 4
bp5 int %3.0f Blood pressure systolic Trial 5
pl1 int %3.0f Pulse: Trial 1
pl2 byte %3.0f Pulse: Trial 2
pl3 int %3.0f Pulse: Trial 3
pl4 int %3.0f Pulse: Trial 4
pl5 byte %3.0f Pulse: Trial 5
----------------------------------------------------------------------------------------------------------------------
Sorted by:
We have 12 variables idcode, age, and our 5 different blood pressure and pulse trials. We see that are 5 different incidents/trials of blood pressure and pulse are spread out across 5 different variables \(bp1-bp5\) and \(pl1-pl5\)*
| id age bp1 bp2 bp3 bp4 bp5 pl1 pl2 pl3 pl4 pl5 |
|----------------------------------------------------------------------|
1. | 1 40 115 86 129 105 127 54 87 93 81 92 |
2. | 2 30 123 136 107 111 120 92 88 125 87 58 |
3. | 3 16 124 122 101 109 112 105 97 128 57 68 |
4. | 4 23 105 115 121 129 137 52 79 71 106 39 |
5. | 5 18 116 128 112 125 111 70 64 52 68 59 |
|----------------------------------------------------------------------|
6. | 6 27 108 126 124 131 107 74 78 92 99 80 |
+----------------------------------------------------------------------+
With a wide format each row is an observation and all of our different trials are located on a single row
Long Format
We’ll look at an example of the same dataset but in long format
/Users/Sam/Desktop/Econ 645/Data/Mitchell
Contains data from cardio_long.dta
obs: 30
vars: 5 10 Feb 2020 23:02
size: 210
----------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------------
id byte %3.0f ID of person
trial byte %9.0g Trial number
age byte %3.0f Age of person
bp int %3.0f Blood pressure (systolic)
pl int %3.0f Pulse
----------------------------------------------------------------------------------------------------------------------
Sorted by: id trial
We only have 5 variables idcode, age, trial, blood pressure, and pulse.
| id trial age bp pl |
|------------------------------|
1. | 1 1 40 115 54 |
2. | 1 2 40 86 87 |
3. | 1 3 40 129 93 |
4. | 1 4 40 105 81 |
5. | 1 5 40 127 92 |
|------------------------------|
6. | 2 1 30 123 92 |
7. | 2 2 30 136 88 |
8. | 2 3 30 107 125 |
9. | 2 4 30 111 87 |
10. | 2 5 30 120 58 |
|------------------------------|
11. | 3 1 16 124 105 |
12. | 3 2 16 122 97 |
13. | 3 3 16 101 128 |
14. | 3 4 16 109 57 |
15. | 3 5 16 112 68 |
|------------------------------|
16. | 4 1 23 105 52 |
17. | 4 2 23 115 79 |
18. | 4 3 23 121 71 |
19. | 4 4 23 129 106 |
20. | 4 5 23 137 39 |
|------------------------------|
21. | 5 1 18 116 70 |
22. | 5 2 18 128 64 |
23. | 5 3 18 112 52 |
24. | 5 4 18 125 68 |
25. | 5 5 18 111 59 |
|------------------------------|
26. | 6 1 27 108 74 |
27. | 6 2 27 126 78 |
28. | 6 3 27 124 92 |
29. | 6 4 27 131 99 |
30. | 6 5 27 107 80 |
+------------------------------+
We also see that each individual has 5 observations. One for each trial that is sorted from 1 to 5.
Notice: We have a cross-sectional unit variable with \(idcode\) and we also have a time component with \(trial\). Our dataset is basically a panel data set. We observe the same person over 5 trials.
What kind of format should we use?
It depends.
Wide Format
Mitchell makes a good point of looking at correlations among the blood pressure trials.
/Users/Sam/Desktop/Econ 645/Data/Mitchell
(obs=6)
| bp1 bp2 bp3 bp4 bp5
-------------+---------------------------------------------
bp1 | 1.0000
bp2 | 0.2427 1.0000
bp3 | -0.7662 -0.6657 1.0000
bp4 | -0.7644 0.3980 0.2644 1.0000
bp5 | -0.3643 -0.4984 0.3694 -0.0966 1.0000
Mitchell also mentions that multivariate analysis with multiple trials is a bit easier with factor variables. If we wanted to see all of the different regressions for each trial, then
Equation Obs Parms RMSE "R-sq" F P
--------------------------------------------------------------------------
bp1 6 2 8.516293 0.0160 .0651388 0.8111
bp2 6 2 15.34222 0.3858 2.512064 0.1882
bp3 6 2 8.846404 0.4597 3.402782 0.1388
bp4 6 2 11.60081 0.1554 .7357661 0.4394
bp5 6 2 11.81248 0.1360 .629676 0.4719
------------------------------------------------------------------------------
| Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
bp1 |
age | -.1107266 .4338427 -0.26 0.811 -1.315267 1.093814
_cons | 118.0087 11.66545 10.12 0.001 85.62017 150.3971
-------------+----------------------------------------------------------------
bp2 |
age | -1.238754 .7815736 -1.58 0.188 -3.40875 .9312418
_cons | 150.628 21.01547 7.17 0.002 92.27974 208.9763
-------------+----------------------------------------------------------------
bp3 |
age | .8313149 .4506595 1.84 0.139 -.4199164 2.082546
_cons | 94.32958 12.11763 7.78 0.001 60.68565 127.9735
-------------+----------------------------------------------------------------
bp4 |
age | -.5069204 .5909761 -0.86 0.439 -2.147733 1.133892
_cons | 131.3443 15.89055 8.27 0.001 87.22504 175.4635
-------------+----------------------------------------------------------------
bp5 |
age | .4775087 .6017591 0.79 0.472 -1.193242 2.14826
_cons | 106.7439 16.1805 6.60 0.003 61.81969 151.6682
------------------------------------------------------------------------------
Long Format
There are several reasons we would want to use long format over wide format
Correlations
If we want correlations between blood pressure and pulse, a long format is preferable.
/Users/Sam/Desktop/Econ 645/Data/Mitchell
(obs=30)
| bp pl
-------------+------------------
bp | 1.0000
pl | -0.0444 1.0000
Or
-> trial = 1
(obs=6)
| bp pl
-------------+------------------
bp | 1.0000
pl | 0.7958 1.0000
----------------------------------------------------------------------------------------------------------------------
-> trial = 2
(obs=6)
| bp pl
-------------+------------------
bp | 1.0000
pl | -0.1985 1.0000
----------------------------------------------------------------------------------------------------------------------
-> trial = 3
(obs=6)
| bp pl
-------------+------------------
bp | 1.0000
pl | -0.4790 1.0000
----------------------------------------------------------------------------------------------------------------------
-> trial = 4
(obs=6)
| bp pl
-------------+------------------
bp | 1.0000
pl | 0.5639 1.0000
----------------------------------------------------------------------------------------------------------------------
-> trial = 5
(obs=6)
| bp pl
-------------+------------------
bp | 1.0000
pl | -0.3911 1.0000
Panel data
It is also more appropriate for panel data and the xt commands, such as xtset, xtreg, etc. I also prefer this for regular regressions where we can use factor variables \(i.var\) for different time periods.
Recoding
It is also easier to recode in long format.
Recoding among many variables in wide format is more of a pain You can see this on page 293 in Mitchell.
Egen is easier
If you wanted to mean of each 5 trials, it just as easy with egen in long format.
For wide-format you can use rowmean with the wildcard
We don’t need the bysort group, but we need to be careful which columns to include
It easier to explicitly difference trails with wide-format, but requires more typing.
4.1.0.1 Using subscripting/indexing, differences, lags with ease
I prefer using subscripting/indexing within groups or using xtset with panel data since we have a lot of flexibilities.
We can use indexing
We can difference from the first trial
We can use xtset as well with l.var
panel variable: id (strongly balanced)
time variable: trial, 1 to 5
delta: 1 unit
One lag
We can use two lags
We can use three lags
(18 missing values generated)
+-----------------------------------------------------------------------------------+
| id trial age bp pl pldiff pldiff~l ~f_other p~2_ot~r p~3_ot~r |
|-----------------------------------------------------------------------------------|
1. | 1 1 40 115 54 . 0 . . . |
2. | 1 2 40 86 87 33 33 33 . . |
3. | 1 3 40 129 93 6 39 6 -33 . |
4. | 1 4 40 105 81 -12 27 -12 -6 -33 |
5. | 1 5 40 127 92 11 38 11 12 -6 |
|-----------------------------------------------------------------------------------|
6. | 2 1 30 123 92 . 0 . . . |
7. | 2 2 30 136 88 -4 -4 -4 . . |
8. | 2 3 30 107 125 37 33 37 4 . |
9. | 2 4 30 111 87 -38 -5 -38 -37 4 |
10. | 2 5 30 120 58 -29 -34 -29 38 -37 |
|-----------------------------------------------------------------------------------|
11. | 3 1 16 124 105 . 0 . . . |
12. | 3 2 16 122 97 -8 -8 -8 . . |
13. | 3 3 16 101 128 31 23 31 8 . |
14. | 3 4 16 109 57 -71 -48 -71 -31 8 |
15. | 3 5 16 112 68 11 -37 11 71 -31 |
|-----------------------------------------------------------------------------------|
16. | 4 1 23 105 52 . 0 . . . |
17. | 4 2 23 115 79 27 27 27 . . |
18. | 4 3 23 121 71 -8 19 -8 -27 . |
19. | 4 4 23 129 106 35 54 35 8 -27 |
20. | 4 5 23 137 39 -67 -13 -67 -35 8 |
|-----------------------------------------------------------------------------------|
21. | 5 1 18 116 70 . 0 . . . |
22. | 5 2 18 128 64 -6 -6 -6 . . |
23. | 5 3 18 112 52 -12 -18 -12 6 . |
24. | 5 4 18 125 68 16 -2 16 12 6 |
25. | 5 5 18 111 59 -9 -11 -9 -16 12 |
|-----------------------------------------------------------------------------------|
26. | 6 1 27 108 74 . 0 . . . |
27. | 6 2 27 126 78 4 4 4 . . |
28. | 6 3 27 124 92 14 18 14 -4 . |
29. | 6 4 27 131 99 7 25 7 -14 -4 |
30. | 6 5 27 107 80 -19 6 -19 -7 -14 |
+-----------------------------------------------------------------------------------+
Wide format can become unwieldy quickly and Stata does have a variable limit size, while your observation limit is typically your memory.In short, it is better in the long-run to learn to work with Stata in long-format.
However, Mitchell does bring up a good point that in our panel data, we mayneed multiple weights wt1-wtk. You can see this in the ACS PUMS with 250 or soreplicate weights. We do not want to reshape our data by weights, we need to keep the data in a cross-sectional unit i and time period t format.
In short, I agree with Mitchell that you should learn to work with your data in a long-format data structure.
4.2 Reshaping Long to Wide
There may be scenarios where we need to reshape our long-format data into a wide format. For example, it might be easier to match and merge our data If our using data set is in wide format and our master data set is in wide format
Let’s look at an example
/Users/Sam/Desktop/Econ 645/Data/Mitchell
Contains data from cardio_long.dta
obs: 30
vars: 5 10 Feb 2020 23:02
size: 210
----------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------------
id byte %3.0f ID of person
trial byte %9.0g Trial number
age byte %3.0f Age of person
bp int %3.0f Blood pressure (systolic)
pl int %3.0f Pulse
----------------------------------------------------------------------------------------------------------------------
Sorted by: id trial
+------------------------------+
| id trial age bp pl |
|------------------------------|
1. | 1 1 40 115 54 |
2. | 1 2 40 86 87 |
3. | 1 3 40 129 93 |
4. | 1 4 40 105 81 |
5. | 1 5 40 127 92 |
|------------------------------|
6. | 2 1 30 123 92 |
7. | 2 2 30 136 88 |
8. | 2 3 30 107 125 |
9. | 2 4 30 111 87 |
10. | 2 5 30 120 58 |
|------------------------------|
11. | 3 1 16 124 105 |
12. | 3 2 16 122 97 |
13. | 3 3 16 101 128 |
14. | 3 4 16 109 57 |
15. | 3 5 16 112 68 |
|------------------------------|
16. | 4 1 23 105 52 |
17. | 4 2 23 115 79 |
18. | 4 3 23 121 71 |
19. | 4 4 23 129 106 |
20. | 4 5 23 137 39 |
|------------------------------|
21. | 5 1 18 116 70 |
22. | 5 2 18 128 64 |
23. | 5 3 18 112 52 |
24. | 5 4 18 125 68 |
25. | 5 5 18 111 59 |
|------------------------------|
26. | 6 1 27 108 74 |
27. | 6 2 27 126 78 |
28. | 6 3 27 124 92 |
29. | 6 4 27 131 99 |
30. | 6 5 27 107 80 |
+------------------------------+
Our reshape command requires reshape 1. What direction: wide or long 2. Variables to reshape (bp and pl) 3. Variables define an observation i(id) 4. Variables that defines the repeated observations for each person orvtime variable j(trial) for reshaping long to wide
Trial values will become the suffix values for our reshape values
(note: j = 1 2 3 4 5)
Data long -> wide
-----------------------------------------------------------------------------
Number of obs. 30 -> 6
Number of variables 5 -> 12
j variable (5 values) trial -> (dropped)
xij variables:
bp -> bp1 bp2 ... bp5
pl -> pl1 pl2 ... pl5
-----------------------------------------------------------------------------
Contains data
obs: 6
vars: 12
size: 132
----------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------------
id byte %3.0f ID of person
bp1 int %3.0f 1 bp
pl1 int %3.0f 1 pl
bp2 int %3.0f 2 bp
pl2 int %3.0f 2 pl
bp3 int %3.0f 3 bp
pl3 int %3.0f 3 pl
bp4 int %3.0f 4 bp
pl4 int %3.0f 4 pl
bp5 int %3.0f 5 bp
pl5 int %3.0f 5 pl
age byte %3.0f Age of person
----------------------------------------------------------------------------------------------------------------------
Sorted by: id
Note: Dataset has changed since last saved.
+----------------------------------------------------------------------+
| id bp1 pl1 bp2 pl2 bp3 pl3 bp4 pl4 bp5 pl5 age |
|----------------------------------------------------------------------|
1. | 1 115 54 86 87 129 93 105 81 127 92 40 |
2. | 2 123 92 136 88 107 125 111 87 120 58 30 |
3. | 3 124 105 122 97 101 128 109 57 112 68 16 |
4. | 4 105 52 115 79 121 71 129 106 137 39 23 |
5. | 5 116 70 128 64 112 52 125 68 111 59 18 |
|----------------------------------------------------------------------|
6. | 6 108 74 126 78 124 92 131 99 107 80 27 |
+----------------------------------------------------------------------+
Notice we excluded age. It is constant within the cross-sectional unit. It does not define the cross-sectional unit or the*time dimension unit. We do not want to separate it into repeated observations either.
Without doing any additional work, we can reshape it back to long with simply reshape long.
(note: j = 1 2 3 4 5)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 6 -> 30
Number of variables 12 -> 5
j variable (5 values) -> trial
xij variables:
bp1 bp2 ... bp5 -> bp
pl1 pl2 ... pl5 -> pl
-----------------------------------------------------------------------------
Contains data
obs: 30
vars: 5
size: 210
----------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------------
id byte %3.0f ID of person
trial byte %9.0g Trial number
bp int %3.0f Blood pressure (systolic)
pl int %3.0f Pulse
age byte %3.0f Age of person
----------------------------------------------------------------------------------------------------------------------
Sorted by: id trial
Note: Dataset has changed since last saved.
4.3 Reshaping Long to Wide: Problems
Problem 1: A constant value within a cross-sectional unit is not constant
/Users/Sam/Desktop/Econ 645/Data/Mitchell
+------------------------------+
| id trial age bp pl |
|------------------------------|
1. | 1 1 40 115 54 |
2. | 1 2 40 86 87 |
3. | 1 3 44 129 93 |
4. | 1 4 40 105 81 |
5. | 1 5 40 127 92 |
|------------------------------|
6. | 2 1 30 123 92 |
7. | 2 2 30 136 88 |
8. | 2 3 30 107 125 |
9. | 2 4 30 111 87 |
10. | 2 5 30 120 58 |
+------------------------------+
We see that cross-sectional unit 1 has a data entry problem with age in the 3rd trial. If we reshape, we will get an error
(note: j = 1 2 3 4 5)
variable age not constant within id
Your data are currently long. You are performing a reshape wide. You typed something like
. reshape wide a b, i(id) j(trial)
There are variables other than a, b, id, trial in your data. They must be constant within id because that is
the only way they can fit into wide data without loss of information.
The variable or variables listed above are not constant within id. Perhaps the values are in error. Type
reshape error for a list of the problem observations.
Either that, or the values vary because they should vary, in which case you must either add the variables to the
list of xij variables to be reshaped, or drop them.
r(9);
r(9);
Use the reshape error command to help with the problem
It tells us that our constant variable is not constant within cross-sectional units.
Problem 2: Excluding a non-constant that we need to reshape
If we excluded pl from our reshape, we will get a similar error.
(note: j = 1 2 3 4 5)
i (id) indicates the top-level grouping such as subject id.
j (trial) indicates the subgrouping such as time.
xij variable is bp.
Thus, the following variable(s) should be constant within i:
age pl
age not constant within i (id) for 1 value of i:
+------------------+
| id trial age |
|------------------|
1. | 1 1 40 |
2. | 1 2 40 |
3. | 1 3 44 |
4. | 1 4 40 |
5. | 1 5 40 |
+------------------+
pl not constant within i (id) for 6 values of i:
+------------------+
| id trial pl |
|------------------|
1. | 1 1 54 |
2. | 1 2 87 |
3. | 1 3 93 |
4. | 1 4 81 |
5. | 1 5 92 |
|------------------|
6. | 2 1 92 |
7. | 2 2 88 |
8. | 2 3 125 |
9. | 2 4 87 |
10. | 2 5 58 |
|------------------|
11. | 3 1 105 |
12. | 3 2 97 |
13. | 3 3 128 |
14. | 3 4 57 |
15. | 3 5 68 |
|------------------|
16. | 4 1 52 |
17. | 4 2 79 |
18. | 4 3 71 |
19. | 4 4 106 |
20. | 4 5 39 |
|------------------|
21. | 5 1 70 |
22. | 5 2 64 |
23. | 5 3 52 |
24. | 5 4 68 |
25. | 5 5 59 |
|------------------|
26. | 6 1 74 |
27. | 6 2 78 |
28. | 6 3 92 |
29. | 6 4 99 |
30. | 6 5 80 |
+------------------+
(data now sorted by id trial)
4.4 Reshaping Wide to Long
In my personal opinion, this reshaping wide to long is a more common occurance. It is important to get data into a long-format so we can conduct panel analysis. I had to do this for a few of the Wooldridge datasets that were panel data, but in wide format.
/Users/Sam/Desktop/Econ 645/Data/Mitchell
Contains data from cardio_wide.dta
obs: 6
vars: 12 22 Dec 2009 20:43
size: 120
----------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------------
id byte %3.0f ID of person
age byte %3.0f Age of person
bp1 int %3.0f Blood pressure systolic Trial 1
bp2 int %3.0f Blood pressure systolic Trial 2
bp3 int %3.0f Blood pressure systolic Trial 3
bp4 int %3.0f Blood pressure systolic Trial 4
bp5 int %3.0f Blood pressure systolic Trial 5
pl1 int %3.0f Pulse: Trial 1
pl2 byte %3.0f Pulse: Trial 2
pl3 int %3.0f Pulse: Trial 3
pl4 int %3.0f Pulse: Trial 4
pl5 byte %3.0f Pulse: Trial 5
----------------------------------------------------------------------------------------------------------------------
Sorted by:
+----------------------------------------------------------------------+
| id age bp1 bp2 bp3 bp4 bp5 pl1 pl2 pl3 pl4 pl5 |
|----------------------------------------------------------------------|
1. | 1 40 115 86 129 105 127 54 87 93 81 92 |
2. | 2 30 123 136 107 111 120 92 88 125 87 58 |
3. | 3 16 124 122 101 109 112 105 97 128 57 68 |
4. | 4 23 105 115 121 129 137 52 79 71 106 39 |
5. | 5 18 116 128 112 125 111 70 64 52 68 59 |
|----------------------------------------------------------------------|
6. | 6 27 108 126 124 131 107 74 78 92 99 80 |
+----------------------------------------------------------------------+
Let’s reshape wide to long.
Our reshape values that are not constant: bp and pl.
Our cross-sectional unit to reshape upon or our list of variables that uniquely identify a cross-sectional unit. For example, we may need a State FIPS code and a County FIPS code to unique identify a county cross-sectional unit (of course we can merge the 2-digit State FIPS code and the 3-digit County FIPS code to create a 5-digit FIPS code in one variable).
Our time unit, where with reshape long, we need to create a new variable that takes on the values of the suffixes of bp and pl (For example trialnum). Our constant values within cross-sectional units do not need to be identified.
(note: j = 1 2 3 4 5)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 6 -> 30
Number of variables 12 -> 5
j variable (5 values) -> trialnum
xij variables:
bp1 bp2 ... bp5 -> bp
pl1 pl2 ... pl5 -> pl
-----------------------------------------------------------------------------
Contains data
obs: 30
vars: 5
size: 210
----------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------------
id byte %3.0f ID of person
trialnum byte %9.0g
age byte %3.0f Age of person
bp int %3.0f
pl int %3.0f
----------------------------------------------------------------------------------------------------------------------
Sorted by: id trialnum
Note: Dataset has changed since last saved.
+---------------------------------+
| id trialnum age bp pl |
|---------------------------------|
1. | 1 1 40 115 54 |
2. | 1 2 40 86 87 |
3. | 1 3 40 129 93 |
4. | 1 4 40 105 81 |
5. | 1 5 40 127 92 |
|---------------------------------|
6. | 2 1 30 123 92 |
7. | 2 2 30 136 88 |
8. | 2 3 30 107 125 |
9. | 2 4 30 111 87 |
10. | 2 5 30 120 58 |
+---------------------------------+
To reshape it back to wide, we just simply state
(note: j = 1 2 3 4 5)
Data long -> wide
-----------------------------------------------------------------------------
Number of obs. 30 -> 6
Number of variables 5 -> 12
j variable (5 values) trialnum -> (dropped)
xij variables:
bp -> bp1 bp2 ... bp5
pl -> pl1 pl2 ... pl5
-----------------------------------------------------------------------------
Contains data
obs: 6
vars: 12
size: 132
----------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------------
id byte %3.0f ID of person
bp1 int %3.0f 1 bp
pl1 int %3.0f 1 pl
bp2 int %3.0f 2 bp
pl2 int %3.0f 2 pl
bp3 int %3.0f 3 bp
pl3 int %3.0f 3 pl
bp4 int %3.0f 4 bp
pl4 int %3.0f 4 pl
bp5 int %3.0f 5 bp
pl5 int %3.0f 5 pl
age byte %3.0f Age of person
----------------------------------------------------------------------------------------------------------------------
Sorted by: id
Note: Dataset has changed since last saved.
+----------------------------------------------------------------------+
| id bp1 pl1 bp2 pl2 bp3 pl3 bp4 pl4 bp5 pl5 age |
|----------------------------------------------------------------------|
1. | 1 115 54 86 87 129 93 105 81 127 92 40 |
2. | 2 123 92 136 88 107 125 111 87 120 58 30 |
3. | 3 124 105 122 97 101 128 109 57 112 68 16 |
4. | 4 105 52 115 79 121 71 129 106 137 39 23 |
5. | 5 116 70 128 64 112 52 125 68 111 59 18 |
|----------------------------------------------------------------------|
6. | 6 108 74 126 78 124 92 131 99 107 80 27 |
+----------------------------------------------------------------------+
4.5 Reshaping wide to long problems
Problem 1: Failing to list all of the varying variables to be reshaped.
This is a dangerous kind of problem, since it doesn’t throw an error. This requires the user to inspect the data after reshaping to make sure everything worked properly.
cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use cardio_wide, clear
reshape long bp, i(id) j(trialnum)
list/Users/Sam/Desktop/Econ 645/Data/Mitchell
(note: j = 1 2 3 4 5)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 6 -> 30
Number of variables 12 -> 9
j variable (5 values) -> trialnum
xij variables:
bp1 bp2 ... bp5 -> bp
-----------------------------------------------------------------------------
+---------------------------------------------------------+
| id trialnum age bp pl1 pl2 pl3 pl4 pl5 |
|---------------------------------------------------------|
1. | 1 1 40 115 54 87 93 81 92 |
2. | 1 2 40 86 54 87 93 81 92 |
3. | 1 3 40 129 54 87 93 81 92 |
4. | 1 4 40 105 54 87 93 81 92 |
5. | 1 5 40 127 54 87 93 81 92 |
|---------------------------------------------------------|
6. | 2 1 30 123 92 88 125 87 58 |
7. | 2 2 30 136 92 88 125 87 58 |
8. | 2 3 30 107 92 88 125 87 58 |
9. | 2 4 30 111 92 88 125 87 58 |
10. | 2 5 30 120 92 88 125 87 58 |
|---------------------------------------------------------|
11. | 3 1 16 124 105 97 128 57 68 |
12. | 3 2 16 122 105 97 128 57 68 |
13. | 3 3 16 101 105 97 128 57 68 |
14. | 3 4 16 109 105 97 128 57 68 |
15. | 3 5 16 112 105 97 128 57 68 |
|---------------------------------------------------------|
16. | 4 1 23 105 52 79 71 106 39 |
17. | 4 2 23 115 52 79 71 106 39 |
18. | 4 3 23 121 52 79 71 106 39 |
19. | 4 4 23 129 52 79 71 106 39 |
20. | 4 5 23 137 52 79 71 106 39 |
|---------------------------------------------------------|
21. | 5 1 18 116 70 64 52 68 59 |
22. | 5 2 18 128 70 64 52 68 59 |
23. | 5 3 18 112 70 64 52 68 59 |
24. | 5 4 18 125 70 64 52 68 59 |
25. | 5 5 18 111 70 64 52 68 59 |
|---------------------------------------------------------|
26. | 6 1 27 108 74 78 92 99 80 |
27. | 6 2 27 126 74 78 92 99 80 |
28. | 6 3 27 124 74 78 92 99 80 |
29. | 6 4 27 131 74 78 92 99 80 |
30. | 6 5 27 107 74 78 92 99 80 |
+---------------------------------------------------------+
The remedy is easy: don’t forget all of your variables to be reshaped
Problem 2: Be careful when the time-varying value is embedded within the variable name and not at the end.
Typically you will find that the varying variable will have \(var_1, var_2, ..., var_k\) and you simply specify reshape long var, i(id) j(time). But, if the varying value is embedded like t1bp,…,t5bp and t1pl,…,t5pl, then we cannot simply use reshape long tbp tpl, i(id) j(trialnum), since Stata looks for the varying value at the end.
cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use cardio_wide2, clear
describe
capture reshape long tpl tpb, i(id) j(trialnum)
reshape error/Users/Sam/Desktop/Econ 645/Data/Mitchell
Contains data from cardio_wide2.dta
obs: 6
vars: 12 31 Dec 2009 15:46
size: 120
----------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------------
id byte %3.0f ID of person
age byte %3.0f Age of person
t1bp int %3.0f Blood pressure systolic Trial 1
t2bp int %3.0f Blood pressure systolic Trial 2
t3bp int %3.0f Blood pressure systolic Trial 3
t4bp int %3.0f Blood pressure systolic Trial 4
t5bp int %3.0f Blood pressure systolic Trial 5
t1pl int %3.0f Pulse: Trial 1
t2pl byte %3.0f Pulse: Trial 2
t3pl int %3.0f Pulse: Trial 3
t4pl int %3.0f Pulse: Trial 4
t5pl byte %3.0f Pulse: Trial 5
----------------------------------------------------------------------------------------------------------------------
Sorted by:
no xij variables found
You typed something like reshape wide a b, i(i) j(j).
reshape looked for existing variables named a# and b# but could not find any. Remember this picture:
long wide
+---------------+ +------------------+
| i j a b | | i a1 a2 b1 b2 |
|---------------| <--- reshape ---> |------------------|
| 1 1 1 2 | | 1 1 3 2 4 |
| 1 2 3 4 | | 2 5 7 6 8 |
| 2 1 5 6 | +------------------+
| 2 2 7 8 |
+---------------+
long to wide: reshape wide a b, i(i) j(j) (j existing variable)
wide to long: reshape long a b, i(i) j(j) (j new variable)
We need the operator @ to indicate to Stata that our varying-values are in that part of the variable names.
(note: j = 1 2 3 4 5)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 6 -> 30
Number of variables 12 -> 5
j variable (5 values) -> trialnum
xij variables:
t1pl t2pl ... t5pl -> tpl
t1bp t2bp ... t5bp -> tbp
-----------------------------------------------------------------------------
Contains data
obs: 30
vars: 5
size: 210
----------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------------
id byte %3.0f ID of person
trialnum byte %9.0g
age byte %3.0f Age of person
tbp int %3.0f
tpl int %3.0f
----------------------------------------------------------------------------------------------------------------------
Sorted by: id trialnum
Note: Dataset has changed since last saved.
+---------------------------------+
| id trialnum age tbp tpl |
|---------------------------------|
1. | 1 1 40 115 54 |
2. | 1 2 40 86 87 |
3. | 1 3 40 129 93 |
4. | 1 4 40 105 81 |
5. | 1 5 40 127 92 |
|---------------------------------|
6. | 2 1 30 123 92 |
7. | 2 2 30 136 88 |
8. | 2 3 30 107 125 |
9. | 2 4 30 111 87 |
10. | 2 5 30 120 58 |
+---------------------------------+
Problem 3: Be specific with which variables to reshape and don’t mistakenly constant variables that have the same prefix as the varying variables
/Users/Sam/Desktop/Econ 645/Data/Mitchell
bp1 bp2 bp3 bp4 bp5 bp2005 pl1 pl2 pl3 pl4 pl5 pl2005
115 86 129 105 127 112 54 87 93 81 92 81
123 136 107 111 120 119 92 88 125 87 58 90
124 122 101 109 112 113 105 97 128 57 68 91
105 115 121 129 137 121 52 79 71 106 39 69
116 128 112 125 111 118 70 64 52 68 59 62
108 126 124 131 107 119 74 78 92 99 80 84
Our \(bp2005\) and \(pl2005\) are constant variables within our cross-section, but have the same prefix as our value-varying variables.
(note: j = 1 2 3 4 5 2005)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 6 -> 36
Number of variables 14 -> 5
j variable (6 values) -> trialnum
xij variables:
bp1 bp2 ... bp2005 -> bp
pl1 pl2 ... pl2005 -> pl
-----------------------------------------------------------------------------
Contains data
obs: 36
vars: 5
size: 432
----------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------------
id byte %3.0f ID of person
trialnum int %9.0g
age byte %3.0f Age of person
bp float %9.0g
pl float %9.0g
----------------------------------------------------------------------------------------------------------------------
Sorted by: id trialnum
Note: Dataset has changed since last saved.
+---------------------------------+
| id trialnum age bp pl |
|---------------------------------|
1. | 1 1 40 115 54 |
2. | 1 2 40 86 87 |
3. | 1 3 40 129 93 |
4. | 1 4 40 105 81 |
5. | 1 5 40 127 92 |
6. | 1 2005 40 112 81 |
|---------------------------------|
7. | 2 1 30 123 92 |
8. | 2 2 30 136 88 |
9. | 2 3 30 107 125 |
10. | 2 4 30 111 87 |
11. | 2 5 30 120 58 |
12. | 2 2005 30 119 90 |
+---------------------------------+
We did not intend to reshape \(bp2005\) and \(pl2005\).
To rememdy, we need to specify the values within j() with j(trialnum 1-5) to indicate that we only want to reshape suffixes with 1-5.
cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use cardio_wide3, clear
reshape long bp pl, i(id) j(trialnum 1-5)
describe
list in 1/10, sepby(id)/Users/Sam/Desktop/Econ 645/Data/Mitchell
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 6 -> 30
Number of variables 14 -> 7
j variable (5 values) -> trialnum
xij variables:
bp1 bp2 ... bp5 -> bp
pl1 pl2 ... pl5 -> pl
-----------------------------------------------------------------------------
Contains data
obs: 30
vars: 7
size: 450
----------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------------
id byte %3.0f ID of person
trialnum byte %9.0g
age byte %3.0f Age of person
bp int %3.0f
bp2005 float %9.0g
pl int %3.0f
pl2005 float %9.0g
----------------------------------------------------------------------------------------------------------------------
Sorted by: id trialnum
Note: Dataset has changed since last saved.
+---------------------------------------------------+
| id trialnum age bp bp2005 pl pl2005 |
|---------------------------------------------------|
1. | 1 1 40 115 112 54 81 |
2. | 1 2 40 86 112 87 81 |
3. | 1 3 40 129 112 93 81 |
4. | 1 4 40 105 112 81 81 |
5. | 1 5 40 127 112 92 81 |
|---------------------------------------------------|
6. | 2 1 30 123 119 92 90 |
7. | 2 2 30 136 119 88 90 |
8. | 2 3 30 107 119 125 90 |
9. | 2 4 30 111 119 87 90 |
10. | 2 5 30 120 119 58 90 |
+---------------------------------------------------+
Problem 4: No id varible to identify the cross-sectional unit
If each wide-format observation is a unique cross-sectional unit, then we can use _n to generate a unique id variable.
cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use cardio_wide3, clear
describe
gen idcode = _n
describe
reshape long bp pl, i(id idcode) j(trialnum 1-5)
describe
list in 1/10, sepby(id)/Users/Sam/Desktop/Econ 645/Data/Mitchell
Contains data from cardio_wide3.dta
obs: 6
vars: 14 31 Dec 2009 15:46
size: 168
----------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------------
id byte %3.0f ID of person
age byte %3.0f Age of person
bp1 int %3.0f Blood pressure systolic Trial 1
bp2 int %3.0f Blood pressure systolic Trial 2
bp3 int %3.0f Blood pressure systolic Trial 3
bp4 int %3.0f Blood pressure systolic Trial 4
bp5 int %3.0f Blood pressure systolic Trial 5
bp2005 float %9.0g
pl1 int %3.0f Pulse: Trial 1
pl2 byte %3.0f Pulse: Trial 2
pl3 int %3.0f Pulse: Trial 3
pl4 int %3.0f Pulse: Trial 4
pl5 byte %3.0f Pulse: Trial 5
pl2005 float %9.0g
----------------------------------------------------------------------------------------------------------------------
Sorted by:
Contains data from cardio_wide3.dta
obs: 6
vars: 15 31 Dec 2009 15:46
size: 216
----------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------------
id byte %3.0f ID of person
age byte %3.0f Age of person
bp1 int %3.0f Blood pressure systolic Trial 1
bp2 int %3.0f Blood pressure systolic Trial 2
bp3 int %3.0f Blood pressure systolic Trial 3
bp4 int %3.0f Blood pressure systolic Trial 4
bp5 int %3.0f Blood pressure systolic Trial 5
bp2005 float %9.0g
pl1 int %3.0f Pulse: Trial 1
pl2 byte %3.0f Pulse: Trial 2
pl3 int %3.0f Pulse: Trial 3
pl4 int %3.0f Pulse: Trial 4
pl5 byte %3.0f Pulse: Trial 5
pl2005 float %9.0g
idcode double %10.0g
----------------------------------------------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 6 -> 30
Number of variables 15 -> 8
j variable (5 values) -> trialnum
xij variables:
bp1 bp2 ... bp5 -> bp
pl1 pl2 ... pl5 -> pl
-----------------------------------------------------------------------------
Contains data
obs: 30
vars: 8
size: 690
----------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------------
id byte %3.0f ID of person
idcode double %10.0g
trialnum byte %9.0g
age byte %3.0f Age of person
bp int %3.0f
bp2005 float %9.0g
pl int %3.0f
pl2005 float %9.0g
----------------------------------------------------------------------------------------------------------------------
Sorted by: id idcode trialnum
Note: Dataset has changed since last saved.
+------------------------------------------------------------+
| id idcode trialnum age bp bp2005 pl pl2005 |
|------------------------------------------------------------|
1. | 1 1 1 40 115 112 54 81 |
2. | 1 1 2 40 86 112 87 81 |
3. | 1 1 3 40 129 112 93 81 |
4. | 1 1 4 40 105 112 81 81 |
5. | 1 1 5 40 127 112 92 81 |
|------------------------------------------------------------|
6. | 2 2 1 30 123 119 92 90 |
7. | 2 2 2 30 136 119 88 90 |
8. | 2 2 3 30 107 119 125 90 |
9. | 2 2 4 30 111 119 87 90 |
10. | 2 2 5 30 120 119 58 90 |
+------------------------------------------------------------+
4.6 Multilevel datasets
Mitchell discusses when we have multiple levels of data, such as cross-sectional unit i and time dimension t, where data is level 1 data are constant across time within unit, but vary across cross-sectional units and level 2 data are time-varying data that vary within a cross-sectional units
Sometimes we have multiple levels, such as school district, school, classroom, and student.
If you are interested, please review pages 311-314
4.7 Collapsing datasets
The collapse command can be useful finding statistics at aggregate levels for different groups. You don’t have to use egen but you can if you would like. After using the egen command, such as egen sumvar = sum(var), or egen avgvar = mean(var) the collapse command can be useful.
Let’s say we want to aggregate mean wages by state in the CPS, we can use the egen mean command to find average wages in the state for month m in year y. We can then keep state month year and avgvar for all 50 states for each month and year of interest.
We need to be careful with the collapse command, since your original data are lost from memory unless you use a tempfile or a data frame.
/Users/Sam/Desktop/Econ 645/Data/Mitchell
+------------------------------+
| id trial age bp pl |
|------------------------------|
1. | 1 1 40 115 54 |
2. | 1 2 40 86 87 |
3. | 1 3 40 129 93 |
4. | 1 4 40 105 81 |
5. | 1 5 40 127 92 |
|------------------------------|
6. | 2 1 30 123 92 |
7. | 2 2 30 136 88 |
8. | 2 3 30 107 125 |
9. | 2 4 30 111 87 |
10. | 2 5 30 120 58 |
|------------------------------|
11. | 3 1 16 124 105 |
12. | 3 2 16 122 97 |
13. | 3 3 16 101 128 |
14. | 3 4 16 109 57 |
15. | 3 5 16 112 68 |
|------------------------------|
16. | 4 1 23 105 52 |
17. | 4 2 23 115 79 |
18. | 4 3 23 121 71 |
19. | 4 4 23 129 106 |
20. | 4 5 23 137 39 |
|------------------------------|
21. | 5 1 18 116 70 |
22. | 5 2 18 128 64 |
23. | 5 3 18 112 52 |
24. | 5 4 18 125 68 |
25. | 5 5 18 111 59 |
|------------------------------|
26. | 6 1 27 108 74 |
27. | 6 2 27 126 78 |
28. | 6 3 27 124 92 |
29. | 6 4 27 131 99 |
30. | 6 5 27 107 80 |
+------------------------------+
Let’s say we want to find mean blood pressure and pluse for each cross-sectional unit, then we can use the egen command or the collapse commands.
sort id
by id: egen meanbp = mean(bp)
by id: egen meanpl = mean(pl)
by id: egen maxbp = max(bp)
by id: egen maxpl = max(bp)
by id: egen minbp = min(bp)
by id: egen minpl = min(bp)
by id: egen medbp = pctile(bp), p(50)
by id: egen medpl = pctile(pl), p(50)
collapse meanbp meanpl maxbp maxpl minbp minpl medbp medpl, by(id)Or you can just use the collapse command
cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use cardio_long, clear
collapse (mean) meanbp=bp meanpl=pl (max) maxbp=bp maxpl=pl (min) minbp=bp minpl=pl (p50) medbp=bp medpl=pl, by(id)/Users/Sam/Desktop/Econ 645/Data/Mitchell
Notice that our data are rounded to the nearest 1, unlike our egen commands, we can format our data to reflect that decimal point to two places instead of rounding up by default
cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use cardio_long, clear
format bp* pl* %5.2f
collapse (mean) meanbp=bp meanpl=pl (max) maxbp=bp maxpl=pl (min) minbp=bp minpl=pl (p50) medbp=bp medpl=pl, by(id)/Users/Sam/Desktop/Econ 645/Data/Mitchell