Chapter 4 Processing observations across subgroups

One thing Stata easily provides are commands and options for subgroup analysis. We can use the by prefix command to create and analyze subgroups or cross-sectional units in a panel data set.

4.1 Obtaining separate results for subgroups

Tabulate is a very helpful command to analyze categorical variables, or occassionally look through continuous variables (as long as there aren’t too many values).

The tabulate command has an option to summarize a continuous variable when tabulating categorical variables.

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use wws2, clear
tabulate married, summarize(wage)
/Users/Sam/Desktop/Econ 645/Data/Mitchell

(Working Women Survey w/fixes)

            |       Summary of hourly wage
    married |        Mean   Std. Dev.       Freq.
------------+------------------------------------
          0 |   8.0920006    6.354849         804
          1 |   7.6319496   5.5017864       1,440
------------+------------------------------------
      Total |   7.7967807   5.8245895       2,244

Another option is using the bysort prefix command.

bysort married: summarize wage
-> married = 0

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        wage |        804    8.092001    6.354849          0   40.19808

------------------------------------------------------------------------------------------------------------------
-> married = 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        wage |      1,440     7.63195    5.501786   1.004952   40.74659

We can also correlate data within groups instead of using qualifiers and additional statements.

With qualifiers

correlate wage age if married == 0
correlate wage age if married == 1
(obs=804)

             |     wage      age
-------------+------------------
        wage |   1.0000
         age |  -0.0185   1.0000


(obs=1,440)

             |     wage      age
-------------+------------------
        wage |   1.0000
         age |   0.0049   1.0000

Using bysort accomplishes this in one command

bysort married: correlate wage age
-> married = 0
(obs=804)

             |     wage      age
-------------+------------------
        wage |   1.0000
         age |  -0.0185   1.0000


------------------------------------------------------------------------------------------------------------------
-> married = 1
(obs=1,440)

             |     wage      age
-------------+------------------
        wage |   1.0000
         age |   0.0049   1.0000

Using bysort accomplishes even faster if we have a categorical variable with many categories

bysort race: correlate wage age
-> race = 1
(obs=1,637)

             |     wage      age
-------------+------------------
        wage |   1.0000
         age |   0.0017   1.0000


------------------------------------------------------------------------------------------------------------------
-> race = 2
(obs=581)

             |     wage      age
-------------+------------------
        wage |   1.0000
         age |  -0.0331   1.0000


------------------------------------------------------------------------------------------------------------------
-> race = 3
(obs=26)

             |     wage      age
-------------+------------------
        wage |   1.0000
         age |  -0.2194   1.0000

4.2 Computing values separately by subgroups

The by prefix command and the egen command is a powerful combination that makes aggregating group statistics much easier than other statistical software packages

Bysort and egen makes aggregating by groups much easier than other software. R has aggregate which is flexible and powerful, but requires more coding compared to egen.

I’m not the biggest fan of Mitchell’s examples with bysort var: egen, but they get the job done. I would like us to use some CPS examples with bysort and egen.

With bysort var: egen we can calculate subgroup statistics, counts, summations with one easy line of code

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
list, sepby(kidid)
/Users/Sam/Desktop/Econ 645/Data/Mitchell

     +--------------------------------------------+
     | kidid          dt   female   wt   tv   vac |
     |--------------------------------------------|
  1. |     1   07jan2002        1   53    1     1 |
  2. |     1   08jan2002        1   55    3     1 |
     |--------------------------------------------|
  3. |     2   16jan2002        1   58    8     1 |
     |--------------------------------------------|
  4. |     3   18jan2002        0   60    2     0 |
  5. |     3   19jan2002        0   63    5     1 |
  6. |     3   21jan2002        0   66    1     1 |
  7. |     3   22jan2002        0   64    6     0 |
     |--------------------------------------------|
  8. |     4   10jan2002        1   62    7     0 |
  9. |     4   11jan2002        1   58    1     0 |
 10. |     4   13jan2002        1   55    4     0 |
     +--------------------------------------------+

If we want to calculate the average tv time for each kid we first sort by the child id and then use egen. We may have multiple levels and we can use bysort to find the multiple level identifiers such as id, year, and month of year.

bysort kidid: egen avgtv = mean(tv)
sort kidid
list kidid tv avgtv, sepby(kidid)
     | kidid   tv   avgtv |
     |--------------------|
  1. |     1    1       2 |
  2. |     1    3       2 |
     |--------------------|
  3. |     2    8       8 |
     |--------------------|
  4. |     3    2     3.5 |
  5. |     3    5     3.5 |
  6. |     3    1     3.5 |
  7. |     3    6     3.5 |
     |--------------------|
  8. |     4    7       4 |
  9. |     4    1       4 |
 10. |     4    4       4 |
     +--------------------+

If we want the standard deviation of the child’s tv watching.Let’s generate some z-scores and look at the statistics.

bysort kidid: egen sdtv = sd(tv)
generate ztv = (tv-avgtv)/sdtv
list 
(1 missing value generated)

(1 missing value generated)

     +-----------------------------------------------------------------------------+
     | kidid          dt   female   wt   tv   vac   avgtv        sdtv          ztv |
     |-----------------------------------------------------------------------------|
  1. |     1   07jan2002        1   53    1     1       2   1.4142136   -.70710678 |
  2. |     1   08jan2002        1   55    3     1       2   1.4142136    .70710678 |
  3. |     2   16jan2002        1   58    8     1       8           .            . |
  4. |     3   18jan2002        0   60    2     0     3.5   2.3804761   -.63012604 |
  5. |     3   19jan2002        0   63    5     1     3.5   2.3804761    .63012604 |
     |-----------------------------------------------------------------------------|
  6. |     3   21jan2002        0   66    1     1     3.5   2.3804761   -1.0502101 |
  7. |     3   22jan2002        0   64    6     0     3.5   2.3804761    1.0502101 |
  8. |     4   10jan2002        1   62    7     0       4           3            1 |
  9. |     4   11jan2002        1   58    1     0       4           3           -1 |
 10. |     4   13jan2002        1   55    4     0       4           3            0 |
     +-----------------------------------------------------------------------------+

Let’s generate some subgroup statistics for a binary variable vac. Where vac=0 if the kid was not on vacation and vac=1 if the kid was on vacation.

bysort kidid: egen vac_total = total(vac)
bysort kidid: egen vac_sd = sd(vac)
bysort kidid: egen vac_min = min(vac)
bysort kidid: egen vac_max = max(vac)
list kidid vac*, sepby(kidid) abb(10)
(1 missing value generated)

     +---------------------------------------------------------+
     | kidid   vac   vac_total      vac_sd   vac_min   vac_max |
     |---------------------------------------------------------|
  1. |     1     1           2           0         1         1 |
  2. |     1     1           2           0         1         1 |
     |---------------------------------------------------------|
  3. |     2     1           1           .         1         1 |
     |---------------------------------------------------------|
  4. |     3     0           2   .57735027         0         1 |
  5. |     3     1           2   .57735027         0         1 |
  6. |     3     1           2   .57735027         0         1 |
  7. |     3     0           2   .57735027         0         1 |
     |---------------------------------------------------------|
  8. |     4     0           0           0         0         0 |
  9. |     4     0           0           0         0         0 |
 10. |     4     0           0           0         0         0 |
     +---------------------------------------------------------+

Let’s see if some kids watch less than 4 hours of tv per day. We’ll generate a binary/dummy variable to be 1 if equal to or less than 4 hours a day and 0 if it is greater than 4 hours a day.

generate tvlo = (tv < 4) if !missing(tv)

We can generate individual level subgroup analysis with bysort and egen on binary variables

bysort kidid: egen tvlocnt = count(tvlo)
bysort kidid: egen tvlototal = total(tvlo)
bysort kidid: egen tvlosum = sum(tvlo)
bysort kidid: gen tvlosum2 = sum(tvlo)
bysort kidid: egen tvlosame = sd(tvlo)
bysort kidid: egen tvloall = min(tvlo)
bysort kidid: egen tvloever = max(tvlo)
list kidid tv tvlo*, sepby(kidid) abb(20)
(1 missing value generated)

     +-----------------------------------------------------------------------------------------------+
     | kidid   tv   tvlo   tvlocnt   tvlototal   tvlosum   tvlosum2    tvlosame   tvloall   tvloever |
     |-----------------------------------------------------------------------------------------------|
  1. |     1    1      1         2           2         2          1           0         1          1 |
  2. |     1    3      1         2           2         2          2           0         1          1 |
     |-----------------------------------------------------------------------------------------------|
  3. |     2    8      0         1           0         0          0           .         0          0 |
     |-----------------------------------------------------------------------------------------------|
  4. |     3    2      1         4           2         2          1   .57735027         0          1 |
  5. |     3    5      0         4           2         2          1   .57735027         0          1 |
  6. |     3    1      1         4           2         2          2   .57735027         0          1 |
  7. |     3    6      0         4           2         2          2   .57735027         0          1 |
     |-----------------------------------------------------------------------------------------------|
  8. |     4    7      0         3           1         1          0   .57735027         0          1 |
  9. |     4    1      1         3           1         1          1   .57735027         0          1 |
 10. |     4    4      0         3           1         1          1   .57735027         0          1 |
     +-----------------------------------------------------------------------------------------------+

Notice how count() provides the number of observations for each kid, while total() returns a constant for the sum of all values, but so does egen sum(). The problem is that there is a gen var = sum(var2) function that returns a running sum that we see in tvsum2. Use egen sum with caution if you want a cumulative sum. If you want a cumulative sum use egen total().

We have our central tendencies functions with

  1. mean(),
  2. median(),
  3. mode().

We can find percentiles with

  1. pctile(var), p(#).

We have other egen functions that may be of help, such as

  1. Interquartile Range iqr(),
  2. Median Absolute Deviation mad(),
  3. Mean Absolute Deviation mdev(),
  4. Kurtosis kurt()
  5. Skewness skew().
  6. Other functions
help egen

Mitchell has a good note here:

Egen mean() takes an arguement, not a varlist, so if you put bysort idvar: egen meanvars1_5=mean(var1-var5), mean() will return not the means of vars 1 through 5, but var1 minus var5.

4.3 Subscripting or Indexing: Computing values within subgroups:

Unsolicated Opinion Alert: Subscripting (or I may accidently call it indexing) is a very powerful tool that I personally think puts Stata as the top paid statistical software (I do think that R is more powerful and more flexible, but Stata balances power, flexibility, and ease of learning).

Each variable is a vector x1=x[x11, x12, x13,…,x1N] for i=1,…,N observations We can use a subscript or index to call which part of the vector we want to return.

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
list, sepby(kidid)
/Users/Sam/Desktop/Econ 645/Data/Mitchell

     +--------------------------------------------+
     | kidid          dt   female   wt   tv   vac |
     |--------------------------------------------|
  1. |     1   07jan2002        1   53    1     1 |
  2. |     1   08jan2002        1   55    3     1 |
     |--------------------------------------------|
  3. |     2   16jan2002        1   58    8     1 |
     |--------------------------------------------|
  4. |     3   18jan2002        0   60    2     0 |
  5. |     3   19jan2002        0   63    5     1 |
  6. |     3   21jan2002        0   66    1     1 |
  7. |     3   22jan2002        0   64    6     0 |
     |--------------------------------------------|
  8. |     4   10jan2002        1   62    7     0 |
  9. |     4   11jan2002        1   58    1     0 |
 10. |     4   13jan2002        1   55    4     0 |
     +--------------------------------------------+

If we want the first observation in our tv vector we can call it with [1]

display tv[1]
1

We can look at the first kid id and date and time

display "kid: " kidid[1] ", Date: " dt[1] ", Sex: " female[1] ", TV Hours: " tv[1]
kid: 1, Date: 15347, Sex: 1, TV Hours: 1

We can see the second observation

display tv[2]
3

We can see the difference between the two observations

display tv[2]-tv[1]
2

Note: we have some very useful system variables of _N and _n _N is total number of observations and when used in the subscript/index it will return the last observation

_n is current number of observations or observation number and when used in the subscript/index it will return the current observation (almost like i=i+1)

help system variables

Subscripting is very helpful when working Panel Data. You can index within cross-sectional units over time with ease. The subscript (or index) will return the nth observation given

If we want the first observation to be compared to all observations

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
bysort kidid: gen tv_1ob = tv[1]
list kidid tv tv_1ob, sepby(kidid)
/Users/Sam/Desktop/Econ 645/Data/Mitchell

     +---------------------+
     | kidid   tv   tv_1ob |
     |---------------------|
  1. |     1    1        1 |
  2. |     1    3        1 |
     |---------------------|
  3. |     2    8        8 |
     |---------------------|
  4. |     3    2        2 |
  5. |     3    5        2 |
  6. |     3    1        2 |
  7. |     3    6        2 |
     |---------------------|
  8. |     4    7        7 |
  9. |     4    1        7 |
 10. |     4    4        7 |
     +---------------------+

If we want to compare the last observation to all observations

bysort kidid: gen tv_lastob = tv[_N]
list kidid tv tv_lastob, sepby(kidid)
     | kidid   tv   tv_las~b |
     |-----------------------|
  1. |     1    1          3 |
  2. |     1    3          3 |
     |-----------------------|
  3. |     2    8          8 |
     |-----------------------|
  4. |     3    2          6 |
  5. |     3    5          6 |
  6. |     3    1          6 |
  7. |     3    6          6 |
     |-----------------------|
  8. |     4    7          4 |
  9. |     4    1          4 |
 10. |     4    4          4 |
     +-----------------------+

If we want the second to last observation

bysort kidid: gen tv_2tolastob = tv[_N-1]
list kidid tv tv_2tolastob, sepby(kidid)
(1 missing value generated)

     +-----------------------+
     | kidid   tv   tv_2to~b |
     |-----------------------|
  1. |     1    1          1 |
  2. |     1    3          1 |
     |-----------------------|
  3. |     2    8          . |
     |-----------------------|
  4. |     3    2          1 |
  5. |     3    5          1 |
  6. |     3    1          1 |
  7. |     3    6          1 |
     |-----------------------|
  8. |     4    7          1 |
  9. |     4    1          1 |
 10. |     4    4          1 |
     +-----------------------+

If we want the prior observation (lag of 1)

bysort kidid: gen tv_lagob = tv[_n-1]
list kidid tv tv_lagob, sepby(kidid)
(4 missing values generated)

     +-----------------------+
     | kidid   tv   tv_lagob |
     |-----------------------|
  1. |     1    1          . |
  2. |     1    3          1 |
     |-----------------------|
  3. |     2    8          . |
     |-----------------------|
  4. |     3    2          . |
  5. |     3    5          2 |
  6. |     3    1          5 |
  7. |     3    6          1 |
     |-----------------------|
  8. |     4    7          . |
  9. |     4    1          7 |
 10. |     4    4          1 |
     +-----------------------+

If we want the next observation (lead of 1)

bysort kidid: gen tv_leadob = tv[_n+1]
list kidid tv tv_leadob, sepby(kidid)
(4 missing values generated)

     +-----------------------+
     | kidid   tv   tv_lea~b |
     |-----------------------|
  1. |     1    1          3 |
  2. |     1    3          . |
     |-----------------------|
  3. |     2    8          . |
     |-----------------------|
  4. |     3    2          5 |
  5. |     3    5          1 |
  6. |     3    1          6 |
  7. |     3    6          . |
     |-----------------------|
  8. |     4    7          1 |
  9. |     4    1          4 |
 10. |     4    4          . |
     +-----------------------+

You can use bysort kidid (dt) to tell Stata to order by kid id and date, but NOT INCLUDE dt in the grouping.

If we use kidid AND tv bysort kidid tv: egen, then we will look for observation Within kid id AND the date. Since there is only 1 observation per kid per date, we will only have 1 observation for each grouping. If we want the first observation to be compared to all observations

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
bysort kidid (dt): gen tv_1ob1 = tv[1]
bysort kidid dt: gen tv_1ob2 = tv[1]
list kidid tv tv_1ob*, sepby(kidid)
/Users/Sam/Desktop/Econ 645/Data/Mitchell

     +--------------------------------+
     | kidid   tv   tv_1ob1   tv_1ob2 |
     |--------------------------------|
  1. |     1    1         1         1 |
  2. |     1    3         1         3 |
     |--------------------------------|
  3. |     2    8         8         8 |
     |--------------------------------|
  4. |     3    2         2         2 |
  5. |     3    5         2         5 |
  6. |     3    1         2         1 |
  7. |     3    6         2         6 |
     |--------------------------------|
  8. |     4    7         7         7 |
  9. |     4    1         7         1 |
 10. |     4    4         7         4 |
     +--------------------------------+

If we want to compare the last observation to all observations

bysort kidid (dt): gen tv_lastob1 = tv[_N]
bysort kidid dt: gen tv_lastob2 = tv[_N]
list kidid tv tv_lastob*, sepby(kidid)
     | kidid   tv   tv_las~1   tv_las~2 |
     |----------------------------------|
  1. |     1    1          3          1 |
  2. |     1    3          3          3 |
     |----------------------------------|
  3. |     2    8          8          8 |
     |----------------------------------|
  4. |     3    2          6          2 |
  5. |     3    5          6          5 |
  6. |     3    1          6          1 |
  7. |     3    6          6          6 |
     |----------------------------------|
  8. |     4    7          4          7 |
  9. |     4    1          4          1 |
 10. |     4    4          4          4 |
     +----------------------------------+

If we want the prior observation (lag of 1)

bysort kidid (dt): gen tv_lagob1 = tv[_n-1]
bysort kidid dt: gen tv_lagob2 = tv[_n-1]
list kidid tv tv_lagob*, sepby(kidid)
quietly {
cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
bysort kidid (dt): gen tv_1ob1 = tv[1]
bysort kidid dt: gen tv_1ob2 = tv[1]
bysort kidid (dt): gen tv_lastob1 = tv[_N]
bysort kidid dt: gen tv_lastob2 = tv[_N]
}
bysort kidid (dt): gen tv_lagob1 = tv[_n-1]
bysort kidid dt: gen tv_lagob2 = tv[_n-1]
list kidid tv tv_lagob*, sepby(kidid)
(4 missing values generated)

(10 missing values generated)

     +----------------------------------+
     | kidid   tv   tv_lag~1   tv_lag~2 |
     |----------------------------------|
  1. |     1    1          .          . |
  2. |     1    3          1          . |
     |----------------------------------|
  3. |     2    8          .          . |
     |----------------------------------|
  4. |     3    2          .          . |
  5. |     3    5          2          . |
  6. |     3    1          5          . |
  7. |     3    6          1          . |
     |----------------------------------|
  8. |     4    7          .          . |
  9. |     4    1          7          . |
 10. |     4    4          1          . |
     +----------------------------------+

4.4 Computing values within subgroups: Computations across observations

Another powerful combination with subscripting/indexing is that we can the generate command to create new variables that perform mathematical operators on different observations WITHIN the vector

Difference in tv time between current period and prior period

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
bysort kidid (dt): generate tvdfp = tv - tv[_n-1]
list
/Users/Sam/Desktop/Econ 645/Data/Mitchell


(4 missing values generated)

     +----------------------------------------------------+
     | kidid          dt   female   wt   tv   vac   tvdfp |
     |----------------------------------------------------|
  1. |     1   07jan2002        1   53    1     1       . |
  2. |     1   08jan2002        1   55    3     1       2 |
  3. |     2   16jan2002        1   58    8     1       . |
  4. |     3   18jan2002        0   60    2     0       . |
  5. |     3   19jan2002        0   63    5     1       3 |
     |----------------------------------------------------|
  6. |     3   21jan2002        0   66    1     1      -4 |
  7. |     3   22jan2002        0   64    6     0       5 |
  8. |     4   10jan2002        1   62    7     0       . |
  9. |     4   11jan2002        1   58    1     0      -6 |
 10. |     4   13jan2002        1   55    4     0       3 |
     +----------------------------------------------------+

Difference in tv time between current period and next period

bysort kidid (dt): generate tvdfs = tv - tv[_n+1]

Difference in tv time between current period and first period

bysort kidid (dt): generate tvdff = tv - tv[1]

Difference in tv time between current period and last period

bysort kidid (dt): generate tvdfl = tv - tv[_N]

Difference between current period and 3-year moving average over time

bysort kidid (dt): generate tv3avg = (tv[_n-1] + tv[_n] + tv[_n+1])/3
list kidid dt tvd* tv3avg
(7 missing values generated)

     +---------------------------------------------------------------+
     | kidid          dt   tvdfp   tvdfs   tvdff   tvdfl      tv3avg |
     |---------------------------------------------------------------|
  1. |     1   07jan2002       .      -2       0      -2           . |
  2. |     1   08jan2002       2       .       2       0           . |
  3. |     2   16jan2002       .       .       0       0           . |
  4. |     3   18jan2002       .      -3       0      -4           . |
  5. |     3   19jan2002       3       4       3      -1   2.6666667 |
     |---------------------------------------------------------------|
  6. |     3   21jan2002      -4      -5      -1      -5           4 |
  7. |     3   22jan2002       5       .       4       0           . |
  8. |     4   10jan2002       .       6       0       3           . |
  9. |     4   11jan2002      -6      -3      -6      -3           4 |
 10. |     4   13jan2002       3       .      -3       0           . |
     +---------------------------------------------------------------+

We can also rebase our vector. For example, we can rebase a deflator for the period dollars we want.

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
import excel using "cpi_1993_2023.xlsx", cellrange(A12:P42) firstrow clear
keep Year Annual
/Users/Sam/Desktop/Econ 645/Data/Mitchell

Rebase in 1993 Dollars

gen rebase93 = Annual/Annual[1]*100

Rebase in 2022 Dollars

gen rebase22 = Annual/Annual[_N]*100

Rebase to 2012

gen rebase12 = Annual/Annual[_N-10]*100
list
     | Year    Annual    rebase93    rebase22    rebase12 |
     |----------------------------------------------------|
  1. | 1993     144.5         100   49.375545   62.937185 |
  2. | 1994     148.2   102.56055   50.639832   64.548725 |
  3. | 1995     152.4   105.46713   52.074969   66.378041 |
  4. | 1996     156.9   108.58131   53.612616   68.338023 |
  5. | 1997     160.5   111.07266   54.842733   69.906008 |
     |----------------------------------------------------|
  6. | 1998       163   112.80277   55.696981   70.994887 |
  7. | 1999     166.6   115.29412   56.927098   72.562872 |
  8. | 2000     172.2   119.16955   58.840614    75.00196 |
  9. | 2001     177.1   122.56055   60.514941   77.136162 |
 10. | 2002     179.9   124.49827   61.471699   78.355706 |
     |----------------------------------------------------|
 11. | 2003       184   127.33564   62.872666   80.141467 |
 12. | 2004     188.9   130.72664   64.546992   82.275669 |
 13. | 2005     195.3   135.15571   66.733868   85.063199 |
 14. | 2006     201.6   139.51557   68.886573   87.807173 |
 15. | 2007   207.342   143.48927    70.84861   90.308109 |
     |----------------------------------------------------|
 16. | 2008   215.303   148.99862   73.568878   93.775534 |
 17. | 2009   214.537   148.46851   73.307136   93.441902 |
 18. | 2010   218.056   150.90381   74.509576   94.974607 |
 19. | 2011   224.939   155.66713   76.861492   97.972508 |
 20. | 2012   229.594   158.88858   78.452102         100 |
     |----------------------------------------------------|
 21. | 2013   232.957   161.21592   79.601237   101.46476 |
 22. | 2014   236.736   163.83114   80.892518   103.11071 |
 23. | 2015   237.017   164.02561   80.988536    103.2331 |
 24. | 2016   240.007   166.09481   82.010217    104.5354 |
 25. | 2017    245.12   169.63322   83.757325   106.76237 |
     |----------------------------------------------------|
 26. | 2018   251.107   173.77647   85.803079   109.37002 |
 27. | 2019   255.657   176.92526    87.35781   111.35178 |
 28. | 2020   258.811   179.10796    88.43553   112.72551 |
 29. | 2021    270.97   187.52249   92.590251   118.02138 |
 30. | 2022   292.655   202.52941         100   127.46631 |
     +----------------------------------------------------+
graph twoway line Annual Year || line rebase93 Year || ///
    line rebase22 Year, yline(100) ///
    legend(order(1 "100=1982" 2 "100=1993" 3 "100=2022"))
graph export "/Users/Sam/Desktop/Econ 645/Stata/week8_bls.png", replace
Indexed CPI
Indexed CPI

4.5 Computing values within subgroups: Running sums

As we mentioned earlier, when we use egen sum vs gen sum, we get different results. Egen sum() is similar to egen total(), but it can be confusing. When we use gen with sum(), we generate a RUNNING sum not a constant of total.

We can generate the tv running sum across all individuals over time

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
generate tvrunsum = sum(tv)
/Users/Sam/Desktop/Econ 645/Data/Mitchell

We can generate the tv running sum within an individuals time period

bysort kidid (dt): generate bytvrunsum=sum(tv)

We can generate the total sum with an individuals time period

bysort kidid (dt): egen bytvsum=total(tv)

We can generate the total sum for all individuals over time

egen tvsum = total(tv)

We can also calculate a running average

bysort kidid (dt): generate bytvrunavg=sum(tv)/_n

We can compute the individual’s average average

bysort kidid (dt): egen bytvavg = mean(tv)
list kidid tv tv* by*, sepby(kidid)
     | kidid   tv   tv   tvrunsum   tvsum   bytvru~m   bytvsum   bytvrun~g   bytvavg |
     |-------------------------------------------------------------------------------|
  1. |     1    1    1          1      38          1         4           1         2 |
  2. |     1    3    3          4      38          4         4           2         2 |
     |-------------------------------------------------------------------------------|
  3. |     2    8    8         12      38          8         8           8         8 |
     |-------------------------------------------------------------------------------|
  4. |     3    2    2         14      38          2        14           2       3.5 |
  5. |     3    5    5         19      38          7        14         3.5       3.5 |
  6. |     3    1    1         20      38          8        14   2.6666667       3.5 |
  7. |     3    6    6         26      38         14        14         3.5       3.5 |
     |-------------------------------------------------------------------------------|
  8. |     4    7    7         33      38          7        12           7         4 |
  9. |     4    1    1         34      38          8        12           4         4 |
 10. |     4    4    4         38      38         12        12           4         4 |
     +-------------------------------------------------------------------------------+

4.6 Computing values within subgroups: More examples

There are other useful calculations we can do with subscripting/indexing. Some do overlap with egen, but it is helpful to know the differences.

Counting

Count the number of observations: this can be done with subscripting or egen depending upon what we want

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear

*Generate total observation count per individual missing or not missing:
bysort kidid (dt): generate idcount=_N

*Generate the total observation without missing
bysort kidid (dt): egen idcount_nomiss = count(tv)

*Generate a running count of an observation
bysort kidid (dt): gen idruncount = _n
list, sepby(kidid)
/Users/Sam/Desktop/Econ 645/Data/Mitchell

     +----------------------------------------------------------------------------+
     | kidid          dt   female   wt   tv   vac   idcount   idcoun~s   idrunc~t |
     |----------------------------------------------------------------------------|
  1. |     1   07jan2002        1   53    1     1         2          2          1 |
  2. |     1   08jan2002        1   55    3     1         2          2          2 |
     |----------------------------------------------------------------------------|
  3. |     2   16jan2002        1   58    8     1         1          1          1 |
     |----------------------------------------------------------------------------|
  4. |     3   18jan2002        0   60    2     0         4          4          1 |
  5. |     3   19jan2002        0   63    5     1         4          4          2 |
  6. |     3   21jan2002        0   66    1     1         4          4          3 |
  7. |     3   22jan2002        0   64    6     0         4          4          4 |
     |----------------------------------------------------------------------------|
  8. |     4   10jan2002        1   62    7     0         3          3          1 |
  9. |     4   11jan2002        1   58    1     0         3          3          2 |
 10. |     4   13jan2002        1   55    4     0         3          3          3 |
     +----------------------------------------------------------------------------+

Generate Binaries

We can generate binary variables to find first and last observations or nth observation. This differences from id counts, we are generating binaries for when the qualifier is true.

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear

*Find individuals with only one observation
bysort kidid (dt): generate singleob = (_N==1)

*Find the first observation of an individual
bysort kidid (dt): generate firstob = (_n==1)

*Find the last observation of an individual 
bysort kidid (dt): generate lastob = (_n==_N)
list, sepby(kidid)
/Users/Sam/Desktop/Econ 645/Data/Mitchell

     +--------------------------------------------------------------------------+
     | kidid          dt   female   wt   tv   vac   singleob   firstob   lastob |
     |--------------------------------------------------------------------------|
  1. |     1   07jan2002        1   53    1     1          0         1        0 |
  2. |     1   08jan2002        1   55    3     1          0         0        1 |
     |--------------------------------------------------------------------------|
  3. |     2   16jan2002        1   58    8     1          1         1        1 |
     |--------------------------------------------------------------------------|
  4. |     3   18jan2002        0   60    2     0          0         1        0 |
  5. |     3   19jan2002        0   63    5     1          0         0        0 |
  6. |     3   21jan2002        0   66    1     1          0         0        0 |
  7. |     3   22jan2002        0   64    6     0          0         0        1 |
     |--------------------------------------------------------------------------|
  8. |     4   10jan2002        1   62    7     0          0         1        0 |
  9. |     4   11jan2002        1   58    1     0          0         0        0 |
 10. |     4   13jan2002        1   55    4     0          0         0        1 |
     +--------------------------------------------------------------------------+

We can create binaries depending upon leads and lags too. Look for a change in vac

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
bysort kidid (dt): generate vacstart=(vac==1) & (vac[_n-1]==0)
bysort kidid (dt): generate vacend=(vac==1) & (vac[_n+1]==0)
list kidid dt vac*, sepby(kidid)
/Users/Sam/Desktop/Econ 645/Data/Mitchell

     +---------------------------------------------+
     | kidid          dt   vac   vacstart   vacend |
     |---------------------------------------------|
  1. |     1   07jan2002     1          0        0 |
  2. |     1   08jan2002     1          0        0 |
     |---------------------------------------------|
  3. |     2   16jan2002     1          0        0 |
     |---------------------------------------------|
  4. |     3   18jan2002     0          0        0 |
  5. |     3   19jan2002     1          1        0 |
  6. |     3   21jan2002     1          0        1 |
  7. |     3   22jan2002     0          0        0 |
     |---------------------------------------------|
  8. |     4   10jan2002     0          0        0 |
  9. |     4   11jan2002     0          0        0 |
 10. |     4   13jan2002     0          0        0 |
     +---------------------------------------------+

Fill in Missing

Another useful tool that we should use with caution is filling in missings. This should only really be applied when we have a constant variable that does not change over time.

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv2, clear
sort kidid dt
list, sepby(kidid)
/Users/Sam/Desktop/Econ 645/Data/Mitchell

     +--------------------------------------------+
     | kidid          dt   female   wt   tv   vac |
     |--------------------------------------------|
  1. |     1   07jan2002        1   53    1     1 |
  2. |     1   08jan2002        1   55    3     1 |
     |--------------------------------------------|
  3. |     2   16jan2002        1   58    8     1 |
     |--------------------------------------------|
  4. |     3   18jan2002        0   60    2     0 |
  5. |     3   19jan2002        0    .    .     . |
  6. |     3   21jan2002        0   66    .     1 |
  7. |     3   22jan2002        0   64    6     0 |
     |--------------------------------------------|
  8. |     4   10jan2002        1   62    7     0 |
  9. |     4   11jan2002        1   58    .     . |
 10. |     4   13jan2002        1    .    4     0 |
     +--------------------------------------------+

We can backfill the observation with the last nonmissing observation. First generate a copy of the variable with missing values.

generate tvimp1 = tv
bysort kidid (dt): replace tvimp1 = tv[_n-1] if missing(tv)
list kidid dt tv tvimp1, sepby(kidid)
(3 missing values generated)

(2 real changes made)

     +---------------------------------+
     | kidid          dt   tv   tvimp1 |
     |---------------------------------|
  1. |     1   07jan2002    1        1 |
  2. |     1   08jan2002    3        3 |
     |---------------------------------|
  3. |     2   16jan2002    8        8 |
     |---------------------------------|
  4. |     3   18jan2002    2        2 |
  5. |     3   19jan2002    .        2 |
  6. |     3   21jan2002    .        . |
  7. |     3   22jan2002    6        6 |
     |---------------------------------|
  8. |     4   10jan2002    7        7 |
  9. |     4   11jan2002    .        7 |
 10. |     4   13jan2002    4        4 |
     +---------------------------------+

Notice that we are still missing the 3rd observation for the 3rd kid. It cannot backfill the 3rd observation from the second observation, since the second observation is missing. There are a couple of strategies to use We can generate a new variable like Mitchell.

generate tvimp2 = tvimp1
bysort kidid (dt): replace tvimp2 = tvimp2[_n-1] if missing(tvimp2)
list kidid tv tvimp*, sepby(kidid)
(1 missing value generated)

(1 real change made)

     +------------------------------+
     | kidid   tv   tvimp1   tvimp2 |
     |------------------------------|
  1. |     1    1        1        1 |
  2. |     1    3        3        3 |
     |------------------------------|
  3. |     2    8        8        8 |
     |------------------------------|
  4. |     3    2        2        2 |
  5. |     3    .        2        2 |
  6. |     3    .        .        2 |
  7. |     3    6        6        6 |
     |------------------------------|
  8. |     4    7        7        7 |
  9. |     4    .        7        7 |
 10. |     4    4        4        4 |
     +------------------------------+

You can just replace tvimp1 twice instead of generating a new variable, but that is up to the user. You would use tv[_n-1] for the first replace and tvimp1[_n-1] for the second replace.

Interpolation

We may need to interpolate between 2 known values and assume a linear trend.

generate tvimp3=tv
*Interpolate for 1 missing value between two known values
bysort kidid (dt): replace tvimp3 = (tv[_n-1]+tv[_n+1])/2 if missing(tv)
list kidid tv tvimp3, sepby(kidid)
(3 missing values generated)

(1 real change made)

     +---------------------+
     | kidid   tv   tvimp3 |
     |---------------------|
  1. |     1    1        1 |
  2. |     1    3        3 |
     |---------------------|
  3. |     2    8        8 |
     |---------------------|
  4. |     3    2        2 |
  5. |     3    .        . |
  6. |     3    .        . |
  7. |     3    6        6 |
     |---------------------|
  8. |     4    7        7 |
  9. |     4    .      5.5 |
 10. |     4    4        4 |
     +---------------------+

This is a bit of hard coding, but you can interpolate with more than 1 missing.

bysort kidid (dt): replace tvimp3 = ((tvimp3[4]-tvimp3[1])/3)+tvimp3[_n-1] if missing(tvimp3) & kidid==3
list kidid tv tvimp3, sepby(kidid)
(2 real changes made)

     +------------------------+
     | kidid   tv      tvimp3 |
     |------------------------|
  1. |     1    1           1 |
  2. |     1    3           3 |
     |------------------------|
  3. |     2    8           8 |
     |------------------------|
  4. |     3    2           2 |
  5. |     3    .   3.3333333 |
  6. |     3    .   4.6666667 |
  7. |     3    6           6 |
     |------------------------|
  8. |     4    7           7 |
  9. |     4    .         5.5 |
 10. |     4    4           4 |
     +------------------------+

Indicators

What is we want to find outliers in time-varying differences? We can generate indicators variables to find when a variable changes more than a set limit. For example we want to know if the tv viewing habits drop more than 2 hours

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
bysort kidid (dt): generate tvchange = tv[_n]-tv[_n-1]
bysort kidid (dt): generate tvchangerate = ((tv[_n]-tv[_n-1])/tv[_n-1])
list, sepby(kidid)
/Users/Sam/Desktop/Econ 645/Data/Mitchell


(4 missing values generated)

(4 missing values generated)

     +--------------------------------------------------------------------+
     | kidid          dt   female   wt   tv   vac   tvchange   tvchange~e |
     |--------------------------------------------------------------------|
  1. |     1   07jan2002        1   53    1     1          .            . |
  2. |     1   08jan2002        1   55    3     1          2            2 |
     |--------------------------------------------------------------------|
  3. |     2   16jan2002        1   58    8     1          .            . |
     |--------------------------------------------------------------------|
  4. |     3   18jan2002        0   60    2     0          .            . |
  5. |     3   19jan2002        0   63    5     1          3          1.5 |
  6. |     3   21jan2002        0   66    1     1         -4          -.8 |
  7. |     3   22jan2002        0   64    6     0          5            5 |
     |--------------------------------------------------------------------|
  8. |     4   10jan2002        1   62    7     0          .            . |
  9. |     4   11jan2002        1   58    1     0         -6   -.85714286 |
 10. |     4   13jan2002        1   55    4     0          3            3 |
     +--------------------------------------------------------------------+

Generate an indicator variable to see if tvchange is less than -2. This is not very helpful with small datasets, but it is important with larger datasets such as the CPS.

gen tvchangeid=(tvchange <= -2) if !missing(tvchange)
list, sepby(kidid)
(4 missing values generated)

     +-------------------------------------------------------------------------------+
     | kidid          dt   female   wt   tv   vac   tvchange   tvchange~e   tvchan~d |
     |-------------------------------------------------------------------------------|
  1. |     1   07jan2002        1   53    1     1          .            .          . |
  2. |     1   08jan2002        1   55    3     1          2            2          0 |
     |-------------------------------------------------------------------------------|
  3. |     2   16jan2002        1   58    8     1          .            .          . |
     |-------------------------------------------------------------------------------|
  4. |     3   18jan2002        0   60    2     0          .            .          . |
  5. |     3   19jan2002        0   63    5     1          3          1.5          0 |
  6. |     3   21jan2002        0   66    1     1         -4          -.8          1 |
  7. |     3   22jan2002        0   64    6     0          5            5          0 |
     |-------------------------------------------------------------------------------|
  8. |     4   10jan2002        1   62    7     0          .            .          . |
  9. |     4   11jan2002        1   58    1     0         -6   -.85714286          1 |
 10. |     4   13jan2002        1   55    4     0          3            3          0 |
     +-------------------------------------------------------------------------------+

4.7 Comparing the by, tsset, xtset commands

Another way to find differences within vectors, we can use the tsset or xtset command to establish the times series (tsset) or panel data (xtset). We can use our bysort with subscripting/indexing.

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
bysort kidid (dt): generate ltv = tv[_n-1]
list, sepby(kidid)
/Users/Sam/Desktop/Econ 645/Data/Mitchell


(4 missing values generated)

     +--------------------------------------------------+
     | kidid          dt   female   wt   tv   vac   ltv |
     |--------------------------------------------------|
  1. |     1   07jan2002        1   53    1     1     . |
  2. |     1   08jan2002        1   55    3     1     1 |
     |--------------------------------------------------|
  3. |     2   16jan2002        1   58    8     1     . |
     |--------------------------------------------------|
  4. |     3   18jan2002        0   60    2     0     . |
  5. |     3   19jan2002        0   63    5     1     2 |
  6. |     3   21jan2002        0   66    1     1     5 |
  7. |     3   22jan2002        0   64    6     0     1 |
     |--------------------------------------------------|
  8. |     4   10jan2002        1   62    7     0     . |
  9. |     4   11jan2002        1   58    1     0     7 |
 10. |     4   13jan2002        1   55    4     0     1 |
     +--------------------------------------------------+

tsset

Instead we can establish a time series with tsset

We’ll need to specify that our cross-sectional groups is kidid. We’ll need to specify our date variable with dt. We’ll need to use the option, daily, to specify that time period is daily as opposed to weeks, months, years. Or we can specify delta(1) for one day.

We can use the operator L.var to specify that we want to a lag of 1 day

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
sort kidid dt
tsset kidid dt, daily delta(1)
generate lagtv = L.tv
/Users/Sam/Desktop/Econ 645/Data/Mitchell

       panel variable:  kidid (unbalanced)
        time variable:  dt, 07jan2002 to 22jan2002, but with gaps
                delta:  1 day

(6 missing values generated)

We can use the operator F.var to specify that we want a lead of 1 day

generate leadtv = F.tv
list, sepby(kidid)
(6 missing values generated)

     +-------------------------------------------------------------+
     | kidid          dt   female   wt   tv   vac   lagtv   leadtv |
     |-------------------------------------------------------------|
  1. |     1   07jan2002        1   53    1     1       .        3 |
  2. |     1   08jan2002        1   55    3     1       1        . |
     |-------------------------------------------------------------|
  3. |     2   16jan2002        1   58    8     1       .        . |
     |-------------------------------------------------------------|
  4. |     3   18jan2002        0   60    2     0       .        5 |
  5. |     3   19jan2002        0   63    5     1       2        . |
  6. |     3   21jan2002        0   66    1     1       .        6 |
  7. |     3   22jan2002        0   64    6     0       1        . |
     |-------------------------------------------------------------|
  8. |     4   10jan2002        1   62    7     0       .        1 |
  9. |     4   11jan2002        1   58    1     0       7        . |
 10. |     4   13jan2002        1   55    4     0       .        . |
     +-------------------------------------------------------------+

xtset

We can establish a panel series with xtset

We’ll need to specify that our cross-sectional group is kidid. We’ll need to specify our time period is dt. We’ll use a delta of 1 to specify that the differnce is 1 day. Or, we can use daily, as well

Generate a lag with the l.var operator and generate a lead with the f.var

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
sort kidid dt
xtset kidid dt, daily delta(1)
generate lagtv = l.tv
generate leadtv = f.tv
/Users/Sam/Desktop/Econ 645/Data/Mitchell

       panel variable:  kidid (unbalanced)
        time variable:  dt, 07jan2002 to 22jan2002, but with gaps
                delta:  1 day

(6 missing values generated)

(6 missing values generated)

What do you notice? You can see that there are some leads and lags missing. Why? Because there is an unbalance panel and the daily differences cannot be computed if we are missing days. In this case, we can use the bysort with subscripting indexing

list, sepby(kidid)
bysort kidid (dt): gen bylagtv=tv[_n-1]
bysort kidid (dt): gen byleadtv=tv[_n+1]
list, sepby(kidid)
     | kidid          dt   female   wt   tv   vac   lagtv   leadtv |
     |-------------------------------------------------------------|
  1. |     1   07jan2002        1   53    1     1       .        3 |
  2. |     1   08jan2002        1   55    3     1       1        . |
     |-------------------------------------------------------------|
  3. |     2   16jan2002        1   58    8     1       .        . |
     |-------------------------------------------------------------|
  4. |     3   18jan2002        0   60    2     0       .        5 |
  5. |     3   19jan2002        0   63    5     1       2        . |
  6. |     3   21jan2002        0   66    1     1       .        6 |
  7. |     3   22jan2002        0   64    6     0       1        . |
     |-------------------------------------------------------------|
  8. |     4   10jan2002        1   62    7     0       .        1 |
  9. |     4   11jan2002        1   58    1     0       7        . |
 10. |     4   13jan2002        1   55    4     0       .        . |
     +-------------------------------------------------------------+

(4 missing values generated)

(4 missing values generated)

     +----------------------------------------------------------------------------------+
     | kidid          dt   female   wt   tv   vac   lagtv   leadtv   bylagtv   byleadtv |
     |----------------------------------------------------------------------------------|
  1. |     1   07jan2002        1   53    1     1       .        3         .          3 |
  2. |     1   08jan2002        1   55    3     1       1        .         1          . |
     |----------------------------------------------------------------------------------|
  3. |     2   16jan2002        1   58    8     1       .        .         .          . |
     |----------------------------------------------------------------------------------|
  4. |     3   18jan2002        0   60    2     0       .        5         .          5 |
  5. |     3   19jan2002        0   63    5     1       2        .         2          1 |
  6. |     3   21jan2002        0   66    1     1       .        6         5          6 |
  7. |     3   22jan2002        0   64    6     0       1        .         1          . |
     |----------------------------------------------------------------------------------|
  8. |     4   10jan2002        1   62    7     0       .        1         .          1 |
  9. |     4   11jan2002        1   58    1     0       7        .         7          4 |
 10. |     4   13jan2002        1   55    4     0       .        .         1          . |
     +----------------------------------------------------------------------------------+

What is the difference? xtset vs tsset

From Nick Cox: xtset allows a panel identifier only. tsset allows a time identifier only. Where they overlap is when two variables are supplied in which case the first is treated as a panel identifier and the second as a time variable.

4.8 Exercises

Let’s grab the CPS and generate subgroup analysis. We will be using unweighted data for simplicity 1. What are the average wages by sex? 2. What are average wages by state 3. What are the median wages by sex 4. What are the median wages by state 5. What is the 75th percentile of wages by race? 6. What is the 25th percentile of wages by marital status?

use "/Users/Sam/Desktop/Econ 645/Data/CPS/jun23pub.dta",replace

Use bysort state: egen totalvar=total(var1)