Chapter 4 Subgroup Functions Part II

4.1 Computing values within subgroups: Running sums

As we mentioned earlier, when we use egen sum vs gen sum, we get different results. Egen sum() is similar to egen total(), but it can be confusing. When we use gen with sum(), we generate a RUNNING sum not a constant of total.

We can generate the tv running sum across all individuals over time

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
generate tvrunsum = sum(tv)

/Users/Sam/Desktop/Econ 645/Data/Mitchell

We can generate the tv running sum within an individuals time period

bysort kidid (dt): generate bytvrunsum=sum(tv)

We can generate the total sum with an individuals time period

bysort kidid (dt): egen bytvsum=total(tv)

We can generate the total sum for all individuals over time

egen tvsum = total(tv)

We can also calculate a running average

bysort kidid (dt): generate bytvrunavg=sum(tv)/_n

We can compute the individual’s average average

bysort kidid (dt): egen bytvavg = mean(tv)
list kidid tv tv* by*, sepby(kidid)

     | kidid   tv   tv   tvrunsum   tvsum   bytvru~m   bytvsum   bytvrun~g   bytvavg |
     |-------------------------------------------------------------------------------|
  1. |     1    1    1          1      38          1         4           1         2 |
  2. |     1    3    3          4      38          4         4           2         2 |
     |-------------------------------------------------------------------------------|
  3. |     2    8    8         12      38          8         8           8         8 |
     |-------------------------------------------------------------------------------|
  4. |     3    2    2         14      38          2        14           2       3.5 |
  5. |     3    5    5         19      38          7        14         3.5       3.5 |
  6. |     3    1    1         20      38          8        14   2.6666667       3.5 |
  7. |     3    6    6         26      38         14        14         3.5       3.5 |
     |-------------------------------------------------------------------------------|
  8. |     4    7    7         33      38          7        12           7         4 |
  9. |     4    1    1         34      38          8        12           4         4 |
 10. |     4    4    4         38      38         12        12           4         4 |
     +-------------------------------------------------------------------------------+

4.2 Computing values within subgroups: More examples

There are other useful calculations we can do with subscripting/indexing. Some do overlap with egen, but it is helpful to know the differences.

Counting

Count the number of observations: this can be done with subscripting or egen depending upon what we want

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear

*Generate total observation count per individual missing or not missing:
bysort kidid (dt): generate idcount=_N

*Generate the total observation without missing
bysort kidid (dt): egen idcount_nomiss = count(tv)

*Generate a running count of an observation
bysort kidid (dt): gen idruncount = _n
list, sepby(kidid)

/Users/Sam/Desktop/Econ 645/Data/Mitchell

     +----------------------------------------------------------------------------+
     | kidid          dt   female   wt   tv   vac   idcount   idcoun~s   idrunc~t |
     |----------------------------------------------------------------------------|
  1. |     1   07jan2002        1   53    1     1         2          2          1 |
  2. |     1   08jan2002        1   55    3     1         2          2          2 |
     |----------------------------------------------------------------------------|
  3. |     2   16jan2002        1   58    8     1         1          1          1 |
     |----------------------------------------------------------------------------|
  4. |     3   18jan2002        0   60    2     0         4          4          1 |
  5. |     3   19jan2002        0   63    5     1         4          4          2 |
  6. |     3   21jan2002        0   66    1     1         4          4          3 |
  7. |     3   22jan2002        0   64    6     0         4          4          4 |
     |----------------------------------------------------------------------------|
  8. |     4   10jan2002        1   62    7     0         3          3          1 |
  9. |     4   11jan2002        1   58    1     0         3          3          2 |
 10. |     4   13jan2002        1   55    4     0         3          3          3 |
     +----------------------------------------------------------------------------+

Generate Binaries

We can generate binary variables to find first and last observations or nth observation. This differences from id counts, we are generating binaries for when the qualifier is true.

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear

*Find individuals with only one observation
bysort kidid (dt): generate singleob = (_N==1)

*Find the first observation of an individual
bysort kidid (dt): generate firstob = (_n==1)

*Find the last observation of an individual 
bysort kidid (dt): generate lastob = (_n==_N)
list, sepby(kidid)

/Users/Sam/Desktop/Econ 645/Data/Mitchell

     +--------------------------------------------------------------------------+
     | kidid          dt   female   wt   tv   vac   singleob   firstob   lastob |
     |--------------------------------------------------------------------------|
  1. |     1   07jan2002        1   53    1     1          0         1        0 |
  2. |     1   08jan2002        1   55    3     1          0         0        1 |
     |--------------------------------------------------------------------------|
  3. |     2   16jan2002        1   58    8     1          1         1        1 |
     |--------------------------------------------------------------------------|
  4. |     3   18jan2002        0   60    2     0          0         1        0 |
  5. |     3   19jan2002        0   63    5     1          0         0        0 |
  6. |     3   21jan2002        0   66    1     1          0         0        0 |
  7. |     3   22jan2002        0   64    6     0          0         0        1 |
     |--------------------------------------------------------------------------|
  8. |     4   10jan2002        1   62    7     0          0         1        0 |
  9. |     4   11jan2002        1   58    1     0          0         0        0 |
 10. |     4   13jan2002        1   55    4     0          0         0        1 |
     +--------------------------------------------------------------------------+

We can create binaries depending upon leads and lags too. Look for a change in vac

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
bysort kidid (dt): generate vacstart=(vac==1) & (vac[_n-1]==0)
bysort kidid (dt): generate vacend=(vac==1) & (vac[_n+1]==0)
list kidid dt vac*, sepby(kidid)

/Users/Sam/Desktop/Econ 645/Data/Mitchell

     +---------------------------------------------+
     | kidid          dt   vac   vacstart   vacend |
     |---------------------------------------------|
  1. |     1   07jan2002     1          0        0 |
  2. |     1   08jan2002     1          0        0 |
     |---------------------------------------------|
  3. |     2   16jan2002     1          0        0 |
     |---------------------------------------------|
  4. |     3   18jan2002     0          0        0 |
  5. |     3   19jan2002     1          1        0 |
  6. |     3   21jan2002     1          0        1 |
  7. |     3   22jan2002     0          0        0 |
     |---------------------------------------------|
  8. |     4   10jan2002     0          0        0 |
  9. |     4   11jan2002     0          0        0 |
 10. |     4   13jan2002     0          0        0 |
     +---------------------------------------------+

Fill in Missing

Another useful tool that we should use with caution is filling in missings. This should only really be applied when we have a constant variable that does not change over time.

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv2, clear
sort kidid dt
list, sepby(kidid)

/Users/Sam/Desktop/Econ 645/Data/Mitchell

     +--------------------------------------------+
     | kidid          dt   female   wt   tv   vac |
     |--------------------------------------------|
  1. |     1   07jan2002        1   53    1     1 |
  2. |     1   08jan2002        1   55    3     1 |
     |--------------------------------------------|
  3. |     2   16jan2002        1   58    8     1 |
     |--------------------------------------------|
  4. |     3   18jan2002        0   60    2     0 |
  5. |     3   19jan2002        0    .    .     . |
  6. |     3   21jan2002        0   66    .     1 |
  7. |     3   22jan2002        0   64    6     0 |
     |--------------------------------------------|
  8. |     4   10jan2002        1   62    7     0 |
  9. |     4   11jan2002        1   58    .     . |
 10. |     4   13jan2002        1    .    4     0 |
     +--------------------------------------------+

We can backfill the observation with the last nonmissing observation. First generate a copy of the variable with missing values.

generate tvimp1 = tv
bysort kidid (dt): replace tvimp1 = tv[_n-1] if missing(tv)
list kidid dt tv tvimp1, sepby(kidid)

(3 missing values generated)

(2 real changes made)

     +---------------------------------+
     | kidid          dt   tv   tvimp1 |
     |---------------------------------|
  1. |     1   07jan2002    1        1 |
  2. |     1   08jan2002    3        3 |
     |---------------------------------|
  3. |     2   16jan2002    8        8 |
     |---------------------------------|
  4. |     3   18jan2002    2        2 |
  5. |     3   19jan2002    .        2 |
  6. |     3   21jan2002    .        . |
  7. |     3   22jan2002    6        6 |
     |---------------------------------|
  8. |     4   10jan2002    7        7 |
  9. |     4   11jan2002    .        7 |
 10. |     4   13jan2002    4        4 |
     +---------------------------------+

Notice that we are still missing the 3rd observation for the 3rd kid. It cannot backfill the 3rd observation from the second observation, since the second observation is missing. There are a couple of strategies to use We can generate a new variable like Mitchell.

generate tvimp2 = tvimp1
bysort kidid (dt): replace tvimp2 = tvimp2[_n-1] if missing(tvimp2)
list kidid tv tvimp*, sepby(kidid)

(1 missing value generated)

(1 real change made)

     +------------------------------+
     | kidid   tv   tvimp1   tvimp2 |
     |------------------------------|
  1. |     1    1        1        1 |
  2. |     1    3        3        3 |
     |------------------------------|
  3. |     2    8        8        8 |
     |------------------------------|
  4. |     3    2        2        2 |
  5. |     3    .        2        2 |
  6. |     3    .        .        2 |
  7. |     3    6        6        6 |
     |------------------------------|
  8. |     4    7        7        7 |
  9. |     4    .        7        7 |
 10. |     4    4        4        4 |
     +------------------------------+

You can just replace tvimp1 twice instead of generating a new variable, but that is up to the user. You would use tv[_n-1] for the first replace and tvimp1[_n-1] for the second replace.

Interpolation

We may need to interpolate between 2 known values and assume a linear trend.

generate tvimp3=tv
*Interpolate for 1 missing value between two known values
bysort kidid (dt): replace tvimp3 = (tv[_n-1]+tv[_n+1])/2 if missing(tv)
list kidid tv tvimp3, sepby(kidid)

(3 missing values generated)

(1 real change made)

     +---------------------+
     | kidid   tv   tvimp3 |
     |---------------------|
  1. |     1    1        1 |
  2. |     1    3        3 |
     |---------------------|
  3. |     2    8        8 |
     |---------------------|
  4. |     3    2        2 |
  5. |     3    .        . |
  6. |     3    .        . |
  7. |     3    6        6 |
     |---------------------|
  8. |     4    7        7 |
  9. |     4    .      5.5 |
 10. |     4    4        4 |
     +---------------------+

This is a bit of hard coding, but you can interpolate with more than 1 missing.

bysort kidid (dt): replace tvimp3 = ((tvimp3[4]-tvimp3[1])/3)+tvimp3[_n-1] if missing(tvimp3) & kidid==3
list kidid tv tvimp3, sepby(kidid)

(2 real changes made)

     +------------------------+
     | kidid   tv      tvimp3 |
     |------------------------|
  1. |     1    1           1 |
  2. |     1    3           3 |
     |------------------------|
  3. |     2    8           8 |
     |------------------------|
  4. |     3    2           2 |
  5. |     3    .   3.3333333 |
  6. |     3    .   4.6666667 |
  7. |     3    6           6 |
     |------------------------|
  8. |     4    7           7 |
  9. |     4    .         5.5 |
 10. |     4    4           4 |
     +------------------------+

Indicators

What is we want to find outliers in time-varying differences? We can generate indicators variables to find when a variable changes more than a set limit. For example we want to know if the tv viewing habits drop more than 2 hours

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
bysort kidid (dt): generate tvchange = tv[_n]-tv[_n-1]
bysort kidid (dt): generate tvchangerate = ((tv[_n]-tv[_n-1])/tv[_n-1])
list, sepby(kidid)

/Users/Sam/Desktop/Econ 645/Data/Mitchell


(4 missing values generated)

(4 missing values generated)

     +--------------------------------------------------------------------+
     | kidid          dt   female   wt   tv   vac   tvchange   tvchange~e |
     |--------------------------------------------------------------------|
  1. |     1   07jan2002        1   53    1     1          .            . |
  2. |     1   08jan2002        1   55    3     1          2            2 |
     |--------------------------------------------------------------------|
  3. |     2   16jan2002        1   58    8     1          .            . |
     |--------------------------------------------------------------------|
  4. |     3   18jan2002        0   60    2     0          .            . |
  5. |     3   19jan2002        0   63    5     1          3          1.5 |
  6. |     3   21jan2002        0   66    1     1         -4          -.8 |
  7. |     3   22jan2002        0   64    6     0          5            5 |
     |--------------------------------------------------------------------|
  8. |     4   10jan2002        1   62    7     0          .            . |
  9. |     4   11jan2002        1   58    1     0         -6   -.85714286 |
 10. |     4   13jan2002        1   55    4     0          3            3 |
     +--------------------------------------------------------------------+

Generate an indicator variable to see if tvchange is less than -2. This is not very helpful with small datasets, but it is important with larger datasets such as the CPS.

gen tvchangeid=(tvchange <= -2) if !missing(tvchange)
list, sepby(kidid)

(4 missing values generated)

     +-------------------------------------------------------------------------------+
     | kidid          dt   female   wt   tv   vac   tvchange   tvchange~e   tvchan~d |
     |-------------------------------------------------------------------------------|
  1. |     1   07jan2002        1   53    1     1          .            .          . |
  2. |     1   08jan2002        1   55    3     1          2            2          0 |
     |-------------------------------------------------------------------------------|
  3. |     2   16jan2002        1   58    8     1          .            .          . |
     |-------------------------------------------------------------------------------|
  4. |     3   18jan2002        0   60    2     0          .            .          . |
  5. |     3   19jan2002        0   63    5     1          3          1.5          0 |
  6. |     3   21jan2002        0   66    1     1         -4          -.8          1 |
  7. |     3   22jan2002        0   64    6     0          5            5          0 |
     |-------------------------------------------------------------------------------|
  8. |     4   10jan2002        1   62    7     0          .            .          . |
  9. |     4   11jan2002        1   58    1     0         -6   -.85714286          1 |
 10. |     4   13jan2002        1   55    4     0          3            3          0 |
     +-------------------------------------------------------------------------------+

4.3 Comparing the by, tsset, xtset commands

Another way to find differences within vectors, we can use the tsset or xtset command to establish the times series (tsset) or panel data (xtset). We can use our bysort with subscripting/indexing.

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
bysort kidid (dt): generate ltv = tv[_n-1]
list, sepby(kidid)

/Users/Sam/Desktop/Econ 645/Data/Mitchell


(4 missing values generated)

     +--------------------------------------------------+
     | kidid          dt   female   wt   tv   vac   ltv |
     |--------------------------------------------------|
  1. |     1   07jan2002        1   53    1     1     . |
  2. |     1   08jan2002        1   55    3     1     1 |
     |--------------------------------------------------|
  3. |     2   16jan2002        1   58    8     1     . |
     |--------------------------------------------------|
  4. |     3   18jan2002        0   60    2     0     . |
  5. |     3   19jan2002        0   63    5     1     2 |
  6. |     3   21jan2002        0   66    1     1     5 |
  7. |     3   22jan2002        0   64    6     0     1 |
     |--------------------------------------------------|
  8. |     4   10jan2002        1   62    7     0     . |
  9. |     4   11jan2002        1   58    1     0     7 |
 10. |     4   13jan2002        1   55    4     0     1 |
     +--------------------------------------------------+

tsset

Instead we can establish a time series with tsset

We’ll need to specify that our cross-sectional groups is kidid. We’ll need to specify our date variable with dt. We’ll need to use the option, daily, to specify that time period is daily as opposed to weeks, months, years. Or we can specify delta(1) for one day.

We can use the operator L.var to specify that we want to a lag of 1 day

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
sort kidid dt
tsset kidid dt, daily delta(1)
generate lagtv = L.tv

/Users/Sam/Desktop/Econ 645/Data/Mitchell

       panel variable:  kidid (unbalanced)
        time variable:  dt, 07jan2002 to 22jan2002, but with gaps
                delta:  1 day

(6 missing values generated)

We can use the operator F.var to specify that we want a lead of 1 day

generate leadtv = F.tv
list, sepby(kidid)

(6 missing values generated)

     +-------------------------------------------------------------+
     | kidid          dt   female   wt   tv   vac   lagtv   leadtv |
     |-------------------------------------------------------------|
  1. |     1   07jan2002        1   53    1     1       .        3 |
  2. |     1   08jan2002        1   55    3     1       1        . |
     |-------------------------------------------------------------|
  3. |     2   16jan2002        1   58    8     1       .        . |
     |-------------------------------------------------------------|
  4. |     3   18jan2002        0   60    2     0       .        5 |
  5. |     3   19jan2002        0   63    5     1       2        . |
  6. |     3   21jan2002        0   66    1     1       .        6 |
  7. |     3   22jan2002        0   64    6     0       1        . |
     |-------------------------------------------------------------|
  8. |     4   10jan2002        1   62    7     0       .        1 |
  9. |     4   11jan2002        1   58    1     0       7        . |
 10. |     4   13jan2002        1   55    4     0       .        . |
     +-------------------------------------------------------------+

xtset

We can establish a panel series with xtset

We’ll need to specify that our cross-sectional group is kidid. We’ll need to specify our time period is dt. We’ll use a delta of 1 to specify that the differnce is 1 day. Or, we can use daily, as well

Generate a lag with the l.var operator and generate a lead with the f.var

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
sort kidid dt
xtset kidid dt, daily delta(1)
generate lagtv = l.tv
generate leadtv = f.tv

/Users/Sam/Desktop/Econ 645/Data/Mitchell

       panel variable:  kidid (unbalanced)
        time variable:  dt, 07jan2002 to 22jan2002, but with gaps
                delta:  1 day

(6 missing values generated)

(6 missing values generated)

What do you notice? You can see that there are some leads and lags missing. Why? Because there is an unbalance panel and the daily differences cannot be computed if we are missing days. In this case, we can use the bysort with subscripting indexing

list, sepby(kidid)
bysort kidid (dt): gen bylagtv=tv[_n-1]
bysort kidid (dt): gen byleadtv=tv[_n+1]
list, sepby(kidid)

     | kidid          dt   female   wt   tv   vac   lagtv   leadtv |
     |-------------------------------------------------------------|
  1. |     1   07jan2002        1   53    1     1       .        3 |
  2. |     1   08jan2002        1   55    3     1       1        . |
     |-------------------------------------------------------------|
  3. |     2   16jan2002        1   58    8     1       .        . |
     |-------------------------------------------------------------|
  4. |     3   18jan2002        0   60    2     0       .        5 |
  5. |     3   19jan2002        0   63    5     1       2        . |
  6. |     3   21jan2002        0   66    1     1       .        6 |
  7. |     3   22jan2002        0   64    6     0       1        . |
     |-------------------------------------------------------------|
  8. |     4   10jan2002        1   62    7     0       .        1 |
  9. |     4   11jan2002        1   58    1     0       7        . |
 10. |     4   13jan2002        1   55    4     0       .        . |
     +-------------------------------------------------------------+

(4 missing values generated)

(4 missing values generated)

     +----------------------------------------------------------------------------------+
     | kidid          dt   female   wt   tv   vac   lagtv   leadtv   bylagtv   byleadtv |
     |----------------------------------------------------------------------------------|
  1. |     1   07jan2002        1   53    1     1       .        3         .          3 |
  2. |     1   08jan2002        1   55    3     1       1        .         1          . |
     |----------------------------------------------------------------------------------|
  3. |     2   16jan2002        1   58    8     1       .        .         .          . |
     |----------------------------------------------------------------------------------|
  4. |     3   18jan2002        0   60    2     0       .        5         .          5 |
  5. |     3   19jan2002        0   63    5     1       2        .         2          1 |
  6. |     3   21jan2002        0   66    1     1       .        6         5          6 |
  7. |     3   22jan2002        0   64    6     0       1        .         1          . |
     |----------------------------------------------------------------------------------|
  8. |     4   10jan2002        1   62    7     0       .        1         .          1 |
  9. |     4   11jan2002        1   58    1     0       7        .         7          4 |
 10. |     4   13jan2002        1   55    4     0       .        .         1          . |
     +----------------------------------------------------------------------------------+

What is the difference? xtset vs tsset

From Nick Cox: xtset allows a panel identifier only. tsset allows a time identifier only. Where they overlap is when two variables are supplied in which case the first is treated as a panel identifier and the second as a time variable.

4.4 Exercises

Let’s grab the CPS and generate subgroup analysis. We will be using unweighted data for simplicity 1. What are the average wages by sex? 2. What are average wages by state 3. What are the median wages by sex 4. What are the median wages by state 5. What is the 75th percentile of wages by race? 6. What is the 25th percentile of wages by marital status?

use "/Users/Sam/Desktop/Econ 645/Data/CPS/jun23pub.dta",replace

Use bysort state: egen totalvar=total(var1)