Chapter 4 Subgroup Functions Part II
4.1 Computing values within subgroups: Running sums
As we mentioned earlier, when we use egen sum vs gen sum, we get different results. Egen sum() is similar to egen total(), but it can be confusing. When we use gen with sum(), we generate a RUNNING sum not a constant of total.
We can generate the tv running sum across all individuals over time
/Users/Sam/Desktop/Econ 645/Data/Mitchell
We can generate the tv running sum within an individuals time period
We can generate the total sum with an individuals time period
We can generate the total sum for all individuals over time
We can also calculate a running average
We can compute the individual’s average average
| kidid tv tv tvrunsum tvsum bytvru~m bytvsum bytvrun~g bytvavg |
|-------------------------------------------------------------------------------|
1. | 1 1 1 1 38 1 4 1 2 |
2. | 1 3 3 4 38 4 4 2 2 |
|-------------------------------------------------------------------------------|
3. | 2 8 8 12 38 8 8 8 8 |
|-------------------------------------------------------------------------------|
4. | 3 2 2 14 38 2 14 2 3.5 |
5. | 3 5 5 19 38 7 14 3.5 3.5 |
6. | 3 1 1 20 38 8 14 2.6666667 3.5 |
7. | 3 6 6 26 38 14 14 3.5 3.5 |
|-------------------------------------------------------------------------------|
8. | 4 7 7 33 38 7 12 7 4 |
9. | 4 1 1 34 38 8 12 4 4 |
10. | 4 4 4 38 38 12 12 4 4 |
+-------------------------------------------------------------------------------+
4.2 Computing values within subgroups: More examples
There are other useful calculations we can do with subscripting/indexing. Some do overlap with egen, but it is helpful to know the differences.
Counting
Count the number of observations: this can be done with subscripting or egen depending upon what we want
cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
*Generate total observation count per individual missing or not missing:
bysort kidid (dt): generate idcount=_N
*Generate the total observation without missing
bysort kidid (dt): egen idcount_nomiss = count(tv)
*Generate a running count of an observation
bysort kidid (dt): gen idruncount = _n
list, sepby(kidid)/Users/Sam/Desktop/Econ 645/Data/Mitchell
+----------------------------------------------------------------------------+
| kidid dt female wt tv vac idcount idcoun~s idrunc~t |
|----------------------------------------------------------------------------|
1. | 1 07jan2002 1 53 1 1 2 2 1 |
2. | 1 08jan2002 1 55 3 1 2 2 2 |
|----------------------------------------------------------------------------|
3. | 2 16jan2002 1 58 8 1 1 1 1 |
|----------------------------------------------------------------------------|
4. | 3 18jan2002 0 60 2 0 4 4 1 |
5. | 3 19jan2002 0 63 5 1 4 4 2 |
6. | 3 21jan2002 0 66 1 1 4 4 3 |
7. | 3 22jan2002 0 64 6 0 4 4 4 |
|----------------------------------------------------------------------------|
8. | 4 10jan2002 1 62 7 0 3 3 1 |
9. | 4 11jan2002 1 58 1 0 3 3 2 |
10. | 4 13jan2002 1 55 4 0 3 3 3 |
+----------------------------------------------------------------------------+
Generate Binaries
We can generate binary variables to find first and last observations or nth observation. This differences from id counts, we are generating binaries for when the qualifier is true.
cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
*Find individuals with only one observation
bysort kidid (dt): generate singleob = (_N==1)
*Find the first observation of an individual
bysort kidid (dt): generate firstob = (_n==1)
*Find the last observation of an individual
bysort kidid (dt): generate lastob = (_n==_N)
list, sepby(kidid)/Users/Sam/Desktop/Econ 645/Data/Mitchell
+--------------------------------------------------------------------------+
| kidid dt female wt tv vac singleob firstob lastob |
|--------------------------------------------------------------------------|
1. | 1 07jan2002 1 53 1 1 0 1 0 |
2. | 1 08jan2002 1 55 3 1 0 0 1 |
|--------------------------------------------------------------------------|
3. | 2 16jan2002 1 58 8 1 1 1 1 |
|--------------------------------------------------------------------------|
4. | 3 18jan2002 0 60 2 0 0 1 0 |
5. | 3 19jan2002 0 63 5 1 0 0 0 |
6. | 3 21jan2002 0 66 1 1 0 0 0 |
7. | 3 22jan2002 0 64 6 0 0 0 1 |
|--------------------------------------------------------------------------|
8. | 4 10jan2002 1 62 7 0 0 1 0 |
9. | 4 11jan2002 1 58 1 0 0 0 0 |
10. | 4 13jan2002 1 55 4 0 0 0 1 |
+--------------------------------------------------------------------------+
We can create binaries depending upon leads and lags too. Look for a change in vac
cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
bysort kidid (dt): generate vacstart=(vac==1) & (vac[_n-1]==0)
bysort kidid (dt): generate vacend=(vac==1) & (vac[_n+1]==0)
list kidid dt vac*, sepby(kidid)/Users/Sam/Desktop/Econ 645/Data/Mitchell
+---------------------------------------------+
| kidid dt vac vacstart vacend |
|---------------------------------------------|
1. | 1 07jan2002 1 0 0 |
2. | 1 08jan2002 1 0 0 |
|---------------------------------------------|
3. | 2 16jan2002 1 0 0 |
|---------------------------------------------|
4. | 3 18jan2002 0 0 0 |
5. | 3 19jan2002 1 1 0 |
6. | 3 21jan2002 1 0 1 |
7. | 3 22jan2002 0 0 0 |
|---------------------------------------------|
8. | 4 10jan2002 0 0 0 |
9. | 4 11jan2002 0 0 0 |
10. | 4 13jan2002 0 0 0 |
+---------------------------------------------+
Fill in Missing
Another useful tool that we should use with caution is filling in missings. This should only really be applied when we have a constant variable that does not change over time.
/Users/Sam/Desktop/Econ 645/Data/Mitchell
+--------------------------------------------+
| kidid dt female wt tv vac |
|--------------------------------------------|
1. | 1 07jan2002 1 53 1 1 |
2. | 1 08jan2002 1 55 3 1 |
|--------------------------------------------|
3. | 2 16jan2002 1 58 8 1 |
|--------------------------------------------|
4. | 3 18jan2002 0 60 2 0 |
5. | 3 19jan2002 0 . . . |
6. | 3 21jan2002 0 66 . 1 |
7. | 3 22jan2002 0 64 6 0 |
|--------------------------------------------|
8. | 4 10jan2002 1 62 7 0 |
9. | 4 11jan2002 1 58 . . |
10. | 4 13jan2002 1 . 4 0 |
+--------------------------------------------+
We can backfill the observation with the last nonmissing observation. First generate a copy of the variable with missing values.
generate tvimp1 = tv
bysort kidid (dt): replace tvimp1 = tv[_n-1] if missing(tv)
list kidid dt tv tvimp1, sepby(kidid)(3 missing values generated)
(2 real changes made)
+---------------------------------+
| kidid dt tv tvimp1 |
|---------------------------------|
1. | 1 07jan2002 1 1 |
2. | 1 08jan2002 3 3 |
|---------------------------------|
3. | 2 16jan2002 8 8 |
|---------------------------------|
4. | 3 18jan2002 2 2 |
5. | 3 19jan2002 . 2 |
6. | 3 21jan2002 . . |
7. | 3 22jan2002 6 6 |
|---------------------------------|
8. | 4 10jan2002 7 7 |
9. | 4 11jan2002 . 7 |
10. | 4 13jan2002 4 4 |
+---------------------------------+
Notice that we are still missing the 3rd observation for the 3rd kid. It cannot backfill the 3rd observation from the second observation, since the second observation is missing. There are a couple of strategies to use We can generate a new variable like Mitchell.
generate tvimp2 = tvimp1
bysort kidid (dt): replace tvimp2 = tvimp2[_n-1] if missing(tvimp2)
list kidid tv tvimp*, sepby(kidid)(1 missing value generated)
(1 real change made)
+------------------------------+
| kidid tv tvimp1 tvimp2 |
|------------------------------|
1. | 1 1 1 1 |
2. | 1 3 3 3 |
|------------------------------|
3. | 2 8 8 8 |
|------------------------------|
4. | 3 2 2 2 |
5. | 3 . 2 2 |
6. | 3 . . 2 |
7. | 3 6 6 6 |
|------------------------------|
8. | 4 7 7 7 |
9. | 4 . 7 7 |
10. | 4 4 4 4 |
+------------------------------+
You can just replace tvimp1 twice instead of generating a new variable, but that is up to the user. You would use tv[_n-1] for the first replace and tvimp1[_n-1] for the second replace.
Interpolation
We may need to interpolate between 2 known values and assume a linear trend.
generate tvimp3=tv
*Interpolate for 1 missing value between two known values
bysort kidid (dt): replace tvimp3 = (tv[_n-1]+tv[_n+1])/2 if missing(tv)
list kidid tv tvimp3, sepby(kidid)(3 missing values generated)
(1 real change made)
+---------------------+
| kidid tv tvimp3 |
|---------------------|
1. | 1 1 1 |
2. | 1 3 3 |
|---------------------|
3. | 2 8 8 |
|---------------------|
4. | 3 2 2 |
5. | 3 . . |
6. | 3 . . |
7. | 3 6 6 |
|---------------------|
8. | 4 7 7 |
9. | 4 . 5.5 |
10. | 4 4 4 |
+---------------------+
This is a bit of hard coding, but you can interpolate with more than 1 missing.
bysort kidid (dt): replace tvimp3 = ((tvimp3[4]-tvimp3[1])/3)+tvimp3[_n-1] if missing(tvimp3) & kidid==3
list kidid tv tvimp3, sepby(kidid)(2 real changes made)
+------------------------+
| kidid tv tvimp3 |
|------------------------|
1. | 1 1 1 |
2. | 1 3 3 |
|------------------------|
3. | 2 8 8 |
|------------------------|
4. | 3 2 2 |
5. | 3 . 3.3333333 |
6. | 3 . 4.6666667 |
7. | 3 6 6 |
|------------------------|
8. | 4 7 7 |
9. | 4 . 5.5 |
10. | 4 4 4 |
+------------------------+
Indicators
What is we want to find outliers in time-varying differences? We can generate indicators variables to find when a variable changes more than a set limit. For example we want to know if the tv viewing habits drop more than 2 hours
cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
bysort kidid (dt): generate tvchange = tv[_n]-tv[_n-1]
bysort kidid (dt): generate tvchangerate = ((tv[_n]-tv[_n-1])/tv[_n-1])
list, sepby(kidid)/Users/Sam/Desktop/Econ 645/Data/Mitchell
(4 missing values generated)
(4 missing values generated)
+--------------------------------------------------------------------+
| kidid dt female wt tv vac tvchange tvchange~e |
|--------------------------------------------------------------------|
1. | 1 07jan2002 1 53 1 1 . . |
2. | 1 08jan2002 1 55 3 1 2 2 |
|--------------------------------------------------------------------|
3. | 2 16jan2002 1 58 8 1 . . |
|--------------------------------------------------------------------|
4. | 3 18jan2002 0 60 2 0 . . |
5. | 3 19jan2002 0 63 5 1 3 1.5 |
6. | 3 21jan2002 0 66 1 1 -4 -.8 |
7. | 3 22jan2002 0 64 6 0 5 5 |
|--------------------------------------------------------------------|
8. | 4 10jan2002 1 62 7 0 . . |
9. | 4 11jan2002 1 58 1 0 -6 -.85714286 |
10. | 4 13jan2002 1 55 4 0 3 3 |
+--------------------------------------------------------------------+
Generate an indicator variable to see if tvchange is less than -2. This is not very helpful with small datasets, but it is important with larger datasets such as the CPS.
(4 missing values generated)
+-------------------------------------------------------------------------------+
| kidid dt female wt tv vac tvchange tvchange~e tvchan~d |
|-------------------------------------------------------------------------------|
1. | 1 07jan2002 1 53 1 1 . . . |
2. | 1 08jan2002 1 55 3 1 2 2 0 |
|-------------------------------------------------------------------------------|
3. | 2 16jan2002 1 58 8 1 . . . |
|-------------------------------------------------------------------------------|
4. | 3 18jan2002 0 60 2 0 . . . |
5. | 3 19jan2002 0 63 5 1 3 1.5 0 |
6. | 3 21jan2002 0 66 1 1 -4 -.8 1 |
7. | 3 22jan2002 0 64 6 0 5 5 0 |
|-------------------------------------------------------------------------------|
8. | 4 10jan2002 1 62 7 0 . . . |
9. | 4 11jan2002 1 58 1 0 -6 -.85714286 1 |
10. | 4 13jan2002 1 55 4 0 3 3 0 |
+-------------------------------------------------------------------------------+
4.3 Comparing the by, tsset, xtset commands
Another way to find differences within vectors, we can use the tsset or xtset command to establish the times series (tsset) or panel data (xtset). We can use our bysort with subscripting/indexing.
cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
bysort kidid (dt): generate ltv = tv[_n-1]
list, sepby(kidid)/Users/Sam/Desktop/Econ 645/Data/Mitchell
(4 missing values generated)
+--------------------------------------------------+
| kidid dt female wt tv vac ltv |
|--------------------------------------------------|
1. | 1 07jan2002 1 53 1 1 . |
2. | 1 08jan2002 1 55 3 1 1 |
|--------------------------------------------------|
3. | 2 16jan2002 1 58 8 1 . |
|--------------------------------------------------|
4. | 3 18jan2002 0 60 2 0 . |
5. | 3 19jan2002 0 63 5 1 2 |
6. | 3 21jan2002 0 66 1 1 5 |
7. | 3 22jan2002 0 64 6 0 1 |
|--------------------------------------------------|
8. | 4 10jan2002 1 62 7 0 . |
9. | 4 11jan2002 1 58 1 0 7 |
10. | 4 13jan2002 1 55 4 0 1 |
+--------------------------------------------------+
tsset
Instead we can establish a time series with tsset
We’ll need to specify that our cross-sectional groups is kidid. We’ll need to specify our date variable with dt. We’ll need to use the option, daily, to specify that time period is daily as opposed to weeks, months, years. Or we can specify delta(1) for one day.
We can use the operator L.var to specify that we want to a lag of 1 day
cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
sort kidid dt
tsset kidid dt, daily delta(1)
generate lagtv = L.tv/Users/Sam/Desktop/Econ 645/Data/Mitchell
panel variable: kidid (unbalanced)
time variable: dt, 07jan2002 to 22jan2002, but with gaps
delta: 1 day
(6 missing values generated)
We can use the operator F.var to specify that we want a lead of 1 day
(6 missing values generated)
+-------------------------------------------------------------+
| kidid dt female wt tv vac lagtv leadtv |
|-------------------------------------------------------------|
1. | 1 07jan2002 1 53 1 1 . 3 |
2. | 1 08jan2002 1 55 3 1 1 . |
|-------------------------------------------------------------|
3. | 2 16jan2002 1 58 8 1 . . |
|-------------------------------------------------------------|
4. | 3 18jan2002 0 60 2 0 . 5 |
5. | 3 19jan2002 0 63 5 1 2 . |
6. | 3 21jan2002 0 66 1 1 . 6 |
7. | 3 22jan2002 0 64 6 0 1 . |
|-------------------------------------------------------------|
8. | 4 10jan2002 1 62 7 0 . 1 |
9. | 4 11jan2002 1 58 1 0 7 . |
10. | 4 13jan2002 1 55 4 0 . . |
+-------------------------------------------------------------+
xtset
We can establish a panel series with xtset
We’ll need to specify that our cross-sectional group is kidid. We’ll need to specify our time period is dt. We’ll use a delta of 1 to specify that the differnce is 1 day. Or, we can use daily, as well
Generate a lag with the l.var operator and generate a lead with the f.var
cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use tv1, clear
sort kidid dt
xtset kidid dt, daily delta(1)
generate lagtv = l.tv
generate leadtv = f.tv/Users/Sam/Desktop/Econ 645/Data/Mitchell
panel variable: kidid (unbalanced)
time variable: dt, 07jan2002 to 22jan2002, but with gaps
delta: 1 day
(6 missing values generated)
(6 missing values generated)
What do you notice? You can see that there are some leads and lags missing. Why? Because there is an unbalance panel and the daily differences cannot be computed if we are missing days. In this case, we can use the bysort with subscripting indexing
list, sepby(kidid)
bysort kidid (dt): gen bylagtv=tv[_n-1]
bysort kidid (dt): gen byleadtv=tv[_n+1]
list, sepby(kidid) | kidid dt female wt tv vac lagtv leadtv |
|-------------------------------------------------------------|
1. | 1 07jan2002 1 53 1 1 . 3 |
2. | 1 08jan2002 1 55 3 1 1 . |
|-------------------------------------------------------------|
3. | 2 16jan2002 1 58 8 1 . . |
|-------------------------------------------------------------|
4. | 3 18jan2002 0 60 2 0 . 5 |
5. | 3 19jan2002 0 63 5 1 2 . |
6. | 3 21jan2002 0 66 1 1 . 6 |
7. | 3 22jan2002 0 64 6 0 1 . |
|-------------------------------------------------------------|
8. | 4 10jan2002 1 62 7 0 . 1 |
9. | 4 11jan2002 1 58 1 0 7 . |
10. | 4 13jan2002 1 55 4 0 . . |
+-------------------------------------------------------------+
(4 missing values generated)
(4 missing values generated)
+----------------------------------------------------------------------------------+
| kidid dt female wt tv vac lagtv leadtv bylagtv byleadtv |
|----------------------------------------------------------------------------------|
1. | 1 07jan2002 1 53 1 1 . 3 . 3 |
2. | 1 08jan2002 1 55 3 1 1 . 1 . |
|----------------------------------------------------------------------------------|
3. | 2 16jan2002 1 58 8 1 . . . . |
|----------------------------------------------------------------------------------|
4. | 3 18jan2002 0 60 2 0 . 5 . 5 |
5. | 3 19jan2002 0 63 5 1 2 . 2 1 |
6. | 3 21jan2002 0 66 1 1 . 6 5 6 |
7. | 3 22jan2002 0 64 6 0 1 . 1 . |
|----------------------------------------------------------------------------------|
8. | 4 10jan2002 1 62 7 0 . 1 . 1 |
9. | 4 11jan2002 1 58 1 0 7 . 7 4 |
10. | 4 13jan2002 1 55 4 0 . . 1 . |
+----------------------------------------------------------------------------------+
4.4 Exercises
Let’s grab the CPS and generate subgroup analysis. We will be using unweighted data for simplicity 1. What are the average wages by sex? 2. What are average wages by state 3. What are the median wages by sex 4. What are the median wages by state 5. What is the 75th percentile of wages by race? 6. What is the 25th percentile of wages by marital status?
Use bysort state: egen totalvar=total(var1)