Chapter 3 EGEN Command
3.1 EGEN Computations across variables
The egen command is something that I miss in other statistical packages since it is so useful. We can do computation across columns or we can do computation across observations with egen
For computations across columns/variables, we can look at row means, row mins, row maxs, row missing, row nonmissing, etc. On a personal note, I do not use computations across variables too often. If you have have panel data set, it should be in a long-form format instead of a wide-form format. When you use data in a long-form format, egen across variables has limited uses.
Let’s find the mean across pl1-pl5 with row mean instead of gen avgpl=(pl1+…+pl5)/5
cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell/"
use "cardio2miss.dta", clear
egen avgpl=rowmean(pl1-pl5)
list id pl1-pl5 avgpl/Users/Sam/Desktop/Econ 645/Data/Mitchell
+------------------------------------------+
| id pl1 pl2 pl3 pl4 pl5 avgpl |
|------------------------------------------|
1. | 1 54 115 87 86 93 87 |
2. | 2 92 123 88 136 125 112.8 |
3. | 3 105 .a 97 122 128 113 |
4. | 4 52 105 79 115 71 84.4 |
5. | 5 70 116 .a 128 52 91.5 |
+------------------------------------------+
rowmeans() will ignore the missing values. (I wish other statistical packages did this)
| id bp1 bp2 bp3 bp4 bp5 avgbp |
|------------------------------------------|
1. | 1 129 81 105 .b .b 105 |
2. | 2 107 87 111 58 120 96.6 |
3. | 3 101 57 109 68 112 89.4 |
4. | 4 121 106 129 39 137 106.4 |
5. | 5 112 68 125 59 111 95 |
+------------------------------------------+
rowmin() returns the row minimum, while rowmax() returns the row maximum.
| id bp1 bp2 bp3 bp4 bp5 avgbp maxbp minbp |
|----------------------------------------------------------|
1. | 1 129 81 105 .b .b 105 129 81 |
2. | 2 107 87 111 58 120 96.6 120 58 |
3. | 3 101 57 109 68 112 89.4 112 57 |
4. | 4 121 106 129 39 137 106.4 137 39 |
5. | 5 112 68 125 59 111 95 125 59 |
+----------------------------------------------------------+
We can find missing or not missing with the rowmiss() and rownonmiss()
Note: the ? operator differs from the wildcard * operator. The ? operator is a wildcard for 1 character in between so bp? will pick up bp1, bp2, bp3, bp4, and bp5, and bp? will exlude bpavg, bpmin, bpmax Examples: 1. my*var variables starting with my & ending with var with any number of other characters between. 2. my~var one variable starting with my & ending with var with any number of other characters between. 3. my?var variables starting with my & ending with var with one other character between.
Recap: 1. myvar is just one variable 2. myvar thisvar thatvar are three variables 3. myvar are all variables starting with myvar 4. var are all variables ending with var 5. myvar1-myvar6* includes myvar1, myvar2, …, myvar6 6. this-that includes variables this through that, inclusive
3.2 EGEN Computation across observations
EGEN and bysort are a powerful combination for working across groups of observations. This can be very helpful when working with panel data or time series by groups. We will sort by id and then time and perform our egen commands We have our egen sum, egen total, egen max, egen min, egen mean which will be the main egen commands.
Let’s use egen to find the average for groups. Egen without a bysort will return the mean for the entire column, which may or may not be helpful. If we want a mean within a group (or an individual’s panel data), we want to sort by groups (or individuals) first and then perform the egen mean.
cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell/"
use "gasctrysmall.dta", clear
egen avggas = mean(gas)
bysort ctry: egen avggas_ctry = mean(gas)
list ctry year gas avggas avggas_ctry, sepby(ctry)/Users/Sam/Desktop/Econ 645/Data/Mitchell
+----------------------------------------+
| ctry year gas avggas avggas_~y |
|----------------------------------------|
1. | 1 1974 .78 .7675 .805 |
2. | 1 1975 .83 .7675 .805 |
|----------------------------------------|
3. | 2 1971 .69 .7675 .78333333 |
4. | 2 1971 .77 .7675 .78333333 |
5. | 2 1973 .89 .7675 .78333333 |
|----------------------------------------|
6. | 3 1974 .42 .7675 .42 |
|----------------------------------------|
7. | 4 1974 .82 .7675 .88 |
8. | 4 1975 .94 .7675 .88 |
+----------------------------------------+
Let’s get the max and min for each country.
We can count our observations.
| ctry year gas infl avggas avggas_~y mingas~y maxgas~y count_~y |
|--------------------------------------------------------------------------------|
1. | 1 1974 .78 1.32 .7675 .805 .78 .83 2 |
2. | 1 1975 .83 1.4 .7675 .805 .78 .83 2 |
|--------------------------------------------------------------------------------|
3. | 2 1971 .69 1.15 .7675 .78333333 .69 .89 3 |
4. | 2 1971 .77 1.15 .7675 .78333333 .69 .89 3 |
5. | 2 1973 .89 1.29 .7675 .78333333 .69 .89 3 |
|--------------------------------------------------------------------------------|
6. | 3 1974 .42 1.14 .7675 .42 .42 .42 1 |
|--------------------------------------------------------------------------------|
7. | 4 1974 .82 1.12 .7675 .88 .82 .94 2 |
8. | 4 1975 .94 1.18 .7675 .88 .82 .94 2 |
+--------------------------------------------------------------------------------+
We also have egen count(), egen iqr(), egen median(), and egen pctile(#,p(#)) See more egen commands:
request ignored because of batch mode
3.2.1 Subscripting
(Head start to Mitchell 8.4)
Bysort can be combined with indexes/subscripting to find the first, last, or # observaration within a group
Find first year and last year
| ctry year gas infl firstyr lastyr |
|---------------------------------------------|
1. | 1 1974 .78 1.32 1974 1975 |
2. | 1 1975 .83 1.4 1974 1975 |
|---------------------------------------------|
3. | 2 1971 .69 1.15 1971 1973 |
4. | 2 1971 .77 1.15 1971 1973 |
5. | 2 1973 .89 1.29 1971 1973 |
|---------------------------------------------|
6. | 3 1974 .42 1.14 1974 1974 |
|---------------------------------------------|
7. | 4 1974 .82 1.12 1974 1975 |
8. | 4 1975 .94 1.18 1974 1975 |
+---------------------------------------------+
Take a difference between periods (same as l.var)
(4 missing values generated)
+-------------------------------------------------------+
| ctry year gas infl firstyr lastyr diffgas |
|-------------------------------------------------------|
1. | 1 1974 .78 1.32 1974 1975 . |
2. | 1 1975 .83 1.4 1974 1975 .05 |
|-------------------------------------------------------|
3. | 2 1971 .69 1.15 1971 1973 . |
4. | 2 1971 .77 1.15 1971 1973 .08 |
5. | 2 1973 .89 1.29 1971 1973 .12 |
|-------------------------------------------------------|
6. | 3 1974 .42 1.14 1974 1974 . |
|-------------------------------------------------------|
7. | 4 1974 .82 1.12 1974 1975 . |
8. | 4 1975 .94 1.18 1974 1975 .12 |
+-------------------------------------------------------+
Create Indexes for Rate of change to base year
| ctry year gas infl firstyr lastyr diffgas gasindex |
|-------------------------------------------------------------------|
1. | 1 1974 .78 1.32 1974 1975 . 100 |
2. | 1 1975 .83 1.4 1974 1975 .05 106.41026 |
|-------------------------------------------------------------------|
3. | 2 1971 .69 1.15 1971 1973 . 100 |
4. | 2 1971 .77 1.15 1971 1973 .08 111.5942 |
5. | 2 1973 .89 1.29 1971 1973 .12 128.98551 |
|-------------------------------------------------------------------|
6. | 3 1974 .42 1.14 1974 1974 . 100 |
|-------------------------------------------------------------------|
7. | 4 1974 .82 1.12 1974 1975 . 100 |
8. | 4 1975 .94 1.18 1974 1975 .12 114.63415 |
+-------------------------------------------------------------------+
3.3 Converting Strings to numerics
These next two section have a lot of practical implications that you find when working with survey and especially administrative data. First we will cover converting numerical characters in strings to numerical data to analyze.
Let’s summarize our data, but it comes back blank.
/Users/Sam/Desktop/Econ 645/Data/Mitchell
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
wt | 0
bp1 | 0
bp2 | 0
bp3 | 0
All of our numerical data are in string formats.
Contains data from cardio1str.dta
obs: 5
vars: 11 22 Dec 2009 19:51
size: 205
----------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
id str1 %3s Identification variable
wt str5 %5s Weight of person
age str2 %2s Age of person
bp1 str3 %3s Systolic BP: Trial 1
bp2 str3 %3s Systolic BP: Trial 2
bp3 str3 %3s Systolic BP: Trial 3
pl1 str3 %3s Pulse: Trial 1
pl2 str2 %3s Pulse: Trial 2
pl3 str3 %3s Pulse: Trial 3
income str10 %10s Income
gender str6 %6s Gender of person
----------------------------------------------------------------------------------------------
Sorted by:
3.3.1 Destring
The destring command is our main command to convert numerical characters to numerics. With destring, we have to choose an option of generating a new variable or replace the current variable
age: all characters numeric; agen generated as byte
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
age | 0
agen | 5 37.4 10.83051 23 48
We can destring all of our numerics.
quietly {
cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use "cardio1str.dta", clear
}
destring id-income, replaceNotice that the replace failed for income and pl3. This is becuase pl3 has a letter, so all of the data are not all digits. Income has $ and , so it can not convert the data.
We can use the force option with pl3 to convert the X to missing.
| pl3 |
|-----|
1. | 93 |
2. | 125 |
3. | X |
4. | 71 |
5. | 52 |
+-----+
pl3: contains nonnumeric characters; replaced as int
(1 missing value generated)
+-----+
| pl3 |
|-----|
1. | 93 |
2. | 125 |
3. | . |
4. | 71 |
5. | 52 |
+-----+
We cannot use the force option with income, since it contains important information. We need to get rid of the two non-numerical characters of $ and ,
We have two options: eliminate the “$” and “,” or use the ignore option in destring. The latter option is easier.
| income |
|------------|
1. | $25,308.92 |
2. | $46,213.31 |
3. | $65,234.11 |
4. | $89,234.23 |
5. | $54,989.87 |
+------------+
income: characters $ , removed; replaced as double
+----------+
| income |
|----------|
1. | 25308.92 |
2. | 46213.31 |
3. | 65234.11 |
4. | 89234.23 |
5. | 54989.87 |
+----------+
3.3.2 Encode
Sometimes we have categorical variables or ID variables that come in as strings. We need to code them as categorical, but strings can be tricky with leading or lagging zeros or misspellings.
One option is to use the encode command, which will generate a new variable that codifies the string variable. This can be very helpful when our id variable in panel data are in string format.
Remember: trim any white space around the string just in case, 1. strtrim() - remove leading and lagging white space 2. strrtrim() - removes lagging white spaces 3. strltrim() - removes leading white spaces 4. stritrim() - removes leading, lagging, and consecutive white spaces
list id gender
replace gender = stritrim(gender)
encode gender, generate(gendercode)
list gender gendercode | id gender |
|-------------|
1. | 1 male |
2. | 2 male |
3. | 3 male |
4. | 4 male |
5. | 5 female |
+-------------+
(0 real changes made)
+-------------------+
| gender gender~e |
|-------------------|
1. | male male |
2. | male male |
3. | male male |
4. | male male |
5. | female female |
+-------------------+
They look the same, but let’s use tabulation
Gender of | Gender of person
person | 1 2 | Total
-----------+----------------------+----------
female | 1 0 | 1
male | 0 4 | 4
-----------+----------------------+----------
Total | 1 4 | 5
The encode creates label values around the new variable
gendercode Gender of person
----------------------------------------------------------------------------------------------
type: numeric (long)
label: gendercode
range: [1,2] units: 1
unique values: 2 missing .: 0/5
tabulation: Freq. Numeric Label
1 1 female
4 2 male
3.3.3 Converting numerics to strings
In my experience, it is more common to destring, but there are situtations where you need to convert a numeric to a string. The most common examples from me are zip codes and FIPS codes. We use the tostring command for this process.
As a side note, whenever working with or merging with state-level data, always, always use FIPS codes and not state names or state abbreviations. The probability of a problematic merge is much higher matching on strings than it is on numerics.
/Users/Sam/Desktop/Econ 645/Data/Mitchell
+---------+
| zipcode |
|---------|
1. | 1003 |
2. | 90095 |
3. | 43409 |
4. | 23219 |
5. | 66214 |
+---------+
This zip code list is a problem and likely will not merge properly, since zipcodes that have a leading 0 will not match.
tostring zipcode, generate(zips)
replace zips = "0" + zips if strlen(zips) == 4
replace zips = "00" + zips if strlen(zips) == 3
list zipcode zipszips generated as str5
(1 real change made)
(0 real changes made)
+-----------------+
| zipcode zips |
|-----------------|
1. | 1003 01003 |
2. | 90095 90095 |
3. | 43409 43409 |
4. | 23219 23219 |
5. | 66214 66214 |
+-----------------+
I don’t recommend just formatting the data to 5 digits. The data should be formatted exactly for matching and merging between datasets on the key merging variable.
I recommend using tostring, but there is decode which is the opposite of encode. If we want to use the variable labels instead of the numeric for some reason, the decode command can be used.
| famhist famhists |
|--------------------|
1. | 0 No HD |
2. | 1 Yes HD |
3. | 0 No HD |
4. | 1 Yes HD |
5. | 1 Yes HD |
+--------------------+
----------------------------------------------------------------------------------------------
famhist Family history of heart disease
----------------------------------------------------------------------------------------------
type: numeric (long)
label: famhistl
range: [0,1] units: 1
unique values: 2 missing .: 0/5
tabulation: Freq. Numeric Label
2 0 No HD
3 1 Yes HD
----------------------------------------------------------------------------------------------
famhists Family history of heart disease
----------------------------------------------------------------------------------------------
type: string (str6)
unique values: 2 missing "": 0/5
tabulation: Freq. Value
2 "No HD"
3 "Yes HD"
warning: variable has embedded blanks
3.4 Renaming and reordering
Reordering and renaming seems straightforward enough, but there are some useful tips that will help consolidate your scripts. For example, you don’t need to a new rename command for every single variable to be renamed. We can use group renaming.
3.4.1 Rename
The rename command is straightfoward, and allows us to rename our variable(s) of interest.
cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell/"
use "cardio2.dta", clear
describe
rename age age_yrs
describe/Users/Sam/Desktop/Econ 645/Data/Mitchell
Contains data from cardio2.dta
obs: 5
vars: 12 22 Dec 2009 19:51
size: 100
----------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
id byte %3.0f Identification variable
age byte %3.0f Age of person
pl1 int %3.0f Pulse: Trial 1
bp1 int %3.0f Systolic BP: Trial 1
pl2 byte %3.0f Pulse: Trial 2
bp2 int %3.0f Systolic BP: Trial 2
pl3 int %3.0f Pulse: Trial 3
bp3 int %3.0f Systolic BP: Trial 3
pl4 int %3.0f Pulse: Trial 4
bp4 int %3.0f Systolic BP: Trial 4
pl5 byte %3.0f Pulse: Trial 5
bp5 int %3.0f Systolic BP: Trial 5
----------------------------------------------------------------------------------------------
Sorted by:
Contains data from cardio2.dta
obs: 5
vars: 12 22 Dec 2009 19:51
size: 100
----------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
id byte %3.0f Identification variable
age_yrs byte %3.0f Age of person
pl1 int %3.0f Pulse: Trial 1
bp1 int %3.0f Systolic BP: Trial 1
pl2 byte %3.0f Pulse: Trial 2
bp2 int %3.0f Systolic BP: Trial 2
pl3 int %3.0f Pulse: Trial 3
bp3 int %3.0f Systolic BP: Trial 3
pl4 int %3.0f Pulse: Trial 4
bp4 int %3.0f Systolic BP: Trial 4
pl5 byte %3.0f Pulse: Trial 5
bp5 int %3.0f Systolic BP: Trial 5
----------------------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
We also can do bulk renames if the group of variables have common characters in the variable name. We can rename our pl group to pulse
Contains data from cardio2.dta
obs: 5
vars: 12 22 Dec 2009 19:51
size: 100
----------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
id byte %3.0f Identification variable
age_yrs byte %3.0f Age of person
pulse1 int %3.0f Pulse: Trial 1
bp1 int %3.0f Systolic BP: Trial 1
pulse2 byte %3.0f Pulse: Trial 2
bp2 int %3.0f Systolic BP: Trial 2
pulse3 int %3.0f Pulse: Trial 3
bp3 int %3.0f Systolic BP: Trial 3
pulse4 int %3.0f Pulse: Trial 4
bp4 int %3.0f Systolic BP: Trial 4
pulse5 byte %3.0f Pulse: Trial 5
bp5 int %3.0f Systolic BP: Trial 5
----------------------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
Where ? is an operator for just one character to be different.
An alternate way is to rename by grouping and it prevents any unintended renaming with the ? or * operators.
Contains data from cardio2.dta
obs: 5
vars: 12 22 Dec 2009 19:51
size: 100
----------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
id byte %3.0f Identification variable
age_yrs byte %3.0f Age of person
pulse1 int %3.0f Pulse: Trial 1
bpress1 int %3.0f Systolic BP: Trial 1
pulse2 byte %3.0f Pulse: Trial 2
bpress2 int %3.0f Systolic BP: Trial 2
pulse3 int %3.0f Pulse: Trial 3
bpress3 int %3.0f Systolic BP: Trial 3
pulse4 int %3.0f Pulse: Trial 4
bpress4 int %3.0f Systolic BP: Trial 4
pulse5 byte %3.0f Pulse: Trial 5
bpress5 int %3.0f Systolic BP: Trial 5
----------------------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
3.4.2 Order
We can move our variables around with the order command. This command can be especially helpful with panel data, since we want the unit id and the time period to be next to one another.
Our variables are out of order and pl and bp alternate, but we may want to keep our pl variables and bp variables next to one another.
Reorder blood pressure and pulse
quietly {
cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell/"
use "cardio2.dta", clear
rename age age_yrs
rename pl? pulse?
rename (bp1 bp2 bp3 bp4 bp5) (bpress1 bpress2 bpress3 bpress4 bpress5)
}
order id age_yrs bp* pul*Or,
Or,
If you generate a new variable, it will default to the end, but we may want it somewhere else. Let’s say we want age-squared, but we want age-squared to be next to age. We can use the after (or before) option in the generate command.
If you wanted to reorder the whole dataset alphabetically, then
Contains data from cardio2.dta
obs: 5
vars: 13 22 Dec 2009 19:51
size: 140
----------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
age2 double %10.0g
age_yrs byte %3.0f Age of person
bpress1 int %3.0f Systolic BP: Trial 1
bpress2 int %3.0f Systolic BP: Trial 2
bpress3 int %3.0f Systolic BP: Trial 3
bpress4 int %3.0f Systolic BP: Trial 4
bpress5 int %3.0f Systolic BP: Trial 5
id byte %3.0f Identification variable
pulse1 int %3.0f Pulse: Trial 1
pulse2 byte %3.0f Pulse: Trial 2
pulse3 int %3.0f Pulse: Trial 3
pulse4 int %3.0f Pulse: Trial 4
pulse5 byte %3.0f Pulse: Trial 5
----------------------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
There is a sequential option as well if your var1, …, vark are out of order
Contains data from cardio2.dta
obs: 5
vars: 13 22 Dec 2009 19:51
size: 140
----------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
id byte %3.0f Identification variable
age_yrs byte %3.0f Age of person
pulse1 int %3.0f Pulse: Trial 1
pulse2 byte %3.0f Pulse: Trial 2
pulse3 int %3.0f Pulse: Trial 3
pulse4 int %3.0f Pulse: Trial 4
pulse5 byte %3.0f Pulse: Trial 5
age2 double %10.0g
bpress5 int %3.0f Systolic BP: Trial 5
bpress1 int %3.0f Systolic BP: Trial 1
bpress2 int %3.0f Systolic BP: Trial 2
bpress3 int %3.0f Systolic BP: Trial 3
bpress4 int %3.0f Systolic BP: Trial 4
----------------------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
Contains data from cardio2.dta
obs: 5
vars: 13 22 Dec 2009 19:51
size: 140
----------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
bpress1 int %3.0f Systolic BP: Trial 1
bpress2 int %3.0f Systolic BP: Trial 2
bpress3 int %3.0f Systolic BP: Trial 3
bpress4 int %3.0f Systolic BP: Trial 4
bpress5 int %3.0f Systolic BP: Trial 5
id byte %3.0f Identification variable
age_yrs byte %3.0f Age of person
pulse1 int %3.0f Pulse: Trial 1
pulse2 byte %3.0f Pulse: Trial 2
pulse3 int %3.0f Pulse: Trial 3
pulse4 int %3.0f Pulse: Trial 4
pulse5 byte %3.0f Pulse: Trial 5
age2 double %10.0g
----------------------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.