Chapter 3 Labeling Data
Mitchell’s 5th chapter has a bunch of useful information, but the most important parts are label define (and its two options of add and modify), label values, and time formatting.
3.1 Describing datasets
We have seen the describe command before, but it is a very useful command to being working with data. It provides the varible name, storage type, display format, value label, and variable label, Let’s get some data on the survey of graduate students.
/Users/Sam/Desktop/Econ 645/Data/Mitchell
(Survey of graduate students)
Contains data from survey7.dta
obs: 8 Survey of graduate students
vars: 11 5 May 2020 14:37
size: 400 (_dta has notes)
----------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
id float %9.0g Unique identification variable
STUDENTVARS float %9.0g STUDENT VARIABLES ===============
gender float %9.0g mf Gender of student
race float %19.0g racelab * Race of student
bday float %td.. Date of birth of student
income float %11.1fc Income of student
havechild float %18.0g havelab * Given birth to a child?
KIDVARS float %9.0g KID VARIABLES ===================
kidname str10 %-10s Name of child
ksex float %15.0g mfkid * Sex of child
kbday float %td.. Date of birth of child
* indicated variables have notes
----------------------------------------------------------------------------------------------
Sorted by:
We also have a short option, but it just contain general information.
Contains data from survey7.dta
obs: 8 Survey of graduate students
vars: 11 5 May 2020 14:37
size: 400
Sorted by:
We can subset the variables we want to describe if we want
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
id float %9.0g Unique identification variable
gender float %9.0g mf Gender of student
race float %19.0g racelab * Race of student
Finally, the command codebook provides a deep dive into your dataset. This is very useful for looking at the value labels. We only see the value label name in the describe command, but the codebook command provides more information, such as type of variable, label name, range of values, unique values, missing, value labels, missing value labels (if any).
id Unique identification variable
----------------------------------------------------------------------------------------------
type: numeric (float)
range: [1,8] units: 1
unique values: 8 missing .: 0/8
tabulation: Freq. Value
1 1
1 2
1 3
1 4
1 5
1 6
1 7
1 8
----------------------------------------------------------------------------------------------
STUDENTVARS STUDENT VARIABLES ===============
----------------------------------------------------------------------------------------------
type: numeric (float)
range: [.,.] units: .
unique values: 0 missing .: 8/8
tabulation: Freq. Value
8 .
----------------------------------------------------------------------------------------------
gender Gender of student
----------------------------------------------------------------------------------------------
type: numeric (float)
label: mf
range: [1,2] units: 1
unique values: 2 missing .: 0/8
tabulation: Freq. Numeric Label
3 1 Male
5 2 Female
----------------------------------------------------------------------------------------------
race Race of student
----------------------------------------------------------------------------------------------
type: numeric (float)
label: racelab
range: [1,5] units: 1
unique values: 5 missing .: 0/8
tabulation: Freq. Numeric Label
2 1 White
2 2 Asian
2 3 Hispanic
1 4 African American
1 5 Other
----------------------------------------------------------------------------------------------
bday Date of birth of student
----------------------------------------------------------------------------------------------
type: numeric daily date (float)
range: [389,7935] units: 1
or equivalently: [24jan1961,22sep1981] units: days
unique values: 8 missing .: 0/8
tabulation: Freq. Value
1 389 24jan1961
1 3027 15apr1968
1 4160 23may1971
1 4924 25jun1973
1 5036 15oct1973
1 6059 03aug1976
1 6391 01jul1977
1 7935 22sep1981
----------------------------------------------------------------------------------------------
income Income of student
----------------------------------------------------------------------------------------------
type: numeric (float)
range: [545.23,1284354.5] units: .01
unique values: 8 missing .: 0/8
tabulation: Freq. Value
1 545.22998
1 4500.9199
1 10500.93
1 45234.129
1 109452.11
1 120102.32
1 124313.45
1 1284354.5
----------------------------------------------------------------------------------------------
havechild Given birth to a child?
----------------------------------------------------------------------------------------------
type: numeric (float)
label: havelab
range: [0,1] units: 1
unique values: 2 missing .: 0/8
unique mv codes: 1 missing .*: 3/8
tabulation: Freq. Numeric Label
1 0 Dont Have Child
4 1 Have Child
3 .n NA
----------------------------------------------------------------------------------------------
KIDVARS KID VARIABLES ===================
----------------------------------------------------------------------------------------------
type: numeric (float)
range: [.,.] units: .
unique values: 0 missing .: 8/8
tabulation: Freq. Value
8 .
----------------------------------------------------------------------------------------------
kidname Name of child
----------------------------------------------------------------------------------------------
type: string (str10), but longest is str9
unique values: 5 missing "": 0/8
tabulation: Freq. Value
4 ""
1 "Catherine"
1 "Robin"
1 "Sally"
1 "Samuell"
warning: variable has leading and trailing blanks
----------------------------------------------------------------------------------------------
ksex Sex of child
----------------------------------------------------------------------------------------------
type: numeric (float)
label: mfkid
range: [1,2] units: 1
unique values: 2 missing .: 0/8
unique mv codes: 2 missing .*: 5/8
tabulation: Freq. Numeric Label
1 1 Male
2 2 Female
4 .n NA
1 .u Unknown
----------------------------------------------------------------------------------------------
kbday Date of birth of child
----------------------------------------------------------------------------------------------
type: numeric daily date (float)
range: [12888,15932] units: 1
or equivalently: [15apr1995,15aug2003] units: days
unique values: 4 missing .: 4/8
tabulation: Freq. Value
1 12888 15apr1995
1 14019 20may1998
1 14256 12jan1999
1 15932 15aug2003
4 . .
We can go by variables.
race Race of student
----------------------------------------------------------------------------------------------
type: numeric (float)
label: racelab
range: [1,5] units: 1
unique values: 5 missing .: 0/8
tabulation: Freq. Numeric Label
2 1 White
2 2 Asian
2 3 Hispanic
1 4 African American
1 5 Other
We can go by variables and notes
havechild Given birth to a child?
----------------------------------------------------------------------------------------------
type: numeric (float)
label: havelab
range: [0,1] units: 1
unique values: 2 missing .: 0/8
unique mv codes: 1 missing .*: 3/8
tabulation: Freq. Numeric Label
1 0 Dont Have Child
4 1 Have Child
3 .n NA
havechild:
1. This variable measures whether a woman has given birth to a child, not just whether she
is a parent.
2. The .n (NA) missing code is used for males, because they cannot bear children.
3. The .u (Unknown) missing code for a female indicating it is unknown if she has a child.
We can look at the variable and missing value labels with the option mv. I recommend that you don’t label the missing values unless it is absolutely necessary. Different types of missing values besides “.” cause problems down the road, especially with the marginsplot command.
ksex Sex of child
----------------------------------------------------------------------------------------------
type: numeric (float)
label: mfkid
range: [1,2] units: 1
unique values: 2 missing .: 0/8
unique mv codes: 2 missing .*: 5/8
tabulation: Freq. Numeric Label
1 1 Male
2 2 Female
4 .n NA
1 .u Unknown
missing values: havechild==mv --> ksex==mv
kbday==mv --> ksex==mv
If you are interested in the different languages labels it is on page 112.
The lookfor command will return all variables with the search word. This is a bit redundent, since this is available in the variable window. But, it provides more space to look.
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
bday float %td.. Date of birth of student
havechild float %18.0g havelab * Given birth to a child?
kbday float %td.. Date of birth of child
We can also search for the notes by the search word
havechild:
1. This variable measures whether a woman has given birth to a child, not just whether she
is a parent.
We can see the formats of the variables as well
| income bday |
|------------------------|
1. | 10,500.9 01/24/61 |
2. | 45,234.1 04/15/68 |
3. | 1,284,354.5 05/23/71 |
4. | 124,313.5 06/25/73 |
5. | 120,102.3 09/22/81 |
|------------------------|
6. | 545.2 10/15/73 |
7. | 109,452.1 07/01/77 |
8. | 4,500.9 08/03/76 |
+------------------------+
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
income float %11.1fc Income of student
bday float %td.. Date of birth of student
We can see that the format for income is %11.1fc and the format for bday is %td
3.2 Labeling variables
Labeling the variables is a very helpful shortcut to describe what the variable contain without having to go back to the data dicionary. Sometimes we want a short and concise label if we are exporting labels to regression tables, or sometimes we want longer variable labels to give us context of the variable.
Let’s get some data on graduate students and use the describe command.
/Users/Sam/Desktop/Econ 645/Data/Mitchell
Contains data from survey1.dta
obs: 8
vars: 9 1 Jan 2010 12:13
size: 432
----------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
id float %9.0g
gender float %9.0g
race float %9.0g
havechild float %9.0g
ksex float %9.0g
bdays str10 %10s
income float %9.0g
kbdays str10 %10s
kidname str10 %10s
----------------------------------------------------------------------------------------------
Sorted by:
We have no variable labels, so we will need to provide some so future users have an understand what the data are. We will use the label variable command to describe the variable.
label variable id "Identification variable"
label variable gender "Gender of the student"
label variable race "Race of the student"
label variable havechild "Given birth to a child"
label variable ksex "Sex of child"
label variable bday "Birthday of student"
label variable income "Income of student"
label variable kbdays "Birthday of child"
label variable kidname "Name of child"
describeContains data from survey1.dta
obs: 8
vars: 9 1 Jan 2010 12:13
size: 432
----------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
id float %9.0g Identification variable
gender float %9.0g Gender of the student
race float %9.0g Race of the student
havechild float %9.0g Given birth to a child
ksex float %9.0g Sex of child
bdays str10 %10s Birthday of student
income float %9.0g Income of student
kbdays str10 %10s Birthday of child
kidname str10 %10s Name of child
----------------------------------------------------------------------------------------------
Sorted by:
We can simply change the variable label with running the command again with the new variable label.
Contains data from survey1.dta
obs: 8
vars: 9 1 Jan 2010 12:13
size: 432
----------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
id float %9.0g Unique identification variable
gender float %9.0g Gender of the student
race float %9.0g Race of the student
havechild float %9.0g Given birth to a child
ksex float %9.0g Sex of child
bdays str10 %10s Birthday of student
income float %9.0g Income of student
kbdays str10 %10s Birthday of child
kidname str10 %10s Name of child
----------------------------------------------------------------------------------------------
Sorted by:
file survey2.dta saved
3.3 Labeling values
Labeling values is a very practice way of analyzing data without having to go back to the data dictionary. Labeling values requires two commands:
- label define to define a label that can be used across many different variables.
- label variable to place a label onto a variable.
Labeling values is a bit different than labeling variables, since we need to modify or replace after a label has been defined. If you try to change a label, then you will get an error unless you use add, modify or replace
Let’s look at our codebook
/Users/Sam/Desktop/Econ 645/Data/Mitchell
----------------------------------------------------------------------------------------------
id Unique identification variable
----------------------------------------------------------------------------------------------
type: numeric (float)
range: [1,8] units: 1
unique values: 8 missing .: 0/8
tabulation: Freq. Value
1 1
1 2
1 3
1 4
1 5
1 6
1 7
1 8
----------------------------------------------------------------------------------------------
gender Gender of the student
----------------------------------------------------------------------------------------------
type: numeric (float)
range: [1,2] units: 1
unique values: 2 missing .: 0/8
tabulation: Freq. Value
3 1
5 2
----------------------------------------------------------------------------------------------
race Race of the student
----------------------------------------------------------------------------------------------
type: numeric (float)
range: [1,5] units: 1
unique values: 5 missing .: 0/8
tabulation: Freq. Value
2 1
2 2
2 3
1 4
1 5
----------------------------------------------------------------------------------------------
havechild Given birth to a child
----------------------------------------------------------------------------------------------
type: numeric (float)
range: [0,1] units: 1
unique values: 2 missing .: 0/8
unique mv codes: 1 missing .*: 3/8
tabulation: Freq. Value
1 0
4 1
3 .n
----------------------------------------------------------------------------------------------
ksex Sex of child
----------------------------------------------------------------------------------------------
type: numeric (float)
range: [1,2] units: 1
unique values: 2 missing .: 0/8
unique mv codes: 2 missing .*: 5/8
tabulation: Freq. Value
1 1
2 2
4 .n
1 .u
----------------------------------------------------------------------------------------------
bdays Birthday of student
----------------------------------------------------------------------------------------------
type: string (str10)
unique values: 8 missing "": 0/8
tabulation: Freq. Value
1 "1/24/1961"
1 "10/15/1973"
1 "4/15/1968"
1 "5/23/1971"
1 "6/25/1973"
1 "7/1/1977"
1 "8/3/1976"
1 "9/22/1981"
----------------------------------------------------------------------------------------------
income Income of student
----------------------------------------------------------------------------------------------
type: numeric (float)
range: [545.23,1284354.5] units: .01
unique values: 8 missing .: 0/8
tabulation: Freq. Value
1 545.22998
1 4500.9199
1 10500.93
1 45234.129
1 109452.11
1 120102.32
1 124313.45
1 1284354.5
----------------------------------------------------------------------------------------------
kbdays Birthday of child
----------------------------------------------------------------------------------------------
type: string (str10), but longest is str9
unique values: 5 missing "": 0/8
tabulation: Freq. Value
4 ""
1 "1/12/1999"
1 "4/15/1995"
1 "5/20/1998"
1 "8/15/2003"
warning: variable has leading and trailing blanks
----------------------------------------------------------------------------------------------
kidname Name of child
----------------------------------------------------------------------------------------------
type: string (str10), but longest is str9
unique values: 5 missing "": 0/8
tabulation: Freq. Value
4 ""
1 "Catherine"
1 "Robin"
1 "Sally"
1 "Samuell"
warning: variable has leading and trailing blanks
We have our variable labels from 5.3, but now we need to label the values so replicators can know what the data are without having to reference the data dictionary for every variable. We will use the label define command to create a new label.
First we need to define a label with label define
Next we need to label the values of the variable with label values command and we’ll look at the codebook again.
race Race of the student
----------------------------------------------------------------------------------------------
type: numeric (float)
label: racelabel, but 1 nonmissing value is not labeled
range: [1,5] units: 1
unique values: 5 missing .: 0/8
tabulation: Freq. Numeric Label
2 1 White
2 2 Asian
2 3 Hispanic
1 4 Black
1 5
We are still missing a value label for 5, which is Other, so we need to modify our defined label race1. If we do not modify our label, we will get an error if we try to label values again. We can use the add option in label define.
If we want to modify an existing label, we can use the modify option in label define.
race Race of the student
----------------------------------------------------------------------------------------------
type: numeric (float)
label: racelabel
range: [1,5] units: 1
unique values: 5 missing .: 0/8
tabulation: Freq. Numeric Label
2 1 White
2 2 Asian
2 3 Hispanic
1 4 African American
1 5 Other
Labeling missing is something that I don’t recommend, but we’ll show an example here.
label define mfkid 1 "Male" 2 "Female" .u "Unknown" .n "NA"
label values ksex mfkid
codebook ksex
label define havechildlabel 0 "Don't have a child" 1 "Have a child" .u "Unknown" .n "NA"
label values havechild havechildlabel
codebook havechildksex Sex of child
----------------------------------------------------------------------------------------------
type: numeric (float)
label: mfkid
range: [1,2] units: 1
unique values: 2 missing .: 0/8
unique mv codes: 2 missing .*: 5/8
tabulation: Freq. Numeric Label
1 1 Male
2 2 Female
4 .n NA
1 .u Unknown
----------------------------------------------------------------------------------------------
havechild Given birth to a child
----------------------------------------------------------------------------------------------
type: numeric (float)
label: havechildlabel
range: [0,1] units: 1
unique values: 2 missing .: 0/8
unique mv codes: 1 missing .*: 3/8
tabulation: Freq. Numeric Label
1 0 Don't have a child
4 1 Have a child
3 .n NA
We can look at our label list to see what we have define so far. We can use the label list command to see our labels.
havechildlabel:
0 Don't have a child
1 Have a child
.n NA
.u Unknown
mfkid:
1 Male
2 Female
.n NA
.u Unknown
racelabel:
1 White
2 Asian
3 Hispanic
4 African American
5 Other
The numlabel command is an interesting command. It takes the guess work out of knowing the numeric value of the category by appending the numeric value with the label value.
racelabel:
1 1. White
2 2. Asian
3 3. Hispanic
4 4. African American
5 5. Other
Race of the |
student | Freq. Percent Cum.
------------+-----------------------------------
1 | 2 25.00 25.00
2 | 2 25.00 50.00
3 | 2 25.00 75.00
4 | 1 12.50 87.50
5 | 1 12.50 100.00
------------+-----------------------------------
Total | 8 100.00
And, if we don’t like it or don’t need it any more, we can remove the numeric values with the remove option.
Race of the |
student | Freq. Percent Cum.
------------+-----------------------------------
1 | 2 25.00 25.00
2 | 2 25.00 50.00
3 | 2 25.00 75.00
4 | 1 12.50 87.50
5 | 1 12.50 100.00
------------+-----------------------------------
Total | 8 100.00
We can add additional characters with numlabel as well, such as “#=” or “#)” with the mask option
Race of the |
student | Freq. Percent Cum.
------------+-----------------------------------
1 | 2 25.00 25.00
2 | 2 25.00 50.00
3 | 2 25.00 75.00
4 | 1 12.50 87.50
5 | 1 12.50 100.00
------------+-----------------------------------
Total | 8 100.00
We can remove the mask with remove plus the mask option
Race of the |
student | Freq. Percent Cum.
------------+-----------------------------------
1 | 2 25.00 25.00
2 | 2 25.00 50.00
3 | 2 25.00 75.00
4 | 1 12.50 87.50
5 | 1 12.50 100.00
------------+-----------------------------------
Total | 8 100.00
3.4 Labeling utilities
Stata has label utilities to manage the labels defined. The first one is label dir to see the labels names available in a quick and more concise way than using the codebook command.
For me, I think that label list will be your most useful command here.
Quick check of your label directory
/Users/Sam/Desktop/Econ 645/Data/Mitchell
mfkid
havechildlabel
racelabel
The label list command gives a more comprehensive view of your labels that includes the value labels associated with the value label name
mfkid:
1 Male
2 Female
.n NA
.u Unknown
havechildlabel:
0 Don't have a child
1 Have a child
.n NA
.u Unknown
racelabel:
1 White
2 Asian
3 Hispanic
4 African American
5 Other
The label save command will save your labels into a do file for future use. Our do file name is stated after the using statement.
label define havechildlabel 0 `"Don't have a child"', modify
label define havechildlabel 1 `"Have a child"', modify
label define havechildlabel .n `"NA"', modify
label define havechildlabel .u `"Unknown"', modify
label define racelabel 1 `"White"', modify
label define racelabel 2 `"Asian"', modify
label define racelabel 3 `"Hispanic"', modify
label define racelabel 4 `"African American"', modify
label define racelabel 5 `"Other"', modify
The labelbook command provides information similar to codebook but only for the labels that are defined.
value label havechildlabel
----------------------------------------------------------------------------------------------
values labels
range: [0,1] string length: [2,18]
N: 4 unique at full length: yes
gaps: no unique at length 12: yes
missing .*: 2 null string: no
leading/trailing blanks: no
numeric -> numeric: no
definition
0 Don't have a child
1 Have a child
.n NA
.u Unknown
variables: havechild
----------------------------------------------------------------------------------------------
value label mfkid
----------------------------------------------------------------------------------------------
values labels
range: [1,2] string length: [2,7]
N: 4 unique at full length: yes
gaps: no unique at length 12: yes
missing .*: 2 null string: no
leading/trailing blanks: no
numeric -> numeric: no
definition
1 Male
2 Female
.n NA
.u Unknown
variables: ksex
----------------------------------------------------------------------------------------------
value label racelabel
----------------------------------------------------------------------------------------------
values labels
range: [1,5] string length: [5,16]
N: 5 unique at full length: yes
gaps: no unique at length 12: yes
missing .*: 0 null string: no
leading/trailing blanks: no
numeric -> numeric: no
definition
1 White
2 Asian
3 Hispanic
4 African American
5 Other
variables: race
----------------------------------------------------------------------------------------------
value label racelabel
----------------------------------------------------------------------------------------------
values labels
range: [1,5] string length: [5,16]
N: 5 unique at full length: yes
gaps: no unique at length 12: yes
missing .*: 0 null string: no
leading/trailing blanks: no
numeric -> numeric: no
definition
1 White
2 Asian
3 Hispanic
4 African American
5 Other
variables: race
The problem option for labelbook provides information to alert the users of any problems
no potential problems in dataset survey3.dta
We can have a more detailed look with the detail and problem options
value label racelabel
----------------------------------------------------------------------------------------------
values labels
range: [1,5] string length: [5,16]
N: 5 unique at full length: yes
gaps: no unique at length 12: yes
missing .*: 0 null string: no
leading/trailing blanks: no
numeric -> numeric: no
definition
1 White
2 Asian
3 Hispanic
4 African American
5 Other
variables: race
no potential problems in dataset survey3.dta
3.5 Labeling variables and values in different languages
We will not be covering this, but if you are interested, please review pages 127-132.
3.6 Using Notes
The note commmand can be helpful for future users or for replicators. If you use the note command without specifying the variable, then it is a general note that will show up under the _dta note. If you add a variable in front of the note command, like note var1:, then you will add a note to the variable
Let’s add some general notes.
Adding TS to the end adds a timestamp, which is a nice feature.
note: The missing values for havechild and childage were coded using -1 and -2 but were converted to .n and .u TSLet’s call our notes with the notes command.
_dta:
1. This was based on the dataset called survey1.txt
2. The missing values for havechild and childage were coded using -1 and -2 but were
converted to .n and .u 24 Sep 2025 17:33
Let’s add some notes to our variables
note race: The other category includes people who specified multiple races
note race: This is another note
note race: This is a third noteWe can just call a particular variable for notes
We can just call a particular variable for notes
race:
1. The other category includes people who specified multiple races
2. This is another note
3. This is a third note
Let say we added an unhelpful note, then we can drop it with the notes drop command and we want to drop the second note.
(1 note dropped)
race:
1. The other category includes people who specified multiple races
3. This is a third note
Notice that we have a gap in the sequence of numbering. We can fix that with the notes renumber command.
race:
1. The other category includes people who specified multiple races
2. This is a third note
We can also search notes with the notes search “string” command.
_dta:
2. The missing values for havechild and childage were coded using -1 and -2 but were
converted to .n and .u 24 Sep 2025 17:33
3.7 Formatting the display of variables
Formatting data will be more common than you expect. It can be a pain when dealing with numbers in the millions or billions and you lack commas. We can format our data with the format command.
3.7.1 Format numerics
Let’s get our survey data and list the first 5 observations for id and income
Let’s look at the format of income.
cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use survey5, clear
describe income
list id income in 1/5/Users/Sam/Desktop/Econ 645/Data/Mitchell
(Survey of graduate students)
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
income float %9.0g Income of student
+---------------+
| id income |
|---------------|
1. | 1 10500.93 |
2. | 2 45234.13 |
3. | 3 1284355 |
4. | 4 124313.5 |
5. | 5 120102.3 |
+---------------+
The format is %9.0g. We always have % in front of our format and g is a general way of displaying incomes using a width of nine digits and decides for us the best way to display the values. g means general here.
%
Note: %
%
%
%
The manual is helpful for formatting: https://www.stata.com/manuals/dformat.pdf
Example 1: format v1 %10.0g - Width of 10 digits and decimals will be decided.
Example 2: format v2 %4.1f - Show 3 digits in v3 and 1 decimal
Example 3: format v3 %6.1fc - Show 4 digits plus the comma plus 1 digit
Let’s get more control over the income format and use the %w.df format. We want a total of 12 digits with 2 decimals places, which means we have 10 digits on the left side of the “.”
| income |
|------------|
1. | 10500.93 |
2. | 45234.13 |
3. | 1284354.50 |
4. | 124313.45 |
5. | 120102.32 |
+------------+
Notice that we now can see observation #3’s decimal places.
If we don’t care to see the decimal place (even though it is still there).
| income |
|---------|
1. | 10501 |
2. | 45234 |
3. | 1284354 |
4. | 124313 |
5. | 120102 |
+---------+
We we want to see one decimal place
| income |
|-----------|
1. | 10500.9 |
2. | 45234.1 |
3. | 1284354.5 |
4. | 124313.5 |
5. | 120102.3 |
+-----------+
Now let’s add commas, but we need to add two additional digit widths for the commas and we’ll add two decimal places.
| income |
|--------------|
1. | 10,500.93 |
2. | 45,234.13 |
3. | 1,284,354.50 |
4. | 124,313.45 |
5. | 120,102.32 |
+--------------+
3.7.2 Format Strings
Let’s use the format command with strings.
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
kidname str10 %10s Name of child
The format is %10s, which is a (s)tring of 10 characters wide that is right-justified.
| kidname |
|-----------|
1. | |
2. | Sally |
3. | Catherine |
4. | |
5. | Samuell |
|-----------|
6. | |
7. | Robin |
8. | |
+-----------+
If we wanted to left-justify the string, we can add a ‘-’ in between % and #s.
| kidname |
|-----------|
1. | |
2. | Sally |
3. | Catherine |
4. | |
5. | Samuell |
|-----------|
6. | |
7. | Robin |
8. | |
+-----------+
3.7.3 Format dates
Dates in Stata are a bit of a pain, so learning how to format the dates will be helpful in the future.
| bdays kbdays |
|------------------------|
1. | 1/24/1961 |
2. | 4/15/1968 4/15/1995 |
3. | 5/23/1971 8/15/2003 |
4. | 6/25/1973 |
5. | 9/22/1981 1/12/1999 |
|------------------------|
6. | 10/15/1973 |
7. | 7/1/1977 5/20/1998 |
8. | 8/3/1976 |
+------------------------+
Our birthdays are in a MM/DD/YYYY format currently. Let’s generate a new variable with the date function. The date function will convert a string that is in a date format into a Stata date, but it still needs to be formatted. The option “MDY” tells Stata that the string is in the Month-Day-Year format and needs to be converted.
Let’s list the days.
| bdays bday kbdays kbday |
|---------------------------------------|
1. | 1/24/1961 389 . |
2. | 4/15/1968 3027 4/15/1995 12888 |
3. | 5/23/1971 4160 8/15/2003 15932 |
4. | 6/25/1973 4924 . |
5. | 9/22/1981 7935 1/12/1999 14256 |
|---------------------------------------|
6. | 10/15/1973 5036 . |
7. | 7/1/1977 6391 5/20/1998 14019 |
8. | 8/3/1976 6059 . |
+---------------------------------------+
The Stata dates are actually stored as the number of days from Jan 1, 1960. This method is convenient for the computer storing and performing date computations, but is difficult for us to read.
Let’s use the %td format - for example 01Jan2000
| bdays bday kbdays kbday |
|--------------------------------------------|
1. | 1/24/1961 24jan1961 . |
2. | 4/15/1968 15apr1968 4/15/1995 12888 |
3. | 5/23/1971 23may1971 8/15/2003 15932 |
4. | 6/25/1973 25jun1973 . |
5. | 9/22/1981 22sep1981 1/12/1999 14256 |
|--------------------------------------------|
6. | 10/15/1973 15oct1973 . |
7. | 7/1/1977 01jul1977 5/20/1998 14019 |
8. | 8/3/1976 03aug1976 . |
+--------------------------------------------+
Let’s use the %tdNN/DD/YY format…NN is used for 01-12 and nn is for 1-12,DD for the day 01-31, and YY is for the last two digits of the year.
| bdays bday kbdays kbday |
|-------------------------------------------|
1. | 1/24/1961 01/24/61 . |
2. | 4/15/1968 04/15/68 4/15/1995 12888 |
3. | 5/23/1971 05/23/71 8/15/2003 15932 |
4. | 6/25/1973 06/25/73 . |
5. | 9/22/1981 09/22/81 1/12/1999 14256 |
|-------------------------------------------|
6. | 10/15/1973 10/15/73 . |
7. | 7/1/1977 07/01/77 5/20/1998 14019 |
8. | 8/3/1976 08/03/76 . |
+-------------------------------------------+
Mon is Jan-Dec, and Month is January-December.
| bdays bday kbdays kbday |
|--------------------------------------------------|
1. | 1/24/1961 January/24/61 . |
2. | 4/15/1968 April/15/68 4/15/1995 12888 |
3. | 5/23/1971 May/23/71 8/15/2003 15932 |
4. | 6/25/1973 June/25/73 . |
5. | 9/22/1981 September/22/81 1/12/1999 14256 |
|--------------------------------------------------|
6. | 10/15/1973 October/15/73 . |
7. | 7/1/1977 July/01/77 5/20/1998 14019 |
8. | 8/3/1976 August/03/76 . |
+--------------------------------------------------+
We can use a standard Month DD, YYYY with the format %tdMonth_DD,CCYY. Where Month is the full name of the month, DD is our days in digits, and CC is Century, such as 19- and 20- and YY is 2-digit year, such as -88, -97
| bdays bday kbdays kbday |
|----------------------------------------------------|
1. | 1/24/1961 January 24,1961 . |
2. | 4/15/1968 April 15,1968 4/15/1995 12888 |
3. | 5/23/1971 May 23,1971 8/15/2003 15932 |
4. | 6/25/1973 June 25,1973 . |
5. | 9/22/1981 September 22,1981 1/12/1999 14256 |
|----------------------------------------------------|
6. | 10/15/1973 October 15,1973 . |
7. | 7/1/1977 July 01,1977 5/20/1998 14019 |
8. | 8/3/1976 August 03,1976 . |
+----------------------------------------------------+
Let’s use a standard format, but don’t use YYYY - it just repeats the 2-digit year twice.
| bdays bday kbdays kbday |
|---------------------------------------------|
1. | 1/24/1961 01/24/6161 . |
2. | 4/15/1968 04/15/6868 4/15/1995 12888 |
3. | 5/23/1971 05/23/7171 8/15/2003 15932 |
4. | 6/25/1973 06/25/7373 . |
5. | 9/22/1981 09/22/8181 1/12/1999 14256 |
|---------------------------------------------|
6. | 10/15/1973 10/15/7373 . |
7. | 7/1/1977 07/01/7777 5/20/1998 14019 |
8. | 8/3/1976 08/03/7676 . |
+---------------------------------------------+
Use %tdNN/DD/CCYY instead for the desired result.
format bday %tdNN/DD/CCYY
list bdays bday kbdays kbday
label variable bday "Date of birth of student"
label variable kbdays "Date of birth of child" | bdays bday kbdays kbday |
|---------------------------------------------|
1. | 1/24/1961 01/24/1961 . |
2. | 4/15/1968 04/15/1968 4/15/1995 12888 |
3. | 5/23/1971 05/23/1971 8/15/2003 15932 |
4. | 6/25/1973 06/25/1973 . |
5. | 9/22/1981 09/22/1981 1/12/1999 14256 |
|---------------------------------------------|
6. | 10/15/1973 10/15/1973 . |
7. | 7/1/1977 07/01/1977 5/20/1998 14019 |
8. | 8/3/1976 08/03/1976 . |
+---------------------------------------------+
bday and bdays are redundent and we’ll only keep one.
3.8 Changing the order of variables in a dataset
I personally find reordering the order of variables with the order command to be useful. This is especially true when working with panel data. I like to order the panel data to have the cross-sectional unit first, such as personal id, firm id, etc.first and then have the time period second, so we have our N and T next to one another.
Let’s pull our survey of graduate students and describe our dataset
/Users/Sam/Desktop/Econ 645/Data/Mitchell
(Survey of graduate students)
Contains data from survey6.dta
obs: 8 Survey of graduate students
vars: 9 11 Mar 2024 14:40
size: 416 (_dta has notes)
----------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
id float %9.0g Unique identification variable
gender float %9.0g mf Gender of student
race float %19.0g racelab * Race of student
havechild float %18.0g havelab * Given birth to a child?
ksex float %15.0g mfkid * Sex of child
bdays str10 %10s Birthday of student
income float %12.2fc Income of student
kidname str10 %-10s Name of child
kbday double %td
* indicated variables have notes
----------------------------------------------------------------------------------------------
Sorted by:
We might want to group our variables with similar types of variables. This can be helpful when you have a large dataset with hundreds of variables, such as the CPS.
Contains data from survey6.dta
obs: 8 Survey of graduate students
vars: 9 11 Mar 2024 14:40
size: 416 (_dta has notes)
----------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
id float %9.0g Unique identification variable
gender float %9.0g mf Gender of student
race float %19.0g racelab * Race of student
bdays str10 %10s Birthday of student
income float %12.2fc Income of student
havechild float %18.0g havelab * Given birth to a child?
ksex float %15.0g mfkid * Sex of child
kidname str10 %-10s Name of child
kbday double %td
* indicated variables have notes
----------------------------------------------------------------------------------------------
Sorted by:
The variables that we leave off will remain in the same order as before after the new variables are moved to the left.
With the before option, we can move variable(s) before a defined variable. Let’s move kidname before ksex
Contains data from survey6.dta
obs: 8 Survey of graduate students
vars: 9 11 Mar 2024 14:40
size: 416 (_dta has notes)
----------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
id float %9.0g Unique identification variable
gender float %9.0g mf Gender of student
race float %19.0g racelab * Race of student
bdays str10 %10s Birthday of student
income float %12.2fc Income of student
havechild float %18.0g havelab * Given birth to a child?
kidname str10 %-10s Name of child
ksex float %15.0g mfkid * Sex of child
kbday double %td
* indicated variables have notes
----------------------------------------------------------------------------------------------
Sorted by:
We can move newly created variables with the before and after options with the generate command
(8 missing values generated)
(8 missing values generated)
Contains data from survey6.dta
obs: 8 Survey of graduate students
vars: 11 11 Mar 2024 14:40
size: 544 (_dta has notes)
----------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------
id float %9.0g Unique identification variable
STUDENTVARS double %10.0g
gender float %9.0g mf Gender of student
race float %19.0g racelab * Race of student
bdays str10 %10s Birthday of student
income float %12.2fc Income of student
havechild float %18.0g havelab * Given birth to a child?
KIDSVARS double %10.0g
kidname str10 %-10s Name of child
ksex float %15.0g mfkid * Sex of child
kbday double %td
* indicated variables have notes
----------------------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
3.9 Practice
Let’s bring in the CPS use “/Users/Sam/Desktop/Econ 645/Data/CPS/jun23pub.dta”, clear
- Generate a new variable from pemlr called employed where employed = 1 if the individual is employed (present or absent) and employed = 0 if the individual is unemployed. The value should be missing if the individual is not in the labor force.
- Label the variable “Currently employed”.
- Label the values for 0 “Not employed” 1 “Employed” . “Not in the Labor Force”.
- Move the variable after pemlr.
- Generate a date that appends hrmonth (month of interview), the string “12”, and the hryear4 (year of interview). We use 12 because the week of the 12th is the reference period.
- Now format the date so it is like 06/12/2023