Chapter 3 Labeling Data

Mitchell’s 5th chapter has a bunch of useful information, but the most important parts are label define (and its two options of add and modify), label values, and time formatting.

3.1 Describing datasets

We have seen the describe command before, but it is a very useful command to being working with data. It provides the varible name, storage type, display format, value label, and variable label, Let’s get some data on the survey of graduate students.

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use "survey7.dta", clear
describe
/Users/Sam/Desktop/Econ 645/Data/Mitchell

(Survey of graduate students)


Contains data from survey7.dta
  obs:             8                          Survey of graduate students
 vars:            11                          5 May 2020 14:37
 size:           400                          (_dta has notes)
----------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------
id              float   %9.0g                 Unique identification variable
STUDENTVARS     float   %9.0g                 STUDENT VARIABLES ===============
gender          float   %9.0g      mf         Gender of student
race            float   %19.0g     racelab  * Race of student
bday            float   %td..                 Date of birth of student
income          float   %11.1fc               Income of student
havechild       float   %18.0g     havelab  * Given birth to a child?
KIDVARS         float   %9.0g                 KID VARIABLES ===================
kidname         str10   %-10s                 Name of child
ksex            float   %15.0g     mfkid    * Sex of child
kbday           float   %td..                 Date of birth of child
                                            * indicated variables have notes
----------------------------------------------------------------------------------------------
Sorted by: 

We also have a short option, but it just contain general information.

describe, short
Contains data from survey7.dta
  obs:             8                          Survey of graduate students
 vars:            11                          5 May 2020 14:37
 size:           400                          
Sorted by: 

We can subset the variables we want to describe if we want

describe id gender race
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------
id              float   %9.0g                 Unique identification variable
gender          float   %9.0g      mf         Gender of student
race            float   %19.0g     racelab  * Race of student

Finally, the command codebook provides a deep dive into your dataset. This is very useful for looking at the value labels. We only see the value label name in the describe command, but the codebook command provides more information, such as type of variable, label name, range of values, unique values, missing, value labels, missing value labels (if any).

codebook
id                                                              Unique identification variable
----------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [1,8]                        units:  1
         unique values:  8                        missing .:  0/8

            tabulation:  Freq.  Value
                             1  1
                             1  2
                             1  3
                             1  4
                             1  5
                             1  6
                             1  7
                             1  8

----------------------------------------------------------------------------------------------
STUDENTVARS                                                  STUDENT VARIABLES ===============
----------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [.,.]                        units:  .
         unique values:  0                        missing .:  8/8

            tabulation:  Freq.  Value
                             8  .

----------------------------------------------------------------------------------------------
gender                                                                       Gender of student
----------------------------------------------------------------------------------------------

                  type:  numeric (float)
                 label:  mf

                 range:  [1,2]                        units:  1
         unique values:  2                        missing .:  0/8

            tabulation:  Freq.   Numeric  Label
                             3         1  Male
                             5         2  Female

----------------------------------------------------------------------------------------------
race                                                                           Race of student
----------------------------------------------------------------------------------------------

                  type:  numeric (float)
                 label:  racelab

                 range:  [1,5]                        units:  1
         unique values:  5                        missing .:  0/8

            tabulation:  Freq.   Numeric  Label
                             2         1  White
                             2         2  Asian
                             2         3  Hispanic
                             1         4  African American
                             1         5  Other

----------------------------------------------------------------------------------------------
bday                                                                  Date of birth of student
----------------------------------------------------------------------------------------------

                  type:  numeric daily date (float)

                 range:  [389,7935]                   units:  1
       or equivalently:  [24jan1961,22sep1981]        units:  days
         unique values:  8                        missing .:  0/8

            tabulation:  Freq.  Value
                             1  389    24jan1961
                             1  3027   15apr1968
                             1  4160   23may1971
                             1  4924   25jun1973
                             1  5036   15oct1973
                             1  6059   03aug1976
                             1  6391   01jul1977
                             1  7935   22sep1981

----------------------------------------------------------------------------------------------
income                                                                       Income of student
----------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [545.23,1284354.5]           units:  .01
         unique values:  8                        missing .:  0/8

            tabulation:  Freq.  Value
                             1  545.22998
                             1  4500.9199
                             1  10500.93
                             1  45234.129
                             1  109452.11
                             1  120102.32
                             1  124313.45
                             1  1284354.5

----------------------------------------------------------------------------------------------
havechild                                                              Given birth to a child?
----------------------------------------------------------------------------------------------

                  type:  numeric (float)
                 label:  havelab

                 range:  [0,1]                        units:  1
         unique values:  2                        missing .:  0/8
       unique mv codes:  1                       missing .*:  3/8

            tabulation:  Freq.   Numeric  Label
                             1         0  Dont Have Child
                             4         1  Have Child
                             3        .n  NA

----------------------------------------------------------------------------------------------
KIDVARS                                                      KID VARIABLES ===================
----------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [.,.]                        units:  .
         unique values:  0                        missing .:  8/8

            tabulation:  Freq.  Value
                             8  .

----------------------------------------------------------------------------------------------
kidname                                                                          Name of child
----------------------------------------------------------------------------------------------

                  type:  string (str10), but longest is str9

         unique values:  5                        missing "":  0/8

            tabulation:  Freq.  Value
                             4  ""
                             1  "Catherine"
                             1  "Robin"
                             1  "Sally"
                             1  "Samuell"

               warning:  variable has leading and trailing blanks

----------------------------------------------------------------------------------------------
ksex                                                                              Sex of child
----------------------------------------------------------------------------------------------

                  type:  numeric (float)
                 label:  mfkid

                 range:  [1,2]                        units:  1
         unique values:  2                        missing .:  0/8
       unique mv codes:  2                       missing .*:  5/8

            tabulation:  Freq.   Numeric  Label
                             1         1  Male
                             2         2  Female
                             4        .n  NA
                             1        .u  Unknown

----------------------------------------------------------------------------------------------
kbday                                                                   Date of birth of child
----------------------------------------------------------------------------------------------

                  type:  numeric daily date (float)

                 range:  [12888,15932]                units:  1
       or equivalently:  [15apr1995,15aug2003]        units:  days
         unique values:  4                        missing .:  4/8

            tabulation:  Freq.  Value
                             1  12888  15apr1995
                             1  14019  20may1998
                             1  14256  12jan1999
                             1  15932  15aug2003
                             4  .              .

We can go by variables.

codebook race
race                                                                           Race of student
----------------------------------------------------------------------------------------------

                  type:  numeric (float)
                 label:  racelab

                 range:  [1,5]                        units:  1
         unique values:  5                        missing .:  0/8

            tabulation:  Freq.   Numeric  Label
                             2         1  White
                             2         2  Asian
                             2         3  Hispanic
                             1         4  African American
                             1         5  Other

We can go by variables and notes

codebook havechild, notes
havechild                                                              Given birth to a child?
----------------------------------------------------------------------------------------------

                  type:  numeric (float)
                 label:  havelab

                 range:  [0,1]                        units:  1
         unique values:  2                        missing .:  0/8
       unique mv codes:  1                       missing .*:  3/8

            tabulation:  Freq.   Numeric  Label
                             1         0  Dont Have Child
                             4         1  Have Child
                             3        .n  NA

havechild:
  1.  This variable measures whether a woman has given birth to a child, not just whether she
      is a parent.
  2.  The .n (NA) missing code is used for males, because they cannot bear children.
  3.  The .u (Unknown) missing code for a female indicating it is unknown if she has a child.

We can look at the variable and missing value labels with the option mv. I recommend that you don’t label the missing values unless it is absolutely necessary. Different types of missing values besides “.” cause problems down the road, especially with the marginsplot command.

codebook ksex, mv
ksex                                                                              Sex of child
----------------------------------------------------------------------------------------------

                  type:  numeric (float)
                 label:  mfkid

                 range:  [1,2]                        units:  1
         unique values:  2                        missing .:  0/8
       unique mv codes:  2                       missing .*:  5/8

            tabulation:  Freq.   Numeric  Label
                             1         1  Male
                             2         2  Female
                             4        .n  NA
                             1        .u  Unknown

        missing values:     havechild==mv --> ksex==mv
                                kbday==mv --> ksex==mv

If you are interested in the different languages labels it is on page 112.

The lookfor command will return all variables with the search word. This is a bit redundent, since this is available in the variable window. But, it provides more space to look.

lookfor birth
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------
bday            float   %td..                 Date of birth of student
havechild       float   %18.0g     havelab  * Given birth to a child?
kbday           float   %td..                 Date of birth of child

We can also search for the notes by the search word

notes search birth
havechild:
  1.  This variable measures whether a woman has given birth to a child, not just whether she
      is a parent.

We can see the formats of the variables as well

list income bday
describe income bday
     |      income       bday |
     |------------------------|
  1. |    10,500.9   01/24/61 |
  2. |    45,234.1   04/15/68 |
  3. | 1,284,354.5   05/23/71 |
  4. |   124,313.5   06/25/73 |
  5. |   120,102.3   09/22/81 |
     |------------------------|
  6. |       545.2   10/15/73 |
  7. |   109,452.1   07/01/77 |
  8. |     4,500.9   08/03/76 |
     +------------------------+

              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------
income          float   %11.1fc               Income of student
bday            float   %td..                 Date of birth of student

We can see that the format for income is %11.1fc and the format for bday is %td

3.2 Labeling variables

Labeling the variables is a very helpful shortcut to describe what the variable contain without having to go back to the data dicionary. Sometimes we want a short and concise label if we are exporting labels to regression tables, or sometimes we want longer variable labels to give us context of the variable.

Let’s get some data on graduate students and use the describe command.

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use "survey1.dta", clear
describe 
/Users/Sam/Desktop/Econ 645/Data/Mitchell



Contains data from survey1.dta
  obs:             8                          
 vars:             9                          1 Jan 2010 12:13
 size:           432                          
----------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------
id              float   %9.0g                 
gender          float   %9.0g                 
race            float   %9.0g                 
havechild       float   %9.0g                 
ksex            float   %9.0g                 
bdays           str10   %10s                  
income          float   %9.0g                 
kbdays          str10   %10s                  
kidname         str10   %10s                  
----------------------------------------------------------------------------------------------
Sorted by: 

We have no variable labels, so we will need to provide some so future users have an understand what the data are. We will use the label variable command to describe the variable.

label variable id "Identification variable"
label variable gender "Gender of the student"
label variable race "Race of the student"
label variable havechild "Given birth to a child"
label variable ksex "Sex of child"
label variable bday "Birthday of student"
label variable income "Income of student"
label variable kbdays "Birthday of child"
label variable kidname "Name of child"
describe
Contains data from survey1.dta
  obs:             8                          
 vars:             9                          1 Jan 2010 12:13
 size:           432                          
----------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------
id              float   %9.0g                 Identification variable
gender          float   %9.0g                 Gender of the student
race            float   %9.0g                 Race of the student
havechild       float   %9.0g                 Given birth to a child
ksex            float   %9.0g                 Sex of child
bdays           str10   %10s                  Birthday of student
income          float   %9.0g                 Income of student
kbdays          str10   %10s                  Birthday of child
kidname         str10   %10s                  Name of child
----------------------------------------------------------------------------------------------
Sorted by: 

We can simply change the variable label with running the command again with the new variable label.

label variable id "Unique identification variable"
describe
save survey2, replace
Contains data from survey1.dta
  obs:             8                          
 vars:             9                          1 Jan 2010 12:13
 size:           432                          
----------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------
id              float   %9.0g                 Unique identification variable
gender          float   %9.0g                 Gender of the student
race            float   %9.0g                 Race of the student
havechild       float   %9.0g                 Given birth to a child
ksex            float   %9.0g                 Sex of child
bdays           str10   %10s                  Birthday of student
income          float   %9.0g                 Income of student
kbdays          str10   %10s                  Birthday of child
kidname         str10   %10s                  Name of child
----------------------------------------------------------------------------------------------
Sorted by: 

file survey2.dta saved

3.3 Labeling values

Labeling values is a very practice way of analyzing data without having to go back to the data dictionary. Labeling values requires two commands:

  1. label define to define a label that can be used across many different variables.
  2. label variable to place a label onto a variable.

Labeling values is a bit different than labeling variables, since we need to modify or replace after a label has been defined. If you try to change a label, then you will get an error unless you use add, modify or replace

Let’s look at our codebook

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use survey2, clear
codebook 
/Users/Sam/Desktop/Econ 645/Data/Mitchell



----------------------------------------------------------------------------------------------
id                                                              Unique identification variable
----------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [1,8]                        units:  1
         unique values:  8                        missing .:  0/8

            tabulation:  Freq.  Value
                             1  1
                             1  2
                             1  3
                             1  4
                             1  5
                             1  6
                             1  7
                             1  8

----------------------------------------------------------------------------------------------
gender                                                                   Gender of the student
----------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [1,2]                        units:  1
         unique values:  2                        missing .:  0/8

            tabulation:  Freq.  Value
                             3  1
                             5  2

----------------------------------------------------------------------------------------------
race                                                                       Race of the student
----------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [1,5]                        units:  1
         unique values:  5                        missing .:  0/8

            tabulation:  Freq.  Value
                             2  1
                             2  2
                             2  3
                             1  4
                             1  5

----------------------------------------------------------------------------------------------
havechild                                                               Given birth to a child
----------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [0,1]                        units:  1
         unique values:  2                        missing .:  0/8
       unique mv codes:  1                       missing .*:  3/8

            tabulation:  Freq.  Value
                             1  0
                             4  1
                             3  .n

----------------------------------------------------------------------------------------------
ksex                                                                              Sex of child
----------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [1,2]                        units:  1
         unique values:  2                        missing .:  0/8
       unique mv codes:  2                       missing .*:  5/8

            tabulation:  Freq.  Value
                             1  1
                             2  2
                             4  .n
                             1  .u

----------------------------------------------------------------------------------------------
bdays                                                                      Birthday of student
----------------------------------------------------------------------------------------------

                  type:  string (str10)

         unique values:  8                        missing "":  0/8

            tabulation:  Freq.  Value
                             1  "1/24/1961"
                             1  "10/15/1973"
                             1  "4/15/1968"
                             1  "5/23/1971"
                             1  "6/25/1973"
                             1  "7/1/1977"
                             1  "8/3/1976"
                             1  "9/22/1981"

----------------------------------------------------------------------------------------------
income                                                                       Income of student
----------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [545.23,1284354.5]           units:  .01
         unique values:  8                        missing .:  0/8

            tabulation:  Freq.  Value
                             1  545.22998
                             1  4500.9199
                             1  10500.93
                             1  45234.129
                             1  109452.11
                             1  120102.32
                             1  124313.45
                             1  1284354.5

----------------------------------------------------------------------------------------------
kbdays                                                                       Birthday of child
----------------------------------------------------------------------------------------------

                  type:  string (str10), but longest is str9

         unique values:  5                        missing "":  0/8

            tabulation:  Freq.  Value
                             4  ""
                             1  "1/12/1999"
                             1  "4/15/1995"
                             1  "5/20/1998"
                             1  "8/15/2003"

               warning:  variable has leading and trailing blanks

----------------------------------------------------------------------------------------------
kidname                                                                          Name of child
----------------------------------------------------------------------------------------------

                  type:  string (str10), but longest is str9

         unique values:  5                        missing "":  0/8

            tabulation:  Freq.  Value
                             4  ""
                             1  "Catherine"
                             1  "Robin"
                             1  "Sally"
                             1  "Samuell"

               warning:  variable has leading and trailing blanks

We have our variable labels from 5.3, but now we need to label the values so replicators can know what the data are without having to reference the data dictionary for every variable. We will use the label define command to create a new label.

First we need to define a label with label define

label define racelabel 1 "White" 2 "Asian" 3 "Hispanic" 4 "Black"

Next we need to label the values of the variable with label values command and we’ll look at the codebook again.

label values race racelabel
codebook race
race                                                                       Race of the student
----------------------------------------------------------------------------------------------

                  type:  numeric (float)
                 label:  racelabel, but 1 nonmissing value is not labeled

                 range:  [1,5]                        units:  1
         unique values:  5                        missing .:  0/8

            tabulation:  Freq.   Numeric  Label
                             2         1  White
                             2         2  Asian
                             2         3  Hispanic
                             1         4  Black
                             1         5  

We are still missing a value label for 5, which is Other, so we need to modify our defined label race1. If we do not modify our label, we will get an error if we try to label values again. We can use the add option in label define.

label define racelabel 5 "Other", add

If we want to modify an existing label, we can use the modify option in label define.

label define racelabel 4 "African American", modify
codebook race
race                                                                       Race of the student
----------------------------------------------------------------------------------------------

                  type:  numeric (float)
                 label:  racelabel

                 range:  [1,5]                        units:  1
         unique values:  5                        missing .:  0/8

            tabulation:  Freq.   Numeric  Label
                             2         1  White
                             2         2  Asian
                             2         3  Hispanic
                             1         4  African American
                             1         5  Other

Labeling missing is something that I don’t recommend, but we’ll show an example here.

label define mfkid 1 "Male" 2 "Female" .u "Unknown" .n "NA"
label values ksex mfkid
codebook ksex

label define havechildlabel 0 "Don't have a child" 1 "Have a child" .u "Unknown" .n "NA"
label values havechild havechildlabel
codebook havechild
ksex                                                                              Sex of child
----------------------------------------------------------------------------------------------

                  type:  numeric (float)
                 label:  mfkid

                 range:  [1,2]                        units:  1
         unique values:  2                        missing .:  0/8
       unique mv codes:  2                       missing .*:  5/8

            tabulation:  Freq.   Numeric  Label
                             1         1  Male
                             2         2  Female
                             4        .n  NA
                             1        .u  Unknown




----------------------------------------------------------------------------------------------
havechild                                                               Given birth to a child
----------------------------------------------------------------------------------------------

                  type:  numeric (float)
                 label:  havechildlabel

                 range:  [0,1]                        units:  1
         unique values:  2                        missing .:  0/8
       unique mv codes:  1                       missing .*:  3/8

            tabulation:  Freq.   Numeric  Label
                             1         0  Don't have a child
                             4         1  Have a child
                             3        .n  NA

We can look at our label list to see what we have define so far. We can use the label list command to see our labels.

label list
havechildlabel:
           0 Don't have a child
           1 Have a child
          .n NA
          .u Unknown
mfkid:
           1 Male
           2 Female
          .n NA
          .u Unknown
racelabel:
           1 White
           2 Asian
           3 Hispanic
           4 African American
           5 Other

The numlabel command is an interesting command. It takes the guess work out of knowing the numeric value of the category by appending the numeric value with the label value.

numlabel racelabel, add
label list racelabel
tabulate race
racelabel:
           1 1. White
           2 2. Asian
           3 3. Hispanic
           4 4. African American
           5 5. Other


Race of the |
    student |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2       25.00       25.00
          2 |          2       25.00       50.00
          3 |          2       25.00       75.00
          4 |          1       12.50       87.50
          5 |          1       12.50      100.00
------------+-----------------------------------
      Total |          8      100.00

And, if we don’t like it or don’t need it any more, we can remove the numeric values with the remove option.

numlabel racelabel, remove
tabulate race
Race of the |
    student |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2       25.00       25.00
          2 |          2       25.00       50.00
          3 |          2       25.00       75.00
          4 |          1       12.50       87.50
          5 |          1       12.50      100.00
------------+-----------------------------------
      Total |          8      100.00

We can add additional characters with numlabel as well, such as “#=” or “#)” with the mask option

numlabel racelabel, add mask("#) ")
tabulate race
Race of the |
    student |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2       25.00       25.00
          2 |          2       25.00       50.00
          3 |          2       25.00       75.00
          4 |          1       12.50       87.50
          5 |          1       12.50      100.00
------------+-----------------------------------
      Total |          8      100.00

We can remove the mask with remove plus the mask option

numlabel racelabel, remove mask("#) ")
tabulate race
Race of the |
    student |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2       25.00       25.00
          2 |          2       25.00       50.00
          3 |          2       25.00       75.00
          4 |          1       12.50       87.50
          5 |          1       12.50      100.00
------------+-----------------------------------
      Total |          8      100.00
save survey3, replace

3.4 Labeling utilities

Stata has label utilities to manage the labels defined. The first one is label dir to see the labels names available in a quick and more concise way than using the codebook command.

For me, I think that label list will be your most useful command here.

Quick check of your label directory

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use survey3, clear
label dir
/Users/Sam/Desktop/Econ 645/Data/Mitchell


mfkid
havechildlabel
racelabel

The label list command gives a more comprehensive view of your labels that includes the value labels associated with the value label name

label list
mfkid:
           1 Male
           2 Female
          .n NA
          .u Unknown
havechildlabel:
           0 Don't have a child
           1 Have a child
          .n NA
          .u Unknown
racelabel:
           1 White
           2 Asian
           3 Hispanic
           4 African American
           5 Other

The label save command will save your labels into a do file for future use. Our do file name is stated after the using statement.

label save havechildlabel racelabel using surveylabs, replace
type surveylabs.do
label define havechildlabel 0 `"Don't have a child"', modify
label define havechildlabel 1 `"Have a child"', modify
label define havechildlabel .n `"NA"', modify
label define havechildlabel .u `"Unknown"', modify
label define racelabel 1 `"White"', modify
label define racelabel 2 `"Asian"', modify
label define racelabel 3 `"Hispanic"', modify
label define racelabel 4 `"African American"', modify
label define racelabel 5 `"Other"', modify

The labelbook command provides information similar to codebook but only for the labels that are defined.

labelbook
labelbook racelabel
value label havechildlabel 
----------------------------------------------------------------------------------------------

      values                                    labels
       range:  [0,1]                     string length:  [2,18]
           N:  4                 unique at full length:  yes
        gaps:  no                  unique at length 12:  yes
  missing .*:  2                           null string:  no
                               leading/trailing blanks:  no
                                    numeric -> numeric:  no
  definition
           0   Don't have a child
           1   Have a child
          .n   NA
          .u   Unknown

   variables:  havechild


----------------------------------------------------------------------------------------------
value label mfkid 
----------------------------------------------------------------------------------------------

      values                                    labels
       range:  [1,2]                     string length:  [2,7]
           N:  4                 unique at full length:  yes
        gaps:  no                  unique at length 12:  yes
  missing .*:  2                           null string:  no
                               leading/trailing blanks:  no
                                    numeric -> numeric:  no
  definition
           1   Male
           2   Female
          .n   NA
          .u   Unknown

   variables:  ksex


----------------------------------------------------------------------------------------------
value label racelabel 
----------------------------------------------------------------------------------------------

      values                                    labels
       range:  [1,5]                     string length:  [5,16]
           N:  5                 unique at full length:  yes
        gaps:  no                  unique at length 12:  yes
  missing .*:  0                           null string:  no
                               leading/trailing blanks:  no
                                    numeric -> numeric:  no
  definition
           1   White
           2   Asian
           3   Hispanic
           4   African American
           5   Other

   variables:  race



----------------------------------------------------------------------------------------------
value label racelabel 
----------------------------------------------------------------------------------------------

      values                                    labels
       range:  [1,5]                     string length:  [5,16]
           N:  5                 unique at full length:  yes
        gaps:  no                  unique at length 12:  yes
  missing .*:  0                           null string:  no
                               leading/trailing blanks:  no
                                    numeric -> numeric:  no
  definition
           1   White
           2   Asian
           3   Hispanic
           4   African American
           5   Other

   variables:  race

The problem option for labelbook provides information to alert the users of any problems

labelbook, problem
no potential problems in dataset survey3.dta

We can have a more detailed look with the detail and problem options

labelbook racelabel, problem detail
value label racelabel 
----------------------------------------------------------------------------------------------

      values                                    labels
       range:  [1,5]                     string length:  [5,16]
           N:  5                 unique at full length:  yes
        gaps:  no                  unique at length 12:  yes
  missing .*:  0                           null string:  no
                               leading/trailing blanks:  no
                                    numeric -> numeric:  no
  definition
           1   White
           2   Asian
           3   Hispanic
           4   African American
           5   Other

   variables:  race


no potential problems in dataset survey3.dta

3.5 Labeling variables and values in different languages

We will not be covering this, but if you are interested, please review pages 127-132.

3.6 Using Notes

The note commmand can be helpful for future users or for replicators. If you use the note command without specifying the variable, then it is a general note that will show up under the _dta note. If you add a variable in front of the note command, like note var1:, then you will add a note to the variable

Let’s add some general notes.

note: This was based on the dataset called survey1.txt

Adding TS to the end adds a timestamp, which is a nice feature.

note: The missing values for havechild and childage were coded using -1 and -2 but were converted to .n and .u TS

Let’s call our notes with the notes command.

notes
_dta:
  1.  This was based on the dataset called survey1.txt
  2.  The missing values for havechild and childage were coded using -1 and -2 but were
      converted to .n and .u 24 Sep 2025 17:33

Let’s add some notes to our variables

note race: The other category includes people who specified multiple races
note race: This is another note
note race: This is a third note

We can just call a particular variable for notes

notes race

We can just call a particular variable for notes

race:
  1.  The other category includes people who specified multiple races
  2.  This is another note
  3.  This is a third note

Let say we added an unhelpful note, then we can drop it with the notes drop command and we want to drop the second note.

notes drop race in 2
notes race
  (1 note dropped)


race:
  1.  The other category includes people who specified multiple races
  3.  This is a third note

Notice that we have a gap in the sequence of numbering. We can fix that with the notes renumber command.

notes renumber race
notes race
race:
  1.  The other category includes people who specified multiple races
  2.  This is a third note

We can also search notes with the notes search “string” command.

notes search .u
_dta:
  2.  The missing values for havechild and childage were coded using -1 and -2 but were
      converted to .n and .u 24 Sep 2025 17:33

3.7 Formatting the display of variables

Formatting data will be more common than you expect. It can be a pain when dealing with numbers in the millions or billions and you lack commas. We can format our data with the format command.

3.7.1 Format numerics

Let’s get our survey data and list the first 5 observations for id and income

Let’s look at the format of income.

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use survey5, clear
describe income
list id income in 1/5
/Users/Sam/Desktop/Econ 645/Data/Mitchell

(Survey of graduate students)

              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------
income          float   %9.0g                 Income of student

     +---------------+
     | id     income |
     |---------------|
  1. |  1   10500.93 |
  2. |  2   45234.13 |
  3. |  3    1284355 |
  4. |  4   124313.5 |
  5. |  5   120102.3 |
     +---------------+

The format is %9.0g. We always have % in front of our format and g is a general way of displaying incomes using a width of nine digits and decides for us the best way to display the values. g means general here.

%.0g means general - find the best way to show the decimals.

Note: %.g will change the format to exponent if necessary. Also is usually set to 0 with g.

%.f means fixed - w is the width, d is the decimals, and f means fixed

%.fc means fixed with commas - w is the width, d is the decimals, f means fixed, and c means comma

%.0gc means general with commas - w is the width, setting 0 means Stata will decide the decimals, g means general, and c means comma

The manual is helpful for formatting: https://www.stata.com/manuals/dformat.pdf

Example 1: format v1 %10.0g - Width of 10 digits and decimals will be decided.

Example 2: format v2 %4.1f - Show 3 digits in v3 and 1 decimal

Example 3: format v3 %6.1fc - Show 4 digits plus the comma plus 1 digit

Let’s get more control over the income format and use the %w.df format. We want a total of 12 digits with 2 decimals places, which means we have 10 digits on the left side of the “.”

format income %12.2f
list income in 1/5
     |     income |
     |------------|
  1. |   10500.93 |
  2. |   45234.13 |
  3. | 1284354.50 |
  4. |  124313.45 |
  5. |  120102.32 |
     +------------+

Notice that we now can see observation #3’s decimal places.

If we don’t care to see the decimal place (even though it is still there).

format income %7.0f
list income in 1/5
     |  income |
     |---------|
  1. |   10501 |
  2. |   45234 |
  3. | 1284354 |
  4. |  124313 |
  5. |  120102 |
     +---------+

We we want to see one decimal place

format income %9.1f
list income in 1/5
     |    income |
     |-----------|
  1. |   10500.9 |
  2. |   45234.1 |
  3. | 1284354.5 |
  4. |  124313.5 |
  5. |  120102.3 |
     +-----------+

Now let’s add commas, but we need to add two additional digit widths for the commas and we’ll add two decimal places.

format income %12.2fc
list income in 1/5
     |       income |
     |--------------|
  1. |    10,500.93 |
  2. |    45,234.13 |
  3. | 1,284,354.50 |
  4. |   124,313.45 |
  5. |   120,102.32 |
     +--------------+

3.7.2 Format Strings

Let’s use the format command with strings.

describe kidname
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------
kidname         str10   %10s                  Name of child

The format is %10s, which is a (s)tring of 10 characters wide that is right-justified.

list kidname
     |   kidname |
     |-----------|
  1. |           |
  2. |     Sally |
  3. | Catherine |
  4. |           |
  5. |   Samuell |
     |-----------|
  6. |           |
  7. |     Robin |
  8. |           |
     +-----------+

If we wanted to left-justify the string, we can add a ‘-’ in between % and #s.

format kidname %-10s
list kidname
     | kidname   |
     |-----------|
  1. |           |
  2. | Sally     |
  3. | Catherine |
  4. |           |
  5. | Samuell   |
     |-----------|
  6. |           |
  7. | Robin     |
  8. |           |
     +-----------+

3.7.3 Format dates

Dates in Stata are a bit of a pain, so learning how to format the dates will be helpful in the future.

list bdays kbdays
     |      bdays      kbdays |
     |------------------------|
  1. |  1/24/1961             |
  2. |  4/15/1968   4/15/1995 |
  3. |  5/23/1971   8/15/2003 |
  4. |  6/25/1973             |
  5. |  9/22/1981   1/12/1999 |
     |------------------------|
  6. | 10/15/1973             |
  7. |   7/1/1977   5/20/1998 |
  8. |   8/3/1976             |
     +------------------------+

Our birthdays are in a MM/DD/YYYY format currently. Let’s generate a new variable with the date function. The date function will convert a string that is in a date format into a Stata date, but it still needs to be formatted. The option “MDY” tells Stata that the string is in the Month-Day-Year format and needs to be converted.

generate bday = date(bdays, "MDY")
generate kbday = date(kbdays, "MDY")

Let’s list the days.

list bdays bday kbdays kbday
     |      bdays   bday      kbdays   kbday |
     |---------------------------------------|
  1. |  1/24/1961    389                   . |
  2. |  4/15/1968   3027   4/15/1995   12888 |
  3. |  5/23/1971   4160   8/15/2003   15932 |
  4. |  6/25/1973   4924                   . |
  5. |  9/22/1981   7935   1/12/1999   14256 |
     |---------------------------------------|
  6. | 10/15/1973   5036                   . |
  7. |   7/1/1977   6391   5/20/1998   14019 |
  8. |   8/3/1976   6059                   . |
     +---------------------------------------+

The Stata dates are actually stored as the number of days from Jan 1, 1960. This method is convenient for the computer storing and performing date computations, but is difficult for us to read.

Let’s use the %td format - for example 01Jan2000

format bday %td 
list bdays bday kbdays kbday
     |      bdays        bday      kbdays   kbday |
     |--------------------------------------------|
  1. |  1/24/1961   24jan1961                   . |
  2. |  4/15/1968   15apr1968   4/15/1995   12888 |
  3. |  5/23/1971   23may1971   8/15/2003   15932 |
  4. |  6/25/1973   25jun1973                   . |
  5. |  9/22/1981   22sep1981   1/12/1999   14256 |
     |--------------------------------------------|
  6. | 10/15/1973   15oct1973                   . |
  7. |   7/1/1977   01jul1977   5/20/1998   14019 |
  8. |   8/3/1976   03aug1976                   . |
     +--------------------------------------------+

Let’s use the %tdNN/DD/YY format…NN is used for 01-12 and nn is for 1-12,DD for the day 01-31, and YY is for the last two digits of the year.

format bday %tdNN/DD/YY
list bdays bday kbdays kbday
     |      bdays       bday      kbdays   kbday |
     |-------------------------------------------|
  1. |  1/24/1961   01/24/61                   . |
  2. |  4/15/1968   04/15/68   4/15/1995   12888 |
  3. |  5/23/1971   05/23/71   8/15/2003   15932 |
  4. |  6/25/1973   06/25/73                   . |
  5. |  9/22/1981   09/22/81   1/12/1999   14256 |
     |-------------------------------------------|
  6. | 10/15/1973   10/15/73                   . |
  7. |   7/1/1977   07/01/77   5/20/1998   14019 |
  8. |   8/3/1976   08/03/76                   . |
     +-------------------------------------------+

Mon is Jan-Dec, and Month is January-December.

format bday %tdMonth/DD/YY 
list bdays bday kbdays kbday
     |      bdays              bday      kbdays   kbday |
     |--------------------------------------------------|
  1. |  1/24/1961     January/24/61                   . |
  2. |  4/15/1968       April/15/68   4/15/1995   12888 |
  3. |  5/23/1971         May/23/71   8/15/2003   15932 |
  4. |  6/25/1973        June/25/73                   . |
  5. |  9/22/1981   September/22/81   1/12/1999   14256 |
     |--------------------------------------------------|
  6. | 10/15/1973     October/15/73                   . |
  7. |   7/1/1977        July/01/77   5/20/1998   14019 |
  8. |   8/3/1976      August/03/76                   . |
     +--------------------------------------------------+

We can use a standard Month DD, YYYY with the format %tdMonth_DD,CCYY. Where Month is the full name of the month, DD is our days in digits, and CC is Century, such as 19- and 20- and YY is 2-digit year, such as -88, -97

format bday %tdMonth_DD,CCYY
list bdays bday kbdays kbday
     |      bdays                bday      kbdays   kbday |
     |----------------------------------------------------|
  1. |  1/24/1961     January 24,1961                   . |
  2. |  4/15/1968       April 15,1968   4/15/1995   12888 |
  3. |  5/23/1971         May 23,1971   8/15/2003   15932 |
  4. |  6/25/1973        June 25,1973                   . |
  5. |  9/22/1981   September 22,1981   1/12/1999   14256 |
     |----------------------------------------------------|
  6. | 10/15/1973     October 15,1973                   . |
  7. |   7/1/1977        July 01,1977   5/20/1998   14019 |
  8. |   8/3/1976      August 03,1976                   . |
     +----------------------------------------------------+

Let’s use a standard format, but don’t use YYYY - it just repeats the 2-digit year twice.

format bday %tdNN/DD/YYYY
list bdays bday kbdays kbday
     |      bdays         bday      kbdays   kbday |
     |---------------------------------------------|
  1. |  1/24/1961   01/24/6161                   . |
  2. |  4/15/1968   04/15/6868   4/15/1995   12888 |
  3. |  5/23/1971   05/23/7171   8/15/2003   15932 |
  4. |  6/25/1973   06/25/7373                   . |
  5. |  9/22/1981   09/22/8181   1/12/1999   14256 |
     |---------------------------------------------|
  6. | 10/15/1973   10/15/7373                   . |
  7. |   7/1/1977   07/01/7777   5/20/1998   14019 |
  8. |   8/3/1976   08/03/7676                   . |
     +---------------------------------------------+

Use %tdNN/DD/CCYY instead for the desired result.

format bday %tdNN/DD/CCYY
list bdays bday kbdays kbday

label variable bday "Date of birth of student"
label variable kbdays "Date of birth of child"
     |      bdays         bday      kbdays   kbday |
     |---------------------------------------------|
  1. |  1/24/1961   01/24/1961                   . |
  2. |  4/15/1968   04/15/1968   4/15/1995   12888 |
  3. |  5/23/1971   05/23/1971   8/15/2003   15932 |
  4. |  6/25/1973   06/25/1973                   . |
  5. |  9/22/1981   09/22/1981   1/12/1999   14256 |
     |---------------------------------------------|
  6. | 10/15/1973   10/15/1973                   . |
  7. |   7/1/1977   07/01/1977   5/20/1998   14019 |
  8. |   8/3/1976   08/03/1976                   . |
     +---------------------------------------------+

bday and bdays are redundent and we’ll only keep one.

drop bday kbdays
save survey6, replace

3.8 Changing the order of variables in a dataset

I personally find reordering the order of variables with the order command to be useful. This is especially true when working with panel data. I like to order the panel data to have the cross-sectional unit first, such as personal id, firm id, etc.first and then have the time period second, so we have our N and T next to one another.

Let’s pull our survey of graduate students and describe our dataset

cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use survey6, clear
describe
/Users/Sam/Desktop/Econ 645/Data/Mitchell

(Survey of graduate students)


Contains data from survey6.dta
  obs:             8                          Survey of graduate students
 vars:             9                          11 Mar 2024 14:40
 size:           416                          (_dta has notes)
----------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------
id              float   %9.0g                 Unique identification variable
gender          float   %9.0g      mf         Gender of student
race            float   %19.0g     racelab  * Race of student
havechild       float   %18.0g     havelab  * Given birth to a child?
ksex            float   %15.0g     mfkid    * Sex of child
bdays           str10   %10s                  Birthday of student
income          float   %12.2fc               Income of student
kidname         str10   %-10s                 Name of child
kbday           double  %td                   
                                            * indicated variables have notes
----------------------------------------------------------------------------------------------
Sorted by: 

We might want to group our variables with similar types of variables. This can be helpful when you have a large dataset with hundreds of variables, such as the CPS.

order id gender race bday income havechild
describe
Contains data from survey6.dta
  obs:             8                          Survey of graduate students
 vars:             9                          11 Mar 2024 14:40
 size:           416                          (_dta has notes)
----------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------
id              float   %9.0g                 Unique identification variable
gender          float   %9.0g      mf         Gender of student
race            float   %19.0g     racelab  * Race of student
bdays           str10   %10s                  Birthday of student
income          float   %12.2fc               Income of student
havechild       float   %18.0g     havelab  * Given birth to a child?
ksex            float   %15.0g     mfkid    * Sex of child
kidname         str10   %-10s                 Name of child
kbday           double  %td                   
                                            * indicated variables have notes
----------------------------------------------------------------------------------------------
Sorted by: 

The variables that we leave off will remain in the same order as before after the new variables are moved to the left.

With the before option, we can move variable(s) before a defined variable. Let’s move kidname before ksex

order kidname, before(ksex)
describe
Contains data from survey6.dta
  obs:             8                          Survey of graduate students
 vars:             9                          11 Mar 2024 14:40
 size:           416                          (_dta has notes)
----------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------
id              float   %9.0g                 Unique identification variable
gender          float   %9.0g      mf         Gender of student
race            float   %19.0g     racelab  * Race of student
bdays           str10   %10s                  Birthday of student
income          float   %12.2fc               Income of student
havechild       float   %18.0g     havelab  * Given birth to a child?
kidname         str10   %-10s                 Name of child
ksex            float   %15.0g     mfkid    * Sex of child
kbday           double  %td                   
                                            * indicated variables have notes
----------------------------------------------------------------------------------------------
Sorted by: 

We can move newly created variables with the before and after options with the generate command

generate STUDENTVARS = ., before(gender)
generate KIDSVARS = ., after(havechild)
describe
(8 missing values generated)

(8 missing values generated)


Contains data from survey6.dta
  obs:             8                          Survey of graduate students
 vars:            11                          11 Mar 2024 14:40
 size:           544                          (_dta has notes)
----------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------
id              float   %9.0g                 Unique identification variable
STUDENTVARS     double  %10.0g                
gender          float   %9.0g      mf         Gender of student
race            float   %19.0g     racelab  * Race of student
bdays           str10   %10s                  Birthday of student
income          float   %12.2fc               Income of student
havechild       float   %18.0g     havelab  * Given birth to a child?
KIDSVARS        double  %10.0g                
kidname         str10   %-10s                 Name of child
ksex            float   %15.0g     mfkid    * Sex of child
kbday           double  %td                   
                                            * indicated variables have notes
----------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

3.9 Practice

Let’s bring in the CPS use “/Users/Sam/Desktop/Econ 645/Data/CPS/jun23pub.dta”, clear

  1. Generate a new variable from pemlr called employed where employed = 1 if the individual is employed (present or absent) and employed = 0 if the individual is unemployed. The value should be missing if the individual is not in the labor force.
  2. Label the variable “Currently employed”.
  3. Label the values for 0 “Not employed” 1 “Employed” . “Not in the Labor Force”.
  4. Move the variable after pemlr.
  5. Generate a date that appends hrmonth (month of interview), the string “12”, and the hryear4 (year of interview). We use 12 because the week of the 12th is the reference period.
  6. Now format the date so it is like 06/12/2023