Chapter 4 Cleaning and Checking Data

Mitchell Chapter 4

Good note from Mitchell at the beginning: Garbage In; Garbage Out.

Before doing analysis, you really need to check your data to look for data quality issues.

4.1 Checking individual variables

We start off with looking from problems in individual variables. Mitchell uses Working Women’s Survey (WWS) dataset, but we’ll use Current Population Survey (CPS) data, as well.

use wws, clear

The describe command is a useful command to see an overview of your data with variables labels, storage type, and value lables (if any)

quietly cd "/Users/Sam/Desktop/Data/CPS/"
use smalljul25pub
describe
Contains data from smalljul25pub.dta
  obs:       122,252                          
 vars:            50                          12 Aug 2025 19:17
 size:     8,313,136                          
------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------------
hrhhid2         int     %8.0g                 
hrmis           byte    %8.0g                 HRMIS
hrmonth         byte    %8.0g                 HRMONTH
hryear4         int     %8.0g                 HRYEAR4
gestfips        byte    %8.0g                 
gediv           byte    %8.0g                 
pulineno        byte    %8.0g                 PULINENO
pudis           byte    %8.0g                 PUDIS
puiodp1         byte    %8.0g                 PUIODP1
puiodp2         byte    %8.0g                 PUIODP2
puiodp3         byte    %8.0g                 PUIODP3
perrp           byte    %8.0g                 
pesex           byte    %8.0g                 
pemaritl        byte    %8.0g                 
pehspnon        byte    %8.0g                 
peeduca         byte    %8.0g                 
peafever        byte    %8.0g                 
ptdtrace        byte    %8.0g                 
prpertyp        byte    %8.0g                 
prtage          byte    %8.0g                 
pehractt        int     %8.0g                 
pehract1        byte    %8.0g                 
pehract2        byte    %8.0g                 
pemlr           byte    %8.0g                 
peio1icd        int     %8.0g                 
peio1cow        byte    %8.0g                 
peio2cow        byte    %8.0g                 
prmjind1        byte    %8.0g                 
prmjind2        byte    %8.0g                 
prmjocc1        byte    %8.0g                 
prmjocc2        byte    %8.0g                 
prdtind1        byte    %8.0g                 
prdtind2        byte    %8.0g                 
prdtocc1        byte    %8.0g                 
prdtocc2        byte    %8.0g                 
prsjmj          byte    %8.0g                 
peernhry        byte    %8.0g                 
peernhro        byte    %8.0g                 
peernlab        byte    %8.0g                 
prerelg         byte    %8.0g                 
prdisflg        byte    %8.0g                 
pecert1         byte    %8.0g                 
pecert2         byte    %8.0g                 
pecert3         byte    %8.0g                 
pwcmpwgt        long    %12.0g                
pternwa         long    %12.0g                
ptio1ocd        int     %8.0g                 
hrhhid          double  %10.0g                
ptwk            byte    %8.0g                 
gtmetsta        byte    %8.0g                 
------------------------------------------------------------------------------------------------------------------
Sorted by: 

You will noticed that we lack value labels and variable labels. Luckily we have a data dictionary on Census’s website. Let’s label the variables and use the describe command again.

label variable gestfips "State"
label variable gediv "Census Region"
label variable pternwa "Weekly earnings"
label variable peio1cow "Class of worker in Main Job"
label variable peio1cow "Class of worker in Second Job"
label variable pemlr "Labor Force Status"
describe
Contains data from smalljul25pub.dta
  obs:       122,252                          
 vars:            50                          12 Aug 2025 19:17
 size:     8,313,136                          
------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------------
hrhhid2         int     %8.0g                 
hrmis           byte    %8.0g                 HRMIS
hrmonth         byte    %8.0g                 HRMONTH
hryear4         int     %8.0g                 HRYEAR4
gestfips        byte    %8.0g                 State
gediv           byte    %8.0g                 Census Region
pulineno        byte    %8.0g                 PULINENO
pudis           byte    %8.0g                 PUDIS
puiodp1         byte    %8.0g                 PUIODP1
puiodp2         byte    %8.0g                 PUIODP2
puiodp3         byte    %8.0g                 PUIODP3
perrp           byte    %8.0g                 
pesex           byte    %8.0g                 
pemaritl        byte    %8.0g                 
pehspnon        byte    %8.0g                 
peeduca         byte    %8.0g                 
peafever        byte    %8.0g                 
ptdtrace        byte    %8.0g                 
prpertyp        byte    %8.0g                 
prtage          byte    %8.0g                 
pehractt        int     %8.0g                 
pehract1        byte    %8.0g                 
pehract2        byte    %8.0g                 
pemlr           byte    %8.0g                 Labor Force Status
peio1icd        int     %8.0g                 
peio1cow        byte    %8.0g                 Class of worker in Second Job
peio2cow        byte    %8.0g                 
prmjind1        byte    %8.0g                 
prmjind2        byte    %8.0g                 
prmjocc1        byte    %8.0g                 
prmjocc2        byte    %8.0g                 
prdtind1        byte    %8.0g                 
prdtind2        byte    %8.0g                 
prdtocc1        byte    %8.0g                 
prdtocc2        byte    %8.0g                 
prsjmj          byte    %8.0g                 
peernhry        byte    %8.0g                 
peernhro        byte    %8.0g                 
peernlab        byte    %8.0g                 
prerelg         byte    %8.0g                 
prdisflg        byte    %8.0g                 
pecert1         byte    %8.0g                 
pecert2         byte    %8.0g                 
pecert3         byte    %8.0g                 
pwcmpwgt        long    %12.0g                
pternwa         long    %12.0g                Weekly earnings
ptio1ocd        int     %8.0g                 
hrhhid          double  %10.0g                
ptwk            byte    %8.0g                 
gtmetsta        byte    %8.0g                 
------------------------------------------------------------------------------------------------------------------
Sorted by: 

4.2 One way tabulation

The tabulation or tab command is very useful to inspect categorical variables.

The tabulation or tab command is very useful to inspect categorical variables. Let’s look at collgrad, which is a binary for college graduate status in wws data, and let’s look at race data, which is a categorical variable. Let’s include the missing option to make sure no observations are missing data

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use wws, clear
tabulate collgrad, missing
tabulate race, missing
(Working Women Survey)

    college |
   graduate |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      1,713       76.27       76.27
          1 |        533       23.73      100.00
------------+-----------------------------------
      Total |      2,246      100.00

       race |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |      1,636       72.84       72.84
          2 |        583       25.96       98.80
          3 |         26        1.16       99.96
          4 |          1        0.04      100.00
------------+-----------------------------------
      Total |      2,246      100.00

Race should only be coded between 1 and 3, so we have one miscoded observation. Let’s find the observation’s idcode

list idcode race if race==4
(Working Women Survey)

      +---------------+
      | idcode   race |
      |---------------|
2013. |    543      4 |
      +---------------+

4.3 Summarize

The summarize or sum command is very useful to inspect continuous variables. Let’s look at the amount of unemployment insurance in unempins. The values range between 0 and 300 dollars for unemployed insurance received last week.

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use wws, clear
summarize unempins
(Working Women Survey)

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    unempins |      2,246    30.50401    73.16682          0        299

Let’s look at weekly earnings in CPS.

quietly cd "/Users/Sam/Desktop/Data/CPS/"
use smalljul25pub.dta, replace
gen earnings=pternwa/100
summarize earnings if prerelg==1
(28,617 missing values generated)

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    earnings |      9,635    1408.951     1251.25          0    6930.64

The data range between $0 and $6,930.64 in weekly earnings last week with a mean of 1,408.95 dollars and a standard deviation of 1,251,25 dollars. The max seems a bit high. Let’s add the detail option to get more information.

summarize earnings if prerelg==1, detail
                          earnings
-------------------------------------------------------------
      Percentiles      Smallest
 1%           88              0
 5%          250              0
10%          384              0       Obs               9,635
25%          680              0       Sum of Wgt.       9,635

50%         1060                      Mean           1408.951
                        Largest       Std. Dev.       1251.25
75%         1730        6930.64
90%         2700        6930.64       Variance        1565626
95%         3460        6930.64       Skewness       2.688217
99%      6930.64        6930.64       Kurtosis        11.9109

We can see that the mean is $1,408.95, while the median is $1,060.00. The mean appears to be skewed rightward, where the skewness is 2.7 and the kurtosis is 11.9. The 95th percentile is 3,460.00 while the 99th percentile is $6,930.64.

Let’s look at wage data in the wws data set

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use wws, clear
summarize wage, detail
(Working Women Survey)

                         hourly wage
-------------------------------------------------------------
      Percentiles      Smallest
 1%     1.892108              0
 5%     2.801002       1.004952
10%     3.220612       1.032247       Obs               2,246
25%     4.259257       1.151368       Sum of Wgt.       2,246

50%     6.276297                      Mean           288.2885
                        Largest       Std. Dev.      9595.692
75%     9.661837       40.19808
90%     12.77777       40.74659       Variance       9.21e+07
95%     16.73912         250000       Skewness       35.45839
99%     38.70926         380000       Kurtosis       1297.042

For our wws data, our mean of 288.29 dollars/hour appears to be highly skewed rightward and heavily influenced by outliers. Our median hourly wage is 6.7 dollars per hour and our 99th percentile hourly wage is 38.7 dollars per hour, but our mean is 288.29 dollar/hour. Our skewness is 35, which means our wage data are highly skewed. Our Kurtosis shows that the max outliers are heavily skewing the results. A normal distributed variable should have a kurtosis of 3, and our kurtosis is 1297. When we see the 2 largets observations, they are 250,000 and 380,000 dollars/hour.

This means we have found potential data measurement issues. Let’s look at the outliers

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
list idcode wage if wage > 1000
      | idcode     wage |
      |-----------------|
 893. |   3145   380000 |
1241. |   2341   250000 |
      +-----------------+

Let’s look at ages, which should range frm 21 to 50 years old. We can use both the summarize and tabulate commands. Tabulate can be very useful with continuous variables if the range is not too large.

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
summarize age
tabulate age
    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         age |      2,246    36.25111    5.437983         21         83

     age in |
    current |
       year |      Freq.     Percent        Cum.
------------+-----------------------------------
         21 |          2        0.09        0.09
         22 |          8        0.36        0.45
         23 |         16        0.71        1.16
         24 |         35        1.56        2.72
         25 |         44        1.96        4.67
         26 |         50        2.23        6.90
         27 |         50        2.23        9.13
         28 |         63        2.80       11.93
         29 |         49        2.18       14.11
         30 |         55        2.45       16.56
         31 |         57        2.54       19.10
         32 |         60        2.67       21.77
         33 |         56        2.49       24.27
         34 |         95        4.23       28.50
         35 |        213        9.48       37.98
         36 |        224        9.97       47.95
         37 |        171        7.61       55.57
         38 |        175        7.79       63.36
         39 |        167        7.44       70.79
         40 |        139        6.19       76.98
         41 |        148        6.59       83.57
         42 |        111        4.94       88.51
         43 |        110        4.90       93.41
         44 |         98        4.36       97.77
         45 |         45        2.00       99.78
         46 |          1        0.04       99.82
         47 |          1        0.04       99.87
         48 |          1        0.04       99.91
         54 |          1        0.04       99.96
         83 |          1        0.04      100.00
------------+-----------------------------------
      Total |      2,246      100.00

Let’s look at the outliers

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
list idcode age if age > 50
      | idcode   age |
      |--------------|
2205. |     80    54 |
2219. |     51    83 |
      +--------------+

We have again found potential data measurement problems. The data is supposed to range from 21 to 50, but there are two observations with 54 and 83. It is possible that they were incorrectly entered, and they are supposed to be 45 and 38, respectively.

4.4 Crosstabulation - categorical by categorical variables

The two-way tabulation through the tabulate command is a very useful way to look for data quality problems or to double check binaries created. Let’s look at variables metro, which is a binary for whether or not the observation lives in a metropolitian areas, and ccity, which is whether or not the observation lives in the center of the city.

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
tabulate metro ccity, missing
Does woman |
   live in |  Does woman live in
     metro |     city center?
     area? |         0          1 |     Total
-----------+----------------------+----------
         0 |       665          0 |       665 
         1 |       926        655 |     1,581 
-----------+----------------------+----------
     Total |     1,591        655 |     2,246 

An alternative is to count the number of observations that shouldn’t be there. So we’ll count the number of observations not in a metro area but in a city center. I personally don’t use it, but it has it’s uses.

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
count if metro==0 & ccity==1
  0

Let’s look at married and nevermarriaged. There should be no individuals that appear in both married and nevermarried.

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
tabulate married nevermarried, missing
           |   Woman never been
           |        married
   married |         0          1 |     Total
-----------+----------------------+----------
         0 |       570        234 |       804 
         1 |     1,440          2 |     1,442 
-----------+----------------------+----------
     Total |     2,010        236 |     2,246 

There may be observation that have been married, but not currently married. There should be no observations that are both married and never married.

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
count if married == 1 & nevermarried == 1
list idcode married nevermarried if married == 1 & nevermarried == 1
  2

      +-----------------------------+
      | idcode   married   neverm~d |
      |-----------------------------|
1523. |   1758         1          1 |
2231. |     22         1          1 |
      +-----------------------------+

We have found another potential data measurement issue. There are 2 observations that are married and never married.

Let’s look at college graduates and years of school completed. We can use tabulate or the table commands. I personally prefer tabulate, since we can look for missing.

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
tabulate yrschool collgrad, missing
  Years of |
    school |   college graduate
 completed |         0          1 |     Total
-----------+----------------------+----------
         8 |        69          1 |        70 
         9 |        55          0 |        55 
        10 |        84          0 |        84 
        11 |       123          0 |       123 
        12 |       943          0 |       943 
        13 |       174          2 |       176 
        14 |       180          7 |       187 
        15 |        81         11 |        92 
        16 |         0        252 |       252 
        17 |         0        106 |       106 
        18 |         0        154 |       154 
         . |         4          0 |         4 
-----------+----------------------+----------
     Total |     1,713        533 |     2,246 

We can use the table command

table yrschool collgrad
Years of  |  college  
school    |  graduate 
completed |    0     1
----------+-----------
        8 |   69     1
        9 |   55      
       10 |   84      
       11 |  123      
       12 |  943      
       13 |  174     2
       14 |  180     7
       15 |   81    11
       16 |        252
       17 |        106
       18 |        154
----------------------

Now let’s find the ID with potential problems

list idcode if yrschool < 16 & collgrad == 1
      | idcode |
      |--------|
 195. |   4721 |
 369. |   4334 |
 464. |   4131 |
 553. |   3929 |
 690. |   3589 |
      |--------|
1092. |   2681 |
1098. |   2674 |
1114. |   2640 |
1124. |   2613 |
1221. |   2384 |
      |--------|
1493. |   1810 |
1829. |    993 |
1843. |    972 |
1972. |    689 |
2114. |    312 |
      |--------|
2174. |    172 |
2198. |    107 |
2215. |     63 |
2228. |     25 |
2229. |     24 |
      |--------|
2230. |     23 |
      +--------+

There is at least one potential measurement problem. There is an observation with 8 years of schooling completed listed as a college graduate. We also see that there are 2 observations with 13 years and 7 with 14 years. These may be two-year college graduates, and these observations may not be problematic.

4.5 Checking categorical by continuous variables

It is also helpful to look at continuous variables by different categories. Let’s look at the binary variable for union, if the observation is a union member or not by uniondues.

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
summarize uniondues, detail
                  Union Dues paid last week
-------------------------------------------------------------
      Percentiles      Smallest
 1%            0              0
 5%            0              0
10%            0              0       Obs               2,242
25%            0              0       Sum of Wgt.       2,242

50%            0                      Mean           5.603479
                        Largest       Std. Dev.      9.029045
75%           10             29
90%           22             29       Variance       81.52365
95%           26             29       Skewness        1.35268
99%           29             29       Kurtosis       3.339635

The median union dues is 0 dollars, which makes sense. The meand is 5.6 dollars, while the maximum value is 29 dollars.

Let’s use the bysort command, which will sort our data by the categorical variable specified.

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
bysort union: summarize uniondues
-> union = 0

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
   uniondues |      1,413     .094126    1.502237          0         27

------------------------------------------------------------------------------------------------------------------
-> union = 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
   uniondues |        461    14.65944    8.707759          0         29

------------------------------------------------------------------------------------------------------------------
-> union = .

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
   uniondues |        368    15.41304    8.815582          0         29

The mean for not in a union is less than 1 dollar with 1,413 observations The mean for union is about 15 dollars with 461 observations The mean for missing is about 15 dollars with 368 observations

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
tabulate uniondues if union==0, missing
 Union Dues |
  paid last |
       week |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      1,407       99.29       99.29
         10 |          1        0.07       99.36
         17 |          1        0.07       99.44
         26 |          2        0.14       99.58
         27 |          2        0.14       99.72
          . |          4        0.28      100.00
------------+-----------------------------------
      Total |      1,417      100.00

We see about 4 observations have union dues. We may or may not need to recode them to 0. It is possible that someone may be not be a union members, but still have to pay an agency fee. Let’s assume that they are not in 14(b) state..

The recode command can to create a new variable if a person pay union dues. Now let’s compare union members to those who pay union dues

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
recode uniondues (0=0) (1/max=1), generate(paysdues)
tabulate union paysdues, missing
(784 differences between uniondues and paysdues)

           | RECODE of uniondues (Union Dues
     union |         paid last week)
    worker |         0          1          . |     Total
-----------+---------------------------------+----------
         0 |     1,407          6          4 |     1,417 
         1 |        17        444          0 |       461 
         . |         7        361          0 |       368 
-----------+---------------------------------+----------
     Total |     1,431        811          4 |     2,246 

Let’s list the non-union members paying union dues. Note: we use the abb(#) option to abbreviate the observation to 20 characters

list idcode union uniondues if union==0 & (uniondues > 0) & !missing(uniondues), abb(20)
      | idcode   union   uniondues |
      |----------------------------|
 561. |   3905       0          10 |
 582. |   3848       0          26 |
 736. |   3464       0          17 |
1158. |   2541       0          27 |
1668. |   1411       0          27 |
      |----------------------------|
2100. |    345       0          26 |
      +----------------------------+

Let’s look at another great command for observing categorical and continuous variables together: tabstat

We could use tab with a sum option

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
tab married, sum(marriedyrs)
            |  Summary of Years married (rounded
            |          to nearest year)
    married |        Mean   Std. Dev.       Freq.
------------+------------------------------------
          0 |           0           0         804
          1 |   5.5409154   3.5521383       1,442
------------+------------------------------------
      Total |   3.5574354   3.8933494       2,246

Or, we can use tabstat which gives us more options

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
tabstat marriedyrs, by(married) statistics(n mean sd min max p25 p50 p75) missing
Summary for variables: marriedyrs
     by categories of: married (married)

 married |         N      mean        sd       min       max       p25       p50       p75
---------+--------------------------------------------------------------------------------
       0 |       804         0         0         0         0         0         0         0
       1 |      1442  5.540915  3.552138         0        11         2         6         9
---------+--------------------------------------------------------------------------------
   Total |      2246  3.557435  3.893349         0        11         0         2         7
------------------------------------------------------------------------------------------

No observation that said they were never married reported years of marriage

Let’s look at current years of experience and everworked binary

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
tabstat currexp, by(everworked) statistics(n mean sd min max) missing
Summary for variables: currexp
     by categories of: everworked (Has woman ever worked?)

everworked |         N      mean        sd       min       max
-----------+--------------------------------------------------
         0 |        60         0         0         0         0
         1 |      2171  5.328881  5.042181         0        26
-----------+--------------------------------------------------
     Total |      2231  5.185567  5.048073         0        26
--------------------------------------------------------------

Let’s total years of experience, which is currexp plus prevexp

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
generate totexp=currexp+prevexp
tabstat totexp, by(everworked) statistics(n mean sd min max) missing
(15 missing values generated)


Summary for variables: totexp
     by categories of: everworked (Has woman ever worked?)

everworked |         N      mean        sd       min       max
-----------+--------------------------------------------------
         0 |        60         0         0         0         0
         1 |      2171  11.57761  4.552392         1        29
-----------+--------------------------------------------------
     Total |      2231  11.26625  4.865816         0        29
--------------------------------------------------------------

Everyone with at least one year experience has experience working

Note: Three ways to 1. bysort x_cat: summarize x_con 2. tab x_cat, sum(x_con) 3. tabstat x_con, by(x_cat) statistics(…)

4.6 Checking continuous by continuous variables

We will compare two continuous variables to continuous variables Let’s look at unemployment insuranced received last week if the hours were greater than 30 and hours are not missing.

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
summarize unempins if hours > 30 & !missing(hours)
    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    unempins |      1,800    1.333333    16.04617          0        287

The mean is around 1.3 and our range is between our expected 0 and 287

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
count if (hours>30) & !missing(hours) & (unempins>0) & !missing(unempins)
  19

Let’s look at the hours count

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
tabulate hours if (hours>30) & !missing(hours) & (unempins>0) & !missing(unempins)
usual hours |
     worked |      Freq.     Percent        Cum.
------------+-----------------------------------
         35 |          1        5.26        5.26
         38 |          2       10.53       15.79
         40 |         14       73.68       89.47
         65 |          1        5.26       94.74
         70 |          1        5.26      100.00
------------+-----------------------------------
      Total |         19      100.00

It seems that 19 observations worked more than 30 hours and received unemployment insurance. This could be potential data issues, or it is possible that the person got unemployment insurance before working again. It depends on the data collection and data generation process.

Let’s look at age and and years married to find any unusual observations

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
generate agewhenmarried = age - marriedyrs
summarize agewhenmarried 
tabulate agewhenmarried
    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
agewhenmar~d |      2,246    32.69368    6.709341         13         73


agewhenmarr |
        ied |      Freq.     Percent        Cum.
------------+-----------------------------------
         13 |          1        0.04        0.04
         14 |          4        0.18        0.22
         15 |         11        0.49        0.71
         16 |          8        0.36        1.07
         17 |         18        0.80        1.87
         18 |         19        0.85        2.72
         19 |         14        0.62        3.34
         20 |         20        0.89        4.23
         21 |         36        1.60        5.83
         22 |         42        1.87        7.70
         23 |         52        2.32       10.02
         24 |         65        2.89       12.91
         25 |         67        2.98       15.89
         26 |         81        3.61       19.50
         27 |         73        3.25       22.75
         28 |         93        4.14       26.89
         29 |         90        4.01       30.90
         30 |        118        5.25       36.15
         31 |         91        4.05       40.20
         32 |        124        5.52       45.73
         33 |        109        4.85       50.58
         34 |        109        4.85       55.43
         35 |        149        6.63       62.07
         36 |        138        6.14       68.21
         37 |        119        5.30       73.51
         38 |        124        5.52       79.03
         39 |        119        5.30       84.33
         40 |         89        3.96       88.29
         41 |         80        3.56       91.85
         42 |         62        2.76       94.61
         43 |         50        2.23       96.84
         44 |         43        1.91       98.75
         45 |         23        1.02       99.78
         46 |          2        0.09       99.87
         47 |          1        0.04       99.91
         48 |          1        0.04       99.96
         73 |          1        0.04      100.00
------------+-----------------------------------
      Total |      2,246      100.00

There was one person who was 73. However, the dataset is for women 21 to 50, so this observation may have been accidentally entered incorrectly. The correct value may be 37.

Let’s look for anyone under 18. Some states allow under 18 marriages, but it still is suspicious.

tabulate agewhenmarried if agewhenmarried < 18
agewhenmarr |
        ied |      Freq.     Percent        Cum.
------------+-----------------------------------
         13 |          1        2.38        2.38
         14 |          4        9.52       11.90
         15 |         11       26.19       38.10
         16 |          8       19.05       57.14
         17 |         18       42.86      100.00
------------+-----------------------------------
      Total |         42      100.00

Let’s use the same strategy for years of experience deducted from age to find the first age working.

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
generate agewhenstartwork = age - (prevexp + currexp)
tabulate agewhenstartwork if agewhenstartwork < 16
(15 missing values generated)


agewhenstar |
      twork |      Freq.     Percent        Cum.
------------+-----------------------------------
          8 |          1        1.49        1.49
          9 |          1        1.49        2.99
         12 |          1        1.49        4.48
         14 |         20       29.85       34.33
         15 |         44       65.67      100.00
------------+-----------------------------------
      Total |         67      100.00

Some states allow work at the age of 15, but anything below looks suspicious and would need to be investigated for potential data issues.

We can also look for the number of age of children. We should expect that the age of the third child is never older than the second child

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
table kidage2 kidage3 if numkids==3
count if (kidage3 > kidage2) & (numkids==3) & !missing(kidage3)
count if (kidage2 > kidage1) & (numkids>=2) & !missing(kidage2)
Age of    |
second    |               Age of third child              
child     |    0     1     2     3     4     5     6     7
----------+-----------------------------------------------
        0 |   12                                          
        1 |   10     9                                    
        2 |   11     8    10                              
        3 |   10    12     6     8                        
        4 |   10    12    10     7     5                  
        5 |   12    11     9     3     6     8            
        6 |    9     8    10     6     5     6     6      
        7 |    7     6     7     9     4    14    12     6
        8 |          5    11     7     6    14     6    11
        9 |                8    13    10     7    12     9
       10 |                     15     3    10     6    12
       11 |                            9     8     3    13
       12 |                                 16     9     6
       13 |                                       11     5
       14 |                                              8
----------------------------------------------------------

  0

  0

Now let’s check the age of the observation when the first child was born

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
quietly generate agewhenfirstkid = age - kidage1
tabulate agewhenfirstkid if agewhenfirstkid < 18
agewhenfirs |
       tkid |      Freq.     Percent        Cum.
------------+-----------------------------------
          3 |          1        0.51        0.51
          5 |          2        1.01        1.52
          7 |          2        1.01        2.53
          8 |          5        2.53        5.05
          9 |          8        4.04        9.09
         10 |          7        3.54       12.63
         11 |         10        5.05       17.68
         12 |         10        5.05       22.73
         13 |         20       10.10       32.83
         14 |         30       15.15       47.98
         15 |         27       13.64       61.62
         16 |         39       19.70       81.31
         17 |         37       18.69      100.00
------------+-----------------------------------
      Total |        198      100.00

There are some suspicious observations after this tabulation. This would require investigation into data quality issues.

Scatter plots can be helpful as well

twoway scatter hours wage if hours > 0 & wage >0 & wage < 1000
graph export "/Users/Sam/Desktop/Econ 645/Stata/hour_wage.png", width(500) replace
Hours and Wages
Hours and Wages

4.7 Fixing and correcting errors in data

Let go ahead and fix some of these errors that we have found. A word of caution is that you may need to talk to the data stewards about the best ways to correct the data. You don’t want to attempt to fix the measurement error and introduce additional problems. Institutional knowledge of the data is very helpful before correcting errors.

Let’s fix when race was equal to 4 when there are only 3 categories. Let’s add a note for future users, which is good practice for replication.

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
list idcode race if race==4
replace race=1 if idcode == 543
note race: race changed to 1 (from 4) for idcode 543
tabulate race
      | idcode   race |
      |---------------|
2013. |    543      4 |
      +---------------+

(1 real change made)

       race |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |      1,637       72.89       72.89
          2 |        583       25.96       98.84
          3 |         26        1.16      100.00
------------+-----------------------------------
      Total |      2,246      100.00

Let’s fix college gradudate when there were only 8 years of education.

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
list idcode collgrad if yrschool < 12 & collgrad == 1
replace collgrad = 0 if idcode == 107

note collgrad: collgrad changed to 0 (from 1) for idcode 107
tab yrschool collgrad
      | idcode   collgrad |
      |-------------------|
2198. |    107          1 |
      +-------------------+

(1 real change made)



  Years of |
    school |   college graduate
 completed |         0          1 |     Total
-----------+----------------------+----------
         8 |        70          0 |        70 
         9 |        55          0 |        55 
        10 |        84          0 |        84 
        11 |       123          0 |       123 
        12 |       943          0 |       943 
        13 |       174          2 |       176 
        14 |       180          7 |       187 
        15 |        81         11 |        92 
        16 |         0        252 |       252 
        17 |         0        106 |       106 
        18 |         0        154 |       154 
-----------+----------------------+----------
     Total |     1,710        532 |     2,242 

Let’s fix age which where the digits were switched, and we will make a note of it.

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use wws, clear
list idcode age if age > 50
replace age = 38 if idcode == 51
replace age = 45 if idcode == 80
note age: the value of 83 was corrected to 38 for idcode 51
note age: the value of 54 was corrected to 45 for idcode 80
tab age
      | idcode   age |
      |--------------|
2205. |     80    54 |
2219. |     51    83 |
      +--------------+

(1 real change made)

(1 real change made)

     age in |
    current |
       year |      Freq.     Percent        Cum.
------------+-----------------------------------
         21 |          2        0.09        0.09
         22 |          8        0.36        0.45
         23 |         16        0.71        1.16
         24 |         35        1.56        2.72
         25 |         44        1.96        4.67
         26 |         50        2.23        6.90
         27 |         50        2.23        9.13
         28 |         63        2.80       11.93
         29 |         49        2.18       14.11
         30 |         55        2.45       16.56
         31 |         57        2.54       19.10
         32 |         60        2.67       21.77
         33 |         56        2.49       24.27
         34 |         95        4.23       28.50
         35 |        213        9.48       37.98
         36 |        224        9.97       47.95
         37 |        171        7.61       55.57
         38 |        176        7.84       63.40
         39 |        167        7.44       70.84
         40 |        139        6.19       77.03
         41 |        148        6.59       83.62
         42 |        111        4.94       88.56
         43 |        110        4.90       93.46
         44 |         98        4.36       97.82
         45 |         46        2.05       99.87
         46 |          1        0.04       99.91
         47 |          1        0.04       99.96
         48 |          1        0.04      100.00
------------+-----------------------------------
      Total |      2,246      100.00

Let’s look at our notes with the note command.

note
_dta:
  1.  This is a hypothetical dataset and should not be used for analysis purposes

age:
  1.  the value of 83 was corrected to 38 for idcode 51
  2.  the value of 54 was corrected to 45 for idcode 80

race:
  1.  race changed to 1 (from 4) for idcode 543

collgrad:
  1.  collgrad changed to 0 (from 1) for idcode 107

4.8 Identifying duplicates

The duplicates command can be helpful, but it will only find duplicates that are exactly the same unless you specify which variables with duplicates list vars. We’ll cover this a bit later.

We need to be careful with duplicate observations. When working with panel data we will want multiple observations of the same cross-sectional unit, but there should be multiple observations for different time units and not the same time unit.

We can start with our duplicates list command to find rows that are completely the same. This command will show all the duplicates. Note: for duplicate records to be found, all data needs to be the same.

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use "dentists_dups.dta", clear
duplicates list
Duplicates in terms of all variables

  +-----------------------------------------------------------+
  | group:   obs:             name   years   fulltime   recom |
  |-----------------------------------------------------------|
  |      1      4       Mike Avity     8.5          0       0 |
  |      1      6       Mike Avity     8.5          0       0 |
  |      1      8       Mike Avity     8.5          0       0 |
  |      2      1   Olive Tu'Drill   10.25          1       1 |
  |      2     11   Olive Tu'Drill   10.25          1       1 |
  |-----------------------------------------------------------|
  |      3      2     Ruth Canaale      22          1       1 |
  |      3      3     Ruth Canaale      22          1       1 |
  +-----------------------------------------------------------+

We’ll have a more condensed version with duplicates examples. It will give an example of the duplicate records for each group of duplicate records.

duplicates examples 
Duplicates in terms of all variables

  +-------------------------------------------------------------------+
  | group:   #   e.g. obs             name   years   fulltime   recom |
  |-------------------------------------------------------------------|
  |      1   3          4       Mike Avity     8.5          0       0 |
  |      2   2          1   Olive Tu'Drill   10.25          1       1 |
  |      3   2          2     Ruth Canaale      22          1       1 |
  +-------------------------------------------------------------------+

We can find the a consice list of duplicate records with duplicates report.

duplicates report
Duplicates in terms of all variables

--------------------------------------
   copies | observations       surplus
----------+---------------------------
        1 |            4             0
        2 |            4             2
        3 |            3             2
--------------------------------------

The duplicates tag command can return the number of duplicates for each group of duplicates (note the separator option puts a line in the table after # rows.

duplicates tag, generate(dup)
list, separator(2)
Duplicates in terms of all variables

     +----------------------------------------------------+
     |              name   years   fulltime   recom   dup |
     |----------------------------------------------------|
  1. |    Olive Tu'Drill   10.25          1       1     1 |
  2. |      Ruth Canaale      22          1       1     1 |
     |----------------------------------------------------|
  3. |      Ruth Canaale      22          1       1     1 |
  4. |        Mike Avity     8.5          0       0     2 |
     |----------------------------------------------------|
  5. |        Mary Smith       3          1       1     0 |
  6. |        Mike Avity     8.5          0       0     2 |
     |----------------------------------------------------|
  7. | Y. Don Uflossmore    7.25          0       1     0 |
  8. |        Mike Avity     8.5          0       0     2 |
     |----------------------------------------------------|
  9. |        Mary Smith      27          0       0     0 |
 10. | Isaac O'Yerbreath   32.75          1       1     0 |
     |----------------------------------------------------|
 11. |    Olive Tu'Drill   10.25          1       1     1 |
     +----------------------------------------------------+

Let’s sort our data for better legibility, and then we will add lines in our output table whenever there is a new group.

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use "dentists_dups.dta", clear
sort name years
list, sepby(name years)
     |              name   years   fulltime   recom |
     |----------------------------------------------|
  1. | Isaac O'Yerbreath   32.75          1       1 |
     |----------------------------------------------|
  2. |        Mary Smith       3          1       1 |
     |----------------------------------------------|
  3. |        Mary Smith      27          0       0 |
     |----------------------------------------------|
  4. |        Mike Avity     8.5          0       0 |
  5. |        Mike Avity     8.5          0       0 |
  6. |        Mike Avity     8.5          0       0 |
     |----------------------------------------------|
  7. |    Olive Tu'Drill   10.25          1       1 |
  8. |    Olive Tu'Drill   10.25          1       1 |
     |----------------------------------------------|
  9. |      Ruth Canaale      22          1       1 |
 10. |      Ruth Canaale      22          1       1 |
     |----------------------------------------------|
 11. | Y. Don Uflossmore    7.25          0       1 |
     +----------------------------------------------+

Notice! There are two observations for Mary Smith, but one has 3 years of experience and the other one has 27 years, so it does not appear as a duplicate.

If there were too many variables, which could use our data browser.

browse if dup > 0

We will want to drop our duplicates, but not the first observation. The duplicates drop command will drop the duplicate records but still keep one observation of the duplicate group.

duplicates drop

Let’s use a new dataset that is a bit more practical

use "wws.dta", replace

Let’s check for duplicate idcodes with the isid command. If there is a duplicate idcode, the command will return an error.

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use "wws.dta", replace
isid idcode
(Working Women Survey)

Or just use duplicates list for the variable of interest

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
quietly use "wws.dta", replace
duplicates list idcode
Duplicates in terms of idcode

(0 observations are duplicates)

There are no duplicate idcodes, so we should expect that duplicates list will not return any duplicates, since every single values needs to be the same for there to be a dupicate record.

Let’s use a dataset that does have duplicates. Let’s check for duplicate idcodes.

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use "wws_dups.dta", clear
isid idcode
variable idcode does not uniquely identify the observations
r(459);

r(459);

Or

quietly cd "/Users/Sam/Desktop/Econ 645/Data/Mitchell"
use "wws_dups.dta", clear
duplicates list idcode, sepby(idcode)
Duplicates in terms of idcode

  +------------------------+
  | group:   obs:   idcode |
  |------------------------|
  |      1   1088     2831 |
  |      1   2248     2831 |
  |------------------------|
  |      2   1244     3905 |
  |      2   1245     3905 |
  |------------------------|
  |      3    277     4214 |
  |      3   2247     4214 |
  +------------------------+

I prefer duplicates list var1, since an error will break the do file.

Let’s generate our dup variable for identifying the number of duplicates per group of duplicates.

duplicates tag idcode, generate(iddup)

Let’s generate a different duplicate variable to find complete duplicate records. Notice that we do not specify a variable after tag.

duplicates tag, generate(alldup)

Let’s list the data

list idcode age race yrschool occupation wage if iddup==1 & alldup==0
Duplicates in terms of idcode


Duplicates in terms of all variables

      +------------------------------------------------------+
      | idcode   age   race   yrschool   occupa~n       wage |
      |------------------------------------------------------|
1244. |   3905    36      1         14         11   4.339774 |
1245. |   3905    41      1         10          5   7.004828 |
      +------------------------------------------------------+

It seems that idcode 3905 is a different person with a possible incorrect idcode. Let’s fix that.

replace idcode=5160 if idcode==3905 & age==41
list idcode age race yrschool occupation wage if iddup==1 & alldup==0
(1 real change made)

      +------------------------------------------------------+
      | idcode   age   race   yrschool   occupa~n       wage |
      |------------------------------------------------------|
1244. |   3905    36      1         14         11   4.339774 |
1245. |   5160    41      1         10          5   7.004828 |
      +------------------------------------------------------+

Let’s look observation that are complete duplicates.

duplicates report
list idcode age race yrschool occupation wage if iddup==1 & alldup==1
Duplicates in terms of all variables

--------------------------------------
   copies | observations       surplus
----------+---------------------------
        1 |         2244             0
        2 |            4             2
--------------------------------------

      +------------------------------------------------------+
      | idcode   age   race   yrschool   occupa~n       wage |
      |------------------------------------------------------|
 277. |   4214    35      1         17         13   11.27214 |
1088. |   2831    37      2          8          8   2.697261 |
2247. |   4214    35      1         17         13   11.27214 |
2248. |   2831    37      2          8          8   2.697261 |
      +------------------------------------------------------+

We need to keep 1 observation of the duplicates. Use the duplicates drop command.

duplicates drop
duplicates report
Duplicates in terms of all variables

(2 observations deleted)


Duplicates in terms of all variables

--------------------------------------
   copies | observations       surplus
----------+---------------------------
        1 |         2246             0
--------------------------------------

4.9 Practice

Let’s work the Census CPS.

Pull the Census CPS. - You can download the file and import it, or - You can use Stata to pull the file.

import delimited using "https://www2.census.gov/programs-surveys/cps/datasets/2025/basic/jul25pub.csv", clear

Let’s save a copy of the csv file in Stata

Let’s grab the data dictionary. Hint search: “Census CPS 2025 data dictionary”

Let’s check our data. - What variable contains our laborforce status? - What variable contains our weekly earnings? - Does everyone employed have weekly earnings? Hint: Try summarize - What flag with our weekly earning that provides information about available data? - Try tabstat with the flag and weekly earnings. What do you find? - Try tabstat with statistics(n mean sd p25 p50 p75) - What are our identifier variable(s)? How many did you find? - Are there any duplicate records?