Chapter 4 Missing values

In this section, we will look at the patterns in missing data. We expect to have many missing values in this Mobile App User Dataset as the data are collected from a survey where respondents may skip questions that they prefer not to answer.

4.1 Column pattern

## PriceSensitive      Education       YearsEdu     Occupation         Gender 
##           7104           4892           4892           4892           4842 
##            Age    Nationality       AppStore 
##           4842           4842           2310

Here, we define PriceSensitive as people who took price into consideration when choosing apps to download (those who chose Price in Q9). On the other hand, NAs in this column represent people who skipped the question or those who didn’t choose Price for the question. In addition to 7,104 missing values in PriceSensitive, 4,892 people didn’t reply to education, years of education and occupation, while 4,842 values are missing for gender, age and nationalities. It also shows that 2,310 people skipped the question of “which app store do you use?” in the survey.

4.2 Value patterns

4.2.1 Missing values by app store

## # A tibble: 11 x 7
##    AppStore        Gender   Age Nationality Education YearsEdu Occupation
##    <fct>            <int> <int>       <int>     <int>    <int>      <int>
##  1 iOS                106   106         106       113      113        113
##  2 Do not use apps     24    24          24        24       24         24
##  3 Blackberry          28    28          28        35       35         35
##  4 Android            167   167         167       181      181        181
##  5 Nokia              102   102         102       105      105        105
##  6 Samsung             76    76          76        80       80         80
##  7 Windows             23    23          23        23       23         23
##  8 None              1595  1595        1595      1595     1595       1595
##  9 Not sure           390   390         390       403      403        403
## 10 Other               21    21          21        23       23         23
## 11 <NA>              2310  2310        2310      2310     2310       2310

When we group the data by different App stores, some missing data appear to be highly correlated: Gender, Age and Nationality share the same patterns of missing values, while Education, YearsEdu and Occupation have exact same numbers of NAs in each App store. This shows that respondents who didn’t answer education-related questions also skipped the one for occupation. People who prefer not to answer their genders and ages avoided answering nationalities too.

If we visualize missing values of Education by App store, we can see that the largest number of NAs in education appears in the group NA in App store. We should consider removing these NAs since they have little information for both variables.

4.2.2 Missing values by app store & price sensitivity

## # A tibble: 11 x 4
##    AppStore        num_appstore num_na pct_na
##    <fct>                  <int>  <int>  <dbl>
##  1 <NA>                    2310   2310   1   
##  2 None                    1653   1633   0.99
##  3 Do not use apps           76     67   0.88
##  4 Not sure                 737    608   0.82
##  5 Nokia                    830    438   0.53
##  6 Other                    150     78   0.52
##  7 Samsung                  722    357   0.49
##  8 Android                 2066    990   0.48
##  9 Windows                  162     72   0.44
## 10 iOS                     1169    428   0.37
## 11 Blackberry               333    123   0.37

All people who skipped writing down their App stores have an NA in PriceSensitive. People using iOS (Apple Store) appear to have the smallest number of missing values for the price question, so they are the most price-sensitive. We can also see that Android/Google and Samsung market users have similar percentages of NAs.

4.3 Missing values by variable

The most obvious pattern is the complete case (no NA in any column), indicating that over 2,500 observations contain all the information we need. The second most frequent missing pattern appears in the PriceSensitive column. It is interesting that many people didn’t provide any of their personal information in the survey, as we can see from the third, fourth & fifth patterns.