Chapter 3 Data transformation

In this dataset, many columns are pre-processed by the authors, so our work is to choose the variables we need and change the names and orders of factor levels.

We cleaned the dataset in the following steps:

  • Select AppStore, PriceSensitive, WhyDownload, Currency & demographic columns (Gender, Age, Nationality, Education, Occupation, etc.)

  • Mutate character variables to factors

  • Rename the levels with readable information

  • Reorder the levels for future visualization

## # A tibble: 10 x 12
##    ID    AppStore PriceSensitive WhyDownload Gender Age   Nationality Education
##    <chr> <fct>    <fct>          <chr>       <fct>  <chr> <fct>       <chr>    
##  1 1     <NA>     <NA>           <NA>        <NA>   <NA>  <NA>        <NA>     
##  2 2     iOS      Yes            <NA>        Male   21    American    5        
##  3 3     iOS      Yes            <NA>        Male   23    Other       6        
##  4 4     Android  Yes            <NA>        Male   37    Italian     7        
##  5 5     iOS      <NA>           Papers      Male   30    Mexican     7        
##  6 6     iOS      <NA>           navigation  Female 29    British     7        
##  7 7     iOS      Yes            EagleEyes   Male   29    Other       5        
##  8 8     Android  Yes            <NA>        Male   31    Australian  5        
##  9 9     iOS      Yes            <U+82F1><U+8F9E><U+90CE>      Male   28    Japanese    7        
## 10 10    None     <NA>           <NA>        Female 31    Japanese    7        
## # ... with 4 more variables: YearsEdu <chr>, Occupation <fct>, Currency <chr>,
## #   HH.Income <fct>

Here are the first 10 rows of the main table. We keep NAs for now and will explore the missing values in the next section. For the levels of HH.Income column, authors of this survey already divided data into 12 intervals to address the problem of inconsistent income levels in different countries (1 for the lowest, 11 for the highest financial level and 12 for “Prefer not to say”). We therefore decide to not convert these currencies into a uniform unit and to go with the original income numbers. We will explain in detail in the results part.