Chapter 3 Data transformation
In this dataset, many columns are pre-processed by the authors, so our work is to choose the variables we need and change the names and orders of factor levels.
We cleaned the dataset in the following steps:
Select
AppStore
,PriceSensitive
,WhyDownload
,Currency
& demographic columns (Gender
,Age
,Nationality
,Education
,Occupation
, etc.)Mutate character variables to factors
Rename the levels with readable information
Reorder the levels for future visualization
## # A tibble: 10 x 12
## ID AppStore PriceSensitive WhyDownload Gender Age Nationality Education
## <chr> <fct> <fct> <chr> <fct> <chr> <fct> <chr>
## 1 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 2 iOS Yes <NA> Male 21 American 5
## 3 3 iOS Yes <NA> Male 23 Other 6
## 4 4 Android Yes <NA> Male 37 Italian 7
## 5 5 iOS <NA> Papers Male 30 Mexican 7
## 6 6 iOS <NA> navigation Female 29 British 7
## 7 7 iOS Yes EagleEyes Male 29 Other 5
## 8 8 Android Yes <NA> Male 31 Australian 5
## 9 9 iOS Yes <U+82F1><U+8F9E><U+90CE> Male 28 Japanese 7
## 10 10 None <NA> <NA> Female 31 Japanese 7
## # ... with 4 more variables: YearsEdu <chr>, Occupation <fct>, Currency <chr>,
## # HH.Income <fct>
Here are the first 10 rows of the main table. We keep NAs for now and will explore the missing values in the next section. For the levels of HH.Income
column, authors of this survey already divided data into 12 intervals to address the problem of inconsistent income levels in different countries (1 for the lowest, 11 for the highest financial level and 12 for “Prefer not to say”). We therefore decide to not convert these currencies into a uniform unit and to go with the original income numbers. We will explain in detail in the results part.