6 Data wrangling in R

For advanced, and fast, data handling with large R objects and lots of flexibility, two lines of work are available:

the RStudio line (with Hadley Wickham) offering the packages from the tidyverse: see tidyverse
the data.table line developed by Matt Dowle, see e.g. DataCamp’s course on data.table.

Both have a very specific syntax, with a demanding learning curve.

6.1 Ideas from the `tidyverse`

6.1.1 A `tibble` instead of a `data.frame`

Within the tidyverse tibbles are a modern take on data frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating (i.e. converting character vectors to factors). (Quote from tibble vignette) You can use tibble to create a new tibble and as_tibble transforms an object (e.g. a data frame) into a tibble.

library(ggplot2)
diamonds

# A tibble: 53,940 x 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ... with 53,930 more rows

6.1.2 Pipes in R

Read the story behind the pipe operator in R in this tutorial from DataCamp pipes in R. In R, the pipe operator is %>%. You can think of this operator as being similar to the + in a ggplot2 statement, as introduced in Chapter 5. It takes the output of one statement and makes it the input of the next statement. When describing it, you can think of it as a “THEN”.

6.1.3 Filter observations using `filter`

Here is a first example of using the pipe in R.

library(ggplot2)
library(dplyr)


Attaching package: 'dplyr'

The following object is masked from 'package:car':

    recode

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

diamonds %>% filter(cut == "Ideal")

# A tibble: 21,551 x 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.23 Ideal J     VS1      62.8    56   340  3.93  3.9   2.46
 3  0.31 Ideal J     SI2      62.2    54   344  4.35  4.37  2.71
 4  0.3  Ideal I     SI2      62      54   348  4.31  4.34  2.68
 5  0.33 Ideal I     SI2      61.8    55   403  4.49  4.51  2.78
 6  0.33 Ideal I     SI2      61.2    56   403  4.49  4.5   2.75
 7  0.33 Ideal J     SI1      61.1    56   403  4.49  4.55  2.76
 8  0.23 Ideal G     VS1      61.9    54   404  3.93  3.95  2.44
 9  0.32 Ideal I     SI1      60.9    55   404  4.45  4.48  2.72
10  0.3  Ideal I     SI2      61      59   405  4.3   4.33  2.63
# ... with 21,541 more rows

The code chunk above will translate to something like “you take the diamonds data, then you subset the data”. This is one of the most powerful things about the tidyverse. In fact, having a standardized chain of processing actions is called “a pipeline”.

Here is another example where you now filter diamonds based on two characteristics.

diamonds %>% filter(cut == "Ideal" & color == "E")

# A tibble: 3,903 x 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.26 Ideal E     VVS2     62.9    58   554  4.02  4.06  2.54
 3  0.7  Ideal E     SI1      62.5    57  2757  5.7   5.72  3.57
 4  0.59 Ideal E     VVS2     62      55  2761  5.38  5.43  3.35
 5  0.74 Ideal E     SI2      62.2    56  2761  5.8   5.84  3.62
 6  0.7  Ideal E     VS2      60.7    58  2762  5.73  5.76  3.49
 7  0.74 Ideal E     SI1      62.3    54  2762  5.8   5.83  3.62
 8  0.7  Ideal E     SI1      60.9    57  2768  5.73  5.76  3.5 
 9  0.6  Ideal E     VS1      61.7    55  2774  5.41  5.44  3.35
10  0.7  Ideal E     SI1      62.7    55  2774  5.68  5.74  3.58
# ... with 3,893 more rows

diamonds %>% filter(cut == "Ideal" & color %in% c("E", "D"))

# A tibble: 6,737 x 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.3  Ideal D     SI1      62.5    57   552  4.29  4.32  2.69
 3  0.3  Ideal D     SI1      62.1    56   552  4.3   4.33  2.68
 4  0.26 Ideal E     VVS2     62.9    58   554  4.02  4.06  2.54
 5  0.7  Ideal E     SI1      62.5    57  2757  5.7   5.72  3.57
 6  0.59 Ideal E     VVS2     62      55  2761  5.38  5.43  3.35
 7  0.74 Ideal E     SI2      62.2    56  2761  5.8   5.84  3.62
 8  0.7  Ideal E     VS2      60.7    58  2762  5.73  5.76  3.49
 9  0.71 Ideal D     SI2      62.3    56  2762  5.73  5.69  3.56
10  0.74 Ideal E     SI1      62.3    54  2762  5.8   5.83  3.62
# ... with 6,727 more rows

6.1.4 Summarize variables using `summarize`

The code chunk below will translate to something like “you take the diamonds data, then you subset the data and then you calculate mean and standard deviation of these data”.

diamonds %>% filter(cut == "Ideal") %>% summarize(mean = mean(price), std_dev = sd(price))

# A tibble: 1 x 2
   mean std_dev
  <dbl>   <dbl>
1    NA      NA

6.1.5 Summarize based on groupings of another variable

So, here is what you’d like to do.

# base R way with aggregate
aggregate(price ~ cut, diamonds, mean)

        cut  price
1      Fair 4358.8
2      Good 3928.7
3 Very Good 3981.8
4   Premium 4584.3
5     Ideal 3457.0

How can you do this with the pipe?

diamonds %>% group_by(cut) %>% summarize(mean = mean(price))

`summarise()` ungrouping output (override with `.groups` argument)

# A tibble: 5 x 2
  cut        mean
  <ord>     <dbl>
1 Fair      4359.
2 Good        NA 
3 Very Good 3982.
4 Premium   4584.
5 Ideal       NA

Now you want to group by multiple variables.

diamonds %>% group_by(cut, color) %>% summarize(price = mean(price))

`summarise()` regrouping output by 'cut' (override with `.groups` argument)

# A tibble: 35 x 3
# Groups:   cut [5]
   cut   color price
   <ord> <ord> <dbl>
 1 Fair  D     4291.
 2 Fair  E     3682.
 3 Fair  F     3827.
 4 Fair  G     4239.
 5 Fair  H     5136.
 6 Fair  I     4685.
 7 Fair  J     4976.
 8 Good  D     3405.
 9 Good  E       NA 
10 Good  F     3496.
# ... with 25 more rows

Now you want to calculate multiple metrics.

diamonds %>% group_by(cut) %>% summarize(price = mean(price), carat = mean(carat))

`summarise()` ungrouping output (override with `.groups` argument)

# A tibble: 5 x 3
  cut       price carat
  <ord>     <dbl> <dbl>
1 Fair      4359. 1.05 
2 Good        NA  0.849
3 Very Good 3982. 0.806
4 Premium   4584. 0.892
5 Ideal       NA  0.703

And finally, multiple metrics and multiple grouping variables.

diamonds %>% group_by(cut, color) %>% summarize(price = mean(price), carat = mean(carat))

`summarise()` regrouping output by 'cut' (override with `.groups` argument)

# A tibble: 35 x 4
# Groups:   cut [5]
   cut   color price carat
   <ord> <ord> <dbl> <dbl>
 1 Fair  D     4291. 0.920
 2 Fair  E     3682. 0.857
 3 Fair  F     3827. 0.905
 4 Fair  G     4239. 1.02 
 5 Fair  H     5136. 1.22 
 6 Fair  I     4685. 1.20 
 7 Fair  J     4976. 1.34 
 8 Good  D     3405. 0.745
 9 Good  E       NA  0.745
10 Good  F     3496. 0.776
# ... with 25 more rows

6.1.6 Joining tibbles

Now you want to add the mean price and mean carat per cut to the original tibble. You use the variable cut as the key to identify observations.

d <- diamonds %>% group_by(cut) %>% summarize(price = mean(price), carat = mean(carat))

`summarise()` ungrouping output (override with `.groups` argument)

new_diamonds <- diamonds %>% inner_join(d, by = "cut")
View(diamonds)
View(new_diamonds)

6.2 Data science the `data.table` way

6.2.1 Speed junkies love `data.table`

data.table is a package designed for speed junkies. “The R data.table package is rapidly making its name as the number one choice for handling large datasets in R.” It extends and exchanges the functionality of the basic data.frame in R. The syntax is different and you’ll have to get used to it. A data.table cheat sheet is available here.

6.2.2 What is a `data.table`?

Here you see some basic illustrations with the diamonds data.

library(data.table)

data.table 1.13.2 using 6 threads (see ?getDTthreads).  Latest news: r-datatable.com


Attaching package: 'data.table'

The following objects are masked from 'package:dplyr':

    between, first, last

library(ggplot2)
str(diamonds)

tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

diamonds_DT <- data.table(diamonds)
diamonds_DT # notice intelligent printing of this DT

       carat       cut color clarity depth table price    x    y    z
    1:  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
    2:  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
    3:  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
    4:  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
    5:  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
   ---                                                               
53936:  0.72     Ideal     D     SI1  60.8    57  2757 5.75 5.76 3.50
53937:  0.72      Good     D     SI1  63.1    55  2757 5.69 5.75 3.61
53938:  0.70 Very Good     D     SI1  62.8    60  2757 5.66 5.68 3.56
53939:  0.86   Premium     H     SI2  61.0    58  2757 6.15 6.12 3.74
53940:  0.75     Ideal     D     SI2  62.2    55  2757 5.83 5.87 3.64

summary(diamonds_DT$cut)

     Fair      Good Very Good   Premium     Ideal 
     1610      4906     12082     13791     21551

6.2.3 Identify keys

Instead of using subset from the base R, you will use the setkey to extract the observations you want to have.

# key is used to index the data.table and will provide the extra speed
setkey(diamonds_DT, cut)
tables()

          NAME   NROW NCOL MB                                    COLS KEY
1: diamonds_DT 53,940   10  3 carat,cut,color,clarity,depth,table,... cut
Total: 3MB

diamonds_DT[J("Ideal"), ]

       carat   cut color clarity depth table price    x    y    z
    1:  0.23 Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
    2:  0.23 Ideal     J     VS1  62.8    56   340 3.93 3.90 2.46
    3:  0.31 Ideal     J     SI2  62.2    54   344 4.35 4.37 2.71
    4:  0.30 Ideal     I     SI2  62.0    54   348 4.31 4.34 2.68
    5:  0.33 Ideal     I     SI2  61.8    55   403 4.49 4.51 2.78
   ---                                                           
21547:  0.79 Ideal     I     SI1  61.6    56  2756 5.95 5.97 3.67
21548:  0.71 Ideal     E     SI1  61.9    56  2756 5.71 5.73 3.54
21549:  0.71 Ideal     G     VS1  61.4    56  2756 5.76 5.73 3.53
21550:  0.72 Ideal     D     SI1  60.8    57  2757 5.75 5.76 3.50
21551:  0.75 Ideal     D     SI2  62.2    55  2757 5.83 5.87 3.64

# more than one column can be set as key
setkey(diamonds_DT, cut, color)
tables()

          NAME   NROW NCOL MB                                    COLS       KEY
1: diamonds_DT 53,940   10  3 carat,cut,color,clarity,depth,table,... cut,color
Total: 3MB

# access rows according to both keys, use function 'J'
diamonds_DT[J("Ideal", "E"), ]

      carat   cut color clarity depth table price    x    y    z
   1:  0.23 Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
   2:  0.26 Ideal     E    VVS2  62.9    58   554 4.02 4.06 2.54
   3:  0.70 Ideal     E     SI1  62.5    57  2757 5.70 5.72 3.57
   4:  0.59 Ideal     E    VVS2  62.0    55  2761 5.38 5.43 3.35
   5:  0.74 Ideal     E     SI2  62.2    56  2761 5.80 5.84 3.62
  ---                                                           
3899:  0.70 Ideal     E     SI1  61.7    55  2745 5.71 5.74 3.53
3900:  0.51 Ideal     E    VVS1  61.9    54  2745 5.17 5.11 3.18
3901:  0.56 Ideal     E    VVS1  62.1    56  2750 5.28 5.29 3.28
3902:  0.77 Ideal     E     SI2  62.1    56  2753 5.84 5.86 3.63
3903:  0.71 Ideal     E     SI1  61.9    56  2756 5.71 5.73 3.54

diamonds_DT[J("Ideal", c("E", "D")), ]

      carat   cut color clarity depth table price    x    y    z
   1:  0.23 Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
   2:  0.26 Ideal     E    VVS2  62.9    58   554 4.02 4.06 2.54
   3:  0.70 Ideal     E     SI1  62.5    57  2757 5.70 5.72 3.57
   4:  0.59 Ideal     E    VVS2  62.0    55  2761 5.38 5.43 3.35
   5:  0.74 Ideal     E     SI2  62.2    56  2761 5.80 5.84 3.62
  ---                                                           
6733:  0.51 Ideal     D    VVS2  61.7    56  2742 5.16 5.14 3.18
6734:  0.51 Ideal     D    VVS2  61.3    57  2742 5.17 5.14 3.16
6735:  0.81 Ideal     D     SI1  61.5    57  2748 6.00 6.03 3.70
6736:  0.72 Ideal     D     SI1  60.8    57  2757 5.75 5.76 3.50
6737:  0.75 Ideal     D     SI2  62.2    55  2757 5.83 5.87 3.64

# what would be the alternative with base R?
subset(diamonds, diamonds$cut == "Ideal" && diamonds$color == c("E", "D"))

# A tibble: 53,940 x 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ... with 53,930 more rows

6.2.4 Alternative and faster ways to `aggregate`

Instead of using aggregate from the base R, you will identify the by variable(s).

# base R way with aggregate
aggregate(price ~ cut, diamonds, mean)

        cut  price
1      Fair 4358.8
2      Good 3928.7
3 Very Good 3981.8
4   Premium 4584.3
5     Ideal 3457.0

system.time(aggregate(price ~ cut, diamonds, mean))

   user  system elapsed 
   0.01    0.00    0.03

# aggregation with data.table
# will go faster thanks to indexing
diamonds_DT[ , mean(price), by=cut]

         cut     V1
1:      Fair 4358.8
2:      Good     NA
3: Very Good 3981.8
4:   Premium 4584.3
5:     Ideal     NA

system.time(diamonds_DT[ , mean(price), by=cut])

   user  system elapsed 
   0.01    0.00    0.02

# give variable names in the create date.table
diamonds_DT[ , list(price = mean(price)), by=cut]

         cut  price
1:      Fair 4358.8
2:      Good     NA
3: Very Good 3981.8
4:   Premium 4584.3
5:     Ideal     NA

# aggregate on multiple columns
diamonds_DT[ , mean(price), by=list(cut,color)]

          cut color     V1
 1:      Fair     D 4291.1
 2:      Fair     E 3682.3
 3:      Fair     F 3827.0
 4:      Fair     G 4239.3
 5:      Fair     H 5135.7
 6:      Fair     I 4685.4
 7:      Fair     J 4975.7
 8:      Good     D 3405.4
 9:      Good     E     NA
10:      Good     F 3495.8
11:      Good     G 4123.5
12:      Good     H 4276.3
13:      Good     I 5078.5
14:      Good     J 4574.2
15: Very Good     D 3470.5
16: Very Good     E 3214.7
17: Very Good     F 3778.8
18: Very Good     G 3872.8
19: Very Good     H 4535.4
20: Very Good     I 5255.9
21: Very Good     J 5103.5
22:   Premium     D 3631.3
23:   Premium     E 3538.9
24:   Premium     F 4324.9
25:   Premium     G 4500.7
26:   Premium     H 5216.7
27:   Premium     I 5946.2
28:   Premium     J 6294.6
29:     Ideal     D 2629.1
30:     Ideal     E 2597.6
31:     Ideal     F 3374.9
32:     Ideal     G 3720.7
33:     Ideal     H     NA
34:     Ideal     I 4452.0
35:     Ideal     J 4918.2
          cut color     V1

# aggregate multiple arguments
diamonds_DT[ , list(price = mean(price), carat = mean(carat)), by = cut]

         cut  price   carat
1:      Fair 4358.8 1.04614
2:      Good     NA 0.84918
3: Very Good 3981.8 0.80638
4:   Premium 4584.3 0.89195
5:     Ideal     NA 0.70284

diamonds_DT[ , list(price = mean(price), carat = mean(carat), caratSum = sum(carat)), by=cut]

         cut  price   carat caratSum
1:      Fair 4358.8 1.04614   1684.3
2:      Good     NA 0.84918   4166.1
3: Very Good 3981.8 0.80638   9742.7
4:   Premium 4584.3 0.89195  12301.0
5:     Ideal     NA 0.70284  15146.8

# multiple metrics and multiple grouping variables
diamonds_DT[ , list(price = mean(price), carat = mean(carat)), by = list(cut, color)]

          cut color  price   carat
 1:      Fair     D 4291.1 0.92012
 2:      Fair     E 3682.3 0.85661
 3:      Fair     F 3827.0 0.90471
 4:      Fair     G 4239.3 1.02382
 5:      Fair     H 5135.7 1.21917
 6:      Fair     I 4685.4 1.19806
 7:      Fair     J 4975.7 1.34118
 8:      Good     D 3405.4 0.74452
 9:      Good     E     NA 0.74513
10:      Good     F 3495.8 0.77593
11:      Good     G 4123.5 0.85090
12:      Good     H 4276.3 0.91473
13:      Good     I 5078.5 1.05722
14:      Good     J 4574.2 1.09954
15: Very Good     D 3470.5 0.69642
16: Very Good     E 3214.7 0.67632
17: Very Good     F 3778.8 0.74096
18: Very Good     G 3872.8 0.76680
19: Very Good     H 4535.4 0.91595
20: Very Good     I 5255.9 1.04695
21: Very Good     J 5103.5 1.13322
22:   Premium     D 3631.3 0.72155
23:   Premium     E 3538.9 0.71774
24:   Premium     F 4324.9 0.82704
25:   Premium     G 4500.7 0.84149
26:   Premium     H 5216.7 1.01645
27:   Premium     I 5946.2 1.14494
28:   Premium     J 6294.6 1.29309
29:     Ideal     D 2629.1 0.56577
30:     Ideal     E 2597.6 0.57840
31:     Ideal     F 3374.9 0.65583
32:     Ideal     G 3720.7 0.70071
33:     Ideal     H     NA 0.79952
34:     Ideal     I 4452.0 0.91303
35:     Ideal     J 4918.2 1.06359
          cut color  price   carat

6.2.5 Joining `data.tables`

How to join data.tables?

# join two data.tables
d <- diamonds_DT[ , list(price = mean(price), carat = mean(carat)), by = cut]
d

         cut  price   carat
1:      Fair 4358.8 1.04614
2:      Good     NA 0.84918
3: Very Good 3981.8 0.80638
4:   Premium 4584.3 0.89195
5:     Ideal     NA 0.70284

setkey(diamonds_DT, cut)
dmerge <- diamonds_DT[d]
dmerge

       carat   cut color clarity depth table price    x    y    z i.price
    1:  0.75  Fair     D     SI2  64.6    57  2848 5.74 5.72 3.70  4358.8
    2:  0.71  Fair     D     VS2  56.9    65  2858 5.89 5.84 3.34  4358.8
    3:  0.90  Fair     D     SI2  66.9    57  2885 6.02 5.90 3.99  4358.8
    4:  1.00  Fair     D     SI2  69.3    58  2974 5.96 5.87 4.10  4358.8
    5:  1.01  Fair     D     SI2  64.6    56  3003 6.31 6.24 4.05  4358.8
   ---                                                                   
53936:  0.71 Ideal     J     SI1  60.6    57  2700 5.78 5.83 3.52      NA
53937:  0.81 Ideal     J     VS2  62.1    56  2708 5.92 5.97 3.69      NA
53938:  0.84 Ideal     J     VS2  61.1    57  2709 6.09 6.12 3.73      NA
53939:  0.82 Ideal     J     VS2  61.6    56  2741 6.00 6.04 3.71      NA
53940:  0.83 Ideal     J     VS2  62.3    55  2742 6.01 6.03 3.75      NA
       i.carat
    1: 1.04614
    2: 1.04614
    3: 1.04614
    4: 1.04614
    5: 1.04614
   ---        
53936: 0.70284
53937: 0.70284
53938: 0.70284
53939: 0.70284
53940: 0.70284

6.3 Exercises

Learning check

(An exercise taken from (Kleiber and Zeileis 2008)) “PARADE” is the Sunday newspaper magazine supplementing the Sunday or weekend edition of some 500 daily newspapers in the United States of America. An important yearly feature is an article providing information on some 120150 “randomly” selected US citizens, indicating their profession, hometown and state, and their yearly earnings. The Parade2005 (in library AER) data contain the 2005 version, amended by a variable indicating celebrity status (motivated by substantial oversampling of celebrities in these data). For the Parade2005 data and by using %>% answer the following questions.

Load the data Parade2005 from the AER package, use data("Parade2005") to make the data accessible.
Determine the mean earnings in California.
Determine the number of individuals residing in Idaho.
Determine the mean and the median earnings of celebrities.

6 Data wrangling in R

6.1 Ideas from the tidyverse

6.1.1 A tibble instead of a data.frame

6.1.2 Pipes in R

6.1.3 Filter observations using filter

6.1.4 Summarize variables using summarize