tidyr 0.4を使ってみる

tidyr 0.4が出たと風の噂で聞いたので触ってみました。


nest() and unnest() have been overhauled to support a useful way of structuring data frames: the nested data frame. In a grouped data frame, you have one row per observation, and additional metadata define the groups. In a nested data frame, you have one row per group, and the individual observations are stored in a column that is a list of data frames. This is a useful structure when you have lists of other objects (like models) with one element per group.

この変更むずいです。。when you have lists of other objects (like models) with one element per groupというのの具体例が思いつかないので、この辺の解説はRStudioのブログ記事を待ちます。とりあえずここでは、どういう動作をするかだけ追ってみました。


nest() now produces a single list of data frames


iris %>% nest(-Species)
#> Source: local data frame [3 x 5]
#> Groups: <by row>
#>      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#>       (fctr)        (chr)       (chr)        (chr)       (chr)
#> 1     setosa    <dbl[50]>   <dbl[50]>    <dbl[50]>   <dbl[50]>
#> 2 versicolor    <dbl[50]>   <dbl[50]>    <dbl[50]>   <dbl[50]>
#> 3  virginica    <dbl[50]>   <dbl[50]>    <dbl[50]>   <dbl[50]>


iris %>% nest(-Species)
#>      Species
#> 1     setosa
#> 2 versicolor
#> 3  virginica
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     data
#> 1 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5.0, 5.0, 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5.0, 5.5, 4.9, 4.4, 5.1, 5.0, 4.5, 4.4, 5.0, 5.1, 4.8, 5.1, 4.6, 5.3, 5.0, 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, 3.0, 4.0, 4.4, 3.9, 3.5, 3.8, 3.8, 3.4, 3.7, 3.6, 3.3, 3.4, 3.0, 3.4, 3.5, 3.4, 3.2, 3.1, 3.4, 4.1, 4.2, 3.1, 3.2, 3.5, 3.6, 3.0, 3.4, 3.5, 2.3, 3.2, 3.5, 3.8, 3.0, 3.8, 3.2, 3.7, 3.3, 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, 1.3, 1.4, 1.7, 1.5, 1.7, 1.5, 1.0, 1.7, 1.9, 1.6, 1.6, 1.5, 1.4, 1.6, 1.6, 1.5, 1.5, 1.4, 1.5, 1.2, 1.3, 1.4, 1.3, 1.5, 1.3, 1.3, 1.3, 1.6, 1.9, 1.4, 1.6, 1.4, 1.5, 1.4, 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1, 0.1, 0.2, 0.4, 0.4, 0.3, 0.3, 0.3, 0.2, 0.4, 0.2, 0.5, 0.2, 0.2, 0.4, 0.2, 0.2, 0.2, 0.2, 0.4, 0.1, 0.2, 0.2, 0.2, 0.2, 0.1, 0.2, 0.2, 0.3, 0.3, 0.2, 0.6, 0.4, 0.3, 0.2, 0.2, 0.2, 0.2
#> 2 7.0, 6.4, 6.9, 5.5, 6.5, 5.7, 6.3, 4.9, 6.6, 5.2, 5.0, 5.9, 6.0, 6.1, 5.6, 6.7, 5.6, 5.8, 6.2, 5.6, 5.9, 6.1, 6.3, 6.1, 6.4, 6.6, 6.8, 6.7, 6.0, 5.7, 5.5, 5.5, 5.8, 6.0, 5.4, 6.0, 6.7, 6.3, 5.6, 5.5, 5.5, 6.1, 5.8, 5.0, 5.6, 5.7, 5.7, 6.2, 5.1, 5.7, 3.2, 3.2, 3.1, 2.3, 2.8, 2.8, 3.3, 2.4, 2.9, 2.7, 2.0, 3.0, 2.2, 2.9, 2.9, 3.1, 3.0, 2.7, 2.2, 2.5, 3.2, 2.8, 2.5, 2.8, 2.9, 3.0, 2.8, 3.0, 2.9, 2.6, 2.4, 2.4, 2.7, 2.7, 3.0, 3.4, 3.1, 2.3, 3.0, 2.5, 2.6, 3.0, 2.6, 2.3, 2.7, 3.0, 2.9, 2.9, 2.5, 2.8, 4.7, 4.5, 4.9, 4.0, 4.6, 4.5, 4.7, 3.3, 4.6, 3.9, 3.5, 4.2, 4.0, 4.7, 3.6, 4.4, 4.5, 4.1, 4.5, 3.9, 4.8, 4.0, 4.9, 4.7, 4.3, 4.4, 4.8, 5.0, 4.5, 3.5, 3.8, 3.7, 3.9, 5.1, 4.5, 4.5, 4.7, 4.4, 4.1, 4.0, 4.4, 4.6, 4.0, 3.3, 4.2, 4.2, 4.2, 4.3, 3.0, 4.1, 1.4, 1.5, 1.5, 1.3, 1.5, 1.3, 1.6, 1.0, 1.3, 1.4, 1.0, 1.5, 1.0, 1.4, 1.3, 1.4, 1.5, 1.0, 1.5, 1.1, 1.8, 1.3, 1.5, 1.2, 1.3, 1.4, 1.4, 1.7, 1.5, 1.0, 1.1, 1.0, 1.2, 1.6, 1.5, 1.6, 1.5, 1.3, 1.3, 1.3, 1.2, 1.4, 1.2, 1.0, 1.3, 1.2, 1.3, 1.3, 1.1, 1.3
#> 3 6.3, 5.8, 7.1, 6.3, 6.5, 7.6, 4.9, 7.3, 6.7, 7.2, 6.5, 6.4, 6.8, 5.7, 5.8, 6.4, 6.5, 7.7, 7.7, 6.0, 6.9, 5.6, 7.7, 6.3, 6.7, 7.2, 6.2, 6.1, 6.4, 7.2, 7.4, 7.9, 6.4, 6.3, 6.1, 7.7, 6.3, 6.4, 6.0, 6.9, 6.7, 6.9, 5.8, 6.8, 6.7, 6.7, 6.3, 6.5, 6.2, 5.9, 3.3, 2.7, 3.0, 2.9, 3.0, 3.0, 2.5, 2.9, 2.5, 3.6, 3.2, 2.7, 3.0, 2.5, 2.8, 3.2, 3.0, 3.8, 2.6, 2.2, 3.2, 2.8, 2.8, 2.7, 3.3, 3.2, 2.8, 3.0, 2.8, 3.0, 2.8, 3.8, 2.8, 2.8, 2.6, 3.0, 3.4, 3.1, 3.0, 3.1, 3.1, 3.1, 2.7, 3.2, 3.3, 3.0, 2.5, 3.0, 3.4, 3.0, 6.0, 5.1, 5.9, 5.6, 5.8, 6.6, 4.5, 6.3, 5.8, 6.1, 5.1, 5.3, 5.5, 5.0, 5.1, 5.3, 5.5, 6.7, 6.9, 5.0, 5.7, 4.9, 6.7, 4.9, 5.7, 6.0, 4.8, 4.9, 5.6, 5.8, 6.1, 6.4, 5.6, 5.1, 5.6, 6.1, 5.6, 5.5, 4.8, 5.4, 5.6, 5.1, 5.1, 5.9, 5.7, 5.2, 5.0, 5.2, 5.4, 5.1, 2.5, 1.9, 2.1, 1.8, 2.2, 2.1, 1.7, 1.8, 1.8, 2.5, 2.0, 1.9, 2.1, 2.0, 2.4, 2.3, 1.8, 2.2, 2.3, 1.5, 2.3, 2.0, 2.0, 1.8, 2.1, 1.8, 1.8, 1.8, 2.1, 1.6, 1.9, 2.0, 2.2, 1.5, 1.4, 2.3, 2.4, 1.8, 1.8, 2.1, 2.4, 2.3, 1.9, 2.3, 2.5, 2.3, 1.9, 2.0, 2.3, 1.8


iris %>% nest(-Species) %>% str(max.level = 2L)
'data.frame':  3 obs. of  2 variables:
 $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 2 3
 $ data   :List of 3
  ..$ :'data.frame':   50 obs. of  4 variables:
  ..$ :'data.frame':   50 obs. of  4 variables:
  ..$ :'data.frame':   50 obs. of  4 variables:



gapminder %>%
  group_by(country, continent) %>%
#> Source: local data frame [142 x 3]
#>        country continent            data
#>         (fctr)    (fctr)          (list)
#> 1  Afghanistan      Asia <tbl_df [12,4]>
#> 2      Albania    Europe <tbl_df [12,4]>
#> 3      Algeria    Africa <tbl_df [12,4]>
#> 4       Angola    Africa <tbl_df [12,4]>
#> 5    Argentina  Americas <tbl_df [12,4]>
#> ..         ...       ...             ...


gapminder %>%
  nest(-country, -continent)


unnest() gains a .drop argument which controls what happens to other list columns.

unnest()は、二つ以上ネストされている値があったとき、指定しなかった列は削除するというのがデフォルトの挙動です。.drop = FALSEを指定すると、指定しなかったネストした列もそのまま残してくれます。


df <- data_frame(
 a = list(c("a", "b"), "c"),
 b = list(1:2, 3),
 c = c(11, 22)

# bを指定しないとbは消える
df %>% unnest(a)
#> Source: local data frame [3 x 2]
#>       c     a
#>   (dbl) (chr)
#> 1    11     a
#> 2    11     b
#> 3    22     c

# .drop = FALSEだとbが残る
df %>% unnest(a, .drop = FALSE)
#> df %>% unnest(a, .drop = FALSE)
#> Source: local data frame [3 x 3]
#>          b     c     a
#>     (list) (dbl) (chr)
#> 1 <int[2]>    11     a
#> 2 <int[2]>    11     b
#> 3 <dbl[1]>    22     c


df %>% unnest(a, b)

df %>% unnest(a, .drop = FALSE) %>% unnest(b)



expand() once again allows you to evaluate arbitrary expressions like full_seq(year).



nesting() and crossing() allow you to create nested and crossed data frames from individual vectors. crossing() is similar to base::expand.grid()


nesting(x = 1:3, y = 3:1)
#> Source: local data frame [3 x 2]
#>       x     y
#>   (int) (int)
#> 1     1     3
#> 2     2     2
#> 3     3     1

# expand(data.frame(x = 1:3, y = 3:1), x, y) と同じ
crossing(x = 1:3, y = 3:1)
#> Source: local data frame [9 x 2]
#>       x     y
#>   (int) (int)
#> 1     1     1
#> 2     1     2
#> 3     1     3
#> 4     2     1
#> 5     2     2
#> 6     2     3
#> 7     3     1
#> 8     3     2
#> 9     3     3

nesting()expand()の中で使うと便利です。たとえば、xyは実際のデータに含まれる組み合わせだけにしたいときはnesting(x, y)を指定します。

d <- data.frame(x = 1:3, y = 3:1, z = c(1, 2, 1))

expand(d, x, y , z)
#> Source: local data frame [18 x 3]
#>        x     y     z
#>    (int) (int) (dbl)
#> 1      1     1     1
#> 2      1     1     2
#> 3      1     2     1
#> 4      1     2     2
#> 5      1     3     1
#> 6      1     3     2
#> 7      2     1     1
#> 8      2     1     2
#> 9      2     2     1
#> 10     2     2     2
#> ...

expand(d, nesting(x, y) , z)
#> Source: local data frame [6 x 3]
#>       x     y     z
#>   (int) (int) (dbl)
#> 1     1     3     1
#> 2     1     3     2
#> 3     2     2     1
#> 4     2     2     2
#> 5     3     1     1
#> 6     3     1     2


full_seq(x, period) creates the full sequence of values from min(x) to max(x) every period values.


full_seq(c(1, 2, 4, 5, 10), period = 1)
#>  [1]  1  2  3  4  5  6  7  8  9 10





Issuesを覗いてみると、purrrでよく見るタンタンの人がちらほら出てきます。どうやら、purrrをいろいろいじくっているうちに閃いたとかいう雰囲気を感じます。dplyrにもこんな感じでbreaking changeが出たりしそうな…。