elasticパッケージでRからElasticsearchを使う

前回のTokyo.RでElasticsearchを使うためのRのパッケージがあると聞きました。調べてみると、信頼と安心のrOpenSciがパッケージをつくってました。

<a href="https://github.com/ropensci/elastic">ropensci/elastic</a>github.com

ということで、ちょっと使ってみます。

Elasticsearchのインストール

ローカルにインストールする場合

elasticパッケージのREADMEにいろいろやり方が書いてあります。手元にDockerがあるなら、公式イメージがあるのでお手軽でいいかも。

https://github.com/ropensci/elastic#install-elasticsearch

別マシン（VPSとか）にインストールする場合

メモったのでこっちを参考にしてください。

<a href="http://notchained.hatenablog.com/entry/2015/03/29/134033">Elasticsearch 1.5をUbuntu 14.04にインストールしたときのメモ - Technically, technophobic.</a>notchained.hatenablog.com

外からアクセスできるようにするには、ポートを変えといた方が安全です。たぶん。/etc/elasticsearch/elasticsearch.ymlに以下の設定を追加します。XXXXには適当なポート番号を入れてください。

http.port: XXXX

設定を書き換えたらElasticsearchを再起動します。

sudo service /etc/init.d/elasticsearch restart

あと、ファイアウォールにもそのポートの通信は通すように設定をしましょう。ufwならこんな感じです。

sudo ufw allow in XXXX/tcp

elasticをインストール

参考：https://github.com/ropensci/elastic#install-elasticsearch

CRANにはないみたいなので、devtoolsでインストールします。

devtools::install_github("ropensci/elastic")

elasticの設定

参考：https://github.com/ropensci/elastic#initialization

ElasticsearchのAPIを使う前に、connect()という関数を使ってIPやポートを指定するみたいです。（ローカルでElasticsearchを動かす場合はデフォルトのままで動くので不要）

library(elastic)
connect(es_base = "http://XX.XX.XX.XX", es_port = XXXX)

使ってみる

とりあえずテスト用にデータを入れてみます。

plosdat <- system.file("examples", "plos_data.json", package = "elastic")
docs_bulk(plosdat)

例えば、idを指定してひとつだけドキュメントをもってくるにはdocs_get()を使います。

docs_get(index = 'plos', type = 'article', id = 4)

これは、Document APIのGet APIのラッパーなわけですが、だいたいのAPIのラッパーは

APIの種類_操作()

という名前で用意されています。Search()とかexplain()とか、変則的なのもあるみたいですけど。だいたいの操作はRからでやれそうな雰囲気です。

ls('package:elastic')
#>  [1] "alias_create"          "alias_delete"          "aliases_get"          
#>  [4] "alias_exists"          "alias_get"             "cat_"                 
#>  [7] "cat_aliases"           "cat_allocation"        "cat_count"            
#> [10] "cat_fielddata"         "cat_health"            "cat_indices"          
#> [13] "cat_master"            "cat_nodes"             "cat_pending_tasks"    
#> [16] "cat_plugins"           "cat_recovery"          "cat_segments"         
#> [19] "cat_shards"            "cat_thread_pool"       "cluster_health"       
#> [22] "cluster_pending_tasks" "cluster_reroute"       "cluster_settings"     
#> [25] "cluster_state"         "cluster_stats"         "connect"              
#> [28] "connection"            "count"                 "docs_bulk"            
#> [31] "docs_create"           "docs_delete"           "docs_get"             
#> [34] "docs_mget"             "es_parse"              "explain"              
#> [37] "field_mapping_get"     "index_analyze"         "index_clear_cache"    
#> [40] "index_close"           "index_create"          "index_delete"         
#> [43] "index_exists"          "index_flush"           "index_get"            
#> [46] "index_open"            "index_optimize"        "index_recovery"       
#> [49] "index_segments"        "index_settings"        "index_stats"          
#> [52] "index_status"          "index_upgrade"         "mapping_create"       
#> [55] "mapping_delete"        "mapping_get"           "mlt"                  
#> [58] "nodes_hot_threads"     "nodes_info"            "nodes_stats"          
#> [61] "ping"                  "scroll"                "Search"               
#> [64] "search_shards"         "tokenizer_set"         "type_exists"

クエリは、listを指定することも、JSONの文字列を指定することもできます。下の2つは同じです。

body <- list(query = list(query_string = list(query = "cell")),
             highlight = list(fields = list(title = list(number_of_fragments = 2))))
out <- Search('plos', 'article', body=body)

body <- '{
 "query": {
   "query_string": {
     "query" : "cell"
   }
 },
 "highlight": {
   "fields": {
     "title": {"number_of_fragments": 2}
   }
 }
}'
out <- Search('plos', 'article', body=body)

ということで、かなり便利っぽいです。ただ、返ってくるデータ形式はネストしたリストなので、ちょっと前処理しないとRでの分析には使えなさそうです。。

rlistガチ勢のひとに期待しつつこのへんで筆を置くことにします。

str(out, max.level = 3)
#> List of 4
#>  $ took     : int 169
#>  $ timed_out: logi FALSE
#>  $ _shards  :List of 3
#>   ..$ total     : int 5
#>   ..$ successful: int 5
#>   ..$ failed    : int 0
#>  $ hits     :List of 3
#>   ..$ total    : int 58
#>   ..$ max_score: num 1.25
#>   ..$ hits     :List of 10
#>   .. ..$ :List of 7
#>   .. ..$ :List of 7
#>   .. ..$ :List of 7
#>   .. ..$ :List of 7
#>   .. ..$ :List of 7
#>   .. ..$ :List of 7
#>   .. ..$ :List of 7
#>   .. ..$ :List of 7
#>   .. ..$ :List of 7
#>   .. ..$ :List of 7