メモ：stringrでGitHubの検索APIが受け付けるぎりぎりの長さにクエリを切り詰める

レコード店でマイナーなCDをディグるときのように、GitHubの検索APIでマイナーなファイルを探したい。そんな日もあるでしょう。

ということで、メジャーなやつは除外するためにまずはレポジトリ数が多いRユーザをリストアップします。

library(purrr)
library(gh)

estimate_pages <- function(threshold = 100L) {
  res <- gh("/search/users",
            q = glue::glue("repos:>{threshold} language:R -user:cran"),
            sort = "repositories",
            page = 1L,
            per_page = 1L)
  message("total_count: ", res$total_count)
  ceiling(res$total_count / 100L)
}

do_search_user <- function(page, threshold = 100L) {
  res <- gh("/search/users",
            q = glue::glue("repos:>{threshold} language:R"),
            sort = "repositories",
            page = page,
            per_page = 100L)
  
  login_names <- map_chr(res$items, "login")
  types <- map_chr(res$items, "type")
  
  Sys.sleep(20)
  
  tibble::tibble(
    login_names,
    types
  )
}

# 何ページ取得すればいいかをまず見る（この記事を書く時点では8ページ）
pages <- estimate_pages(50L)
# 一気に取得
users <- map_df(seq_len(pages), do_search_user, threshold = 50L)

結果はこんな感じです。

head(users)
#> # A tibble: 6 x 2
#>           login_names        types
#>                 <chr>        <chr>
#> 1                cran Organization
#> 2 Bioconductor-mirror Organization
#> 3            Libardo1         User
#> 4             defc0n1         User
#> 5             jeperez         User
#> 6             chaabni         User

GitHubの検索APIでは、

-user:yutannihilation

のように、-をつけるとNOT検索をすることができます。

ということで、さっきのユーザをすべて検索クエリに入れてみます。ユーザはuser:、組織はorg:です。

users_query <- users %>%
  mutate(types = recode(types,
                        User = "user",
                        Organization = "org")) %>%
  { sprintf("-%s:%s", .$types, .$login_names) } %>%
  paste(collapse = " ")

stringr::str_sub(users_query, 1, 100)
#> [1] "-org:cran -org:Bioconductor-mirror -user:Libardo1 -user:defc0n1 -user:jeperez -user:chaabni -user:mi"

nchar(users_query)
#> [1] 12733

これでコードを検索してみます。

res <- gh::gh("/search/code",
              q = glue::glue("filename:NAMESPACE fork:false export {users_query}"),
              page = 1,
              per_page = 100L)
#> No encoding supplied: defaulting to UTF-8.
#> No encoding supplied: defaulting to UTF-8.
#> Error in gh::gh("/search/code", q = glue::glue("filename:NAMESPACE fork:false export {users_query}"),  : 
#>   GitHub API error (414): 
#>   <html>
#> <head><title>414 Request-URI Too Large</title></head>
#> <body bgcolor="white">
#> <center><h1>414 Request-URI Too Large</h1></center>
#> <hr><center>nginx</center>
#> </body>
#> </html>
#> 
#> In addition: Warning message:
#> Response came back as html :(

なんということでしょう。長すぎると怒られました。そらそうか。

どうもいろいろ試した結果、制限値は6000～8000くらいにあるみたいです。

ということで、クエリを6000字くらいに切り詰めます。まず空白のインデックスを取ります。

users_query_boundaries <- stringr::str_locate_all(users_query, " ")[[1]][, "start"]
head(users_query_boundaries)
#> [1] 10 35 50 64 78 92

6000以下で最大の空白の位置を調べます。

users_query_boundaries %>%
  keep(`<=`, 6000) %>%
  max
#> [1] 5987

str_sub()でここまで切り詰めます。

users_query_6000 <- substr(users_query, 1, 5987 - 1)

これを検索APIに投げると、まあうまくいったっぽいです。

res <- gh::gh("/search/code",
              q = glue::glue("filename:NAMESPACE fork:false export {users_query_6000}"),
              page = 1,
              per_page = 1L)
res$total_count
#> [1] 19641

うーん、これでも2万件近くあるのか…（GitHubの検索で取れるのは1000件まで）