メモ：GitLabがクラウドをやめた話

GitLabがクラウドをやめてベアメタルに行ったという記事についての自分用メモ。問題はストレージの性能らしい。

CephFSを使うようにしたけど、CephFSは安定したレイテンシが重要。しかしクラウドには「最低〇〇のレイテンシ」というSLAはIOに関してもネットワークに関しても存在しないのでつらい、と。

This problem is amplified because we hosted the system in the cloud where there is not a minimum SLA for IO latency.

On our server, GitLab can only perform 20,000 IOPS but the low limit is 0. With this performance capacity, we became the "noisy neighbors" on the shared machines, using all of the resources. We became the neighbor who plays their music loud and really late. So, we were punished with latencies. Providers don't provide a minimum IOPS, so they can just drop you.

GitLabは20,000IOPSしか使わないけど「noisy neighbor」だとみなされてレイテンシに制限を受ける。

At this point, moving to dedicated hardware makes sense for us. From a cost perspective, it is more economical and reliable because of how the culture of the cloud works and the level of performance we need.

コメント欄でもいくつか指摘があるけど、noisy neighbor問題はdedicated hostを使えば解消できる。でも、経済的な面から考えて自前でサーバ買ったほうが安いよね、と。

インフラについての話はちょっと前の以下のブログ記事が詳しい。どうやらAzureだったらしい。

about.gitlab.com

なんでこういうことができたかというと、たぶんGitLabは特殊で、「Infrastructure mindset」が組織に浸透していたからな気がする。以下がそれについて書いたブログ記事。これが「DevOps」とか言ってたらたぶんクラウドから離れられなかったんじゃないかな、という気がする。

about.gitlab.com

で、なぜinfrastructure mindsetが浸透していたかというと、実際にGitLabがやろうとしてることはインフラ的にチャレンジングなことだからだろう。はじめの記事のコメント欄にGitLabの共同創業者の人のコメントがある。

But I think filesystems are hard in the cloud. Object storage is fine but block stage is hard. For example see this comment https://news.ycombinator.com/item?id=12942337

Our competitors (GitHub and Atlassian) are both on metal too.
(https://about.gitlab.com/2016/11/10/why-choose-bare-metal/#comment-2999756265)

クラウドがどうとかっていうか、分散ブロックストレージはまだまだつらい、という話に尽きる気がする。ちょっと知識が足りなくてこの問題がCephFS特有のものなのかわからないけど、こんな感じでストレージをヘビーに使うサービスはベアメタルに行かざるを得ないケースがけっこうありそう、という印象でした。

（クラウドとかCephFSとかにあんまり詳しくないので、間違ってるところがあれば優しく指摘をもらえるとうれしいです）