2017/05/19にLet's Encryptで起こっていた障害についてのメモ

5/19にLet’s Encryptで障害が起こっていました。その原因と、付随して起こっていたらしいOCSP Stapling絡みのゴタゴタについてのメモ。


障害発生は5/19の6:52AM UTCごろらしいです。

May 19, 2017 6:52AM UTC [Investigating] We are investigating a problem with issuance


May 19, 2017 8:07AM UTC [Identified] We have identified the problem affecting issuance and are working to restore availability for all users. Most users should be able to issue. A minority of users will still be unable to reach the service


May 19, 2017 4:13PM UTC [Monitoring] We continue to see API connection problems for many users and have narrowed down the disruption to our CDN. We are working hard to resolve this.

最終的に収束したのは11:05PM UTC、障害発生から14時間後でした。

May 19, 2017 11:05PM UTC [Monitoring] Services are now operational and being monitored

postmortemはまだ出ていないので詳細は正確にはわかりませんが、中の人が以下のHacker Newsのスレッドに寄せているコメントから少しうかがい知ることができます。

Let’s Encryptの中の人のコメント

Josh from Let’s Encrypt here. First, my apologies for the trouble this has cause… | Hacker News https://news.ycombinator.com/item?id=14375728



OCSP requests that use the GET method use standard base64 encoding, which can contain two slashes one after another. While debugging why a small number of OCSP requests consistently failed our engineers observed a rather odd, but standard, web server behavior.


HTTP-based OCSP requests can use either the GET or the POST method to submit their requests. To enable HTTP caching, small requests (that after encoding are less than 255 bytes) MAY be submitted using GET.

An OCSP request using the GET method is constructed as follows:

GET {url}/{url-encoding of base-64 encoding of the DER encoding of the OCSPRequest}



When a server receives a request with multiple slashes one after another they will collapse them into a single slash.



But some applications will decode the percent-encoding too early in the process of normalizing, security-escaping, and processing the URL. Encoded slashes in URLs are problematic [1][2][3][4][5].


The fix seemed quite simple: disable the slash collapsing behavior.



Unfortunately, stopping this behavior surfaced a more serious issue. The AIA extension that we include in certificates we issue contains a URI for our OCSP server. This URI contains a trailing slash.

Let’s Encryptが発行した証明書は、OCSPサーバを以下のように指定しています。

OCSP: URI: http://ocsp.int-x3.letsencrypt.org/


GET {url}/{url-encoding of base-64 encoding of the DER encoding of the OCSPRequest}


A number of user agents take this quite literally and will construct the URL without inspecting the contents of the AIA extension meaning that they ended up with a double slash between the host name and the base64 encoded OCSP request.






このユーザエージェントへのレスポンスで400 Bad Requestが返るわけですが、これは当然キャッシュされては困るのでno-cacheヘッダが付いています。 CDNはこれを見てキャッシュを捨ててしまい、すべてのリクエストが直接Let’s EncryptのOCSPサーバに直接いくようになってしまいました。

Because we were responding with ‘400 Bad Request’ responses we were setting explicit no-cache headers which meant we had a near 0% cache (CDN) offload rate and were hit with the full brunt of our OCSP request load at our origin servers.


This caused our whole infrastructure to get bogged down.



以下のブログはOCSP StaplingとMust Stapleについての議論をコンパクトにまとめつつApache(と、おそらくnginxも)の実装上の問題を解説していてわかりやすかったです(といいつつまだ完全には理解できていません…)。

The Problem with OCSP Stapling and Must Staple and why Certificate Revocation is still broken - Hanno’s blog https://blog.hboeck.de/archives/886-The-Problem-with-OCSP-Stapling-and-Must-Staple-and-why-Certificate-Revocation-is-still-broken.html

Let’s Encryptでこうした問題が起こっていても、OCSP responseは数日はキャッシュできるので、多くの人はエラーを見ることはないはずです。 しかし、ApacheOCSP requestがエラーになると、キャッシュしている有効なOSCP responseを捨ててしまいます。

If Apache tries to renew the OCSP response and gets an error from the OCSP server – e. g. because it’s currently malfunctioning – it will throw away the existing, still valid OCSP response and replace it with the error. It will then send out stapled OCSP errors. Which makes zero sense. Firefox will show an error if it sees this. This has been reported in 2014 and is still unfixed.

(…snip…) I still got complaints that Firefox users were seeing errors. That’s because in this case the OCSP server wasn’t sending out errors, it was completely unavailable. For that situation Apache has a feature that will fake a tryLater error to send out to the client. If you’re wondering how that makes any sense: It doesn’t. The “tryLater” error of OCSP isn’t useful at all in TLS, because you can’t try later during a handshake which only lasts seconds. (https://blog.hboeck.de/archives/886-The-Problem-with-OCSP-Stapling-and-Must-Staple-and-why-Certificate-Revocation-is-still-broken.html)

で、後半はちょっと理解できなかったんですが、ここで「Firefoxユーザーが」と書かれているのは、FirefoxOCSP Staplingが失敗してもあきらめずに自分でOCSPをfetchして、それも失敗するとエラーになる?という挙動だから?(未確認)

Firefox is, today, the only major browser still fetching OCSP by default for DV certificates. (https://bugzilla.mozilla.org/show_bug.cgi?id=1366100)