Skip to main content

facebook outage : 整件事就是BGP的錯誤配置,不關DNS的事歐

 實在看的眼睛很痛,FB掛點後,很多人把BGP advertisements翻譯成BGP廣播,廣播在網工眼裡有特別的意義的(broadcast),字義上的廣播也有持續發送的意思,在BGP路由宣告是不可能這樣的,現在IPv4全球的路由有86萬筆了,BGP路由宣告只有在變動時才會更新給peer(incremental update)

https://www.networkworld.com/article/3635811/facebook-outage-was-a-series-of-unfortunate-events.html?utm_medium=social&utm_source=facebook&utm_campaign=organic&utm_content=content&fbclid=IwAR15da3AOdfZX6rAY-ApSOFIbDvbZCiAIHWDnf94OYmT0LZ0vIJpuPNaev8

這篇算是最完整的了,用的術語比較準確 (但還是錯, "DNS, or directory name services" 是 Domain Name System)

-自建的DNS在網路裡,整個網路不通 , DNS 客戶解析不到可以理解
-And when server availability went to zero because the network went down, they decommissioned all their DNS servers.”
但網路不通為何要把DNS server下線(用路由撤回的方式)呢,怕太大量解析搞掛嗎?
"DNS was a single point of failure" 我覺得這次事件的重點是DNS,巡檢命令竟然會改設定,審計防錯機制再失效都還是SOP的事,
在架構面上, 內網服務用的DNS和讓外部查詢的DNS是同一個系統,就是很大的問題
"For example, Amazon, whose AWS offers a DNS service, uses two external services—Dyn and UltraDNS—for its DNS, according to Medina."
在AWS上唯一敢說“100% Available”的服務是 Route 53,原來這就是原因啊


So why did Facebook withdraw routes to its service in the first place?
也是我的疑惑


真正的DNS專家說話了,真正的BGP專家太少了,不懂很多不是專門搞網路的為何一直想談這個事件,談DNS就更難了,把DNS和BGP搞在一起就更多變數了

It seems sad that this NBC report was far more informative than the corporate blather that Facebook posted as their statement from engineering

也是我的疑惑

But this form of disappearance in the DNS is a form that raises the ire of the DNS gods. In this situation, where the name servers all go offline, then the result of a query is not an NXDOMAIN response code ("I'm sorry but that name does not exist in the DNS, go away!") but a far more indeterminate timeout to a query with no response whatsoever. A recursive resolver will retry the query using all the name server IP addresses stored in the parent zone (.com in this case), and then return the SERVFAIL response code (which means something like: "I couldn’t resolve this name, but maybe it’s me, so you might want to try other resolvers before giving up!"). So, the client's stub resolver then asks the same question to all the other recursive resolvers that it has been configured with. As the Cloudflare post points out: "So now, because Facebook and their sites are so big, we have DNS resolvers worldwide handling 30x more queries than usual and potentially causing latency and timeout issues to other platforms."


Comments

Popular posts from this blog