On Mon, Jun 24, 2019 at 08:03:26PM -0400, Tom Beecher wrote:
You are 100% right that 701 should have had some sort of protection mechanism in place to prevent this. But do we know they didn???t? Do we know it was there and just setup wrong? Did another change at another time break what was there? I used 701 many jobs ago and they absolutely had filtering in place; it saved my bacon when I screwed up once and started readvertising a full table from a 2nd provider. They smacked my session down an I got a nice call about it.
In my past (and current) dealings with AS701, I do agree that they have generally been good about filtering customer sessions and running a tight ship. But, manual config changes being what they are, I suppose an honest mistake or oversight issue had occurred at 701 today that made them contribute significantly to today's outage.
It also would have been nice, in my opinion, to take a harder stance on the BGP optimizer that generated he bogus routes, and the steel company that failed BGP 101 and just gladly reannounced one upstream to another. 701 is culpable for their mistakes, but there doesn???t seem like there is much appetite to shame the other contributors.
I think the biggest question to be asked here -- why the hell is a BGP optimizer (Noction in this case) injecting fake more specifics to steer traffic? And why did a regional provider providing IP transit (DQE), use such a dangerous accident-waiting-to- happen tool in their network, especially when they have other ASNs taking transit feeds from them, with all these fake man-in-the-middle routes being injected? I get that BGP optimizers can have some use cases, but IMO, in most of the situations, (especially if you are a network provider selling transit and taking peering yourself) a well crafted routing policy and interconnection strategy eliminates the need for implementing flawed route selection optimizers in your network. The notion of BGP Optimizer generating fake more specifics is absurd, and is definitely not a tool that is designed to "fail -> safe". Instead of failing safe, it has failed epically and catastrophically today. I remember long time ago, when Internap used to sell their FCP product, Internap SE were advising the customer to make appropriate adjustments to local-preference to prefer the FCP generated routes to ensure optimal selection. That is a much more sane design choice, than injecting man-in-the-middle attacks and relying on customers to prevent a disaster. Any time I have a sit down with any engineer who "outsources" responsibility of maintaining robustness principle onto their customer, it makes me want to puke. James