On Sat, Aug 28, 2010 at 02:19:28PM +0200, Florian Weimer wrote:
* Claudio Jeker:
I think you blame the wrong people. The vendor should make sure that their implementation does not violate the very basics of the BGP protocol.
The curious thing here is that the peer that resets the session, as required by the spec, causes the actual damage (the session reset), and not the peer producing the wrong update.
This whole thread is quite schizophrenic because the consensus appears to be that (a) a *researcher is not to blame* for sending out a BGP message which eventually leads to session resets, and (b) an *implementor is to blame* for sending out a BGP messages which eventually leads to session resets. You really can't have it both ways.
The researcher is not to blame because all the BGP messages he sent out were properly formed. The implementor is to blame becuase the code he wrote send out BGP messages which were not properly formed.
I'm fed up with this situation, and we will fix it this time. My take is that if you reset the session, you're part of the problem, and consequently deserve part of the blame. So if you receive a properly-framed BGP update message you cannot parse, you should just log it, but not take down the session.
If you get your wish, and that gets implemented, in some numer of years trree will be a NANOG posting (perhaps from you, perhaps not) arguing that any malformed BGP message should result in the session being torn down. This will be after a router develops a failure that causes it to send many incorrect messages, but only some of them malformed. So the malformed ones will be discarded, the remainder will be propogated throughout the Internet. If the ones that are incorrect but not malformed are, say, filled with more specifics for large portions of the Internet, someone will be asking: "How could all the other routers accept these advertisement from a router known to be broken ... it was sending malformed advertisements, but instead of tearning down the sessions, you decided to trust all the validly formed messages from this known-to-be-broken router". My point is: we can't always look at the most recent failure to decide what the correct policy is. We have good data on the cases where NOTIFY on any malformed packet has caused significantly outages in the Internet. We don't have nearly as good data on the cases where NOTIFY-on-any-malformed-packet saved the Internet from a significant outage. I don't claim to know which is the bigger problem. But any serious argument to change the behavior needs to consider the risk from propogating information received from a router known to be broken, on the theory that the brokenness only causes malformed messages (which can be discarded) and does not also cause incorrect but correctly formed messages to be sent. -- Brett