On Mon, Aug 30, 2010 at 15:55, Jack Bates <jbates@brightok.net> wrote:
...
As good a place to break in on the thread as any, I guess. Randy and others believe more testing should have been done. I'm not completely sure they didn't test against XR. They very likely could have tested in a 1 on 1 connection and everything looked fine.
I don't know the full details, but at what point did the corruption appear, and was it visible? We know that it was corrupt on the output which caused peer resets, but was it necessarily visible in the router itself?
Do we require a researcher to setup a chain of every vender BGP speaker in every possible configuration and order to verify a bug doesn't cause things to break? In this case, one very likely would need an XR receiving and transmitting updates to detect the failure, so no less than 3 routers with the XR in the middle.
What about individual configurations? Perhaps the update is received and altered by one vendor due to specific configurations, sent to the next vendor, accepted and altered (due to the first alteration, where as it wouldn't be altered if the original update had been received) which causes the next vendor to reset. Then we add to this that it may pass silently through several middle vendor routers without problems and we realize the scope of such problems and why connecting to the Internet is so unpredictable.
I am not aware that anyone has provided the complete details at this point which would include any test plans that may have been performed. From what I have been able to discern, it does seem likely that a test plan that would have caught this almost had to know of the specific issue in advance. More testing would have been better, but there is just too much variability out there to assure you can do a complete test. I am also not aware that the introduction of the attribute was announced to the usual operational lists in advance "just in case" (Ok, in this case, I mean NANOG). This, is my mind, is actually the bigger faux pas. An "Oh S***" moment has happened to most of us. It probably will happen again to many of us. But letting people know in advance of scheduled changes is the important thing. I would hope that in the future researchers will commit to test plans to (at least) all the major vendor BGP speakers (which, I admit, would likely not have caught this issue), and that before introducing such "new" attributes into the "Internet", they would announce it to the usual operational lists, again, "just in case". But my hopes are often dashed. Gary