NANOG, We would like to inform you of an experiment to evaluate alternatives for speeding up adoption of BGP route origin validation (research paper with details [A]). Our plan is to announce prefix 184.164.224.0/24 with a valid standards-compliant unassigned BGP attribute from routers operated by the PEERING testbed [B, C]. The attribute will have flags 0xe0 (optional transitive [rfc4271, S4.3]), type 0xff (reserved for development), and size 0x20 (256bits). Our collaborators recently ran an equivalent experiment with no complaints or known issues [A], and so we do not anticipate any arising. Back in 2010, an experiment using unassigned attributes by RIPE and Duke University caused disruption in Internet routing due to a bug in Cisco routers [D, CVE-2010-3035]. Since then, this and other similar bugs have been patched [e.g., CVE-2013-6051], and new BGP attributes have been assigned (BGPsec-path) and adopted (large communities). We have successfully tested propagation of the announcements on Cisco IOS-based routers running versions 12.2(33)SRA and 15.3(1)S, Quagga 0.99.23.1 and 1.1.1, as well as BIRD 1.4.5 and 1.6.3. We plan to announce 184.164.224.0/24 from 8 PEERING locations for a predefined period of 15 minutes starting 14:30 GMT, from Monday to Thursday, between the 7th and 22nd of January, 2019 (full schedule and locations [E]). We will stop the experiment immediately in case any issues arise. Although we do not expect the experiment to cause disruption, we welcome feedback on its safety and especially on how to make it safer. We can be reached at disco-experiment@googlegroups.com. Amir Herzberg, University of Connecticut Ethan Katz-Bassett, Columbia University Haya Shulman, Fraunhofer SIT Ítalo Cunha, Universidade Federal de Minas Gerais Michael Schapira, Hebrew University of Jerusalem Tomas Hlavacek, Fraunhofer SIT Yossi Gilad, MIT [A] https://conferences.sigcomm.org/hotnets/2018/program.html [B] http://peering.usc.edu [C] https://goo.gl/AFR1Cn [D] https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experime... [E] https://goo.gl/nJhmx1
Dear Italo, Thanks for giving the community a heads-up on your plan! I think your announcement like these are the best anyone can do when trying legal but new BGP path attributes. I'll forward this message to other NOGs and make sure that our NOC adds it to their calendar. Kind regards, Job On Thu, Dec 20, 2018 at 6:39 PM Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
We would like to inform you of an experiment to evaluate alternatives for speeding up adoption of BGP route origin validation (research paper with details [A]).
Our plan is to announce prefix 184.164.224.0/24 with a valid standards-compliant unassigned BGP attribute from routers operated by the PEERING testbed [B, C]. The attribute will have flags 0xe0 (optional transitive [rfc4271, S4.3]), type 0xff (reserved for development), and size 0x20 (256bits).
Our collaborators recently ran an equivalent experiment with no complaints or known issues [A], and so we do not anticipate any arising. Back in 2010, an experiment using unassigned attributes by RIPE and Duke University caused disruption in Internet routing due to a bug in Cisco routers [D, CVE-2010-3035]. Since then, this and other similar bugs have been patched [e.g., CVE-2013-6051], and new BGP attributes have been assigned (BGPsec-path) and adopted (large communities). We have successfully tested propagation of the announcements on Cisco IOS-based routers running versions 12.2(33)SRA and 15.3(1)S, Quagga 0.99.23.1 and 1.1.1, as well as BIRD 1.4.5 and 1.6.3.
We plan to announce 184.164.224.0/24 from 8 PEERING locations for a predefined period of 15 minutes starting 14:30 GMT, from Monday to Thursday, between the 7th and 22nd of January, 2019 (full schedule and locations [E]). We will stop the experiment immediately in case any issues arise.
Although we do not expect the experiment to cause disruption, we welcome feedback on its safety and especially on how to make it safer. We can be reached at disco-experiment@googlegroups.com.
Amir Herzberg, University of Connecticut Ethan Katz-Bassett, Columbia University Haya Shulman, Fraunhofer SIT Ítalo Cunha, Universidade Federal de Minas Gerais Michael Schapira, Hebrew University of Jerusalem Tomas Hlavacek, Fraunhofer SIT Yossi Gilad, MIT
[A] https://conferences.sigcomm.org/hotnets/2018/program.html [B] http://peering.usc.edu [C] https://goo.gl/AFR1Cn [D] https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experime... [E] https://goo.gl/nJhmx1
NANOG, We've performed the first announcement in this experiment yesterday, and, despite the announcement being compliant with BGP standards, FRR routers reset their sessions upon receiving it. Upon notice of the problem, we halted the experiments. The FRR developers confirmed that this issue is specific to an unintended consequence of how FRR handles the attribute 0xFF (reserved for development) we used. The FRR devs already merged a fix and notified users. We plan to resume the experiments January 16th (next Wednesday), and have updated the experiment schedule [A] accordingly. As always, we welcome your feedback. [A] https://goo.gl/nJhmx1 On Tue, Dec 18, 2018 at 10:05 AM Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
We would like to inform you of an experiment to evaluate alternatives for speeding up adoption of BGP route origin validation (research paper with details [A]).
Our plan is to announce prefix 184.164.224.0/24 with a valid standards-compliant unassigned BGP attribute from routers operated by the PEERING testbed [B, C]. The attribute will have flags 0xe0 (optional transitive [rfc4271, S4.3]), type 0xff (reserved for development), and size 0x20 (256bits).
Our collaborators recently ran an equivalent experiment with no complaints or known issues [A], and so we do not anticipate any arising. Back in 2010, an experiment using unassigned attributes by RIPE and Duke University caused disruption in Internet routing due to a bug in Cisco routers [D, CVE-2010-3035]. Since then, this and other similar bugs have been patched [e.g., CVE-2013-6051], and new BGP attributes have been assigned (BGPsec-path) and adopted (large communities). We have successfully tested propagation of the announcements on Cisco IOS-based routers running versions 12.2(33)SRA and 15.3(1)S, Quagga 0.99.23.1 and 1.1.1, as well as BIRD 1.4.5 and 1.6.3.
We plan to announce 184.164.224.0/24 from 8 PEERING locations for a predefined period of 15 minutes starting 14:30 GMT, from Monday to Thursday, between the 7th and 22nd of January, 2019 (full schedule and locations [E]). We will stop the experiment immediately in case any issues arise.
Although we do not expect the experiment to cause disruption, we welcome feedback on its safety and especially on how to make it safer. We can be reached at disco-experiment@googlegroups.com.
Amir Herzberg, University of Connecticut Ethan Katz-Bassett, Columbia University Haya Shulman, Fraunhofer SIT Ítalo Cunha, Universidade Federal de Minas Gerais Michael Schapira, Hebrew University of Jerusalem Tomas Hlavacek, Fraunhofer SIT Yossi Gilad, MIT
[A] https://conferences.sigcomm.org/hotnets/2018/program.html [B] http://peering.usc.edu [C] https://goo.gl/AFR1Cn [D] https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experime... [E] https://goo.gl/nJhmx1
* cunha@dcc.ufmg.br (Italo Cunha) [Tue 08 Jan 2019, 17:42 CET]:
For the archives, since goo.gl will cease to exist soon, this links to https://docs.google.com/spreadsheets/d/1U42-HCi3RzXkqVxd8e2yLdK9okFZl77tWZv1... After seeing this initial result I'm wondering why the researchers couldn't set up their own sandbox first before breaking code on the internet. I believe FRR is a free download and comes with GNU autoconf. -- Niels.
On Tue, Jan 8, 2019, 11:50 AM <niels=nanog@bakker.net wrote:
* cunha@dcc.ufmg.br (Italo Cunha) [Tue 08 Jan 2019, 17:42 CET]:
For the archives, since goo.gl will cease to exist soon, this links to
https://docs.google.com/spreadsheets/d/1U42-HCi3RzXkqVxd8e2yLdK9okFZl77tWZv1...
After seeing this initial result I'm wondering why the researchers couldn't set up their own sandbox first before breaking code on the internet. I believe FRR is a free download and comes with GNU autoconf.
-- Niels.
There are a fair number of open source BGP implementations now. It would require additional effort to test all of them. Tom
* thomasammon@gmail.com (Tom Ammon) [Tue 08 Jan 2019, 17:59 CET]:
There are a fair number of open source BGP implementations now. It would require additional effort to test all of them.
In the real world, doing the correct thing is often harder than doing an incorrect thing, yes. -- Niels.
On Jan 8, 2019, at 12:10 PM, niels=nanog@bakker.net wrote:
* thomasammon@gmail.com (Tom Ammon) [Tue 08 Jan 2019, 17:59 CET]:
There are a fair number of open source BGP implementations now. It would require additional effort to test all of them.
In the real world, doing the correct thing is often harder than doing an incorrect thing, yes.
And other times you just get BGP as art https://twitter.com/powerdns_bert/status/878291436034170881 - jared
There is no such thing as a fully RFC compliant BGP : https://www.juniper.net/documentation/en_US/junos/topics/reference/standards... does not list 7606 Cisco Bug: CSCvf06327 - Error Handling for RFC 7606 not implemented for NXOS This is as of today and a 2 second google search.. anyone running code from before RFC 7606 (2015) would also not be compliant. I did not see Juniper on the list of BGP speakers tested. Töma Gavrichenkov wrote on 1/8/19 9:31 AM:
8 Jan. 2019 г., 20:19 <niels=nanog@bakker.net <mailto:nanog@bakker.net>>:
In the real world, doing the correct thing
— such as writing RFC compliant code —
is often harder than doing an incorrect thing, yes.
Evidently, yes.
On 1/8/19 9:31 AM, Töma Gavrichenkov wrote:
8 Jan. 2019 г., 20:19 <niels=nanog@bakker.net>:
In the real world, doing the correct thing
— such as writing RFC compliant code —
is often harder than doing an incorrect thing, yes.
Evidently, yes.
I "grew up" during the early days of PPP. As a member of the press I attended an "inter-op" session at Telebit's campus, and watched as a collection of engineers and programmers matched up implementations of PPP and found bugs in both the Proposed Standard and in the implementations thereof. Watching these guys with all sorts of data monitors trying to figure out who goofed was an interesting and fascinating experience. During my stint with the Telecommunications Industry Associate TR-30 committee hashing out modem standards like V.32 et al and V.25 ter was a similar exercise -- one that lead to me being in a near fight in a parking lot in San Jose with a Microsoft enginner over clarity problems with the proposed Standard for side-channel protocol. "Can you do better?" "Yes." "Prove it." And I did. My proposal was accepted by all, even the Microsoft guy. (We continued to collaborate until he cashed out of the company.)
OOn Tue, Jan 8, 2019 at 19:59 Tom Ammon <thomasammon@gmail.com> wrote:
On Tue, Jan 8, 2019, 11:50 AM <niels=nanog@bakker.net wrote:
* cunha@dcc.ufmg.br (Italo Cunha) [Tue 08 Jan 2019, 17:42 CET]:
For the archives, since goo.gl will cease to exist soon, this links to
https://docs.google.com/spreadsheets/d/1U42-HCi3RzXkqVxd8e2yLdK9okFZl77tWZv1...
After seeing this initial result I'm wondering why the researchers couldn't set up their own sandbox first before breaking code on the internet. I believe FRR is a free download and comes with GNU autoconf.
There are a fair number of open source BGP implementations now. It would require additional effort to test all of them.
Not just every implementation, but also every version, and every configuration permutation. This type of black box testing is not scalable. It is not feasible work, nor the job of these researchers. It’s the job of the software the developer to ensure the product is standards compliant. In the case of FRR: - improper use of the 0xFF codepoint - FRR is not compliant with RFC 7606 (the devs indicated they will be working on this) Ultimately, the developers are responsible for their product, not random other internet users. This situation was avoidable if standards had been followed. I’m happy the FRR developers quickly identified the issue and published a fix. We can now all move on. Kind regards, Job
Hi Niels, we did run the experiment in a controlled environment with different versions of Cisco, BIRD, and Quagga routers and observed no issues. We did add FRR to the test suite yesterday for future tests. On Tue, Jan 8, 2019 at 11:49 AM <niels=nanog@bakker.net> wrote:
* cunha@dcc.ufmg.br (Italo Cunha) [Tue 08 Jan 2019, 17:42 CET]:
For the archives, since goo.gl will cease to exist soon, this links to https://docs.google.com/spreadsheets/d/1U42-HCi3RzXkqVxd8e2yLdK9okFZl77tWZv1...
After seeing this initial result I'm wondering why the researchers couldn't set up their own sandbox first before breaking code on the internet. I believe FRR is a free download and comes with GNU autoconf.
-- Niels.
Hey,
After seeing this initial result I'm wondering why the researchers couldn't set up their own sandbox first before breaking code on the internet. I believe FRR is a free download and comes with GNU autoconf.
We probably should avoid anything which might demotivate future good guys from finding breaking bugs and reporting them, while sending perfectly standard-compliant messages. Only ones who will win are bad guys who collect libraries of how-to-break-internet. There are certainly several transit packet of deaths and BGP parser bugs in each implementation, I'd rather have good guy trigger them and give me details why my network broke, than have bad guy store them for future use. -- ++ytti
Hi Saku,
After seeing this initial result I'm wondering why the researchers couldn't set up their own sandbox first before breaking code on the internet. I believe FRR is a free download and comes with GNU autoconf.
We probably should avoid anything which might demotivate future good guys from finding breaking bugs and reporting them, while sending perfectly standard-compliant messages. Only ones who will win are bad guys who collect libraries of how-to-break-internet. There are certainly several transit packet of deaths and BGP parser bugs in each implementation, I'd rather have good guy trigger them and give me details why my network broke, than have bad guy store them for future use.
I fully agree with you. However, this doesn't give 'good guys' carte blanche to break stuff. I'm glad they've already taken action to improve their practices as confirmed by Italo Cunha in his earlier mail. -- Niels.
On Tue, 08 Jan 2019 17:48:46 +0100, niels=nanog@bakker.net said:
After seeing this initial result I'm wondering why the researchers couldn't set up their own sandbox first before breaking code on the internet. I believe FRR is a free download and comes with GNU autoconf.
Perhaps you'd like to supply the researchers (and us) with a *complete* list of all BGP-speaking software in use on the Internet? (Personally, I'd never heard of FRR before)
On Jan 8, 2019, at 12:06 PM, valdis.kletnieks@vt.edu wrote:
On Tue, 08 Jan 2019 17:48:46 +0100, niels=nanog@bakker.net said:
After seeing this initial result I'm wondering why the researchers couldn't set up their own sandbox first before breaking code on the internet. I believe FRR is a free download and comes with GNU autoconf.
Perhaps you'd like to supply the researchers (and us) with a *complete* list of all BGP-speaking software in use on the Internet? (Personally, I'd never heard of FRR before)
Yeah, I think it also gets complicated as some of us also have our own internal BGP speakers as well. Taking MRT files from route-views or RIPE RIS and replaying them is certainly helpful to simulate certain events. I’ve found a lot of interesting “new attribute” experiments when I had a poorly written MRT parser that would trigger periodically when something new hit the internet. (FRR is descendent of Zebra/Quagga world) - Jared
On Jan 8, 2019, at 09:06 , valdis.kletnieks@vt.edu wrote:
On Tue, 08 Jan 2019 17:48:46 +0100, niels=nanog@bakker.net said:
After seeing this initial result I'm wondering why the researchers couldn't set up their own sandbox first before breaking code on the internet. I believe FRR is a free download and comes with GNU autoconf.
Perhaps you'd like to supply the researchers (and us) with a *complete* list of all BGP-speaking software in use on the Internet? (Personally, I'd never heard of FRR before)
+1
niels=nanog@bakker.net wrote on 08/01/2019 16:48:
After seeing this initial result I'm wondering why the researchers couldn't set up their own sandbox first before breaking code on the internet. I believe FRR is a free download and comes with GNU autoconf.
the researchers didn't break code - their test unearthed broken code. That code has now been fixed, so this is a good result. Nick
FRR is undergoing a fairly rapid pace of development, thanks to the cloud-scale operators and hosting providers which are using it in production. https://cumulusnetworks.com/blog/welcoming-frrouting-to-the-linux-foundation... On Tue, Jan 8, 2019 at 11:55 AM Randy Bush <randy@psg.com> wrote:
We plan to resume the experiments January 16th (next Wednesday), and have updated the experiment schedule [A] accordingly. As always, we welcome your feedback.
i did not realize that frr updates propagated so quickly. very cool.
randy
We plan to resume the experiments January 16th (next Wednesday), and have updated the experiment schedule [A] accordingly. As always, we welcome your feedback. i did not realize that frr updates propagated so quickly. very cool.
FRR is undergoing a fairly rapid pace of development
that is impressive but irrelevant. the question is how soon the frr users out on the internet will upgrade. there are a lot of studies on this. it sure isn't on the order of a week. randy
On Wed, Jan 9, 2019 at 9:55 Randy Bush <randy@psg.com> wrote:
We plan to resume the experiments January 16th (next Wednesday), and have updated the experiment schedule [A] accordingly. As always, we welcome your feedback. i did not realize that frr updates propagated so quickly. very cool.
FRR is undergoing a fairly rapid pace of development
that is impressive but irrelevant. the question is how soon the frr users out on the internet will upgrade. there are a lot of studies on this. it sure isn't on the order of a week.
Given the severity of the bug, there is a strong incentive for people to upgrade ASAP. Kind regards, Job
* Job Snijders
Given the severity of the bug, there is a strong incentive for people to upgrade ASAP.
The buggy code path can also be disabled without upgrading, by building FRR with the --disable-bgp-vnc configure option, as I understand it. I've been told that this is the default in Cumulus Linux. Tore
9 Jan. 2019 г., 9:56 Randy Bush <randy@psg.com>:
the question is how soon the frr users out on the internet will upgrade. there are a lot of studies on this. it sure isn't on the order of a week
Which is, as usual, a pity, because, generally, synchronizing a piece of software with upstream security updates less frequently than once to twice in a week belongs in Jurassic Park today; and doing it hardly more frequently than once in 6 months, as ISPs usually do, clearly belongs in a bughouse. (wonder if this FRR update has got a CVE number though)
On Wed, 9 Jan 2019 at 19:54, Töma Gavrichenkov <ximaera@gmail.com> wrote:
Which is, as usual, a pity, because, generally, synchronizing a piece of software with upstream security updates less frequently than once to twice in a week belongs in Jurassic Park today; and doing it hardly more frequently than once in 6 months, as ISPs usually do, clearly belongs in a bughouse.
Not disputing bug or bog house as ideal location for said policy, just want to explain my perspective why it is so. SPs are making their reasonable effort to produce product that customers want to buy. Hitless upgrades are not really a thing yet, even though they've been marketed for 20 years now. Customers have expectation on how often their link flaps which is mutually exclusive with rapid upgrade cycles. And mostly all this is for show, the code is very broken, all of it. And the configurations are very broken, all of them. We regularly break Internet without trying, BGP parsing crashes are like bi-annual thing. I'm holding, without any motivation or attempt to do so, transit -packet-of-death for JNPR applicable to ~all JNPR backbones, and JNPR isn't outlier here. People happily deploy new devices which cannot be protected against even trivial (<10Mbps) control-plane attacks. Only reason things work as well as they do, is because bad guys are not trying to DoS the infrastructure with BGP or packet-of-deaths, it would be very cheap if someone should be so motivated. If this is something we think should be fixed, then we should have good guys intentionally fuzzing _public internet_ BGP and transit-packet-of-deaths with good reporting. But likely it doesn't actually matter at all that the configurations and implementations are fragile, if they are abused, Internet will fix those in no more than days, and trying to guarantee it cannot happen probably is fools errant If anything, I suspect if it's cheaper to enter the market with inferior security and quality then that is likely good business case, internet works so well, consumers are not willing to pay more for better, but would gladly sacrifice uptime for cheaper price. -- ++ytti
On Wed, Jan 9, 2019 at 9:07 PM Saku Ytti <saku@ytti.fi> wrote:
Not disputing bug or bog house as ideal location for said policy, just want to explain my perspective why it is so.
So, network device vendors releasing security advisories twice a year isn't a big part of the explanation?
Hitless upgrades are not really a thing yet, even though they've been marketed for 20 years now.
This is correct; on the flip side, hitless vulnerabilities haven't even been marketed, much less invented.
Only reason things work as well as they do, is because bad guys are not trying to DoS the infrastructure with BGP or packet-of-deaths
Err... don't they? My experience is quite the opposite.
If this is something we think should be fixed, then we should have good guys intentionally fuzzing _public internet_ BGP and transit-packet-of-deaths with good reporting.
If we could be sure that after such fuzzing there would still be a working transport infrastructure to report on top of, then yes.
if they are abused, Internet will fix those in no more than days
— just like we did with IoT in 2016 —
and trying to guarantee it cannot happen probably is fools errant
If anything, I suspect if it's cheaper to enter the market with inferior security and quality then that is likely good business case
This is also correct so far. I wonder if it's here to stay. -- Töma
On Wed, 9 Jan 2019 at 20:24, Töma Gavrichenkov <ximaera@gmail.com> wrote:
So, network device vendors releasing security advisories twice a year isn't a big part of the explanation?
Those are scheduled, they have to meet some criteria to be pushed on scheduled lot. There are also out of cycle SIRTs. And yes, vendors are delaying them, because customers don't want to upgrade often, because customer's customers don't want to see connections down often.
Err... don't they? My experience is quite the opposite.
Well that is odd experience, considering anyone with rudimentary understanding of control-plane policing can bring internet down from single VPS. Majority of deployed devices _cannot_ be protected against DoS motivated attacker, and I'm not talking link congestion, I'm talking control-plane congestion with few Mbps.
If we could be sure that after such fuzzing there would still be a working transport infrastructure to report on top of, then yes.
If it's important to get right, we should try to prove it wrong actively and persistently by good guys, at least then reporting and statistics can be produced. But I'm not sure if it's important to get right, market seems to indicate security does not matter.
— just like we did with IoT in 2016 —
Internet still running, I'm still getting paid.
If anything, I suspect if it's cheaper to enter the market with inferior security and quality then that is likely good business case
This is also correct so far. I wonder if it's here to stay.
We'd need the current security posture to be sufficiently unmarketable. But motivation to simply DoS internet doesn't really exist. DoS is against service end points, infrastucture is trivial target, but for some reason not really targeted. I'm sure state actors have library of DoS transit packets and BGP UPDATE packets to be deployed when strategy requires given network or region to be disrupted. Because, we, the internet plumbers, keep finding those without trying, just trying to keep the network working, what can someone find who is funded and motivated to find those? -- ++ytti
On Wed, Jan 9, 2019 at 9:32 PM Saku Ytti <saku@ytti.fi> wrote:
Those are scheduled, they have to meet some criteria to be pushed on scheduled lot. There are also out of cycle SIRTs. And yes, vendors are delaying them, because customers don't want to upgrade often, because customer's customers don't want to see connections down often.
Yep. The same happened before e.g. to MSFT products and Adobe Flash for a decade before the former have started to update in days no matter what, and before the latter was effectively pushed out of most market niches.
— just like we did with IoT in 2016 — Internet still running, I'm still getting paid.
Well, I know a couple of guys who aren't.
But motivation to simply DoS internet doesn't really exist.
Except for hacktivism, fun, gathering a rep within a cracker society, gathering a rep within one's middle school community, et cetera. But anyway,
DoS is against service end points, infrastucture is trivial target, but for some reason not really targeted.
It really is. ISPs don't get that quite frequently for now, but end-user network services sometimes do.
I'm sure state actors have library of DoS transit packets and BGP UPDATE packets to be deployed when strategy requires given network or region to be disrupted.
There's hardly a reason to rely on your next door neighbor's kid not chatting on the same Darknet forums where those "state actors" get their data from. "State actor" thing is highly overrated today. They are certainly powerful but hardly more powerful than a skilled team of anonymous blackhat researchers going in for ransom money. -- Töma
On Jan 9, 2019, at 09:51 , Töma Gavrichenkov <ximaera@gmail.com> wrote:
9 Jan. 2019 г., 9:56 Randy Bush <randy@psg.com <mailto:randy@psg.com>>:
the question is how soon the frr users out on the internet will upgrade. there are a lot of studies on this. it sure isn't on the order of a week
Which is, as usual, a pity, because, generally, synchronizing a piece of software with upstream security updates less frequently than once to twice in a week belongs in Jurassic Park today; and doing it hardly more frequently than once in 6 months, as ISPs usually do, clearly belongs in a bughouse.
(wonder if this FRR update has got a CVE number though)
So if I understand you correctly, your statement is that everyone should be (potentially) rebooting every core, backbone, edge, and other router at least once or twice a week… To quote Randy Bush… I encourage my competitors to try this. Owen
On Wed, Jan 9, 2019 at 9:31 PM Owen DeLong <owen@delong.com> wrote:
So if I understand you correctly, your statement is that everyone should be (potentially) rebooting every core, backbone, edge, and other router at least once or twice a week…
Nope, this is a misunderstanding. One has to *check* for advisories at least once or twice a week and only update (and reboot is necessary) if there *is* a vulnerability. Checking is quite different from, actually, updating. What you may want to encourage your competition to do is to deploy a piece of software which actually *gets* a severe CVE twice in a week; that will certainly bring you a bunch of new customers. -- Töma
On Wed, 9 Jan 2019 at 20:45, Töma Gavrichenkov <ximaera@gmail.com> wrote:
Nope, this is a misunderstanding. One has to *check* for advisories at least once or twice a week and only update (and reboot is necessary) if there *is* a vulnerability.
I think this contains some assumptions 1. discovering security issues in network devices is expensive (and thus only those you glean from vendor notices realistically exist) 2. downside of being affected by network device security issue is expensive I'm very skeptical if either are true. I think it's very cheap to find security issues in network devices, particularly DoS issues. And I don't think downside is expensive, maybe it's bad 4h and lot of angry customers, but ultimately not that expensive. I think lot of this is self-organising with delay around rules and justifications no one understands, and we're not upgrading often, because it's not (currently) sensible approach. -- ++ytti
On Wed, Jan 9, 2019 at 9:51 PM Saku Ytti <saku@ytti.fi> wrote:
I think this contains some assumptions
1. discovering security issues in network devices is expensive (and thus only those you glean from vendor notices realistically exist) 2. downside of being affected by network device security issue is expensive
I'm very skeptical if either are true.
Well, it's significantly harder to look for vulns in closed source firmware which only runs on certain expensive devices. My point is that e.g. FRR is an open source software which is designed to run on the same Intel-based systems as the one which probably powers your laptop. I've received a note from FRR devs stating that they're going to get a CVE number soon. It's a good sign, though it should have happened a bit before roughly a thousand of this mailing list subscribers have been informed about the issue, but anyway. -- Töma
Hey,
firmware which only runs on certain expensive devices. My point is that e.g. FRR is an open source software which is designed to run on the same Intel-based systems as the one which probably powers your laptop.
Most vendors have virtual image for your laptop, all of the modern routers run Linux and some vendor binary blob, with exception of Nokia running their own bootingOS (forked off of vxworks ages ago). Finding control-plane bugs, like BGP UPDATE crash is cheap for hobbyist, you can download the images off the Internet and run on your laptop. Finding forwarding issues indeed is harder due to the limited access to devices, so bit of security through obscurity I guess. -- ++ytti
On Wed, Jan 9, 2019 at 10:03 PM Saku Ytti <saku@ytti.fi> wrote:
Finding forwarding issues indeed is harder due to the limited access to devices, so bit of security through obscurity I guess.
Or, rather, security by complexity. Today's network infrastructure is complex enough for people to dive into it, looking for all the underlying issues. Right, it still saves us the day, though today's Web JS frontend is also quite complex but it's of no help. -- Töma
Töma Gavrichenkov Sent: Wednesday, January 9, 2019 7:08 PM
On Wed, Jan 9, 2019 at 10:03 PM Saku Ytti <saku@ytti.fi> wrote:
Finding forwarding issues indeed is harder due to the limited access to devices, so bit of security through obscurity I guess.
Or, rather, security by complexity. Today's network infrastructure is complex enough for people to dive into it, looking for all the underlying issues. Right, it still saves us the day, though today's Web JS frontend is also quite complex but it's of no help.
I don't know about that, All modern NPUs are based on the models well explained to the sufficient level in Stanford lectures and other materials which you can download freely. Once you learn how an ideal routing system architecture should look like you can start discovering flaws in the vendor NPU blueprints. But the best fun is to put IXIA or Spirent in loop with your favourite carrier router. What I'm trying to say is it's not that complex to find these routing system architecture shortcomings - they will come out of the basic platform testing. adam
Is that a competition in sarcasm? Because I can do better than that! 10 Jan. 2019 г., 2:41 <adamv0025@netconsultings.com>:
Töma Gavrichenkov Sent: Wednesday, January 9, 2019 7:08 PM
On Wed, Jan 9, 2019 at 10:03 PM Saku Ytti <saku@ytti.fi> wrote:
Finding forwarding issues indeed is harder due to the limited access to devices, so bit of security through obscurity I guess.
Or, rather, security by complexity. Today's network infrastructure is complex enough for people to dive into it, looking for all the underlying issues. Right, it still saves us the day, though today's Web JS frontend is also quite complex but it's of no help.
I don't know about that, All modern NPUs are based on the models well explained to the sufficient level in Stanford lectures and other materials which you can download freely. Once you learn how an ideal routing system architecture should look like you can start discovering flaws in the vendor NPU blueprints. But the best fun is to put IXIA or Spirent in loop with your favourite carrier router. What I'm trying to say is it's not that complex to find these routing system architecture shortcomings - they will come out of the basic platform testing.
adam
On Jan 9, 2019, at 10:51 , Saku Ytti <saku@ytti.fi> wrote:
On Wed, 9 Jan 2019 at 20:45, Töma Gavrichenkov <ximaera@gmail.com> wrote:
Nope, this is a misunderstanding. One has to *check* for advisories at least once or twice a week and only update (and reboot is necessary) if there *is* a vulnerability.
I think this contains some assumptions
1. discovering security issues in network devices is expensive (and thus only those you glean from vendor notices realistically exist)
Not really… I think the assumption here is that you can’t resolve an issue until the vendor publishes the fix. Outside of the open-source routing solutions (and even for most deployments, including those), I would say this is a valid assertion. (It’s more of an assertion than an assumption, IMHO).
2. downside of being affected by network device security issue is expensive
This depends on the issue, right? Owen
On Jan 9, 2019, at 10:37 , Töma Gavrichenkov <ximaera@gmail.com> wrote:
On Wed, Jan 9, 2019 at 9:31 PM Owen DeLong <owen@delong.com> wrote:
So if I understand you correctly, your statement is that everyone should be (potentially) rebooting every core, backbone, edge, and other router at least once or twice a week…
Nope, this is a misunderstanding. One has to *check* for advisories at least once or twice a week and only update (and reboot is necessary) if there *is* a vulnerability.
Checking is quite different from, actually, updating. What you may want to encourage your competition to do is to deploy a piece of software which actually *gets* a severe CVE twice in a week; that will certainly bring you a bunch of new customers.
Fair enough, but the frequency of vulnerability announcements even in some of the best implementations is still more often than I think my customers will tolerated reboots. At the end of the day, this is really about risk analysis and it helps to put things into 1 of 4 risk quadrants based on two axes… Axis 1 is the likelihood of the vulnerability being exploited, while axis 2 is the severity of the cost/consequences of exploitation. Obviously something that scores high on both axes will have me rolling out the upgrades as rapidly as possible, likely within 24 hours to at least the majority of the network. Something that scores low on both axes, conversely, is likely not worth the customer disruption and support call volume (not to mention SLA credits, etc.) that come from doing that level of maintenance on short notice (or without notice). The other two quadrants are a grey area that becomes more of a judgment call where other factors specific to each operator and their customer profile will come into play. Some operators may have a high tolerance for high-probability low-cost problem, while others may find this very urgent, for example. Owen
On Wed, Jan 9, 2019 at 10:33 PM Owen DeLong <owen@delong.com> wrote:
At the end of the day, this is really about risk analysis and it helps to put things into 1 of 4 risk quadrants based on two axes… Axis 1 is the likelihood of the vulnerability being exploited, while axis 2 is the severity of the cost/consequences of exploitation.
Obviously something that scores high on both axes will have me rolling out the upgrades as rapidly as possible, likely within 24 hours to at least the majority of the network.
Good for you (not kidding). Not quite the same on average, as far as I can see.
The other two quadrants are a grey area that becomes more of a judgment call where other factors specific to each operator and their customer profile will come into play. Some operators may have a high tolerance for high-probability low-cost problem, while others may find this very urgent, for example.
I agree with you; however, it's the other quadrant (high cost, seemingly low probability) which is a real gray area IMO which allows for collateral damage at a Hollywood blockbuster scale. -- Töma
On Wed, Jan 9, 2019 at 10:33 PM Owen DeLong <owen@delong.com> wrote:
Fair enough, but the frequency of vulnerability announcements even in some of the best implementations is still more often than I think my customers will tolerated reboots.
Well, and when I think about it for the second time, I can't help pointing out that there are long lived efforts from OS developers to come up with live patching, especially embedded and RTOS developers. As the recent you-know-which downtime has shown us, there are Internet-based services like 911 telephony which really start to treat Internet as a whole as a real time system. The question here is whether this encourages e.g. the aforementioned FRR developers (along with device vendors who actually get paid for the uninterruptible BGP availability) to accept this challenge. -- Töma
NANOG, The FRR devs have released binary packages including the fix and announced it on the FRR mailing lists. After considering the feedback on the list and discussing with FRR devs, we will postpone the experiments until Jan. 23rd, and have updated the schedule to reflect the delayed start and shorter timeline [A]. We will follow up with FRR devs and mailing lists/users. [A] https://goo.gl/nJhmx1 On Tue, Jan 8, 2019 at 11:41 AM Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
We've performed the first announcement in this experiment yesterday, and, despite the announcement being compliant with BGP standards, FRR routers reset their sessions upon receiving it. Upon notice of the problem, we halted the experiments. The FRR developers confirmed that this issue is specific to an unintended consequence of how FRR handles the attribute 0xFF (reserved for development) we used. The FRR devs already merged a fix and notified users.
We plan to resume the experiments January 16th (next Wednesday), and have updated the experiment schedule [A] accordingly. As always, we welcome your feedback.
On Tue, Dec 18, 2018 at 10:05 AM Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
We would like to inform you of an experiment to evaluate alternatives for speeding up adoption of BGP route origin validation (research paper with details [A]).
Our plan is to announce prefix 184.164.224.0/24 with a valid standards-compliant unassigned BGP attribute from routers operated by the PEERING testbed [B, C]. The attribute will have flags 0xe0 (optional transitive [rfc4271, S4.3]), type 0xff (reserved for development), and size 0x20 (256bits).
Our collaborators recently ran an equivalent experiment with no complaints or known issues [A], and so we do not anticipate any arising. Back in 2010, an experiment using unassigned attributes by RIPE and Duke University caused disruption in Internet routing due to a bug in Cisco routers [D, CVE-2010-3035]. Since then, this and other similar bugs have been patched [e.g., CVE-2013-6051], and new BGP attributes have been assigned (BGPsec-path) and adopted (large communities). We have successfully tested propagation of the announcements on Cisco IOS-based routers running versions 12.2(33)SRA and 15.3(1)S, Quagga 0.99.23.1 and 1.1.1, as well as BIRD 1.4.5 and 1.6.3.
We plan to announce 184.164.224.0/24 from 8 PEERING locations for a predefined period of 15 minutes starting 14:30 GMT, from Monday to Thursday, between the 7th and 22nd of January, 2019 (full schedule and locations [E]). We will stop the experiment immediately in case any issues arise.
Although we do not expect the experiment to cause disruption, we welcome feedback on its safety and especially on how to make it safer. We can be reached at disco-experiment@googlegroups.com.
Amir Herzberg, University of Connecticut Ethan Katz-Bassett, Columbia University Haya Shulman, Fraunhofer SIT Ítalo Cunha, Universidade Federal de Minas Gerais Michael Schapira, Hebrew University of Jerusalem Tomas Hlavacek, Fraunhofer SIT Yossi Gilad, MIT
[A] https://conferences.sigcomm.org/hotnets/2018/program.html [B] http://peering.usc.edu [C] https://goo.gl/AFR1Cn [D] https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experime... [E] https://goo.gl/nJhmx1
NANOG, This is a reminder that this experiment will resume tomorrow (Wednesday, Jan. 23rd). We will announce 184.164.224.0/24 carrying a BGP attribute of type 0xff (reserved for development) between 14:00 and 14:15 GMT. On Tue, Dec 18, 2018 at 10:05 AM Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
We would like to inform you of an experiment to evaluate alternatives for speeding up adoption of BGP route origin validation (research paper with details [A]).
Our plan is to announce prefix 184.164.224.0/24 with a valid standards-compliant unassigned BGP attribute from routers operated by the PEERING testbed [B, C]. The attribute will have flags 0xe0 (optional transitive [rfc4271, S4.3]), type 0xff (reserved for development), and size 0x20 (256bits).
Our collaborators recently ran an equivalent experiment with no complaints or known issues [A], and so we do not anticipate any arising. Back in 2010, an experiment using unassigned attributes by RIPE and Duke University caused disruption in Internet routing due to a bug in Cisco routers [D, CVE-2010-3035]. Since then, this and other similar bugs have been patched [e.g., CVE-2013-6051], and new BGP attributes have been assigned (BGPsec-path) and adopted (large communities). We have successfully tested propagation of the announcements on Cisco IOS-based routers running versions 12.2(33)SRA and 15.3(1)S, Quagga 0.99.23.1 and 1.1.1, as well as BIRD 1.4.5 and 1.6.3.
We plan to announce 184.164.224.0/24 from 8 PEERING locations for a predefined period of 15 minutes starting 14:30 GMT, from Monday to Thursday, between the 7th and 22nd of January, 2019 (full schedule and locations [E]). We will stop the experiment immediately in case any issues arise.
Although we do not expect the experiment to cause disruption, we welcome feedback on its safety and especially on how to make it safer. We can be reached at disco-experiment@googlegroups.com.
Amir Herzberg, University of Connecticut Ethan Katz-Bassett, Columbia University Haya Shulman, Fraunhofer SIT Ítalo Cunha, Universidade Federal de Minas Gerais Michael Schapira, Hebrew University of Jerusalem Tomas Hlavacek, Fraunhofer SIT Yossi Gilad, MIT
[A] https://conferences.sigcomm.org/hotnets/2018/program.html [B] http://peering.usc.edu [C] https://goo.gl/AFR1Cn [D] https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experime... [E] https://goo.gl/nJhmx1
Can you stop this? You caused again a massive prefix spike/flap, and as the internet is not centered around NA (shock horror!) a number of operators in Asia and Australia go effected by your “expirment” and had no idea what was happening or why. Get a sandbox like every other researcher, as of now we have black holed and filtered your whole ASN, and have reccomended others do the same. On Wed, 23 Jan 2019 at 1:19 am, Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
This is a reminder that this experiment will resume tomorrow (Wednesday, Jan. 23rd). We will announce 184.164.224.0/24 carrying a BGP attribute of type 0xff (reserved for development) between 14:00 and 14:15 GMT.
On Tue, Dec 18, 2018 at 10:05 AM Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
We would like to inform you of an experiment to evaluate alternatives for speeding up adoption of BGP route origin validation (research paper with details [A]).
Our plan is to announce prefix 184.164.224.0/24 with a valid standards-compliant unassigned BGP attribute from routers operated by the PEERING testbed [B, C]. The attribute will have flags 0xe0 (optional transitive [rfc4271, S4.3]), type 0xff (reserved for development), and size 0x20 (256bits).
Our collaborators recently ran an equivalent experiment with no complaints or known issues [A], and so we do not anticipate any arising. Back in 2010, an experiment using unassigned attributes by RIPE and Duke University caused disruption in Internet routing due to a bug in Cisco routers [D, CVE-2010-3035]. Since then, this and other similar bugs have been patched [e.g., CVE-2013-6051], and new BGP attributes have been assigned (BGPsec-path) and adopted (large communities). We have successfully tested propagation of the announcements on Cisco IOS-based routers running versions 12.2(33)SRA and 15.3(1)S, Quagga 0.99.23.1 and 1.1.1, as well as BIRD 1.4.5 and 1.6.3.
We plan to announce 184.164.224.0/24 from 8 PEERING locations for a predefined period of 15 minutes starting 14:30 GMT, from Monday to Thursday, between the 7th and 22nd of January, 2019 (full schedule and locations [E]). We will stop the experiment immediately in case any issues arise.
Although we do not expect the experiment to cause disruption, we welcome feedback on its safety and especially on how to make it safer. We can be reached at disco-experiment@googlegroups.com.
Amir Herzberg, University of Connecticut Ethan Katz-Bassett, Columbia University Haya Shulman, Fraunhofer SIT Ítalo Cunha, Universidade Federal de Minas Gerais Michael Schapira, Hebrew University of Jerusalem Tomas Hlavacek, Fraunhofer SIT Yossi Gilad, MIT
[A] https://conferences.sigcomm.org/hotnets/2018/program.html [B] http://peering.usc.edu [C] https://goo.gl/AFR1Cn [D]
https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experime...
-- Ben Cooper Chief Executive Officer PacketGG - Multicast M(Telstra): 0410 411 301 M(Optus): 0434 336 743 E: ben@packet.gg & ben@multicast.net.au W: https://packet.gg W: https://multicast.net.au
Ben, NANOG, We have canceled this experiment permanently. On Wed, Jan 23, 2019 at 12:00 PM Ben Cooper <ben@packet.gg> wrote:
Can you stop this?
You caused again a massive prefix spike/flap, and as the internet is not centered around NA (shock horror!) a number of operators in Asia and Australia go effected by your “expirment” and had no idea what was happening or why.
Get a sandbox like every other researcher, as of now we have black holed and filtered your whole ASN, and have reccomended others do the same.
On Wed, 23 Jan 2019 at 1:19 am, Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
This is a reminder that this experiment will resume tomorrow (Wednesday, Jan. 23rd). We will announce 184.164.224.0/24 carrying a BGP attribute of type 0xff (reserved for development) between 14:00 and 14:15 GMT.
On Tue, Dec 18, 2018 at 10:05 AM Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
We would like to inform you of an experiment to evaluate alternatives for speeding up adoption of BGP route origin validation (research paper with details [A]).
Our plan is to announce prefix 184.164.224.0/24 with a valid standards-compliant unassigned BGP attribute from routers operated by the PEERING testbed [B, C]. The attribute will have flags 0xe0 (optional transitive [rfc4271, S4.3]), type 0xff (reserved for development), and size 0x20 (256bits).
Our collaborators recently ran an equivalent experiment with no complaints or known issues [A], and so we do not anticipate any arising. Back in 2010, an experiment using unassigned attributes by RIPE and Duke University caused disruption in Internet routing due to a bug in Cisco routers [D, CVE-2010-3035]. Since then, this and other similar bugs have been patched [e.g., CVE-2013-6051], and new BGP attributes have been assigned (BGPsec-path) and adopted (large communities). We have successfully tested propagation of the announcements on Cisco IOS-based routers running versions 12.2(33)SRA and 15.3(1)S, Quagga 0.99.23.1 and 1.1.1, as well as BIRD 1.4.5 and 1.6.3.
We plan to announce 184.164.224.0/24 from 8 PEERING locations for a predefined period of 15 minutes starting 14:30 GMT, from Monday to Thursday, between the 7th and 22nd of January, 2019 (full schedule and locations [E]). We will stop the experiment immediately in case any issues arise.
Although we do not expect the experiment to cause disruption, we welcome feedback on its safety and especially on how to make it safer. We can be reached at disco-experiment@googlegroups.com.
Amir Herzberg, University of Connecticut Ethan Katz-Bassett, Columbia University Haya Shulman, Fraunhofer SIT Ítalo Cunha, Universidade Federal de Minas Gerais Michael Schapira, Hebrew University of Jerusalem Tomas Hlavacek, Fraunhofer SIT Yossi Gilad, MIT
[A] https://conferences.sigcomm.org/hotnets/2018/program.html [B] http://peering.usc.edu [C] https://goo.gl/AFR1Cn [D]
https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experime...
-- Ben Cooper Chief Executive Officer PacketGG - Multicast M(Telstra): 0410 411 301 M(Optus): 0434 336 743 E: ben@packet.gg & ben@multicast.net.au W: https://packet.gg W: https://multicast.net.au
-- You received this message because you are subscribed to the Google Groups "DISCO Experiment" group. To unsubscribe from this group and stop receiving emails from it, send an email to disco-experiment+unsubscribe@googlegroups.com. To post to this group, send email to disco-experiment@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/disco-experiment/CAPZQKs8aVT%3D7gJdGcoC-KO... <https://groups.google.com/d/msgid/disco-experiment/CAPZQKs8aVT%3D7gJdGcoC-KOPDR0F4Ms33KAKKG5-4k96SVCSFEw%40mail.gmail.com?utm_medium=email&utm_source=footer> . For more options, visit https://groups.google.com/d/optout.
Dear Ben, all, I'm not sure this experiment should be canceled. On the public Internet we MUST assume BGP speakers are compliant with the BGP-4 protocol. Broken BGP-4 speakers are what they are: broken. They must be fixed, or the operator must accept the consequences. "Get a sandbox like every other researcher" is not a fair statement, one can also posit "Get a compliant BGP-4 implementation like every other network operator". When bad guys explicitly seek to target these Asian and Australian operators you reference (who apparently have not upgraded to the vendor recommended release), using *valid* BGP updates, will a politely emailed request help resolve the situation? Of course not! Stopping the experiment is only treating symptoms, the root cause must be addressed: broken software. Kind regards, Job On Wed, Jan 23, 2019 at 12:19:09PM -0500, Italo Cunha wrote:
Ben, NANOG,
We have canceled this experiment permanently.
On Wed, Jan 23, 2019 at 12:00 PM Ben Cooper <ben@packet.gg> wrote:
Can you stop this?
You caused again a massive prefix spike/flap, and as the internet is not centered around NA (shock horror!) a number of operators in Asia and Australia go effected by your “expirment” and had no idea what was happening or why.
Get a sandbox like every other researcher, as of now we have black holed and filtered your whole ASN, and have reccomended others do the same.
On Wed, 23 Jan 2019 at 1:19 am, Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
This is a reminder that this experiment will resume tomorrow (Wednesday, Jan. 23rd). We will announce 184.164.224.0/24 carrying a BGP attribute of type 0xff (reserved for development) between 14:00 and 14:15 GMT.
On Tue, Dec 18, 2018 at 10:05 AM Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
We would like to inform you of an experiment to evaluate alternatives for speeding up adoption of BGP route origin validation (research paper with details [A]).
Our plan is to announce prefix 184.164.224.0/24 with a valid standards-compliant unassigned BGP attribute from routers operated by the PEERING testbed [B, C]. The attribute will have flags 0xe0 (optional transitive [rfc4271, S4.3]), type 0xff (reserved for development), and size 0x20 (256bits).
Our collaborators recently ran an equivalent experiment with no complaints or known issues [A], and so we do not anticipate any arising. Back in 2010, an experiment using unassigned attributes by RIPE and Duke University caused disruption in Internet routing due to a bug in Cisco routers [D, CVE-2010-3035]. Since then, this and other similar bugs have been patched [e.g., CVE-2013-6051], and new BGP attributes have been assigned (BGPsec-path) and adopted (large communities). We have successfully tested propagation of the announcements on Cisco IOS-based routers running versions 12.2(33)SRA and 15.3(1)S, Quagga 0.99.23.1 and 1.1.1, as well as BIRD 1.4.5 and 1.6.3.
We plan to announce 184.164.224.0/24 from 8 PEERING locations for a predefined period of 15 minutes starting 14:30 GMT, from Monday to Thursday, between the 7th and 22nd of January, 2019 (full schedule and locations [E]). We will stop the experiment immediately in case any issues arise.
Although we do not expect the experiment to cause disruption, we welcome feedback on its safety and especially on how to make it safer. We can be reached at disco-experiment@googlegroups.com.
Amir Herzberg, University of Connecticut Ethan Katz-Bassett, Columbia University Haya Shulman, Fraunhofer SIT Ítalo Cunha, Universidade Federal de Minas Gerais Michael Schapira, Hebrew University of Jerusalem Tomas Hlavacek, Fraunhofer SIT Yossi Gilad, MIT
[A] https://conferences.sigcomm.org/hotnets/2018/program.html [B] http://peering.usc.edu [C] https://goo.gl/AFR1Cn [D]
https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experime...
-- Ben Cooper Chief Executive Officer PacketGG - Multicast M(Telstra): 0410 411 301 M(Optus): 0434 336 743 E: ben@packet.gg & ben@multicast.net.au W: https://packet.gg W: https://multicast.net.au
-- You received this message because you are subscribed to the Google Groups "DISCO Experiment" group. To unsubscribe from this group and stop receiving emails from it, send an email to disco-experiment+unsubscribe@googlegroups.com. To post to this group, send email to disco-experiment@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/disco-experiment/CAPZQKs8aVT%3D7gJdGcoC-KO... <https://groups.google.com/d/msgid/disco-experiment/CAPZQKs8aVT%3D7gJdGcoC-KOPDR0F4Ms33KAKKG5-4k96SVCSFEw%40mail.gmail.com?utm_medium=email&utm_source=footer> . For more options, visit https://groups.google.com/d/optout.
I would be very interested in hearing Ben's definition of something that is "massive", if announcing or withdrawing a single /24 from the global routing table constitutes, quote, "a massive prefix spike/flap". Individual /24s are moved around all the time by fully automated systems. On Wed, Jan 23, 2019 at 9:42 AM Job Snijders <job@ntt.net> wrote:
Dear Ben, all,
I'm not sure this experiment should be canceled. On the public Internet we MUST assume BGP speakers are compliant with the BGP-4 protocol. Broken BGP-4 speakers are what they are: broken. They must be fixed, or the operator must accept the consequences.
"Get a sandbox like every other researcher" is not a fair statement, one can also posit "Get a compliant BGP-4 implementation like every other network operator".
When bad guys explicitly seek to target these Asian and Australian operators you reference (who apparently have not upgraded to the vendor recommended release), using *valid* BGP updates, will a politely emailed request help resolve the situation? Of course not!
Stopping the experiment is only treating symptoms, the root cause must be addressed: broken software.
Kind regards,
Job
On Wed, Jan 23, 2019 at 12:19:09PM -0500, Italo Cunha wrote:
Ben, NANOG,
We have canceled this experiment permanently.
On Wed, Jan 23, 2019 at 12:00 PM Ben Cooper <ben@packet.gg> wrote:
Can you stop this?
You caused again a massive prefix spike/flap, and as the internet is not centered around NA (shock horror!) a number of operators in Asia and Australia go effected by your “expirment” and had no idea what was happening or why.
Get a sandbox like every other researcher, as of now we have black holed and filtered your whole ASN, and have reccomended others do the same.
On Wed, 23 Jan 2019 at 1:19 am, Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
This is a reminder that this experiment will resume tomorrow (Wednesday, Jan. 23rd). We will announce 184.164.224.0/24 carrying a BGP attribute of type 0xff (reserved for development) between 14:00 and 14:15 GMT.
On Tue, Dec 18, 2018 at 10:05 AM Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
We would like to inform you of an experiment to evaluate
alternatives
for speeding up adoption of BGP route origin validation (research paper with details [A]).
Our plan is to announce prefix 184.164.224.0/24 with a valid standards-compliant unassigned BGP attribute from routers operated by the PEERING testbed [B, C]. The attribute will have flags 0xe0 (optional transitive [rfc4271, S4.3]), type 0xff (reserved for development), and size 0x20 (256bits).
Our collaborators recently ran an equivalent experiment with no complaints or known issues [A], and so we do not anticipate any arising. Back in 2010, an experiment using unassigned attributes by RIPE and Duke University caused disruption in Internet routing due to a bug in Cisco routers [D, CVE-2010-3035]. Since then, this and other similar bugs have been patched [e.g., CVE-2013-6051], and new BGP attributes have been assigned (BGPsec-path) and adopted (large communities). We have successfully tested propagation of the announcements on Cisco IOS-based routers running versions 12.2(33)SRA and 15.3(1)S, Quagga 0.99.23.1 and 1.1.1, as well as BIRD 1.4.5 and 1.6.3.
We plan to announce 184.164.224.0/24 from 8 PEERING locations for a predefined period of 15 minutes starting 14:30 GMT, from Monday to Thursday, between the 7th and 22nd of January, 2019 (full schedule and locations [E]). We will stop the experiment immediately in case any issues arise.
Although we do not expect the experiment to cause disruption, we welcome feedback on its safety and especially on how to make it safer. We can be reached at disco-experiment@googlegroups.com.
Amir Herzberg, University of Connecticut Ethan Katz-Bassett, Columbia University Haya Shulman, Fraunhofer SIT Ítalo Cunha, Universidade Federal de Minas Gerais Michael Schapira, Hebrew University of Jerusalem Tomas Hlavacek, Fraunhofer SIT Yossi Gilad, MIT
[A] https://conferences.sigcomm.org/hotnets/2018/program.html [B] http://peering.usc.edu [C] https://goo.gl/AFR1Cn [D]
https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experime...
-- Ben Cooper Chief Executive Officer PacketGG - Multicast M(Telstra): 0410 411 301 M(Optus): 0434 336 743 E: ben@packet.gg & ben@multicast.net.au W: https://packet.gg W: https://multicast.net.au
-- You received this message because you are subscribed to the Google Groups "DISCO Experiment" group. To unsubscribe from this group and stop receiving emails from it, send an email to disco-experiment+unsubscribe@googlegroups.com. To post to this group, send email to disco-experiment@googlegroups.com . To view this discussion on the web visit
https://groups.google.com/d/msgid/disco-experiment/CAPZQKs8aVT%3D7gJdGcoC-KO...
. For more options, visit https://groups.google.com/d/optout.
I hope you are as critical of your hardware vendor that cannot accept BGP4 compliant attributes or have you just not updated your code? You can black hole anything you want but as long as the “Internet” is sending you an RFC compliant BGP you better be able to handle it. Steven Naslund Chicago IL On Wed, Jan 23, 2019 at 12:00 PM Ben Cooper <ben@packet.gg<mailto:ben@packet.gg>> wrote: Can you stop this? You caused again a massive prefix spike/flap, and as the internet is not centered around NA (shock horror!) a number of operators in Asia and Australia go effected by your “expirment” and had no idea what was happening or why. Get a sandbox like every other researcher, as of now we have black holed and filtered your whole ASN, and have reccomended others do the same.
On Wed, 23 Jan 2019 at 17:58, Naslund, Steve <SNaslund@medline.com> wrote:
I hope you are as critical of your hardware vendor that cannot accept BGP4 compliant attributes or have you just not updated your code? You can black hole anything you want but as long as the “Internet” is sending you an RFC compliant BGP you better be able to handle it.
I'd go further and say that as long as you're connected to the Internet, your equipment better be resilient when receiving packets with any combination of bits set, RFC compliant or not. Aled
On Wed, Jan 23, 2019 at 9:05 PM Aled Morris via NANOG <nanog@nanog.org> wrote:
I'd go further and say that as long as you're connected to the Internet, your equipment better be resilient when receiving packets with any combination of bits set, RFC compliant or not.
Well, here, when you receive this particular attribute and if you're vulnerable, your equipment automatically gets disconnected from the Internet, so the issue kinda solves itself. -- Töma
Contact your hardware vendor. That is not acceptable behavior. If it is not RFC compliant they need to accept the attribute, if it's not RFC compliant they should gracefully ignore it. Now we all know that anyone using that gear is vulnerable to a DoS attack. Won't be long until anyone else sends that to you. Steven Naslund Chicago IL
Well, here, when you receive this particular attribute and if you're vulnerable, your equipment automatically gets disconnected from the Internet, so the issue kinda solves >itself.
-- Töma
Sorry. Correction. If it IS RFC compliant they should accept the attribute. If it is NOT, they should drop (and maybe log it). Steve
Contact your hardware vendor. That is not acceptable behavior. If it is not RFC compliant they need to accept the attribute, if it's not RFC compliant they should gracefully >ignore it. Now we all know that anyone using that gear is vulnerable to a DoS attack. Won't be long until anyone else sends that to you.
Steven Naslund Chicago IL
On Wed, Jan 23, 2019 at 12:00 PM Ben Cooper <ben@packet.gg> wrote: You caused again a massive prefix spike/flap, and as the internet is not centered around NA (shock horror!) a number of operators in Asia and Australia go effected by your “expirment” and had no idea what was happening or why.
Get a sandbox like every other researcher
The whole thing reminds me of a decades old story which can be found on Google by a search term "The Spider of Doom". What if next time e.g. the bad guys would be doing this? Would you urge them to also get a sandbox? -- Töma
This experiment should be continued. It's the only way to get people to patch stuff. And if all it takes to break things is a single announcement, than that's something that should be definitely fixed. Blacklisting an ASN is not a solution, that's ignorance. Regards, Filip Hruska On 23 January 2019 18:19:09 CET, Italo Cunha <cunha@dcc.ufmg.br> wrote:
Ben, NANOG,
We have canceled this experiment permanently.
On Wed, Jan 23, 2019 at 12:00 PM Ben Cooper <ben@packet.gg> wrote:
Can you stop this?
You caused again a massive prefix spike/flap, and as the internet is not centered around NA (shock horror!) a number of operators in Asia and Australia go effected by your “expirment” and had no idea what was happening or why.
Get a sandbox like every other researcher, as of now we have black holed and filtered your whole ASN, and have reccomended others do the same.
On Wed, 23 Jan 2019 at 1:19 am, Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
This is a reminder that this experiment will resume tomorrow (Wednesday, Jan. 23rd). We will announce 184.164.224.0/24 carrying a BGP attribute of type 0xff (reserved for development) between 14:00 and 14:15 GMT.
On Tue, Dec 18, 2018 at 10:05 AM Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
We would like to inform you of an experiment to evaluate
alternatives
for speeding up adoption of BGP route origin validation (research paper with details [A]).
Our plan is to announce prefix 184.164.224.0/24 with a valid standards-compliant unassigned BGP attribute from routers operated by the PEERING testbed [B, C]. The attribute will have flags 0xe0 (optional transitive [rfc4271, S4.3]), type 0xff (reserved for development), and size 0x20 (256bits).
Our collaborators recently ran an equivalent experiment with no complaints or known issues [A], and so we do not anticipate any arising. Back in 2010, an experiment using unassigned attributes by RIPE and Duke University caused disruption in Internet routing due to a bug in Cisco routers [D, CVE-2010-3035]. Since then, this and other similar bugs have been patched [e.g., CVE-2013-6051], and new BGP attributes have been assigned (BGPsec-path) and adopted (large communities). We have successfully tested propagation of the announcements on Cisco IOS-based routers running versions 12.2(33)SRA and 15.3(1)S, Quagga 0.99.23.1 and 1.1.1, as well as BIRD 1.4.5 and 1.6.3.
We plan to announce 184.164.224.0/24 from 8 PEERING locations for a predefined period of 15 minutes starting 14:30 GMT, from Monday to Thursday, between the 7th and 22nd of January, 2019 (full schedule and locations [E]). We will stop the experiment immediately in case any issues arise.
Although we do not expect the experiment to cause disruption, we welcome feedback on its safety and especially on how to make it safer. We can be reached at disco-experiment@googlegroups.com.
Amir Herzberg, University of Connecticut Ethan Katz-Bassett, Columbia University Haya Shulman, Fraunhofer SIT Ítalo Cunha, Universidade Federal de Minas Gerais Michael Schapira, Hebrew University of Jerusalem Tomas Hlavacek, Fraunhofer SIT Yossi Gilad, MIT
[A] https://conferences.sigcomm.org/hotnets/2018/program.html [B] http://peering.usc.edu [C] https://goo.gl/AFR1Cn [D]
https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experime...
-- Ben Cooper Chief Executive Officer PacketGG - Multicast M(Telstra): 0410 411 301 M(Optus): 0434 336 743 E: ben@packet.gg & ben@multicast.net.au W: https://packet.gg W: https://multicast.net.au
-- You received this message because you are subscribed to the Google Groups "DISCO Experiment" group. To unsubscribe from this group and stop receiving emails from it, send an email to disco-experiment+unsubscribe@googlegroups.com. To post to this group, send email to disco-experiment@googlegroups.com. To view this discussion on the web visit
https://groups.google.com/d/msgid/disco-experiment/CAPZQKs8aVT%3D7gJdGcoC-KO...
. For more options, visit https://groups.google.com/d/optout.
-- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Agreed, do you think you will not see that attribute again now that the public knows that you are vulnerable to this DoS method. Expect to see an attack based on this method shortly. They just did you a favor by exposing your vulnerability, you should take it as such. I would be putting in emergency patches tonight if available. Steven Naslund Chicago IL
This experiment should be continued.
It's the only way to get people to patch stuff. And if all it takes to break things is a single announcement, than that's something that should be definitely fixed.
Blacklisting an ASN is not a solution, that's ignorance.
Regards, Filip Hruska
On Wed, Jan 23, 2019 at 10:16 AM Filip Hruska <fhr@fhrnet.eu> wrote:
This experiment should be continued. It's the only way to get people to patch stuff.
Agreed. But be gentle. Wait a couple months between repeats to give folks a generous amount of time to patch their gear. If you create the emergency that a hacker could but hasn't, you've done the hacker's job for him. Regards, Bill Herrin -- William Herrin ................ herrin@dirtside.com bill@herrin.us Dirtside Systems ......... Web: <http://www.dirtside.com/>
On Jan 23, 2019, at 10:16 , Filip Hruska <fhr@fhrnet.eu> wrote:
This experiment should be continued.
It's the only way to get people to patch stuff. And if all it takes to break things is a single announcement, than that's something that should be definitely fixed.
Blacklisting an ASN is not a solution, that's ignorance.
Actually, at the point where you blacklist the ASN, you’ve moved from ignorance to willful ignorance (aka stupidity). Owen
On 23/01/2019 18:19, Italo Cunha wrote:
We have canceled this experiment permanently.
Sad to hear! :/ My impression if you continue is more users will get aware of bugs with running BGP implementations. Not all follow announcement from $vendor and will continue to run older broken code and complain to $vendor about errors with the software.
Throwing my support behind continuing the experiment also. A singular complaint from a company advertising unallocated ASN and IPv4 resources (the irony) does not warrant cessation of the experiment. The experiment is in compliance with the relevant RFCs, the affected “vendor” has released fixed software and announced it to their notifications list. I can only hope this and future research continues.
On Jan 23, 2019, at 1:39 PM, Christoffer Hansen <christoffer@netravnen.de> wrote:
On 23/01/2019 18:19, Italo Cunha wrote: We have canceled this experiment permanently.
Sad to hear! :/
My impression if you continue is more users will get aware of bugs with running BGP implementations. Not all follow announcement from $vendor and will continue to run older broken code and complain to $vendor about errors with the software.
On Wed, Jan 23, 2019 at 06:45:50PM +0000, Nikolas Geyer wrote:
Throwing my support behind continuing the experiment also. A singular complaint from a company advertising unallocated ASN and IPv4 resources (the irony) does not warrant cessation of the experiment.
Agreed; Please resume the experiment. We're all operators here, and we MUST have confidence that BGP speakers on our network are compliant with protocol specification. Experiments like this are opportunities for a real-life validation of how our devices handle messages that are out of the norm, and help us identify issues. Kudos to researchers by the way, for sending courtesy announcements in advance, and testing against some common platforms available to them (Cisco, Quagga & BIRD) prior to the experiment. James
On 23/01/2019 20:01, James Jun wrote:
Kudos to researchers by the way, for sending courtesy announcements in advance, and testing against some common platforms available to them (Cisco, Quagga & BIRD) prior to the experiment. On 18/12/2018 16:05, Italo Cunha wrote: We have successfully tested propagation of the announcements on Cisco IOS-based routers running versions 12.2(33)SRA and 15.3(1)S, Quagga 0.99.23.1 and 1.1.1, as well as BIRD 1.4.5 and 1.6.3.
If tests was done _only_ Quagga, BIRD, Cisco IOS... I can only conclude a limited set has been done. (But then again, $_vendors and $_versions of $_software cannot be counted in small numbers.)
On 23/Jan/19 21:01, James Jun wrote:
Agreed; Please resume the experiment. We're all operators here, and we MUST have confidence that BGP speakers on our network are compliant with protocol specification. Experiments like this are opportunities for a real-life validation of how our devices handle messages that are out of the norm, and help us identify issues.
Kudos to researchers by the way, for sending courtesy announcements in advance, and testing against some common platforms available to them (Cisco, Quagga & BIRD) prior to the experiment.
+1. Mark.
From: James Jun Sent: Wednesday, January 23, 2019 7:02 PM
Agreed; Please resume the experiment. We're all operators here, and we MUST have confidence that BGP speakers on our network are compliant with protocol specification. Experiments like this are opportunities for a real-life validation of how our devices handle messages that are out of the norm, and help us identify issues.
Kudos to researchers by the way, for sending courtesy announcements in advance, and testing against some common platforms available to them (Cisco, Quagga & BIRD) prior to the experiment.
This actually makes me thing that it might be worthwhile including these types of test to the regression testing suite. So that every time we evaluate new code or vendor we don't only test for functionality, performance and scalability, but also for robustness i.e. sending a whole heap of trash down the sockets which are accessible form the Internet (via the iACL holes), to limit the scope of the test. Rather than relying on experiments to notify us the hard way that something is not right. adam
On Thu, Jan 24, 2019 at 03:49:46PM -0000, adamv0025@netconsultings.com wrote:
This actually makes me thing that it might be worthwhile including these types of test to the regression testing suite. So that every time we evaluate new code or vendor we don't only test for functionality, performance and scalability, but also for robustness i.e. sending a whole heap of trash down the sockets which are accessible form the Internet (via the iACL holes), to limit the scope of the test.
Rather than relying on experiments to notify us the hard way that something is not right.
adam
I agree. It seems to me that testing with almost-valid data (well formed, but with disallowed values) as well as fuzz-testing are essential parts of software quality control. - Brian
From: Brian Kantor Sent: Thursday, January 24, 2019 3:58 PM
I agree.
It seems to me that testing with almost-valid data (well formed, but with disallowed values) as well as fuzz-testing are essential parts of software quality control.
To be frank, Have blasted packets at the platforms from ixias and sipernts to see if they break gracefully, loaded millions of routes and thousands of VRF and BGP sessions to see what happens, even designed the backbones with separate Internet and VPN RRs, and enabled enhanced error handling, but have I ever sat down and generated BGP packets with slight deviations to see how the BGP session, process or whole RPD copes with these? I've got to say no, never. And judging from the overly positive (or even negative) responses to the BGP Experiment I'm not alone in this. Otherwise everyone would be like, nah I don't care as I have all my bases covered and I know how my BGP behaves processing exceptions. adam
On Thu, 24 Jan 2019 at 17:52, <adamv0025@netconsultings.com> wrote:
This actually makes me thing that it might be worthwhile including these types of test to the regression testing suite.
I seem to recall one newish entrant to SP market explaining that they are limited by wall-time in blackbox testing. They would have no particularly challenge testing everything, but the amount of permutations and wall-time to execute single test simply makes it impossible to test comprehensively. So if you can't test everything, what do you test? How do you predict what is more likely to be broken? Focus on MTBF is fools errant, maybe someone like FB, AMZN, MSFT can do statistical analysis on outcome of change, rest of us are just guessing what we did increased MTBF, we don't have enough failures to actually know. Focus should be on MTTR. There are some commercial BGP fuzzers, I've only tested one of them: https://www.synopsys.com/software-integrity/security-testing/fuzz-testing/de... -- ++ytti
From: Saku Ytti <saku@ytti.fi> Sent: Thursday, January 24, 2019 4:28 PM
On Thu, 24 Jan 2019 at 17:52, <adamv0025@netconsultings.com> wrote:
This actually makes me thing that it might be worthwhile including these types of test to the regression testing suite.
I seem to recall one newish entrant to SP market explaining that they are limited by wall-time in blackbox testing. They would have no particularly challenge testing everything, but the amount of permutations and wall-time to execute single test simply makes it impossible to test comprehensively. So if you can't test everything, what do you test? How do you predict what is more likely to be broken?
We fight with that all the time, I'd say that from the whole Design->Certify->Deploy->Verify->Monitor service lifecycle time budget, the service certification testing is almost half of it. That's why I'm so interested in a model driven design and testing approach. I really need to have this ever growing library of test cases that the automat will churn through with very little human intervention, in order to reduce the testing from months to days or weeks at least.
There are some commercial BGP fuzzers, I've only tested one of them: https://www.synopsys.com/software-integrity/security-testing/fuzz- testing/defensics/protocols/bgp4-server.html
Thank you very much for the link. adam
On Thu, 24 Jan 2019 at 18:43, <adamv0025@netconsultings.com> wrote:
We fight with that all the time, I'd say that from the whole Design->Certify->Deploy->Verify->Monitor service lifecycle time budget, the service certification testing is almost half of it. That's why I'm so interested in a model driven design and testing approach.
This shop has 100% automated blackbox testing, and still they have to cherry-pick what to test. Do you have statistics how often you find show-stopper issues and how far into the test they were found? I expect this to be exponential curve, like upgrading box, getting your signalling protocols up, pushing one packet in each service you sell is easy and fast, I wonder will massive amount of work increase confidence significantly from that. The issues I tend to find in production are issues which are not trivial to recreate in lab, once we know what they are, which implies that finding them a-priori is bit naive expectation. So, assumptions: a) blackbox testing has exponentially diminishing returns, quickly you need to expand massively more efforts to gain slightly more confidence b) you can never say 'x works' you can only say 'i found way to confirm x is not broken in this very specific case', the way x will end up being broken may be very complex c) if recreating issues you know about is hard, then finding issues you don't know about is massively more difficult d) testing likely increases more your comfort to deploy than probability of success Hopefully we'll enter NOS future where we download NOS from github and compile it to our devices. Allowing whole community to contribute to unit testing and use-cases and to run minimal bug surface code in your environment. I see very little future in blackbox testing vendor NOS at operator site, beyond quick poke at lab. Seems like poor value. Rather have pessimistic deployment plan, lab => staging => 2-3 low risk site => 2-3 high risk site => slow roll up
I really need to have this ever growing library of test cases that the automat will churn through with very little human intervention, in order to reduce the testing from months to days or weeks at least.
Lot of vendor, maybe all, accept your configuration and test them for releases. I think this is only viable solution vendors have for blackbox, gather configs from customers and test those, instead of try to guess what to test. I've done that with Cisco in two companies, unfortunately I can't really tell if it impacted quality, but I like to think it did. -- ++ytti
From: Saku Ytti <saku@ytti.fi> Sent: Friday, January 25, 2019 7:59 AM
On Thu, 24 Jan 2019 at 18:43, <adamv0025@netconsultings.com> wrote:
We fight with that all the time, I'd say that from the whole Design->Certify->Deploy->Verify->Monitor service lifecycle time budget, the service certification testing is almost half of it. That's why I'm so interested in a model driven design and testing approach.
This shop has 100% automated blackbox testing, and still they have to cherry- pick what to test.
Sure one tests only for the few specific current and near future use cases.
Do you have statistics how often you find show-stopper issues and how far into the test they were found?
I don't keep those statistics, but running bug scrubs in order to determine the code for regression testing is usually good starting point to avoid show-stoppers, what is then found later on during the testing is usually patched -so yes you end up with a brand new code and several patches related to your use cases (PEs, Ps, etc..)
I expect this to be exponential curve, like upgrading box, getting your signalling protocols up, pushing one packet in each service you sell is easy and fast, I wonder will massive amount of work increase confidence significantly from that.
Yes it will.
The issues I tend to find in production are issues which are not trivial to recreate in lab, once we know what they are, which implies that finding them a-priori is bit naive expectation. So, assumptions:
This is because you did your due diligence during the testing. Do you have statistics on the probability of these "complex" bugs occurrence?
Hopefully we'll enter NOS future where we download NOS from github and compile it to our devices. Allowing whole community to contribute to unit testing and use-cases and to run minimal bug surface code in your environment.
Not there yet, but you can compile your own routing protocols and run those on vendor OS.
I see very little future in blackbox testing vendor NOS at operator site, beyond quick poke at lab. Seems like poor value. Rather have pessimistic deployment plan, lab => staging => 2-3 low risk site => 2-3 high risk site => slow roll up
Yes that's also a possibility -one of the strong arguments for massive disaggregation at the edge, to reduce the fallout of a potential critical failure. Depends on the shop really.
I really need to have this ever growing library of test cases that the automat will churn through with very little human intervention, in order to reduce the testing from months to days or weeks at least.
Lot of vendor, maybe all, accept your configuration and test them for releases. I think this is only viable solution vendors have for blackbox, gather configs from customers and test those, instead of try to guess what to test. I've done that with Cisco in two companies, unfortunately I can't really tell if it impacted quality, but I like to think it did.
Did that with juniper partners and now directly with Juniper. The thing is though they are using our test plan... adam
Hey,
This is because you did your due diligence during the testing. Do you have statistics on the probability of these "complex" bugs occurrence?
No. I wish I had and I hope to make change on this. Try to translate how good investment test is, how many customer outages it has saved etc. I suspect simple bugs are found by vendor, complex bugs are not economic to find. And testing is more proof of work than business case. -- ++ytti
I suspect simple bugs are found by vendor, complex bugs are not economic to find.
the running internet is complex and has a horrifying number of special cases compounded by kiddies being clever. no one, independent of resource requirements, could build a lab to the scale needed to test. and then there is ewd's famous quote about testing. randy
From: Randy Bush <randy@psg.com> Sent: Thursday, January 31, 2019 6:56 PM
I suspect simple bugs are found by vendor, complex bugs are not economic to find.
the running internet is complex and has a horrifying number of special cases compounded by kiddies being clever. no one, independent of resource requirements, could build a lab to the scale needed to test.
Yes what can break will break, yet here we are exchanging emails, I think your statement assumes a vast search space. No need to solve the whole thing, just to make my tiny part a bit better. No need to solve my tiny part for eternity, just for the near term. Yes there will always be this long tail, but with what one would deem a sufficiently low probability, in the intersection of the above search spaces.
and then there is ewd's famous quote about testing.
Yes human brains have their limits, hence we invented AI to help us solve complexity. Though in a sense it's just shifting the complexity to yet another layer above... adam
On Wed, Jan 23, 2019 at 12:00 PM Ben Cooper <ben@packet.gg> wrote:
Can you stop this?
You caused again a massive prefix spike/flap, and as the internet is not centered around NA (shock horror!) a number of operators in Asia and Australia go effected by your “expirment” and had no idea what was happening or why.
You could(?) be helpful by propagating the announcement to other $lists or point out to the authors they are not spreading their announcements about the experiment wide enough.
Get a sandbox like every other researcher, as of now we have black holed and filtered your whole ASN, and have reccomended others do the same. Probably the previous sent announcement about the prefix announcements sent out today. Should have gone to more mailing lists at RIRs and NOGs. (Encountered a FRR user today, too. Who had missed previous announcements)
-Christoffer
Replying to throw in my support behind continuing the experiment as well. Assurance that my gear will NOT fall over under adversarial situations is paramount, thank you for the research that you're doing to ensure that. Ben, you may wish to re-evaluate how "rock solid" [1] your networking truly is if you're being taken down by random BGP updates. As others have noted, the right target to be angry at is your equipment vendor. [1]: https://packet.gg/ On 1/24/2019 02:19 午前, Italo Cunha wrote:
Ben, NANOG,
We have canceled this experiment permanently.
On Wed, Jan 23, 2019 at 12:00 PM Ben Cooper <ben@packet.gg <mailto:ben@packet.gg>> wrote:
Can you stop this?
You caused again a massive prefix spike/flap, and as the internet is not centered around NA (shock horror!) a number of operators in Asia and Australia go effected by your “expirment” and had no idea what was happening or why.
Get a sandbox like every other researcher, as of now we have black holed and filtered your whole ASN, and have reccomended others do the same.
On Thu, Jan 24, 2019 at 3:23 AM Paul S. <contact@winterei.se> wrote:
As others have noted, the right target to be angry at is your equipment vendor.
...whose name I'd personally be quite delighted to finally hear. Is it just the same FRR that got a patch someone failed to apply, or this time the issue is more serious?
Replying to throw in my support behind continuing the experiment as well.
+1. I've yet to see any disruptions caused by the experiment in my area. -- Töma
On 1/23/2019 8:27 PM, Töma Gavrichenkov wrote:
Replying to throw in my support behind continuing the experiment as well. +1. I've yet to see any disruptions caused by the experiment in my area.
Speaking of which, were there any statistics gathered and published before, during and after the experiment about the size of the global routing table and how many ASNs were impacted ? ---Mike -- ------------------- Mike Tancsa, tel +1 519 651 3400 x203 Sentex Communications, mike@sentex.net Providing Internet services since 1994 www.sentex.net Cambridge, Ontario Canada
On Thu, Jan 24, 2019, 5:40 PM Mike Tancsa <mike@sentex.net> wrote:
On 1/23/2019 8:27 PM, Töma Gavrichenkov wrote:
+1. I've yet to see any disruptions caused by the experiment in my area.
Speaking of which, were there any statistics gathered and published before, during and after the experiment about the size of the global routing table and how many ASNs were impacted ?
No, sorry. Like I said, Radar has seen nothing more than few minor interruptions during that period. If we detected anything serious, we would have posted that to our blog. I wonder if the experiment organizers have also collected any data. -- Töma
Or you could simply fix your gear rather than leaving a big hole in your infrastructure. *shrug* On Thu, Jan 24, 2019, 7:25 AM Ben Cooper <ben@packet.gg wrote:
Can you stop this?
You caused again a massive prefix spike/flap, and as the internet is not centered around NA (shock horror!) a number of operators in Asia and Australia go effected by your “expirment” and had no idea what was happening or why.
Get a sandbox like every other researcher, as of now we have black holed and filtered your whole ASN, and have reccomended others do the same.
On Wed, 23 Jan 2019 at 1:19 am, Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
This is a reminder that this experiment will resume tomorrow (Wednesday, Jan. 23rd). We will announce 184.164.224.0/24 carrying a BGP attribute of type 0xff (reserved for development) between 14:00 and 14:15 GMT.
On Tue, Dec 18, 2018 at 10:05 AM Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
We would like to inform you of an experiment to evaluate alternatives for speeding up adoption of BGP route origin validation (research paper with details [A]).
Our plan is to announce prefix 184.164.224.0/24 with a valid standards-compliant unassigned BGP attribute from routers operated by the PEERING testbed [B, C]. The attribute will have flags 0xe0 (optional transitive [rfc4271, S4.3]), type 0xff (reserved for development), and size 0x20 (256bits).
Our collaborators recently ran an equivalent experiment with no complaints or known issues [A], and so we do not anticipate any arising. Back in 2010, an experiment using unassigned attributes by RIPE and Duke University caused disruption in Internet routing due to a bug in Cisco routers [D, CVE-2010-3035]. Since then, this and other similar bugs have been patched [e.g., CVE-2013-6051], and new BGP attributes have been assigned (BGPsec-path) and adopted (large communities). We have successfully tested propagation of the announcements on Cisco IOS-based routers running versions 12.2(33)SRA and 15.3(1)S, Quagga 0.99.23.1 and 1.1.1, as well as BIRD 1.4.5 and 1.6.3.
We plan to announce 184.164.224.0/24 from 8 PEERING locations for a predefined period of 15 minutes starting 14:30 GMT, from Monday to Thursday, between the 7th and 22nd of January, 2019 (full schedule and locations [E]). We will stop the experiment immediately in case any issues arise.
Although we do not expect the experiment to cause disruption, we welcome feedback on its safety and especially on how to make it safer. We can be reached at disco-experiment@googlegroups.com.
Amir Herzberg, University of Connecticut Ethan Katz-Bassett, Columbia University Haya Shulman, Fraunhofer SIT Ítalo Cunha, Universidade Federal de Minas Gerais Michael Schapira, Hebrew University of Jerusalem Tomas Hlavacek, Fraunhofer SIT Yossi Gilad, MIT
[A] https://conferences.sigcomm.org/hotnets/2018/program.html [B] http://peering.usc.edu [C] https://goo.gl/AFR1Cn [D]
https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experime...
-- Ben Cooper Chief Executive Officer PacketGG - Multicast M(Telstra): 0410 411 301 M(Optus): 0434 336 743 E: ben@packet.gg & ben@multicast.net.au W: https://packet.gg W: https://multicast.net.au
On Thu, 24 Jan 2019 04:00:27 +1100, Ben Cooper said:
You caused again a massive prefix spike/flap,
That's twice now you've said that without any numbers or details. Care to explain what you mean by "massive" in a world where the IPv4 table has like 700K+ routes? And as percieved by what point(s) in the topology? Knowing where there are pockets of network admins shooting themselves in the foot drastically improves the ability of organizations like NetDotctors Without Borders to give proper aid where needed...
If I understand this thread correctly, the test cause no actual change in the routing table size or route announcement. That was all a result of the incorrect behavior of the software. Instead of throwing rocks, how about some data instead. We can collaborate and better understand the whole thing so make it better and move on to the next thing. Yelling about "North America" when 4 of the 7 listed researchers on the test are NOT IN NORTH AMERICA doesn't really help anything. On Thu, Jan 24, 2019 at 10:25 AM Ben Cooper <ben@packet.gg> wrote:
Can you stop this?
You caused again a massive prefix spike/flap, and as the internet is not centered around NA (shock horror!) a number of operators in Asia and Australia go effected by your “expirment” and had no idea what was happening or why.
Get a sandbox like every other researcher, as of now we have black holed and filtered your whole ASN, and have reccomended others do the same.
On Wed, 23 Jan 2019 at 1:19 am, Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
This is a reminder that this experiment will resume tomorrow (Wednesday, Jan. 23rd). We will announce 184.164.224.0/24 carrying a BGP attribute of type 0xff (reserved for development) between 14:00 and 14:15 GMT.
On Tue, Dec 18, 2018 at 10:05 AM Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
We would like to inform you of an experiment to evaluate alternatives for speeding up adoption of BGP route origin validation (research paper with details [A]).
Our plan is to announce prefix 184.164.224.0/24 with a valid standards-compliant unassigned BGP attribute from routers operated by the PEERING testbed [B, C]. The attribute will have flags 0xe0 (optional transitive [rfc4271, S4.3]), type 0xff (reserved for development), and size 0x20 (256bits).
Our collaborators recently ran an equivalent experiment with no complaints or known issues [A], and so we do not anticipate any arising. Back in 2010, an experiment using unassigned attributes by RIPE and Duke University caused disruption in Internet routing due to a bug in Cisco routers [D, CVE-2010-3035]. Since then, this and other similar bugs have been patched [e.g., CVE-2013-6051], and new BGP attributes have been assigned (BGPsec-path) and adopted (large communities). We have successfully tested propagation of the announcements on Cisco IOS-based routers running versions 12.2(33)SRA and 15.3(1)S, Quagga 0.99.23.1 and 1.1.1, as well as BIRD 1.4.5 and 1.6.3.
We plan to announce 184.164.224.0/24 from 8 PEERING locations for a predefined period of 15 minutes starting 14:30 GMT, from Monday to Thursday, between the 7th and 22nd of January, 2019 (full schedule and locations [E]). We will stop the experiment immediately in case any issues arise.
Although we do not expect the experiment to cause disruption, we welcome feedback on its safety and especially on how to make it safer. We can be reached at disco-experiment@googlegroups.com.
Amir Herzberg, University of Connecticut Ethan Katz-Bassett, Columbia University Haya Shulman, Fraunhofer SIT Ítalo Cunha, Universidade Federal de Minas Gerais Michael Schapira, Hebrew University of Jerusalem Tomas Hlavacek, Fraunhofer SIT Yossi Gilad, MIT
[A] https://conferences.sigcomm.org/hotnets/2018/program.html [B] http://peering.usc.edu [C] https://goo.gl/AFR1Cn [D]
https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experime...
-- Ben Cooper Chief Executive Officer PacketGG - Multicast M(Telstra): 0410 411 301 M(Optus): 0434 336 743 E: ben@packet.gg & ben@multicast.net.au W: https://packet.gg W: https://multicast.net.au
OP is yet to clarify how a single /24 advertisement caused a "massive-prefix spike/flap"; in OP's words. The Experiment should continue. -Randy On Friday, January 25, 2019, 2:32:47 PM PST, Tom Beecher <beecher@beecher.cc> wrote: If I understand this thread correctly, the test cause no actual change in the routing table size or route announcement. That was all a result of the incorrect behavior of the software. Instead of throwing rocks, how about some data instead. We can collaborate and better understand the whole thing so make it better and move on to the next thing. Yelling about "North America" when 4 of the 7 listed researchers on the test are NOT IN NORTH AMERICA doesn't really help anything. On Thu, Jan 24, 2019 at 10:25 AM Ben Cooper <ben@packet.gg> wrote:
Can you stop this?
You caused again a massive prefix spike/flap, and as the internet is not centered around NA (shock horror!) a number of operators in Asia and Australia go effected by your “expirment” and had no idea what was happening or why.
Get a sandbox like every other researcher, as of now we have black holed and filtered your whole ASN, and have reccomended others do the same.
On Wed, 23 Jan 2019 at 1:19 am, Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
This is a reminder that this experiment will resume tomorrow (Wednesday, Jan. 23rd). We will announce 184.164.224.0/24 carrying a BGP attribute of type 0xff (reserved for development) between 14:00 and 14:15 GMT.
On Tue, Dec 18, 2018 at 10:05 AM Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
We would like to inform you of an experiment to evaluate alternatives for speeding up adoption of BGP route origin validation (research paper with details [A]).
Our plan is to announce prefix 184.164.224.0/24 with a valid standards-compliant unassigned BGP attribute from routers operated by the PEERING testbed [B, C]. The attribute will have flags 0xe0 (optional transitive [rfc4271, S4.3]), type 0xff (reserved for development), and size 0x20 (256bits).
Our collaborators recently ran an equivalent experiment with no complaints or known issues [A], and so we do not anticipate any arising. Back in 2010, an experiment using unassigned attributes by RIPE and Duke University caused disruption in Internet routing due to a bug in Cisco routers [D, CVE-2010-3035]. Since then, this and other similar bugs have been patched [e.g., CVE-2013-6051], and new BGP attributes have been assigned (BGPsec-path) and adopted (large communities). We have successfully tested propagation of the announcements on Cisco IOS-based routers running versions 12.2(33)SRA and 15.3(1)S, Quagga 0.99.23.1 and 1.1.1, as well as BIRD 1.4.5 and 1.6.3.
We plan to announce 184.164.224.0/24 from 8 PEERING locations for a predefined period of 15 minutes starting 14:30 GMT, from Monday to Thursday, between the 7th and 22nd of January, 2019 (full schedule and locations [E]). We will stop the experiment immediately in case any issues arise.
Although we do not expect the experiment to cause disruption, we welcome feedback on its safety and especially on how to make it safer. We can be reached at disco-experiment@googlegroups.com.
Amir Herzberg, University of Connecticut Ethan Katz-Bassett, Columbia University Haya Shulman, Fraunhofer SIT Ítalo Cunha, Universidade Federal de Minas Gerais Michael Schapira, Hebrew University of Jerusalem Tomas Hlavacek, Fraunhofer SIT Yossi Gilad, MIT
[A] https://conferences.sigcomm.org/hotnets/2018/program.html [B] http://peering.usc.edu [C] https://goo.gl/AFR1Cn [D] https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experime... [E] https://goo.gl/nJhmx1
-- Ben Cooper Chief Executive Officer PacketGG - Multicast M(Telstra): 0410 411 301 M(Optus): 0434 336 743 E: ben@packet.gg & ben@multicast.net.au W: https://packet.gg W: https://multicast.net.au
I might be reading this wrong but it appears only one person has raised an issue and then not actually backed it up with data. Out of the eyes that have views inside the major networks did anyone see any issues? Surely cross posting this to other NOG lists is sufficienct. On Sat, 26 Jan 2019 at 09:15, Randy via NANOG <nanog@nanog.org> wrote:
OP is yet to clarify how a single /24 advertisement caused a "massive-prefix spike/flap"; in OP's words.
The Experiment should continue. -Randy
On Friday, January 25, 2019, 2:32:47 PM PST, Tom Beecher <beecher@beecher.cc> wrote:
If I understand this thread correctly, the test cause no actual change in the routing table size or route announcement. That was all a result of the incorrect behavior of the software.
Instead of throwing rocks, how about some data instead. We can collaborate and better understand the whole thing so make it better and move on to the next thing. Yelling about "North America" when 4 of the 7 listed researchers on the test are NOT IN NORTH AMERICA doesn't really help anything.
Can you stop this?
You caused again a massive prefix spike/flap, and as the internet is not
On Thu, Jan 24, 2019 at 10:25 AM Ben Cooper <ben@packet.gg> wrote: centered around NA (shock horror!) a number of operators in Asia and Australia go effected by your “expirment” and had no idea what was happening or why.
Get a sandbox like every other researcher, as of now we have black holed
and filtered your whole ASN, and have reccomended others do the same.
On Wed, 23 Jan 2019 at 1:19 am, Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
This is a reminder that this experiment will resume tomorrow (Wednesday, Jan. 23rd). We will announce 184.164.224.0/24 carrying a BGP attribute of type 0xff (reserved for development) between 14:00 and 14:15 GMT.
On Tue, Dec 18, 2018 at 10:05 AM Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
We would like to inform you of an experiment to evaluate alternatives for speeding up adoption of BGP route origin validation (research paper with details [A]).
Our plan is to announce prefix 184.164.224.0/24 with a valid standards-compliant unassigned BGP attribute from routers operated by the PEERING testbed [B, C]. The attribute will have flags 0xe0 (optional transitive [rfc4271, S4.3]), type 0xff (reserved for development), and size 0x20 (256bits).
Our collaborators recently ran an equivalent experiment with no complaints or known issues [A], and so we do not anticipate any arising. Back in 2010, an experiment using unassigned attributes by RIPE and Duke University caused disruption in Internet routing due to a bug in Cisco routers [D, CVE-2010-3035]. Since then, this and other similar bugs have been patched [e.g., CVE-2013-6051], and new BGP attributes have been assigned (BGPsec-path) and adopted (large communities). We have successfully tested propagation of the announcements on Cisco IOS-based routers running versions 12.2(33)SRA and 15.3(1)S, Quagga 0.99.23.1 and 1.1.1, as well as BIRD 1.4.5 and 1.6.3.
We plan to announce 184.164.224.0/24 from 8 PEERING locations for a predefined period of 15 minutes starting 14:30 GMT, from Monday to Thursday, between the 7th and 22nd of January, 2019 (full schedule and locations [E]). We will stop the experiment immediately in case any issues arise.
Although we do not expect the experiment to cause disruption, we welcome feedback on its safety and especially on how to make it safer. We can be reached at disco-experiment@googlegroups.com.
Amir Herzberg, University of Connecticut Ethan Katz-Bassett, Columbia University Haya Shulman, Fraunhofer SIT Ítalo Cunha, Universidade Federal de Minas Gerais Michael Schapira, Hebrew University of Jerusalem Tomas Hlavacek, Fraunhofer SIT Yossi Gilad, MIT
[A] https://conferences.sigcomm.org/hotnets/2018/program.html [B] http://peering.usc.edu [C] https://goo.gl/AFR1Cn [D]
https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experime...
-- Ben Cooper Chief Executive Officer PacketGG - Multicast M(Telstra): 0410 411 301 M(Optus): 0434 336 743 E: ben@packet.gg & ben@multicast.net.au W: https://packet.gg W: https://multicast.net.au
-- Regards, Mark L. Tees
I did realise a little after this that it would be a no no to talk this security wise. On Sat, 26 Jan 2019 at 12:47, Mark Tees <marktees@gmail.com> wrote:
I might be reading this wrong but it appears only one person has raised an issue and then not actually backed it up with data.
Out of the eyes that have views inside the major networks did anyone see any issues?
Surely cross posting this to other NOG lists is sufficienct.
On Sat, 26 Jan 2019 at 09:15, Randy via NANOG <nanog@nanog.org> wrote:
OP is yet to clarify how a single /24 advertisement caused a "massive-prefix spike/flap"; in OP's words.
The Experiment should continue. -Randy
On Friday, January 25, 2019, 2:32:47 PM PST, Tom Beecher <beecher@beecher.cc> wrote:
If I understand this thread correctly, the test cause no actual change in the routing table size or route announcement. That was all a result of the incorrect behavior of the software.
Instead of throwing rocks, how about some data instead. We can collaborate and better understand the whole thing so make it better and move on to the next thing. Yelling about "North America" when 4 of the 7 listed researchers on the test are NOT IN NORTH AMERICA doesn't really help anything.
On Thu, Jan 24, 2019 at 10:25 AM Ben Cooper <ben@packet.gg> wrote:
Can you stop this?
You caused again a massive prefix spike/flap, and as the internet is not centered around NA (shock horror!) a number of operators in Asia and Australia go effected by your “expirment” and had no idea what was happening or why.
Get a sandbox like every other researcher, as of now we have black holed and filtered your whole ASN, and have reccomended others do the same.
On Wed, 23 Jan 2019 at 1:19 am, Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
This is a reminder that this experiment will resume tomorrow (Wednesday, Jan. 23rd). We will announce 184.164.224.0/24 carrying a BGP attribute of type 0xff (reserved for development) between 14:00 and 14:15 GMT.
On Tue, Dec 18, 2018 at 10:05 AM Italo Cunha <cunha@dcc.ufmg.br> wrote:
NANOG,
We would like to inform you of an experiment to evaluate alternatives for speeding up adoption of BGP route origin validation (research paper with details [A]).
Our plan is to announce prefix 184.164.224.0/24 with a valid standards-compliant unassigned BGP attribute from routers operated by the PEERING testbed [B, C]. The attribute will have flags 0xe0 (optional transitive [rfc4271, S4.3]), type 0xff (reserved for development), and size 0x20 (256bits).
Our collaborators recently ran an equivalent experiment with no complaints or known issues [A], and so we do not anticipate any arising. Back in 2010, an experiment using unassigned attributes by RIPE and Duke University caused disruption in Internet routing due to a bug in Cisco routers [D, CVE-2010-3035]. Since then, this and other similar bugs have been patched [e.g., CVE-2013-6051], and new BGP attributes have been assigned (BGPsec-path) and adopted (large communities). We have successfully tested propagation of the announcements on Cisco IOS-based routers running versions 12.2(33)SRA and 15.3(1)S, Quagga 0.99.23.1 and 1.1.1, as well as BIRD 1.4.5 and 1.6.3.
We plan to announce 184.164.224.0/24 from 8 PEERING locations for a predefined period of 15 minutes starting 14:30 GMT, from Monday to Thursday, between the 7th and 22nd of January, 2019 (full schedule and locations [E]). We will stop the experiment immediately in case any issues arise.
Although we do not expect the experiment to cause disruption, we welcome feedback on its safety and especially on how to make it safer. We can be reached at disco-experiment@googlegroups.com.
Amir Herzberg, University of Connecticut Ethan Katz-Bassett, Columbia University Haya Shulman, Fraunhofer SIT Ítalo Cunha, Universidade Federal de Minas Gerais Michael Schapira, Hebrew University of Jerusalem Tomas Hlavacek, Fraunhofer SIT Yossi Gilad, MIT
[A] https://conferences.sigcomm.org/hotnets/2018/program.html [B] http://peering.usc.edu [C] https://goo.gl/AFR1Cn [D]
https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experime...
-- Ben Cooper Chief Executive Officer PacketGG - Multicast M(Telstra): 0410 411 301 M(Optus): 0434 336 743 E: ben@packet.gg & ben@multicast.net.au W: https://packet.gg W: https://multicast.net.au
-- Regards,
Mark L. Tees
-- Regards, Mark L. Tees
i just want to make sure that folk are really in agreement with what i think i have been hearing from a lot of strident voices here. if you know of an out-of-spec vulnerability or bug in deployed router, switch, server, ... ops and researchers should exploit it as much as possible in order to encourage fixing of the hole. given the number of bugs/vulns, are you comfortable that this is going to scale well? and this is prudent when our primary responsibility is a running internet? just checkin' randy PS: if you think this, speak up so i can note to never hire or recommend you. PPS: Anant Shah, Romain Fontugne, Emile Aben, Cristel Pelsser, and Randy Bush; "Disco: Fast, Good, and Cheap Outage Detection"; TMA 2017 ^^^^^ :)
I think that’s a bit of reductio ad absurdum from what has been said. I would prefer that researchers collaborate to: 1. Compile a list of lists that should be notified of such experiments in advance. Try to get the word out to as much of the community as possible through various NOGs and other relevant industry lists. 2. Use said list of lists to provide at least 7 days advance notice of such testing, ideally with links to the details of the vulnerability in question and known vulnerable and known good code bases for as many software/hardware platforms as feasible. (Ideally list unknowns and solicit feedback as well). 3. Provide contact information for reporting test-related problems, issues, affected software versions, etc. Ideally an email address for after-action reports of data and a phone number that will be monitored during active testing for emergent reports of test-related service disruptions. 4. Conduct the test for incrementally longer periods over time. e.g. start with a 15 minute test on the first try and then run 30, 60, and multi-hour tests on later dates after addressing any reported problems during earlier tests. I think such behavior would provide the best intersection of encouraging patching/fixing while also minimizing disruption and harm to innocent third parties. Owen
On Jan 26, 2019, at 8:15 AM, Randy Bush <randy@psg.com> wrote:
i just want to make sure that folk are really in agreement with what i think i have been hearing from a lot of strident voices here.
if you know of an out-of-spec vulnerability or bug in deployed router, switch, server, ... ops and researchers should exploit it as much as possible in order to encourage fixing of the hole.
given the number of bugs/vulns, are you comfortable that this is going to scale well? and this is prudent when our primary responsibility is a running internet?
just checkin'
randy
PS: if you think this, speak up so i can note to never hire or recommend you.
PPS: Anant Shah, Romain Fontugne, Emile Aben, Cristel Pelsser, and Randy Bush; "Disco: Fast, Good, and Cheap Outage Detection"; TMA 2017 ^^^^^ :)
On Sat, 26 Jan 2019 11:37:05 -0800, Owen DeLong said:
1. Compile a list of lists that should be notified of such experiments in advance. Try to get the word out to as much of the community as possible through various NOGs and other relevant industry lists.
As we've discovered after many such events, the overlap between the people who read those lists and the people running outdated vulnerable software isn't very large.
On Jan 26, 2019, at 16:48, valdis.kletnieks@vt.edu wrote:
On Sat, 26 Jan 2019 11:37:05 -0800, Owen DeLong said:
1. Compile a list of lists that should be notified of such experiments in advance. Try to get the word out to as much of the community as possible through various NOGs and other relevant industry lists.
As we've discovered after many such events, the overlap between the people who read those lists and the people running outdated vulnerable software isn't very large.
While this may be true, if you have a better suggestion for how to reach them, I’m all ears. Otherwise, doing the best we can to disseminate the information as widely as possible seems the most practicable approach currently available. Owen
As we've discovered after many such events, the overlap between the people who read those lists and the people running outdated vulnerable software isn't very large.
to steal from a reply to a private message: there are a jillion folk at the edges of the net running with low end gear, low margins, and 312 pressures. *knowingly* abusing them into an update a week is just not reasonable ops behavior. and, at the other extreme, big core isps have a pre-deployment test window of six or more months. the only win here is that public embarrassment does help to get the big vendors to give us a fix with which to start the lab test cycle. bug reports to tac seem not to. randy
I think a better question is, once a vulnerability has become widespread public knowledge, do you expect malicious actors, malware authors and intelligence agencies of autocratic nation-states to obey a gentlemens' agreement not to exploit something? There is not a great deal of venn diagram overlap between "organizations that will pay $2 million for a zero day remote exploit on the latest version of iOS" and "people who care about whether Randy Bush recommends them for a job". On Sat, Jan 26, 2019 at 8:16 AM Randy Bush <randy@psg.com> wrote:
i just want to make sure that folk are really in agreement with what i think i have been hearing from a lot of strident voices here.
if you know of an out-of-spec vulnerability or bug in deployed router, switch, server, ... ops and researchers should exploit it as much as possible in order to encourage fixing of the hole.
given the number of bugs/vulns, are you comfortable that this is going to scale well? and this is prudent when our primary responsibility is a running internet?
just checkin'
randy
PS: if you think this, speak up so i can note to never hire or recommend you.
PPS: Anant Shah, Romain Fontugne, Emile Aben, Cristel Pelsser, and Randy Bush; "Disco: Fast, Good, and Cheap Outage Detection"; TMA 2017 ^^^^^ :)
I think a better question is, once a vulnerability has become widespread public knowledge, do you expect malicious actors, malware authors and intelligence agencies of autocratic nation-states to obey a gentlemens' agreement not to exploit something?
false anology, or maybe just a subject switch. the 'attacker' was not a nation state nor intentionally malicious. it was a naïve researcher meaning no harm. in fact, i have co-authored with ítalo, and he is a very well meaning, and usually cautious, researcher. he just fell in with a crew with a rep for ops cluelessness that needed to demonstrate it once again. to nick's point. as nick knows, i am a naggumite; one of my few disagreements with dr postel. but there is a difference between writing protocol specs/code, and with sending packets on the global internet. rigor in the former, prudence in the latter. while it is tragicaly true that someone will be willing to load mrs schächter on the cattle car, it damned well ain't gonna be me. randy
On 1/26/19 6:37 PM, Randy Bush wrote:
to nick's point. as nick knows, i am a naggumite; one of my few disagreements with dr postel. but there is a difference between writing protocol specs/code, and with sending packets on the global internet. rigor in the former, prudence in the latter.
OK, Randy, you peaked my interest: what is a naggumite? Many of us disagreed with Jon Postel from time to time, but he usually understood the alternative points of view.
On 27/01/2019 19:21, William Allen Simpson wrote:
OK, Randy, you peaked my interest: what is a naggumite?
... (scouring the internet) o https://www.nanog.org/mailinglist/mailarchives/old_archive/2006-01/msg00250.... o https://en.wikipedia.org/wiki/Erik_Naggum o https://www.dictionary.com/browse/-ite ?
OK, Randy, you peaked my interest: what is a naggumite?
erik naggum, an early and strong proponent of being strict. you've been around long enough you should remember erik.
Many of us disagreed with Jon Postel from time to time, but he usually understood the alternative points of view.
oh, i have been dealing with network cowboys (and yes, unsurprisingly pretty universally boys) for enough decades to mostly understand. but the lack of prudence and level of irresponsibility occasionally surprise me. randy
William Allen Simpson wrote on 27/01/2019 18:21:
OK, Randy, you peaked my interest: what is a naggumite?
http://naggum.no/worse-is-better.html a.k.a. "perfect is the enemy of good enough". Nick
On Sun, Jan 27, 2019 at 01:21:56PM -0500, William Allen Simpson wrote:
On 1/26/19 6:37 PM, Randy Bush wrote:
to nick's point. as nick knows, i am a naggumite; one of my few disagreements with dr postel. but there is a difference between writing protocol specs/code, and with sending packets on the global internet. rigor in the former, prudence in the latter.
OK, Randy, you peaked my interest: what is a naggumite?
Many of us disagreed with Jon Postel from time to time, but he usually understood the alternative points of view.
I fondly recall that Erik could be quite acerbic, as I think is well exemplified by this: "If I had to deal with you professionally, I would have told you to hold the onions and give me large fries." - Erik Naggum Unfortunately, I don't recall to whom he said that; I suppose I am lucky that it wasn't me. - Brian
Randy Bush wrote on 26/01/2019 16:15:
if you know of an out-of-spec vulnerability or bug in deployed router, switch, server, ... ops and researchers should exploit it as much as possible in order to encourage fixing of the hole.
It came out as "please continue", but the sentiment sounded less like malice / ignorance, and more like a lack of sympathy for people who leave equipment connected to the dfz which shouldn't be connected to the dfz.
given the number of bugs/vulns, are you comfortable that this is going to scale well? and this is prudent when our primary responsibility is a running internet?
This isn't the first time that a malformed IANA BGP attribute implementation caused service loss, and it's unlikely to be the last time either. https://sempf.net/post/On-Testing1 Some time in the future, it will be acceptable to continue the DISCO experiment along its current lines because bgp stack authors will remember the time that attribute 255 caused things to explode and their code bases will be resilient to this problem. When this happens, will it be acceptable to announce prefixes with arbitrary unassigned attributes with random contents? Where does the boundary lie between what is and what is not acceptable? Do we assign a time limit after which it's considered generally acceptable to announce attributes or capabilities which are known to cause problems? If someone were to set up a beacon system which announced prefixes with unassigned attributes and garbage content, is that a useful community service or simply a nuisance? The research people acted correctly in stopping the experiment. They could engage with the IETF IDR working group to get a temporary attribute code point rather than using 255, and it would be interesting to see results from this. But I'm not convinced that it's feasible for the internet community to assert that any particular machination of bgp announcement is out of bounds in perpetuity - in the longer term, this will promote systemic infrastructural weakness rather than doing what we all aspire to, namely creating a more resilient internet. Nick
participants (35)
-
adamv0025@netconsultings.com
-
Aled Morris
-
Ben Cooper
-
Brian Kantor
-
Christoffer Hansen
-
Eric Kuhnke
-
Filip Hruska
-
Hansen, Christoffer
-
Italo Cunha
-
James Jun
-
Jared Mauch
-
Job Snijders
-
Job Snijders
-
Mark Tees
-
Mark Tinka
-
Mike Hale
-
Mike Tancsa
-
Naslund, Steve
-
Nick Hilliard
-
niels=nanog@bakker.net
-
Nikolas Geyer
-
Owen DeLong
-
Paul S.
-
Randy
-
Randy Bush
-
Saku Ytti
-
Stephen Satchell
-
Steve Noble
-
Tom Ammon
-
Tom Beecher
-
Tore Anderson
-
Töma Gavrichenkov
-
valdis.kletnieks@vt.edu
-
William Allen Simpson
-
William Herrin