how would a sidr-enabled routing infrastructure have fared in yesterdays routing circus? /bill
On Nov 8, 2011, at 5:56 PM, <bmanning@vacation.karoshi.com> wrote:
how would a sidr-enabled routing infrastructure have fared in yesterdays routing circus?
The effects of large amounts of route-churn on the auth chain - perhaps DANE? - might've been interesting . . . ----------------------------------------------------------------------- Roland Dobbins <rdobbins@arbor.net> // <http://www.arbornetworks.com> The basis of optimism is sheer terror. -- Oscar Wilde
We saw an increase in IPv6 traffic which correlated time wise with the onset of this IPv4 incident. Happy eyeballs in action, automatically shifting what it could. Mike. On 11/8/11 2:56 AM, bmanning@vacation.karoshi.com wrote:
how would a sidr-enabled routing infrastructure have fared in yesterdays routing circus?
/bill
that was/is kindof orthoginal to the question... would the sidr plan for routing security have been a help in this event? nice to know unsecured IPv6 took some of the load when the unsecured IPv4 path failed. the answer seems to be NO, it would not have helped and would have actually contributed to network instability with large numbers of validation requests sent to the sidr/ca nodes... /bill On Tue, Nov 08, 2011 at 10:01:04AM -0800, Mike Leber wrote:
We saw an increase in IPv6 traffic which correlated time wise with the onset of this IPv4 incident.
Happy eyeballs in action, automatically shifting what it could.
Mike.
On 11/8/11 2:56 AM, bmanning@vacation.karoshi.com wrote:
how would a sidr-enabled routing infrastructure have fared in yesterdays routing circus?
/bill
On Nov 9, 2011, at 1:14 AM, <bmanning@vacation.karoshi.com> wrote:
that was/is kindof orthoginal to the question... would the sidr plan for routing security have been a help in this event?
SIDR is intended to provide route-origination validation - it isn't intended to be nor can it possibly be a remedy for vendor-specific implementation problems. Validation storm-control is something which must be accounted for in SIDR/DANE architecture, implementation, and deployment. But at the end of the day, vendors are still responsible for their own code. ----------------------------------------------------------------------- Roland Dobbins <rdobbins@arbor.net> // <http://www.arbornetworks.com> The basis of optimism is sheer terror. -- Oscar Wilde
On Nov 9, 2011, at 1:22 AM, Dobbins, Roland wrote:
Validation storm-control is something which must be accounted for in SIDR/DANE architecture, implementation, and deployment. But at the end of the day, vendors are still responsible for their own code.
To be clear, I was alluding to some discussion centering around DANE or a DANE-like mechanism to handle SIDR-type route validation. Recursive dependencies make this a non-starter, IMHO. ----------------------------------------------------------------------- Roland Dobbins <rdobbins@arbor.net> // <http://www.arbornetworks.com> The basis of optimism is sheer terror. -- Oscar Wilde
On Tue, Nov 08, 2011 at 06:25:36PM +0000, Dobbins, Roland wrote:
On Nov 9, 2011, at 1:22 AM, Dobbins, Roland wrote:
Validation storm-control is something which must be accounted for in SIDR/DANE architecture, implementation, and deployment. But at the end of the day, vendors are still responsible for their own code.
To be clear, I was alluding to some discussion centering around DANE or a DANE-like mechanism to handle SIDR-type route validation. Recursive dependencies make this a non-starter, IMHO.
well... your still stuck w/ knowing where your CA is... /bill
On 8 Nov 2011, at 18:24, "Dobbins, Roland" <rdobbins@arbor.net> wrote:
On Nov 9, 2011, at 1:14 AM, <bmanning@vacation.karoshi.com> wrote:
that was/is kindof orthoginal to the question... would the sidr plan for routing security have been a help in this event?
SIDR is intended to provide route-origination validation - it isn't intended to be nor can it possibly be a remedy for vendor-specific implementation problems.
Validation storm-control is something which must be accounted for in SIDR/DANE architecture, implementation, and deployment. But at the end of the day, vendors are still responsible for their own code.
Indeed, we can expect new and exciting ways to blow up networks with SIDR. -- Leigh ______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________
On Tue, Nov 8, 2011 at 4:08 PM, Leigh Porter <leigh.porter@ukbroadband.com> wrote:
On 8 Nov 2011, at 18:24, "Dobbins, Roland" <rdobbins@arbor.net> wrote:
Validation storm-control is something which must be accounted for in SIDR/DANE architecture, implementation, and deployment. But at the end of the day, vendors are still responsible for their own code.
Indeed, we can expect new and exciting ways to blow up networks with SIDR.
or, the same old ways... only with crypto! really, there was some care taken in the process to create this and NOT stomp all over how networks currently work. comments welcome though.
On 08/11/2011 18:14, bmanning@vacation.karoshi.com wrote:
the answer seems to be NO, it would not have helped and would have actually contributed to network instability with large numbers of validation requests sent to the sidr/ca nodes...
i'm curious about sidr cold bootup, specifically when you are attempting to validate prefixes from an rpki CA or cache to which you do not necessarily have network connectivity because your igp is not yet fully up. The phrases "layering violation" and "chicken and egg" come to mind. Nick
On Tue, Nov 08, 2011 at 06:48:12PM +0000, Nick Hilliard wrote:
On 08/11/2011 18:14, bmanning@vacation.karoshi.com wrote:
the answer seems to be NO, it would not have helped and would have actually contributed to network instability with large numbers of validation requests sent to the sidr/ca nodes...
i'm curious about sidr cold bootup, specifically when you are attempting to validate prefixes from an rpki CA or cache to which you do not necessarily have network connectivity because your igp is not yet fully up. The phrases "layering violation" and "chicken and egg" come to mind.
Nick
yeah...there is that. /bill
i'm curious about sidr cold bootup, specifically when you are attempting to validate prefixes from an rpki CA or cache to which you do not necessarily have network connectivity because your igp is not yet fully up. The phrases "layering violation" and "chicken and egg" come to mind.
what comes to my mind is that NotFound is the default and it is recommended to route on it. i know boys are not allowed to read the manual, but this is starting to get boring. randy
On 08/11/2011 19:19, Randy Bush wrote:
what comes to my mind is that NotFound is the default and it is recommended to route on it.
I understand what the manual says (actually, i read it). I'm just curious as to how this is going to work in real life. Let's say you have a router cold boot with a bunch of ibgp peers, a transit or two and an rpki cache which is located on a non-connected network - e.g. small transit pop / AS boundary scenario. The cache is not necessarily going to be reachable until it sees an update for its connected network. Until this happens, there will be no connectivity from the router to the cache, and consequently prefixes received in from the transit may be subject to an incorrect and potentially inconsistent routing policy with respect to the rest of the network. Ok, they'll be revalidated once the cache comes on line, but what do you do with them in the interim? Route traffic to them, knowing that they might or might not be correct? Drop until the cache comes online from the point of view of the router? Forward potentially incorrect UPDATEs to your other ibgp peers, and forward validated updates when the cache comes on-line again? If so, then what if your incorrect new policy takes precedence over an existing path in your ibgp mesh? And what if your RP is low on memory from storing an unvalidated adj-rib-in? You could argue to have a local cache in every pop but may not be feasible either - a cache will require storage with a high write life-cycle (i.e. forget about using lots of types of flash), and you cannot be guaranteed that this is going to be available on a router. Look, i understand that you're designing rpki <-> interactivity such that things will at least work in some fashion when your routers lose sight of their rpki caches. The problem is that this approach weakens rpki's strengths - e.g. the ability to help stop youtube-like incidents from recurring by ignoring invalid prefix injection. Nick
On Tue, 08 Nov 2011 20:51:00 GMT, Nick Hilliard said:
I understand what the manual says (actually, i read it). I'm just curious as to how this is going to work in real life. Let's say you have a router cold boot with a bunch of ibgp peers, a transit or two and an rpki cache which is located on a non-connected network
Anybody who puts their rpki cache someplace that isn't accessible until they get the rpki initialized gets what they deserve. Once you realize this, the rest of the "what do we do for routing until it comes up" concern trolling in the rest of that paragraph becomes pretty easy to sort out...
You could argue to have a local cache in every pop but may not be feasible either - a cache will require storage with a high write life-cycle (i.e. forget about using lots of types of flash), and you cannot be guaranteed that this is going to be available on a router.
Caching just enough to validate the routes you need to get to a more capable rpki server shouldn't have a high write life-cycle. Heck, you could just manually configure a host route pointing to the rpki server... And it would hardly be the first time that people have been unable to deploy feature XYZ because it wouldn't fit in the flash on older boxes still in production.
On 08/11/2011 21:32, Valdis.Kletnieks@vt.edu wrote:
Anybody who puts their rpki cache someplace that isn't accessible until they get the rpki initialized gets what they deserve.
One solution is to have directly-connected rpki caches available to all your bgp edge routers throughout your entire network. This may turn out to be expensive capex-wise, and will turn out to be yet another critical infrastructure item to maintain, increasing opex. Alternatively, you host rpki caches on all your AS-edge routers => upgrades - and lots of currently-sold kit will simply not handle this sort of thing properly.
Once you realize this, the rest of the "what do we do for routing until it comes up" concern trolling in the rest of that paragraph becomes pretty easy to sort out...
I humbly apologise for expressing concern about the wisdom of imposing a hierarchical, higher-layer validation structure for forwarding-info management on a pre-existing lower layer fully distributed system which is already pretty damned complex... What's that principle called again? Was it "Keep It Complex, Stupid"? I can't seem to remember :-)
Caching just enough to validate the routes you need to get to a more capable rpki server shouldn't have a high write life-cycle.
Lots of older flash isn't going to like this => higher implementation cost due to upgrades.
Heck, you could just manually configure a host route pointing to the rpki server...
Yep, hard coding things - good idea, that.
And it would hardly be the first time that people have been unable to deploy feature XYZ because it wouldn't fit in the flash on older boxes still in production.
This is one of several points I'm making: there is a cost factor here, and it's not clear how large it is. Nick
On Nov 9, 2011, at 5:19 AM, Nick Hilliard wrote:
One solution is to have directly-connected rpki caches available to all your bgp edge routers throughout your entire network.
They don't have to be directly-connected - they could be on the DCN, which ought to have at least some static 'hints' to critical resources. ----------------------------------------------------------------------- Roland Dobbins <rdobbins@arbor.net> // <http://www.arbornetworks.com> The basis of optimism is sheer terror. -- Oscar Wilde
In a message written on Tue, Nov 08, 2011 at 10:19:24PM +0000, Nick Hilliard wrote:
One solution is to have directly-connected rpki caches available to all your bgp edge routers throughout your entire network. This may turn out to be expensive capex-wise, and will turn out to be yet another critical infrastructure item to maintain, increasing opex.
Couldn't you just have a couple of these boxes on your network and route them in your IGP, removing any BGP dependancy? KISS. -- Leo Bicknell - bicknell@ufp.org - CCIE 3440 PGP keys at http://www.ufp.org/~bicknell/
I understand what the manual says (actually, i read it).
cheating!!!!
I'm just curious as to how this is going to work in real life. Let's say you have a router cold boot with a bunch of ibgp peers, a transit or two and an rpki cache which is located on a non-connected network - e.g. small transit pop / AS boundary scenario. The cache is not necessarily going to be reachable until it sees an update for its connected network.
once again, o when you have no connection to a cache or no covering roa for a a prefix, the result is specified as NotFound o we recommend you route on NotFound so the result is the same as today.
Until this happens, there will be no connectivity from the router to the cache
false
Look, i understand that you're designing rpki <-> interactivity such that things will at least work in some fashion when your routers lose sight of their rpki caches. The problem is that this approach weakens rpki's strengths - e.g. the ability to help stop youtube-like incidents from recurring by ignoring invalid prefix injection.
you can't have you cake and eat it to. you can not detect invalid originations until you have the data to do so. randy
On 09/11/2011 03:14, Randy Bush wrote:
once again, o when you have no connection to a cache or no covering roa for a a prefix, the result is specified as NotFound o we recommend you route on NotFound
so the result is the same as today.
Well no, not really because when the cache becomes reachable again, you need to revalidate everything which got a NotFound. This will cause extra bgp churn where revalidation caused a local policy change. Even if you have a local cache, this will still cause problems due to the problem you summarised in draft-ietf-sidr-origin-ops, section 6: "Like the DNS, the global RPKI presents only a loosely consistent view, depending on timing, updating, fetching, etc. Thus, one cache or router may have different data about a particular prefix than another cache or router. There is no 'fix' for this, it is the nature of distributed data with distributed caches." Local caches may miss updates due to interior unreachability. Routers will not revalidate after cache updates. So this loosely consistent view will propagate into your routers' bgp views. Do I really want this? Or, more to the point, is a perpetually inconsistent bgp network view better or worse than the occasional more serious reachability problem that rpki is attempting to solve? This isn't clear to me.
Until this happens, there will be no connectivity from the router to the cache
false
Not false in the scenario I described. Please read what I said, not what your straw man whispers in your ear. :-) Nick
On Tue, Nov 8, 2011 at 1:48 PM, Nick Hilliard <nick@foobar.org> wrote:
On 08/11/2011 18:14, bmanning@vacation.karoshi.com wrote:
the answer seems to be NO, it would not have helped and would have actually contributed to network instability with large numbers of validation requests sent to the sidr/ca nodes...
i'm curious about sidr cold bootup, specifically when you are attempting to validate prefixes from an rpki CA or cache to which you do not necessarily have network connectivity because your igp is not yet fully up. The phrases "layering violation" and "chicken and egg" come to mind.
'lazy validation' - prefer to get at least somewhat converged, then validate.
the answer seems to be NO, it would not have helped and would have actually contributed to network instability with large numbers of validation requests sent to the sidr/ca nodes...
utter bullshit. maybe you would benefit by actually reading the doccos and understanding the protocols.
On Tue, Nov 08, 2011 at 08:16:10PM +0100, Randy Bush wrote:
the answer seems to be NO, it would not have helped and would have actually contributed to network instability with large numbers of validation requests sent to the sidr/ca nodes...
utter bullshit. maybe you would benefit by actually reading the doccos and understanding the protocols.
are they actually coherent enough to be read & understood? /bill
On Tue, 8 Nov 2011, bmanning@vacation.karoshi.com wrote:
On Tue, Nov 08, 2011 at 08:16:10PM +0100, Randy Bush wrote:
the answer seems to be NO, it would not have helped and would have actually contributed to network instability with large numbers of validation requests sent to the sidr/ca nodes...
utter bullshit. maybe you would benefit by actually reading the doccos and understanding the protocols.
are they actually coherent enough to be read & understood?
I think so: at least a Bachelor student of my got along with them for his thesis. Btw: There is also a very nice overview by Geoff published in Cisco IPJ: * http://www.cisco.com/web/about/ac123/ac147/archived_issues/ipj_14-2/142_bgp.... Cheers matthias -- Matthias Waehlisch . Freie Universitaet Berlin, Inst. fuer Informatik, AG CST . Takustr. 9, D-14195 Berlin, Germany .. mailto:waehlisch@ieee.org .. http://www.inf.fu-berlin.de/~waehl :. Also: http://inet.cpt.haw-hamburg.de .. http://www.link-lab.net
fwiw, we have not tested the scaling of rpki-rtr performance as much as we might have. we synthesized an rpki cache with roas for all the prefixes in a current table, 370k of them or whatever, and let routers load that cache from zip to full. for low-end routers and a mediocre cache server, either local or across noam, it took less than five seconds. this was small enough that we moved on to other stuff. randy
On Nov 8, 2011, at 7:28 PM, Randy Bush wrote:
fwiw, we have not tested the scaling of rpki-rtr performance as much as we might have. we synthesized an rpki cache with roas for all the prefixes in a current table, 370k of them or whatever, and let routers load that cache from zip to full. for low-end routers and a mediocre cache server, either local or across noam, it took less than five seconds. this was small enough that we moved on to other stuff.
randy
Did you do this on routers that already had fully converged tables, or, did you bootstrap the table load into the routers at the same time as would be the case in a power failure, post-crash reboot, software upgrade, etc.? If only the former, may I suggest that at least doing some level of the latter might prove a useful exercise? I apologize for this mildly operational question. Y'all can go back to Randy's fud-laiden black helicopters now. Owen
participants (11)
-
bmanning@vacation.karoshi.com
-
Christopher Morrow
-
Dobbins, Roland
-
Leigh Porter
-
Leo Bicknell
-
Matthias Waehlisch
-
Mike Leber
-
Nick Hilliard
-
Owen DeLong
-
Randy Bush
-
Valdis.Kletnieks@vt.edu