If the problem we're talking about is that the large number of paths that an exchange point router will soon be seeing is too large, and having a single route-server at each exchange is not politically acceptable, then here is a possible solution. One way to cut down the number of paths each XP router sees would be for every XP NSP/ISP/whatever to install a router _and_ a BGP4 "proxy" server. Meaning that each XP router would peer via IBGP with a large (64-128MB of RAM) computer, which in turn would peer, via EBGP, with all of that AS' peers' routers or router servers. This way each XP router so configured would have as many paths learned at the XP as it would have routes learned at the XP, which would be a substantial reduction from the number of paths most XP routers tend to carry nowadays; the computer next to the router would be the one to absorb the heavy load of handling so many neighbors, routes and paths, and these computers can be upgrade with more ease than can the routers. This approach has two major benefits: - The load on the XP routers would be lowered considerably by relieving the router of part of the path selection process as well as the need to have enough memory to carry hundreds of thousands of paths. The pressure on router vendors to beef up their products "path" capacity would be considerably reduced, thus lowering router costs in the long run. - NAP members would be freed from having to wait for their router vendor to implement better route filtering features. After all, each member would then have better control over the software used for the BGP4 server (there's gated, there could be new PD/free/shareware BGP4 router daemons, or even commercial ones if everyone tried this setup), able, perhaps, to hack their BGP4 daemon any which way they desire. A Pentium-class PC running some sort of Unix or Unix-like OS, or a Sun or DEC Alpha, or something like that, populated with 64MB of RAM or more would do for such a server; these, unlike Cisco routers, tend to be easily upgradable to larger amounts of RAM or faster CPUs too! If you think about it, freeing ourselves from our router vendors' BGP4 limitations would allow more experimenting with route dampening, and even a model that allows for some temporary entropy within the /18 model. For example, a BGP4 daemon could be developped that keeps track of each prefix's routing flap, thus allowing policies that filter unstable prefixes/paths, or that would temporarily allow in prefix announcements that are too long (this would allow a /18+Entropy model, which would help ASes deal with rare AS partitioning incidents and the like). The number and size of holes in large aggregates could also be controlled; a policy could specify that no more than 4 /24 holes in /18s be accepted for example. All of the above features would take some time before they are implemented, but it would probably be less than the time it would take router vendors (after all, many of us would have an interest in helping to develop these features). One other, less important, benefit of doing this would be that of saving router vendors the headache of having to add more and more CPU and RAM capacity to their routers. Why would this be good? Well, IPv6, if it ever is accepted by the Internet (and I bet it will be), will only requiere XP routers to carry a few hundred routes at most (IPv6 would be used with CIDR from the beginning), wasting, as a result, the large amounts of RAM and fast CPUs everyone will have put in their XP routers by the time IPv6 replaces IPv4; general purpose servers can easily be reused, and if anything, are far less expensive than large routers. NAPs could even segregate all BGP4 traffic off of the FDDI or ATM switches, since member's XP route servers could be configured to peer over a lower bandwidth, separate LAN. Would this exchange point configuration be acceptable? Nick PS: A Cisco 4500 with two high speed interface boards and one ether card, along with a Pentium-class, 128MB box in a small case with no monitor or keyboard and with one ethernet interface would be enough of a start up kit for most NAP newcomers and it would all fit in the limited rack space Sprint offers at its NAP. I'd love to see lower entry barriers.
the number of paths most XP routers tend to carry nowadays; the computer next to the router would be the one to absorb the heavy load of handling so many neighbors, routes and paths, and these computers can be upgrade with more ease than can the routers.
I absolutely agree that the time has come to remove complex routing protocols from routers and use workstations.
Jon,
I absolutely agree that the time has come to remove complex routing protocols from routers and use workstations.
Two points: 1. I am not sure whether BGP could be classified as a "complex routing protocol". 2. The proposed scheme does not remove BGP from the routers. In fact, BGP is used to communicate forwarding information between the routers and the route servers. So, if we're to assume that BGP is "a complex routing protocol", then using router servers does not allow "to remove complex routing protocols from routers". And if we're to assume that BGP is not a complex routing protocol, then we don't have a problem of removing "complex routing protocols from routers". So, it seems that using route servers would have no impact wrt to removing "complex routing protocols from routers". Yakov.
Perhaps "policies, procedures & data" is a better word than "protocols" used in the generic sense. I want to route things differently based on the time of day, whether or not a particular network is seeing heavy traffic, and what my horoscope said today. I can implement this much more easily on a workstation where I have source code than I can convince Cisco to implement it.
So, it seems that using route servers would have no impact wrt to removing "complex routing protocols from routers".
Jon,
Perhaps "policies, procedures & data" is a better word than "protocols" used in the generic sense.
I want to route things differently based on the time of day, whether or not a particular network is seeing heavy traffic, and what my horoscope said today. I can implement this much more easily on a workstation where I have source code than I can convince Cisco to implement it.
I wonder how many ISP operations have a requirement to support internet-wide routing based on such factors as traffic load or your horoscope. Let me personally assure you that if there will be enough demand, then there will be enough supply. Yakov.
Yakov Rekhter previously wrote:
Jon,
I absolutely agree that the time has come to remove complex routing protocols from routers and use workstations.
Two points:
1. I am not sure whether BGP could be classified as a "complex routing protocol".
The information it is used to carry at the XPs is complex; so are the routing policies some folk want to use.
2. The proposed scheme does not remove BGP from the routers. In fact, BGP is used to communicate forwarding information between the routers and the route servers.
Of course, but the server condenses all of the paths it learns at the XP to just one path per prefix learned at the XP, thus significantly reducing the load on the router: now the router has several times fewer paths to choose from and store in memory.
So, if we're to assume that BGP is "a complex routing protocol", then using router servers does not allow "to remove complex routing protocols from routers". And if we're to assume that BGP is not a complex routing protocol, then we don't have a problem of removing "complex routing protocols from routers".
So, it seems that using route servers would have no impact wrt to removing "complex routing protocols from routers".
But with reducing the load on them.
Yakov.
Frankly, I think every NAP should require this sort of setup, at least until routers get good enough in these respects, and even requiere that BGP4 NAP traffic be pushed off onto a lower bandwidth LAN. Nick PS: If I remember correctly, every peer at some NAPs gets two IP host addresses for them to use at the NAP; this means that using a BGP4 "proxy" (if I call it a route server some might think I'm referring to the RA route server) is a possibility for all, and noone is stopping anyone from using one at those NAPs.
Nick,
Of course, but the server condenses all of the paths it learns at the XP to just one path per prefix learned at the XP, thus significantly reducing the load on the router: now the router has several times fewer paths to choose from and store in memory.
The amount of saving depends entirely on the average number of paths per destination at the XP. Perhaps some empirical data in this area would help to quantify the possible gain. Yakov.
As others have pointed out, if all the router has to store is simple routes, say one entry containing a byte representing the interface for every /24 address there is, life is easy. 16MB and you can efficiently route 16M routes on a router with much simpler software and cpu demands. -- *** Internet Marketing and Development *** Jon Zeeff Branch Internet Services Inc. jon@branch.com (313) 741-4442 http://branch.com/ gopher branch.com
As Paul Traina took the time to pound into my skull some time last year, all difficult performance problems (CPU, memory, whatever) having to do with BGP4 are due to the number of views, not the number of prefixes. 30,000 prefixes will fit in a 16MB router -- if you have only one view of each prefix. You can do route processing for 30,000 prefixes using a CSC3 CPU -- if you only have one view of each prefix. If the multiple AS paths are only known to a colocated workstation running GateD, then the iBGP from that workstation to the router(s) will only include one view of each prefix. And the workstation can fall back on VM during times when the number of prefixes or views exceeds planning. And finally, the workstation's memory is probably not going to be limited to 64MB. I clearly think that colocated workstations are better than route processors inside the routers themselves. I'm less certain that they are better than route servers and a unified/recursive/realtime RADB. I'm not sure at all that any interconnect can, should, or ever shall require this kind of dual- routing setup for its members. In other words, why are we discussing this?
Paul A Vixie previously wrote:
If the multiple AS paths are only known to a colocated workstation running GateD, then the iBGP from that workstation to the router(s) will only include one view of each prefix. And the workstation can fall back on VM during times when the number of prefixes or views exceeds planning. And finally, the workstation's memory is probably not going to be limited to 64MB.
Precisely my point. I think it's a neat solution.
I clearly think that colocated workstations are better than route processors inside the routers themselves. I'm less certain that they are better than route servers and a unified/recursive/realtime RADB. I'm not sure at all that any interconnect can, should, or ever shall require this kind of dual- routing setup for its members. In other words, why are we discussing this?
We're discussing this precisely because you're not sure, as you say, what course of action is best. Now, the folks at Sprint don't like the idea of a "unified" route server, and I don't blaim them, since it does reduce one's independence when it comes to setting a routing policy: the "unified" route server only gives everyone one "view," the same view when we'd all rather be able to make the decision ourselves of what "view" is best. Yes, the route server would be configured from the RADB, and we'd all have the right to register whatever policy for our ASes we think is best, but how often would the route server be reconfigured to reflect updates to the RADB? and should we really trust such a route server? its implementation? its administrators? As for wether an interconnect can requiere everyone to collocate a route server for their AS, I don't see why not. It's only a few inches of rack space; the collocated route servers don't have to be connected to the high speed (FDDI or ATM or whatever comes along) interconnect either, because, as I've already pointed out, the collocated route servers could be on a totally separate, parallel, low bandwidth network (say, Ethernet). Does anyone see a reason, political or technical, why collocating per-AS route servers is not viable? Yes, the scalability does not improve as opposed to what we have now, but PCs get better much faster than Cisco's or Wellfleet's or NSC's or <fill in the blank>'s routers, meaning scalability problems are pushed off far into the future ; but it is far better, easier to handle and administrate than planning and administrating proxy aggregation. [A fast PC with 1GB of RAM should be able to handle any NAP for the next year to two years, by which time IPv6 could be hitting the scene] Nick
Precisely my point. I think it's a neat solution.
So do I, in the absence of ...
route servers and a unified/recursive/realtime RADB. [...]
We're discussing this precisely because you're not sure, as you say, what course of action is best.
"I'm not sure" in this context is a euphemism for "that's a really bad idea."
Now, the folks at Sprint don't like the idea of a "unified" route server, and I don't blaim them, since it does reduce one's independence when it comes to setting a routing policy: the "unified" route server only gives everyone one "view," the same view when we'd all rather be able to make the decision ourselves of what "view" is best.
Nope. As you're about to explain yourself, the RS architecture makes it possible to spout different truths out of each and every orifice. It is very definitely _not_ nec'y for every RS peer to have the same view, or that the Internet have a single great and context-insensitive "Truth".
Yes, the route server would be configured from the RADB, and we'd all have the right to register whatever policy for our ASes we think is best, but how often would the route server be reconfigured to reflect updates to the RADB?
Right now, this is the thing that makes the RS unusable. At the RPS WG in Stockholm I heard Daniel talk about a way to improve RADB update times, and Bill Manning and I proposed a DNS-based RADB whose rollups could be done by anyone needing the information rather than by a central body. While I admit that the RS doesn't have realtime updates right now, I know that it's coming. Knowing Sean for who he is, I'm fairly sure that no RADB or RS will ever be suitable to him. In particular...
and should we really trust such a route server? its implementation? its administrators?
...while I would trust those things if given sufficient reason to, I know of at least one network/routing engineer who wouldn't no matter how sufficient the reasons seem to the rest of us. So your point is valid on that score.
As for wether an interconnect can requiere everyone to collocate a route server for their AS, I don't see why not. It's only a few inches of rack space; the collocated route servers don't have to be connected to the high speed (FDDI or ATM or whatever comes along) interconnect either, because, as I've already pointed out, the collocated route servers could be on a totally separate, parallel, low bandwidth network (say, Ethernet).
In legal terms, we cannot add contract terms now that a lot of peering points are in use. There is, quite literally, "no way to require" colo'd workstations at peering points. It doesn't matter how little rack space it takes, or that the BGP4 traffic would be on different media like Ethernet, or whether it is (I'm not going to take a position) a wonderful idea or not. Legally, we cannot require it. Practically, the peering points are "open" to the extent that folks are expected to use their GIGAhose "as they see fit." Requiring colo'd workstations, even if we gave them away, is as impossible as requiring that folks sign an MLPA.
Does anyone see a reason, political or technical, why collocating per-AS route servers is not viable? Yes, the scalability does not improve as opposed to what we have now, but PCs get better much faster than Cisco's or Wellfleet's or NSC's or <fill in the blank>'s routers, meaning scalability problems are pushed off far into the future ; but it is far better, easier to handle and administrate than planning and administrating proxy aggregation.
I think this is a reasonable architecture and if I am asked to recommend an architecture to someone connected to a MAE or NAP, I will mention this one. I recommend that you do the same. Perhaps you can live to see your ideal done up in practice. But banish all expectations that either a peering point administrator will help enforce your ideology, or that you will ever get full voluntary buy-in from every peer. N**2 BGP4 sessions are bad for likely values of N (100, maybe.) That won't change just because we've got a 1GB-RAM DEC Alpha with a 300MHz processor instead of a Cisco to do our route processing. N**2 BGP4 sessions is a bad design no matter what you're implementing it with. In that sense, your idea is not "viable" since it doesn't solve some of the real problems coming up.
[A fast PC with 1GB of RAM should be able to handle any NAP for the next year to two years, by which time IPv6 could be hitting the scene]
I can't even begin to comment on that, I'm sure that I'd offend you.
Paul A Vixie previously wrote:
Precisely my point. I think it's a neat solution.
So do I, in the absence of ...
route servers and a unified/recursive/realtime RADB. [...]
We're discussing this precisely because you're not sure, as you say, what course of action is best.
"I'm not sure" in this context is a euphemism for "that's a really bad idea."
I figured. :)
Nope. As you're about to explain yourself, the RS architecture makes it possible to spout different truths out of each and every orifice. It is very definitely _not_ nec'y for every RS peer to have the same view, or that the Internet have a single great and context-insensitive "Truth".
Yes, indeed. But one RS will have to be far larger than each of the collocated RSes plus its software would be far more complicated. Mind you, I like the idea of a central RS configured from a routing DB that is up to date and populated with correct information.
Right now, this is the thing that makes the RS unusable. At the RPS WG in Stockholm I heard Daniel talk about a way to improve RADB update times, and Bill Manning and I proposed a DNS-based RADB whose rollups could be done by anyone needing the information rather than by a central body. While I admit that the RS doesn't have realtime updates right now, I know that it's coming.
Well, the RADB can be updated very quickly nowadays via e-mail; that's not the point. The point us that the RS itself needs to reflect RADB changes very quickly.
Knowing Sean for who he is, I'm fairly sure that no RADB or RS will ever be suitable to him. In particular...
Agreed.
and should we really trust such a route server? its implementation? its administrators?
...while I would trust those things if given sufficient reason to, I know of at least one network/routing engineer who wouldn't no matter how sufficient the reasons seem to the rest of us. So your point is valid on that score.
An RS that implements every AS' policy and responds to changes to its routing DB quickly is quite an undertaking, its software will be fairly complicated, that's why some people will not trust it for a long time; others won't trust the RS because it is not them running it.
In legal terms, we cannot add contract terms now that a lot of peering points are in use. There is, quite literally, "no way to require" colo'd workstations at peering points. It doesn't matter how little rack space it takes, or that the BGP4 traffic would be on different media like Ethernet, or whether it is (I'm not going to take a position) a wonderful idea or not. Legally, we cannot require it. Practically, the peering points are "open" to the extent that folks are expected to use their GIGAhose "as they see fit."
Actually, the only ASes that would benefit immidiately from this collocated RS configuration would be those that are seeing 100k or more paths now (i.e. folk who are located at multiple NAPs, mostly do non-transit peering, peer with others like themselves and buy little transit from others). The rest could continue peering at the XP as they already do since they only add small numbers of paths and tend to buy transit from others to the routes that they don't get via non-transit peering. Anyways, noone has to be forced to use collocated RSes (if they can't be asked to use collocated RS, how can they be asked to use the RA RS?), if it's the only acceptable way to prevent routers from falling over, then many will choose to implement this. Besides, where peering is arranged on a one-by-one basis rather than multilaterally (as at the MAEs) any carrier of sufficient size can refuse to peer with anyone not willing to use collocated RSes.
I think this is a reasonable architecture and if I am asked to recommend an architecture to someone connected to a MAE or NAP, I will mention this one. I recommend that you do the same. Perhaps you can live to see your ideal done up in practice. But banish all expectations that either a peering point administrator will help enforce your ideology, or that you will ever get full voluntary buy-in from every peer.
See above.
N**2 BGP4 sessions are bad for likely values of N (100, maybe.) That won't change just because we've got a 1GB-RAM DEC Alpha with a 300MHz processor instead of a Cisco to do our route processing. N**2 BGP4 sessions is a bad design no matter what you're implementing it with. In that sense, your idea is not "viable" since it doesn't solve some of the real problems coming up.
Oh, of course, and I did mention that collocated route servers do not solve any scalability problem, they just get around the Cisco memory capacity problem. Apart from this, it is also worth noting that few routers at each XP announce a full set of routes to some or all neighbors: most announce what they originate, transit ASes announce what they originate as well as what their customers originate; those who do not get enough routes from non-transit peering buy transit for what they cannot hear from bigger carriers. The result is that with 20 neighbors a given router likely won't hear 20x full-Internet paths. :) Does a central RS per-NAP have better scalability than collocated per-AS route servers do? This is an honest question; I would venture that it doesn't, but I have not studied the question enough. The whole point of my suggesting collocated per-AS route servers was that Sprint refuses to go with the RA RS idea and Sprint is large enough that there is now going to be a bit of the Pittsburg NANOG conference dedicated to this problem; my hope was that collocated route servers would be acceptable to Sprint, thus buying us much time before even that scheme became unusable due to scalability problems.
[A fast PC with 1GB of RAM should be able to handle any NAP for the next year to two years, by which time IPv6 could be hitting the scene]
I can't even begin to comment on that, I'm sure that I'd offend you.
That's ok. If I'm ignorant of something, I'd like to learn; it won't offend me to be shown I don't know something (at worst I will feel embarrassed). What bothers you about that comment? I know how quickly the Internet is growing and I'm betting that collocated route servers will scale for the next year at least, probably longer. As for IPv6, I freely admit that I'm not up to date on what is happenning with it, but I understand it uses CIDR from the beginning. That IPv4 used that awful class scheme for so long is the main reason we're talking about 100k paths and routers falling over; if IPv6 can start with CIDR from the beginning along with no IP address protability accross carriers, then we'll be ok. IPv6 could benefit from better host, router and name server configuration protocols so as to simplify renumbering; so could IPv4. Nick
Yakov Rekhter previously wrote:
Nick,
Of course, but the server condenses all of the paths it learns at the XP to just one path per prefix learned at the XP, thus significantly reducing the load on the router: now the router has several times fewer paths to choose from and store in memory.
The amount of saving depends entirely on the average number of paths per destination at the XP. Perhaps some empirical data in this area would help to quantify the possible gain.
I don't have this data right now, but no matter: I'm not proposing this just because I think it might help routers cope, I'm proposing it because because I _know_ it would limit the paths an XP router hears at the XP to the number of prefixes at the XP. There are close to 30,000 routes in the Internet right now, which means that an XP router will only hear close to 30,000 paths at that XP; it may hear more from other places, internal as well as external but this routing info and calculation load could also be offloaded to a PC-based route server. CIDR is already slowing route table growth enough to keep 1x full routing manageable for a long time to come. BTW, 2x Internet routing BGP4 info fits in 4-6MB on a Cisco today, meaning that an AS that is multi-homed and hears full Internet routes at each border will only have to carry in its routers as many paths as N times the number of Internet routes at most; an AS connected to all NAPs would have no problem running 32MB routers today and for some time into the future (64MB routers are available from Cisco, which had predicted a route explosion and could well, and should well have had less memory capacity limited routers out a while ago). I don't believe there is any impending routing meltdown. If collocating per-AS route servers at the NAPs helps our routers significantly, then we're doing our job well and we can all sleep at night. But if it's not enough, then I doubt any of the other proposals (centralized per-XP route servers, forced proxy route aggregation), with the exception of moving to IPv6, could help either. Anyone who thinks that unilateral proxy aggregation would help at all is probably wrong; anyone who thinks proxy aggregation by committee can be pulled off has not been paying enough attention to the politics of the Internet. Additionally, unilateral proxy aggregation is a very dangerous and heavy handed approach that could bring quite a few lawsuits as well as government regulation into the game. Or are the proponents of forced proxy aggregation counting on their team of lawyers to scare smaller fry away and settle with the mid-size carriers?
Yakov.
Nick
I absolutely agree that the time has come to remove complex routing protocols from routers and use workstations.
High Speed Motherboards .... Just saw a nice 75 - 200 Mhz Pentium board in San Jose, nice chipset and features for under 300 dollars (no CPU). Purchased Quantum drives at $0.26 per MB.... Memory still around $500 per 16 MB..... 64 bit bus architectures are common and an accepted standard... In a year we may be seeing 180 MHz new version Pentiums... drives and storage cheaper.... and hopefully memory prices will fall. It is not that hard to begin thinking of ways to build AS' routing architectures that are more exciting than the current computationally limited routers in the market place...... with computational hardware dropping in price and increasing in performance at an accelerated rate.... why design inter-domain systems based on the constraints of slow moving vendors ? Why constrain routing algorithms on the same H/W limitations? Agreed. Tim -- +--------------------------------------------------------------------------+ | Tim Bass | #include<campfire.h> | | Principal Network Systems Engineer | for(beer=100;beer>1;beer++){ | | The Silk Road Group, Ltd. | take_one_down(); | | | pass_it_around(); | | http://www.silkroad.com/ | } | | | back_to_work(); /*never reached */ | +--------------------------------------------------------------------------+
participants (7)
-
jon@branch.com
-
Nicolas Williams
-
Nicolas Williams
-
Paul A Vixie
-
Tim Bass
-
William B. Norton
-
Yakov Rekhter