Blaine Christian wrote:
Another thing, it would be interesting to hear of any work on breaking the "router code" into multiple threads. Being able to truly take advantage of multiple processors when receiving 2M updates would be the cats pajamas. Has anyone seen this? I suppose MBGP could be rather straightforward, as opposed to one big table, in a multi-processor implementation.
You may want to read this thread from the beginning. The problem is not the routing plane or routing protocol but the forwarding plane or ASIC's or whatever. Both have very different scaling properties. The forwarding plane is at an disadvantage here because at the same time it faces growth in table size and less time to perform a lookup . With current CPU's you can handle a 2M prefix DFZ quite well without killing the budget. For the forwarding hardware this ain't the case unfortunatly.
Hi Andre...
I hear what you are saying but don't agree with the above statement. The problem is with the system as a whole and I believe that was the point Vladis, and others, were making as well. The forwarding plane is only one part of the puzzle. How do you get the updates into the forwarding plane? How do you get the updates into the router in the first place and how fast can you do that? I have seen at least one case where the issue did not appear to be the ASICs but getting the information into them rapidly. If you go and create a new ASIC without taking into account the manner in which you get the data into it you probably won't sell many routers <grin>.
Sure, if you have a bottleneck at FIB insertion you fail much earlier. I'd say if that happens it's an engineering oversight or a design tradeoff. However I don't think this is the choke point in the entire routing table size equation. Depending on the type of prefix churn you don't have that many transactions reaching the FIB. Most far-away churn doesn't change the next hop for example. Local churn, when direct neighbors flap, mostly just changes the nexthop (egress interface). In a high performant ASIC/TCAM whatever FIB a nexthop change can be done quite trivially. Prefix drop can be handled by marking it invalid and garbage collecting it later. Prefix insertions may either salvage an invalidated prefix or have to be re-inserted. The insertion time depends on the algorithms of the FIB table implementation. For all practical purposes a FIB can be designed to be quite speedy in this regard without busting the budget. The link speed between two DFZ routers has seldomly been the limit for initial routing table exchanges. Neither has TCP. It is mostly dominated by the algorithm choice and CPU of the RIB processor on both ends.
BTW, I do agree that spinning new ASICs is a non-trivial task and is certainly the task you want to get started quickly when building a new system.
It is non-trivial for its prefix storage size and ultra-fast lookup times. Longest prefix match is probably the most difficult thing to scale properly as a search always must be done over a number of overlapping prefixes. To scale this much better and remove the bottleneck you may drop the 'overlapping' part or the 'longest-match' part and the world suddenly looks much brighter. This is the crucial thing that got forgotten during the IPng design phase which brought us IPv6. So far we have learned that limiting the number of IPv[46] prefixes in the DFZ is not an option for commercial and socio-technical reasons. That leaves only the other option of changing the routing lookup to something with better scaling properties.
I did read your comment on BGP lending itself to SMP. Can you elaborate on where you might have seen this? It has been a pretty monolithic implementation for as long as I can remember. In fact, that was why I asked the question, to see if anyone had actually observed a functioning multi-processor implementation of the BGP process.
I can make the SMP statement with some authority as I have done the internal design of the OpenBGPd RDE and my co-worker Claudio has implemented it. Given proper locking of the RIB a number of CPU's can crunch on it and handle neighbor communication indepently of each other. If you look at Oracle databases they manage to scale performance with factor 1.9-1.97 per CPU. There is no reason to believe we can't do this with the BGP 'database'. -- Andre