
On Sat, Feb 25, 2012 at 12:20 PM, <Valdis.Kletnieks@vt.edu> wrote:
On Fri, 24 Feb 2012 21:39:37 EST, Christopher Morrow said:
The knobs available are sort of harsh all the way around though today :(
So what would be a good knob if it was available? I've seen about forty-leven people say the current knobs suck, but no real proposals of "what would really rock is if we could...."
I'm not sure... here's a few ideas though to toss on the fire of thought: 1) break the process up inside the router, provide another set of places to blcok and tackle the problem. 2) better metric the problem for operations staff 3) automate the problem 'better' (inside a set of sane boundaries) I think in 1 I want to be able to be assured that inbound data to a bgp peer will not cause problems for all other peers on the same device. Keep the parsing, memory and cpu management separate from the main routing management inside the router, provide controls on these points configurable at a per-peer level. That way you could limit things like: - each peer able to take a maximum amount of RAM, start discarding routes over that limit, alarm at a configurable percentage of the limit. - each peer could consume only a set percentage of CPU resources, better would be the ability to pin bgp peer usage to a particular CPU (or set of CPUs) and other route processing on another CPU/set-of-CPUs. - interfaces between the bgp speaker, receiver, ingest and databases could all be standardized, simple and auditable as well. If the peer sent a malformed update only that peering session would die, if the parsing of the update caused a meltdown again only the single peer would be affected. The interface between the code speaking to the peer and the RIB could be more robust and more resilient to errors. for 2, I think having more data available about avg rate of increase, max rate of increase, average burst size and predicted time to overrun would be helpful. Most of this one could gather with some smart SNMP tricks I suspect... on the other hand, just reacting to the syslog messages in a timely fashion works :) for 3, automate the reaction to syslog/snmp messages, increasing the thresholds if there hasn't been an increase in the last X hours and the limit is not above Y percent of a full table already. (and send a note to the NOC ticket system for historical preservation). These too have flaws... I'm not sure there's a good answer to this though :( -chris