Re: do not filter your customers

26 Feb 2012

      On Sat, Feb 25, 2012 at 12:20 PM,  <Valdis.Kletnieks@vt.edu> wrote:
...
On Fri, 24 Feb 2012 21:39:37 EST, Christopher Morrow said:
...
The knobs available are sort of harsh all the way around though today :(
So what would be a good knob if it was available?  I've seen about forty-leven
people say the current knobs suck, but no real proposals of "what would really
rock is if we could...."
I'm not sure... here's a few ideas though to toss on the fire of thought:

1) break the process up inside the router, provide another set of
places to blcok and tackle the problem.

2) better metric the problem for operations staff

3) automate the problem 'better' (inside a set of sane boundaries)

I think in 1 I want to be able to be assured that inbound data to a
bgp peer will not cause problems for all other peers on the same
device. Keep the parsing, memory and cpu management separate from the
main routing management inside the router, provide controls on these
points configurable at a per-peer level.

That way you could limit things like:
  - each peer able to take a maximum amount of RAM, start discarding
routes over that limit, alarm at a configurable percentage of the
limit.
  - each peer could consume only a set percentage of CPU resources,
better would be the ability to pin bgp peer usage to a particular CPU
(or set of CPUs) and other route processing on another
CPU/set-of-CPUs.
  - interfaces between the bgp speaker, receiver, ingest and databases
could all be standardized, simple and auditable as well. If the peer
sent a malformed update only that peering session would die, if the
parsing of the update caused a meltdown again only the single peer
would be affected. The interface between the code speaking to the peer
and the RIB could be more robust and more resilient to errors.

for 2, I think having more data available about avg rate of increase,
max rate of increase, average burst size and predicted time to overrun
would be helpful. Most of this one could gather with some smart SNMP
tricks I suspect... on the other hand, just reacting to the syslog
messages in a timely fashion works :)

for 3, automate the reaction to syslog/snmp messages, increasing the
thresholds if there hasn't been an increase in the last X hours and
the limit is not above Y percent of a full table already. (and send a
note to the NOC ticket system for historical preservation).

These too have flaws... I'm not sure there's a good answer to this though :(

-chris