Depending on what failure cases you actually see from your peers in the wild, I can see (at least as a thought experiment), a two-bucket solution - "transit" and "everyone else". (Excluding downstream customers, who you obviously hold some responsibility for the hygiene of.)
Although I didn't say it clearly, that's exactly what we do. The described 'bucket' logic is only applied to the 'everyone else' pile ; our transit stuff gets its own special care and feeding. How often do folks see a failure case that's "deaggregated something and
announced you 1000 /24s, rather than the expected/configured 100 max", vs "fat-fingered being a transit provider, and announced you the global table"?
I can count on one hand the number of times I can remember that a peer has gone on a deagg party and ran over limits. Maybe twice in the last 8 years? It's possible it's happened more that I'm not aware of. We have additional protections in place for that second scenario. If a generic peer tries to send us a route with a transit provider in the as-path, we just toss the route on the floor. That protection has been much more useful than prefix limits IMO. On Wed, Aug 18, 2021 at 11:37 AM tim@pelican.org <tim@pelican.org> wrote:
On Wednesday, 18 August, 2021 14:21, "Tom Beecher" <beecher@beecher.cc> said:
We created 5 or 6 different buckets of limit values (for v4 and v6 of course.) Depending on what you have published in PeeringDB (or told us directly what to expect), you're placed in a bucket that gives you a decent amount of headroom to that bucket's max. If your ASN reaches 90% of your limit, our ops folks just move you up to the next bucket. If you start to get up there in the last bucket, then we'll take a manual look and decide what is appropriate. This covers well over 95% of our non-transit sessions, and has dramatically reduced the volume of tickets and changes our ops team has had to sort through.
Depending on what failure cases you actually see from your peers in the wild, I can see (at least as a thought experiment), a two-bucket solution - "transit" and "everyone else". (Excluding downstream customers, who you obviously hold some responsibility for the hygiene of.)
How often do folks see a failure case that's "deaggregated something and announced you 1000 /24s, rather than the expected/configured 100 max", vs "fat-fingered being a transit provider, and announced you the global table"?
My gut says it's the latter case that breaks things and you need to make damn sure doesn't happen. Curious to hear others' experience.
Thanks, Tim.