Re: Vulnerbilities of Interconnection

14 Sep 2002

      On Fri, 13 Sep 2002, Iljitsch van Beijnum wrote:
...
On Fri, 13 Sep 2002, Stephen J. Wilcox wrote:
...
...
At what point does one build redundancy into the network.
...
No, it doesnt necessarily use IX's, in the event of there being no peered path
across an IX traffic will flow from the originator to their upstream
"tier1" over a private transit link, then that "tier1" will peer with the
destination's upstream "tier1" over a private fat pipe then that will go to the
destination via their transit private link.
But will these links have enough spare capacity so congestion doesn't
happen?
Well the policy among major isps tends to be around 50% max utilisation per
circuit so they should have capacity to reroute. you're most likely to hit
issues on the local isp's transit connection which is unlikely to have the
capacity to shift a large amount of their peered traffic onto altho medium isps
can probably reroute to another IXP a large amount anyway..
...
...
I'm only aware of a few providers who transit across IX's and I think the
consensus is that its a bad thing so it tends to be just small people for whom
the cost of the private link is relatively high.
I apologize in advance for naming names here, but I think it is important
for making my point.
A while back (I think last year, but I'm not sure) the AMS-IX had a huge
outage because the power failed in two of the main locations. One of the
locations didn't at that time have battery or generator backed up power
(although they used three diversely routed inputs from the power company)
and the other location only had batteries, which didn't last long.
Nearly everything was still reachable over transit rather than peering
with only minor congestion. However, some networks got their transit in
the same buildings as where they connect to the AMS-IX, so both their
peering and transit was gone and they were unreachable. If you think this
was only true for small networks: think again. Surfnet suffered the same
problem. Surfnet one of the largest (if not _the_ largest) Dutch network,
connecting all the universities in the country at multi-gigabit speeds.
However, they only connected to other networks in a single building at
that time. I don't know if this is still the case.
Yes, there is a large amount of that happening in London where I'm more familiar
with individual ISP's networks.. they tend to exist in one or two locations and
pass traffic through a single location because of economies on bandwidth
scaling. Altho I dont know of any medium/large ones like that..

I personally have always maintained multiple sites with sufficient capacity to
handle the failure of another site since day one however perhaps I was lucky
enough to be able to draw on a company with enough cash to be willing to do
that. I regularly (every month or two) see something major happen at a site and
on the whole things continue working just fine around it!

Steve
...
Now this is only one big network and a few small ones that suffered.
However, things could have been much worse for people in the rest of the
Netherlands, because even with all the rerouting going on almost all
traffic still flowed through Amsterdam. So any outage in Amsterdam that
takes down more than a single building would cripple the majority of Dutch
networks. Obviously, something like this doesn't happen all the time, but
luck has a tendency to run out from time to time. A plane crash (a 747
went down in an Amsterdam suburb 10 years ago) or a good sized flood (lots
of stuff is below sea level in NL) will do it.
...
I suspect the catch would be that in the event of major switching nodes being
taken out there would be considerable congestion on the transit links and most
likely on the private peering of the tier1's also.
I'm more worried about long distance fiber running through rural areas.
Much more bang for your backhoe renting buck.
...
...
not sure I'd call it a "poor job"  for not planning all possible
failure modes, or for not having links in place for them.
...
Well the trouble is in the real world we cant have the budgets we'd like to
implement our plans and end up compromising.. theres the catch.
I don't think it's just a matter of money. In 1999, I helped roll out a
completely new network. EVERYTHING in it, except the ports customers
connect to, had a backup. Management originally wanted to connect every
location to at least three others. (We got this requirement dropped
because it essentially means you're buying a third circuit that doesn't do
anything useful until the two others are down; traffic engineering to for
both regular operation and the different failure modes is too complex.)
Still, I couldn't convince them to move the second transit connection to
another city where both our network and the transit network were also
present in the same building.
A year or so after I left I was in the building where that entire network
connects to its transit network over two independent routers at both ends
and the power went down and they couldn't get the generators online...
Eventually the utility power came back online before the batteries were
empty. All of this is on the ground floor in a place that's below sea
level only a block or so from a river.