2008.02.18 NANOG 42 IX operator BOF notes

19 Feb 2008

      Notes from the IX operator's BOF--last set before
I get beer and gear.  :)
Apologies for the gap, I nodded off partway through,
but hopefully didn't miss too much of the important
content.  ^_^;;

Matt

2008.02.18 IX operator's panel

Louie Lee from Equinix starts the ball
rolling.

Welcome to IXP BOF, anyone is welcome,
mainly for exchanges, but all customers
welcome.

AGENDA:
Greg Dendy Equinix forklift upgrade learning process

Niels Bakker, Cabling Challenges for large chassis
 AMSIX

Greg Dendy L2 Hygiene
 Equinix

Jeff d'Ambly, BFD experiments
 Equnix
 ring protocols out there, hear what we'd like to
 see

Mike Hughes, sFlow portal work at LINX

Hopefully will fill the 90 minutes they have.
It will be discussion style, don't just have
people talk at you, will make it interactive.

Greg Dendy is first up.
Congrats to Louie for 8 years at Equinix.

Forklift upgrades at all the public exchange
points for Equinix.

Increasing demand for 10 gig port capacity
not supported on the current platform.

Current feature sets less stable and resilient
than preferred

Current platform didn't support newer technologies
like DWDM optics

Service continuity for all ports during upgrade is
paramount

Long process
first steps 6-24 months ahead of time
test available platforms that do or can support the
required features

Work with vendors to develop platforms into IX production
candidates; will be visible, but not high volume sales.

Make the vendor decision, rather like jumping off a cliff.

3-6 months before upgrade

test and verify platform for software for launch; pick
and nail it down.

arrange for appropriate cage/cabinet/power within the
datacenter; 150amps of DC power, need to negotiate
with facilities to get space/power.

Order all needed parts; patch panels, patch
cabling, new cross connects

1-3 months prior
receive production hardware, install, test in
production space; hw components, ports, optics,
etc.
finalize maintenance date (avoid conference dates!)

pick at good time, 10pm to 4am, not optimal for
europe, but that's when fabric low point is.

0-1 month prior
finalize new pre-cabled patch panels, cross-connects
and platform configuration.

develop and document maintenance proceedure.
may take 6-7 iterations of that process.

Notify customers and field questions

Train the migration team; 10-12 people from
product managers down to rack-and-stack folks.

Double-recheck of all the elements one more time.

Migration time!
baseline snapshot--mac table, arp table, pingsweep,
bgp state summary.

Interconnect new and old platforms with redundant
trunks for continuity

Move connections port by port to new platform, confirm
each is up/up and works before moving on
 CAM entry
 ICMP reachability
 Equinix Peering session
 Traffic resumes at expected levels

Patrick notes they have route collectors
IP is .250 or .125, AS 65517
Route collector is only way they can troubleshoot
issues with routing that come up.

Old topology slide.

Migrated using double ring to new boxes.

Finishing Migration

Final checks when migration is complete
 overall platform health:
  jitter node at each site
  traffic levels back to normal
 Threshold monitoring for frame errors
 Route collector peers are all up

Disable trunks to old paltform

Close maintenance window -- do round robin
 from everyone involved, and notify customers.

They don't stock old optics in case new platforms
don't work with customers.

Usually between customer and switch, optics
issues crop up during turnups.

XFP, X2/XFP-plus tend to be problematic;
LR to ZR for example; may happen to work
in old platform, new platform may not
interact the same way.

Problem solved--plenty of 10G port capacity to meet
near and long term demand

IX platform has increased resilieance and statbility
with improved failover

Access to new tech: DWDM XFP optics

Lessons learned:
Lots of testing; you can't really retest too much
Vendor participation/cooperation is paramount.  If vendor
 won't work with you, pick someone else.

Teamwork is crucial

Communication among the team is vital as well.
Prepare, prepare, then prepare yet again.

Question--is it better to move customers as a batch
to shorten up the maintenance window, even though
that increases the chance of simultaneous issues.

They prefer to be cautious.

For testing, do they run full traffic through
the platform first?  They test cross connects,
yes.  For the whole platform, they pushed as
much traffic as they could for a few days through
the box.  Generally, not as much throughput
testing, mostly digital lightwave on ports and
fibers.

They focus not so much on throughput as software
stability.  Leverage vendor tests on the box as
much as possible.

What about security side?  They do lots of MAC
filtering test, make sure it learns correctly,
make sure MAC learning and MAC security aren't
done in parallel so that traffic leaks through,
for example.

Q: have their been issues that have cropped up, and
do you test for anomalies, like undervoltage
situations, etc. and make sure gear is solid?
As much is possible.

Q: Do they negotiate finders-fees for bugs which
they send back to vendor?
No, but they wish they could.

Niels Bakker, AMSIX -- cabling challenges for
large switches

Stupid Fibers for 10GbE connections.

Before we start
AMS-IX uses photonic switches with redundant ethernet
switches, effectively tripling the amount of fibers
needed
 patch panel to pxc
 pxc to first ethernet switch
 pxc to second ethernet switch
great for redundancy (ethernet switch lots more complicated
than a photonic switch)

Bad old days in 10gigE
didn't figure everyone would want it.
Standard LC fiber patches everywhere
Rather vulnerable
have to be put in place pretty much individually to have them
line up correctly
difficult to get rid of slack

Before and after photos; lots of fiber extenders.

Photonic cross connect switch with LC connectors

Solutions
Breakout cables
 bundles eight strands into one cable
 easy to install
 specify up front precisely what you want
 interlock snap
Wartelplaat to hold bundles in place.

Picture showing breakout bundles, saves
huge amount of time; sturdy, can run a
whole bunch in one go, pre-tested by the
supplier
Interlock snap on the end of cable.
32 ports with breakout.

144 port photonic cross connect switch
MLX-32 top half--actually did use a forklift
to put it in.

Solutions (2)
MPO cables
 12 strands in one high-density connector, 8 used
 work with latest generation of glimmerglass photonic switches
 find a cable that is not ribbon (so it can bend in all direction
stil LC on other side (TX/Rx)
 MPO on switch side would be nice too!
 different LR vs ER blades

MPO cables out of glimmerglass go in groups of 8;
one set of Rx, other goes to all Tx
MPO cables on left;

Even bigger breakout cables, and mRJ-21
RX-8;
24 strands at the other side;
fiber after the tap is very small.
Yellow go to photonic switch, different rx and tx.

Where do you they get the cool cables?

What are failure rates?   A few arrived broken,
possibly during transport or installation, they
just send them back for warranty installations.

No cases of bundle cables going bad thus far.
With dual switch situation, the backup path
could be bad and might not show up until a
switch toggle happens.

why did you run RX and TX in different multicores?
Out of necessity; optical switch puts 8 rx on a port
and 8 tx on a different port.

Thanks to Louie for starting the BOF off while
Mike was still at the break.

Greg Dendy back up to the mike
Switch Fabric Health and Stability
(or "L2 Hygiene")

Why?
Loops are bad, mmkay
Broadcast storms
Proxy ARP (argh!)
Next-hop foo
IGP leaks
Other multi/broadcast traffic (I'm looking at you,
CDP!)

Customer connections:
single-port
remote-ethernet
LAG ports
redundant ports

remote ethernet handoffs are sometimes the scary ones

How to prevent loops.
Firm policy against them.
 switch policies
 MAC ACLs on each interface
 MAC larning limit of 1
 Static MACs
 LACP for Aggregated Links
  Please use LACP; heartbeat is heart, if it goes away,
 links drop.

Quarantine VLAN
 OSPF hellos
 IGMP/PIM
 MOP (cisco 6500/7600)
  old DECnet--turn it off!
 CDP
Manual scans also
sFlow reporting IDs illegitimate traffic
 tools for peers to use to understand from where
 peering traffic originates

Other measures
Jitter statistics
 measures metro IX latency
 available to IX participants
Full disclosure to all IX participants
 communication during planned and unplanned events
 full RFO investigation and reporting for all outages

Jeff d'Ambly now talks about BFD experiments.
Bidirectional Forwarding Detection
anyone running it on sessions privately?
Anyone doing it over an exchange switch?

IETF describes BFD as a protocol intended to detect
faults in the bidirectional path between two forwarding
engines.

Why?
Ensures connectivity backplane to backplane, makes sure
you have connectivity.
independent of media, routing protocol, and data layer
protocol
detection of one-way link failures
no changes to existing protocols
faster BGP fault detection
BGP timers remain at default values
IXP can easily support BFD
 nothing the IXP needs to do

Failover times
When configuring BFD, be aware of the different failure
times on your network.
MSTP .8 to 4 seconds
Ring protocol 200 to 500ms
VPLS less than 50ms
Static LAG Less than 1ms (rehash)
LACP 1 to 30 seconds

Example topology in lab
4 node ring

Sample cisco config
neigbor
 192.168.100.1 fall-over bfd

bfd interval 999 min_rx 999 multiplier 10

JunOS config
bfd-liveness-detection detection-time FOO

What holds operators back from BFD deployment?

Is there anything else IXPs can do to raise
awareness and best practices around BFD for
ebGP sessions?

Really, just need to raise awareness; would be
good for IXPs to publish their best current
practices for what timers work best for their
exchange point.

What is a good time for it?  10seconds seems
to be a fair compromise.

Depends largely on application.  Prior to this,
used PNI to get fast failver; this could help
reduce need for PNIs.

One point is made that it will control how fast
it tears it down; it will not change how fast it
tries to come back up.

Two camps with BFD; one says it should be on the
route processor, so you know the brain at the
other end is alive; other group says it should
be on linecard to not load down main CPU

RFC talks about maintain state, but it hasn't
been implemented yet; and no lack of dampening
in BFD can end up causing a cascading failure
throughout the rest of the network.  Can't afford
to have a flap in one area churn CPU to a degree
it affects other parts of the network.
Don't turn it on without thinking about the
implications of this!

IX operator can't replicate all the bits of our
networks as well, so it's good to have open
exchange of information.

Mike Hughes, sFlow port work at LINX
Mike is a huge carbon footprint?
recycled slides from LINX last week.

Were you flying blind?
a big L2 network is like a bucket with traffic
sloshing around inside
IXP gets traffic engineering data
IX customers get flow data

datagram flow
the exchange -> sfacctd->mysql<->php/xml<->ajax/php

Switches export 1 per 2000 per port, ingress to each
port; vlan between all switches, not being handled
by management port, but actually through fabric.
sfacctd slices and dices sflow data.

Database layout

1minute samples
5 minute samples
15 minute averages

2x dual core CPUs,
HP servers
16GB RAM
820GB RAID6 array 8x146 disks

about 50GB data/month

select ports on the web page, lists your traffic,
sorts by column headers, etc.

can display graphs of your traffic, etc.

proper web application

Add-in processing of extreme LAN data
authenticated direct XML interface for members
sFlow observation based peering matrix (with opt-out ability)
Improve engineering tools
 toptalkers reports
 switch-to-switch 'traffic matrix'
 per-ethertype, MoU violating traffic, etc.

proactive notification agent
be able to configure various thresholds, recieve alerts
weekly "overview report"

Would love to hear what kind of features you want to see
on an sFlow portal--let them know!

Beer and Gear time!

Matthew Petach

Joe Abley

tags

participants (2)