2009.10.19 NANOG47 Monday notes, second half

19 Oct 2009

      Here's my notes from second half of NANOG today.

Now off to bear and gear.  :)

Matt

2009.10.19 NANOG 47 Monday notes, part 2

Mike Hughes starts things off after lunch at
1436 hours Pacific time.

Few bits of administrivia still.
If you want to submit a lightning talk, you can do
it up until 7pm today.

Please vote for the committee members!

PC nominations close this evening as well;
if you'd like to be on it, do that as well,
as much help as possible is needed.

3 lightning talks next up.

First up is Ernest McCracken
http://netlabs.cs.memphis.edu

NetViews: real time visualization of
Internet Path Dynamics for Network Management

Started doing this as part of his undergrad work.

Goal was to help researchers visualize network paths.

Topology mapping typically try to represent
internet architectures.
Scatter, skitter, Rocket Fuel, CAIDA,

why graph in realtime?
monitor realtime reachability
spot anomalous depeering
identify route hijacking and misconfigurations

developing next-gen routing monitor system.
BGPMon -- realtime lightweight BGP monitor
  with over 70 peers--allows for fast updates
NetViews - visualizes both control plane paths
  (via BGP updates) and forwarding paths (via
  active probing)

BGPMon is running, you connect to it, get the
routing updates; data broker sends BGP updates.
Prober probes target network from BGP peers to
get path updates.
GeoCoder and IP crawler get geographic info,
and traceable IPs for probing.

Slide showing data pathway

They probe during routing events; a timeline
showing BGP updates during the timeline.  They
keep probing until they see no additional updates.

Visualization filters to show networks based on the
number of ASes an AS connects to.

You can see the updates scroll in realtime on the
live map as the updates come into the system.

Blue is path additions/changes, Red for changes.

They can also visualize forwarding paths, but
there's challenges in inferring forwarding paths
based on traceroutes.

Future work:
correlate forwarding and routing dynamics to create
 a classification model for internet paths
add scalability by having clients run traceroute jobs
 in a P2P fashion
Give client users the ability to communicate with each
 other

Funded by NSF, and collaborating with UCLA, ColoState
and UofO on BGPMon system.

Any questions?

Q: Dave Meyer--can it be run internally?  What
infrastructure do you need?  Server portal runs
in lab, clients can run on any java client.
Synch up with him afterwards if you have any
summer internships available.

Next talk
Jim Cowie from Renesys
The recession and the routing table
Reading the tea leaves

They dig into the routing tables to see what's happening.

Tough times, tough questions
We konw that internet transit purchases are sensitive
to business conditions (2000 crash)
 is the 2008-2009 recession affecting growth in
  the global/regional routing tables

Should be some sign of pullback in the routing tables
like in 2000.

3 years of North American routing--it's still going up,
there's no depression visible.

Why did the table keep growing?
Enterprises don't cut costs by leaving the internet,
they cut costs by reducing diversity
cheap transit getting cheaper acts like "easy money"
prospect of v4 runout may result in "use it or lose
it" addition of routes into table.

Half the table is just hanging out with 1 provider.

Number of prefixes with 4 or more providers is going
up.

The 1 provider networks either go to no-longer-advertised
or shift to 2 or 3 providers.

More go to the "no-longer-seen" pool; fewer upgrade
to the next category up.
People postpone getting to multihoming.

Triple-homing seems to be sweet spot.

4 or more provider pool is getting larger
and more stable over time; you don't tend
to decrease over time.

Global recession might give more of a break
before v4 exhaustion
Cheap transit killed that theory
some evidence of single- and dual- homed customers
putting off the move to higher order multihoming
 in 2007 and 2008
"obviously practicing for IPv6 transition, after which
 apparently multihoming becomes unnecessary"
Otherwise, growth continues apae
Bring on the post-IPv4 marketplace!

Q: Randy Bush--BGP is a great data hiding system;
it doesn't tell you much about the real topology
of the internet.  How do you determine how a prefix
has a single upstream?
A: ask him afterwards.

Q: is this transit AS?
A: Yes
Q: You have to have seen the AS through another AS,
 that's how you can count the upstreams.

Joe Abley up to the front from ICANN
DNSSec  for the Root Zone
Matt Larson, from VeriSign.

Info update for those who care about DNSSec
collaboration between ICANN and VeriSign with DoC

ICANN is IANA functions operator

Manages the Key Signing Key
Accepts DS records from TLD operators
Verifies and processes request
Sends update requests to DoC for authorization
 and to VeriSign for implementation

DoC NTIA
Authorizes changes to the root zone
 DS records
 Root key sets
 DNSSec updates

VeriSign
manages the zone signing key

Proposed Approach to protect the KSK
CPS--certificate practice statement
DPS, DNSSEC policy and practice statement
basically, to assure people the practices are
adequate to protect it.

community trust
proposal that community representatives have an active
role in management of the KSK
 as crypto officers needed to activate the KSK
 as backup key share holders protecting shares of the
  symmetric keys in case of disaster recovery

Auditing and Transparency
Third-party auditors check that ICANN...
webcast of sessions

KSK is 2048 bit RSA key
rolled every 2-5 years
RFC5011 for automatic key rollovers
propose using signatures based on SHA-256
 but there's no shipping code based on this

Zone signing Key (held by verisign)
ZSK is 1024-bit RSA
 rolled once a quarter
SHA-256 signature

Signature validityRRSIG validity 15 days
 resign every 10 days
Other RRSIG validity 7 days
 resign twice a day

Key Ceremonies
Key Generation
 Generation of new KSK
 Every 2-5 years
Processing of ZSK signing request (KSR)
 signing ZSK for the next upcoming quarter

Root Trust Anchor
published on a web site by ICANN as
XML-wrapped and plain DS record
 to facilitate automatic processing
PKCS#10 certificate signing request

Roll Out
incremental roll out of the signed root
 groups of root server "letters" at a time
watch the query profile to all root servers
 as roll out progresses
Listen to community feedback for any issues

No validation
Real keys will be replaced by dummy keys
 while rolling out the signed root
signatures not valid during roll out
actual keys will be published at end of rollout

Timeline
December 1, 2009
 root zone signed
  initially signed zone stays internal to ICANN and Verisign
ICANN
Jan-July 2010
 incremental roll out of signed root
July 1, 2010
 KSK rolled out
 root trust anchor

ISP Security BOF later today will talk about it.

Full architectural documents around the process will
be published in the next few weeks.

Next speaker is Paul Francis, talking about
Virtual Aggregation.

Reducing FIB Size with Virtual Aggregation (VA)
ISPs often want to extend the life of old routers
Routers that have inadequate FIB but otherwise are
 still useful

A common approach--use old routers as customer PE,
 default to core

Other FIB/RIB shrinking tips

Filter out more specific routes

For lower-tier ISPs, default to transit ISPs
 ie use 0/0 and load balance among transit ISPs
BUT
 leads to non-optimal routes
 lots of configuration (peer routes, "important" routes
  like Google)
 Can't be used by transit ISPs themselves

Mitigating non-optimal default routes
Use more-specific "semi-defaults"
AS3303 Swisscom IP-Plus
 point 62/8, 80/7, 21/7, etc. to EU transit ISP
 ARIN space to US transit
 class B 128/3, 160/5, 168/6 to US transit

IETF working on a more general solution: virtual agg
GROW working group
draft-ietf-grow-va-00
-va-gre-00
-va-mpls-00
-va-perf-00

VA is a way to control FIB size in routers
 DFZ FIB, not VPN tables
 does not shrink RIB size
Tight control of FIB size for any or all routers
 no coordination between ISPs
 works with legacy routers

Important today--possibly critical tomorrow?
looking forward, BGP RIB growth rate could increase
 substantially
 exhaustion of v4 erodes aggregation
 because of pressure to shrink default prefix size
 uptake of v6
VA can help ease these pressures

VA not perfect
Requires configuration of its own
Entails a traffic load/FIB size tradeoff
 which can be quite good
 academic study on large transit ISP
  10x fib reduction with negligible latency/load
   penalty
But in general we don't know how easy to achieve
 this--
  configuration...

Why this talk?
You can help us define VA
 certain protocols or configuration details
 alternative ways to deploy
 or tell us that VA is useless

encourage your vendor to implement VA
 current implementations from Huawei and ??

VA Basic Idea
Define "Virtual Prefixes" (VP)
 These are shorter (bigger) than real prefixes
 think of /6s, /7s, /8s
Assign different routers to be "responsible" for
 different virtual prefixes
  ie, they need to know how to route everything in the VP

FIB-suppression
BGP runs as normal
 all routers have full RIB
 important to not muck with BGP operation per se
suppress updates to FIB for more specifics of
 virtual prefixes

APR (aggregation point router) for 22/8
originate route to 22/8 with nexthop being itself
it FIB-installs all sub-prefixes within 22/8

other routers FIB-suppress all prefixes within 22/8

This just tunnel-maps from one router to another
out to the egress point.

The only router with the need to know how to route
that packet was APR1 (well, that, and the ingress
router)

The packet takes a bit of a longer path to do
this with simple aggregates.

You can add "popular prefixes" to routers to point
them along "better" paths.

Types of tunnels defined
MPLS (using LDP)
GRE
...

A deployment example
Robert Rasuzck at Cisco
shows a POP site with 4 PE customer agg routers,
2 Rs, 2 RRs;
core can use tunnels between them already.

Use RRs as APRs -- can optionally
FIB-install routes for which PE is egress

If you do FIB suppression at the RR layer
Then need to install popular prefixes at the PE
 layer--GROW looking to automate that part.

VA from our point of view
Figure out where you need FIB reduction
Based on this, design your deployment
 select VPs
 assign routers as APRs, configure
 configure "VIP-list"

New IETF GROW WG work item for FIB suppression

Q: Patrick, Akamai--this seems very complex; couldn't
we just take prefixes out of the FIB that are covered
by a shorter prefix with same next-hop; wouldn't that
be much easier to do, and save FIB space?  Could we
maybe ask vendors to look into doing that?
A: Lixia may have done some looking into that; she
says that two people on her team, they found out
that you can compress your FIB between 10 and 50%
by simply suppressing more specifics with same
next hops.
She was going to give a talk at GROW at the next
meeting that would do this.

Q: Doni from PeakWeb, was asking vendors for this
around the 200,000 routes in the FIB; the vendors
were wanting to simply sell more hardware.
Which routers need the full FIB in the drawing?
A: None of them need full routes.  Generally got
about 10X saving in all of them.

Q: Owen deLong--if you already have all the
routers everywhere, it might make sense; if you
have just 2 routers in a POP, this looks like
a distributed CAM load, to have multiple routers
pretend to look like one router
A: yes, it's like that.

Q: RAS--remember the 8k Foundry boxes?  They had 8k
CAM table, and their solution was to either have
just default, or break it up into /12s; this is
similar, it just limits based on number of next-hops
they have.  Could we get benefits from doing more
simple aggregation like that?

Q: Igor notes you can probably just upgrade for
cheaper than transferring all sorts of routes
back and forth and paying for additional interconnect
ports.

Q: Anton Kapella, have they considered looking about
Auto-TE QoS stuff internally?

If packets are being redirected around internally,
it does mean something for link-loading; how will
this interact with QoS, since this will transport
packets along links not originally planned for it?
In what they saw, very few packets used the
non-optimal paths.

Next up.
We'll do coffee break at 1615, BOFs at 1645

BGP# - a system for dynamic route control
in data centers

tenants and landlords
one landlord
 owner and manager of the datacenter
many tenants
 internal users
  search, email, gaming
 external users
 utility computing customers
empower tenants to control routing decisions

Routing tensions
tenants have different goals

tenant goal--spread traffic
or migrate traffic from one server to another

current system, tenant submits tickets to get routing
changed.
whole ticket flow is shown

Tenants have limited control over routing

A better system
allow for automated route control
allow tenants independent and safe route control
ensure scalability
allow for maintenance changes

BGP#
simple speakers (multispeaker)
 peer with BGP routers
 send route announcements/withdrawls (ECMP capable)
Stateful controller (controller)
 controls coordinates speakers
custom API ("applications")

Application runs on tenant box; speaks to
controller via API; controller speaks to
multispeaker which peers with router to
send the update

to spread traffic, similar thing;
application uses API to ask controller which
asks mutispeaker; it has 2 sessions to router, with
2 next hops for prefix.

Automated route control
controller API allows for custom applications
Application can automatically manage routes

Independent and safe route control
only allow a tenant to change their own prefixes.

Scalability
Multispeaker and controller not placed in machines
handling user trafic
 eliminates need for one policy controller per machine
 reduces peering sessions to router
eliminate per-ticket manual intervention

Resiliency
ensure system continues operating
instantiate multiple multispeakers
 single multispeaker failure doesn't affect other MS ability
 separate multispeaker and controller
prefix resiliency -- ensure prefix stays available
 announce same prefix from mutiple multispeaker
  router retains prefix even if one MS fails

Automation service could deploy a new multispeaker
with same config if one dies.

No inconsistency with multiple Multispeakers
suppose some multispeakers become unresponsive
 BGP# listening tool detects the lack of router
 readvertisement
suppose multispeaker reboots and is in different state?
 get config and state from persistent store

Alternate approach
each tenant sets up its own BGP instances
 needs one session per machine
landlord may need to deal with many BGP peers

Conclusions:
Tenants have more power
Landlord retains responsibility for validation of routes.
system achieves stability and resiliency

Q: Francis asks if BGP is an awfully coarsegrained
tool to use for something like this--what about
using MPLS for setting up flows.
A: BGP finite state machine is much simpler to follow
and update.

We'll go into coffee break now; BOFs start at 1645
hours Eastern time.

SC elections, JUST DO IT!!
PC nominations open for 3 more hours!

1800 hours in Regency for Bear and Gear.

BOFs, Mobile Data Track, ISP Security BOF,
and DNS BOF will be upstairs in DeSoto room.

Tuesday we start at 0830 again with breakfast.

For now, I head over to the DNS thingie BoF

IPv6 and resolvers; how do we make it less
painful?
For most people, rolling out IPv6 can't break
IPv4, and separate hostnames isn't scaleable
long term.

Per Google, 0.078% users are impacted by
enabling quad A on machines.
Assuming a user base of 600M that's 470K users that
get broken, which isn't acceptable.

Right now, in browsers, IPv4 fallback is on the
order of 21 seconds to 181 seconds, which is for
most SLA numbers considered "broken"

Options?
Don't roll out IPv6
prefer A over AAAA
accept the breakage
what about checking for working IPv6 connectivity
 before sending back AAAA record.

Only way to know if user has working v6 transport
is if the AAAA request came via IPv6 instead of IPv4

Recursive servers need to be set to only return
AAAA *only* when request came in via IPv6; otherwise,
return A record only.

Now, auth DNS server only has to worry about IPv6
reachability to the recursive server.

We've asked if ISC can write this; ISC has done
this, it'll be in BIND 9.7; it'll be in a second
beta coming out in early November; if you want to
check it out, if you're on user list or beta list,
you'll get notification; otherwise, check ISC
site in early november and it should be there.

Feature will be a knob you have to turn on.

There's an additional check put in; if DNSSEC
is set, it won't forge DNS answers unless there's
a knob set "BREAKDNSSEC" that you can turn on,
the knob is going to be very well documented.

But if you've gone through the work of setting
up DNSSEC, you should know how to troubleshoot
it yourself.

This should be be set up for resolvers facing
customers, not for internal services that have
odd configurations.

What about having an ACL for controlling
behaviour for different subnets?
If they fit in a view statement, you already
have that capability.

Will this be available within a view?  Yes,
you can do it there.

But the ACL idea is interesting, and could be
better than pushing people towards views.

This really goes on the recursive side.
We need to try to convince ISPs to use these
options.

If a request comes from a 6to4 address; the
source is a 6to4 address; do you respond with
AAAA or not.

How about ACLs with flexibility to see if it's
over v4 but from a 6to4 address to send different
responses.

A simple default policy is good, but the flexibility
is good for more experienced sites.

Simple on-or-off knob is in 9.7b2; more granular
control will be needed for later versions...

what about 6rd?  They would get no AAAA results
in that scenario.  We might need a DNSv6 option
for DHCPv4 which would be able to give back
the v6 DNS servers.

Should we put together an information draft
for IETF; we can draft one up, so the three
of them should talk;
Igor Gashinksi, Yahoo,
Larissa Shapiro, ISC,
Alain Durand, Comcast

OARC meeting, Beijing, Jason Fesler will be there
to talk about it.

ISOC meeting in Paris next week, we'll be there
to talk about it as well.

Internet2 joint techs talked about it as well.

General consensus is that this is a necessary
evil hack.
If we can get it working with 6rd, it'll be an
interesting working solution to a common problem.

What about shoving JavaScript code into web pages
to report DNS lookups back via REST infrastructure
to get an idea what the types of breakage.
OS, browser, IP, and which test cases break or not.
We could do a series of test queries, and see which
ones break or not.
Give A
Give AAAA
Result of the query comes into the beacon server, so
we can see if they saw the reply or not.
It would be better for the javascript to reply back
with what *it* saw, as well as see what the server
log sees.

There's some javascript coding challenges with
collecting this data.

If we can at least break it down to 3 buckets
6to4
teredo
native
it would help really pin down where the breakage.

Do note that the percentage shown wasn't Yahoo
data, that was Google's data, so we don't have
that breakdown ourselves.

Would going to AS level be too specific for people?
We'd need to consider carefully privacy issues
and anonymizing the data as per our privacy
rules.

What about running an experiment in partnerships
for specific ASNs?

Point is, this is coming out, do share the data,
this will be going into mainline code release.
It's opt-in, defaulted off.

The *actual* names for the config options are...

This will apply to just the RR set, not the
glue set; if glue returns AAAA, it'll still
come back intact.

The tests are really testing recursive lookup
server to the last proxy device in front of
the client.

But what if the recursive server to auth server
breakage happens?

If the recursive server lookup side, can we
turn the knob on in the other direction?

This is an interesting challenge; we'll have to
see how much additional work this will need, and
how much additional funding will be needed to
cover for it.

ISC will...look into the feasability of doing that.

The IETF draft cutoff is tonight for Paris, so
maybe it'll be done for the Anaheim one, at which
point we'll have working code out there, and a bit
more time for writing the draft.

We wrap up the BOF at 1724 hours Pacific time.

(what about a switch for auth servers that allows
for turning off "don't send AAAA records to ZZZZZ"?)

Matthew Petach

tags

participants (1)