Here's my notes from this morning's sessions. :) Off to lunch now! Matt 2009.10.20 NANOG day 2 notes, first half Dave Meyer kicks things off at 0934 hours Eastern time. Survey! Fill it out! http://tinyurl.com/nanog47 Cathy Aaronson will start off with a rememberance of Abha Ahuja. She mentored, chaired working groups, she helped found the net-grrls group; she was always in motion, always writing software to help other people. She always had a smile, always had lots to share with people. If you buy a tee shirt, Cathy will match the donation. John Curran is up next, chairman of ARIN Thanks to NANOG SC and Merit for the joint meeting; Add your operator perspective! Vote today in the NRO number council election! You can vote with your nanog registration email. https://www.arin.net/app/election Join us tonight for open policy hour (this room) and happy hour (rotunda) Participate in tomorrow's IPv6 panel discussion and the rest of the ARIN meeting. You can also talk to the people at the election help desk. During the open policy hour, they'll discuss the policies currently on the table. And please join in the IPv6 panel tomorrow! If you can, stay for the ARIN meeting, running through Friday. This includes policy for allocation of ASN blocks to RIRs Allocation of IPv4 blocks to RIRs Open access to IPv6 (make barriers even lower) IPv6 multiple discrete networks (if you have non connected network nodes) Equitable IPv4 run-out (what happens when the free pool gets smaller and smaller!) Tomorrow's Joint NANOG panel IPv6--emerging success stories Whois RESTful web service Lame DNS testing Use of ARIN templates consultation process ongoing now; do we want to maintain email-based access for all template types? Greg Hankins is up next for 40GbE and 100GbE standards update--IEEE P802.3ba Lots of activity to finalize the new standards specs many changes in 2006-2008 as objectives first developed After draft 1.0, less news to report as task force started comment resolution and began work towards the final standard Finished draft 2.2 in august, crossing Is, dotting Ts Working towards sponsor ballot and draft 3.0 On schedule for delivery in June 2010 Copper interface moved from 10meter to 7meter. 100m on multimode, added 125m on OM4 fiber, slightly better grade. CFP is the module people are working towards as a standard. Timeline slide--shows the draft milestones that IEEE must meet. It's actually hard to get hardware out the door based around standards definitions. If you do silicon development and you jump in too fast, the standard can change under you; but if you wait too long, you won't be ready when the standard is fully ratified. July 2009, Draft 2 (2.2), no more technical changes, so MSAs have gotten together and started rolling out pre-standard cards into market. Draft 3.0 is big next goal, it goes to ballot for approval for final standards track. After Draft 3.0, you'll see people start ramping up for volume production. Draft 2.x will be technically complete for WG ballot tech spec finalized first gen pre-standard components have hit market technology demonstrations and forums New media modules: QSFP modules created for high density short reach interfaces (came from Infiniband) Used for 40GBASE-CR4 and 40GBASE-SR4 CXP modules proposed for infiniband and 100GE 12 channels 100GbE uses 10 of 12 channels used for 100GBASE-10 CFP Modules long reach apps big package used for SR4, LR4, SR10, LR4, ER4 about twice the size of a Xenpak 100G and 40G options for it. MPO/MTP cable multi-fiber push-on high-density fiber option 40GBASE-SR4 12 fiber MPO uses 8 fibers 100GBASE-SR10 24 fiber MPO cable, uses 20 fibers this will make cross connects a challenge Switches and Routers several vendors working on pre-standard cards, you saw some at beer and gear last night. Alcatel, Juniper First gen tech will be somewhat expensive and low density geared for those who can afford it initially and really need it. Nx10G LAG may be more cost effective higher speed interfaces will make 10GbE denser and cheaper Density improves as vendors develop higher capacity systems to use these cards density requires > 400Gbps/slot for 4x100GbE ports Cost will decrease as new technology becomes feasible. Future meetings September 2009, Draft 2.2 comment resolution Nov 2009 plenary Nov 15-20, Atlanta Draft 3.0 and sponsor ballot http://grouper.ieee.org/groups/802/3/ba/index.html You have to go to meeting to get password for the draft, unfortunately. Look at your roadmap for next few years get timelines from your vendors optical gear, switches, routers server vendors transport and IP transit providers, IXs Others? figure out what is missing and ask for it will it work with your optical systems what about your cabling infrastructure 40km 40GbE Ethernet OAM Jumbo frames? There's no 40km offering now; if you need it, start asking for it! Demand for other interfaces standard defines a flexible architecture, enables many implementations as technology changes Expect more MSAs as tech develops and becomes cost effective serial signaliing spec duplex MMF spec 25Gbps signalling for 100GbE backplane and copper apps Incorporation of Energy Efficient Ethernet (P802.3az) to reduce energy consumption during idle times. Traffic will continue to increase Need for TbE is already being discussed by network operations Ethernet will continue to evolve as network requirements change. Question, interesting references. Dani Roisman, PeakWeb RSTP to MST spanning tree migration in a live datacenter Had to migrate from a Per-vlan RSTP to MST on a highly utilized network So, minimal impact to a live production network define best practices for MST deployment that will yield maximal stability and future flexibility Had minimal reference material to base this on Focus on this is about real-world migration details read white papers and vendor docs for specs on each type. The environment: managed hosting facility needed flexibility of any vlan to any server, any rack each customer has own subnet, own vlan Dual-uplinks from top-of-rack switches to core. High number of STP logical port instances using rapid pvst on core Multiple VLAN*interface count = logical port instances Too many spanning tree instances for layer 3 core switch concerns around CPU utilization, memory, other resource exhaustion at the core. Vendor support: per-vlan STP Cisco: per-vlan is the default config, cannot switch to single-instance STP foundry/brocade offers per vlan mode to interoperate with cisco Juniper MX and EX offers vstp to interoperate Force10 FTOS Are we too spoiled with per-vlan spanning tree? don't need per-vlan spanning tree, don't want to utilize alternate path during steady-state since we want to guarantee 100% capacity during failure scenario options: collapse from per-vlan to single-instance STP Migrate to standards-based 802.1s MSTP (multiple spanning tree--but really going to fewer spanning trees!) MST introduces new configuration complexity all switches within region must have same vlan-to-mst mapping means any vlan or mst change must be done universally to all devices in site. issues with change control; all rack switches must be touched when making single change. Do they do one MST that covers all vlans? Do they pre-create instances? do all vendors support large instance numbers? No, some only support instances 1-16 Had to do migration with zero downtime if possible Used a lab environment with some L3 and L2 gear Found a way to get it down to one STP cycle of 45secs Know your roots! Set cores to "highest" STP priority (lowest value) Set rack switches to lower-than-default to ensure they never become root. Start from roots, then work your way down. MSTP runs RSTP for backwards compatability choose VLAN groups carefully. Instance numbering some only support small number, 1-16 starting point all devices running 802.1w core 1 root at 8192 core 2 root at 16384 You can pre-config all the devices with spanning tree mapping, but they don't go live until final command is entered Don't use vlan 1! set mst priority for your cores and rack switches. don't forget MST 0! vlan 1 hangs out in MST 0! First network hit; when you change core 1 to spanning mode mst step 2, core2 moves to mst mode; brief blocking moment. step 3; rack switches, one at a time, go into brief blocking cycle. Ongoing maintenance all new devices must be pre-configured with identical MST params any vlan to instance mapping changes, do to core 1 first no protocol for MST config propagation vtp follow-on? MST adds config complexity MST allows for great multi-vendor interoperability in a layer 2 datacenter only deployed a few times--more feedback would be good. Q: Leo Bicknell, ISC; he's done several; he points half rack switches at one core, other half at other core; that way in core failure, only half of traffic sloshes; also, on that way with traffic on both sides, failed links showed up much more quickly. Any device in any rack has to support any vlan is a scaling problem. Most sites end up going to Layer3 on rack switches, which scales much better. A: Running hot on both sides, 50/50 is good for making sure both paths are working; active/ standby allows for hidden failures. But since they set up and then leave, they needed to make sure what they leave behind is simple for the customer to operate. The Layer3 move is harder for managed hosting, you don't know how many servers will want in a given rack switch. Q: someone else comes to mic, ran into same type of issue. They set up their network to have no loops by design. Each switch had 4x1G uplinks; but when they had flapping, it tended to melt CPU. Vendor pushed them towards Layer3, but they needed flexibility for any to any. They did pruning of vlans on trunk ports; but they ended up with little "islands" of MST where vlans weren't trunked up. Left those as odd 'separate' root islands, rather than trying to fix them. A: So many services are built around broadcast and multicast style topologies that it's hard to mode to Layer3, especially as virtualization takes off; the ability to move instances around the datacenter is really crucial for those virtualized sites. David Maltz, Microsoft Research Datacenter challenges--building networks for agility brief characterization of "mega" cloud datacenters based on industry studies costs pain-points traffic pattern characteristics in data centers VL2--virtual layer 2 network virtualization uniform high capacity Cloud service datacenter 50k-200k servers scale-out is paramount; some services have 10s of servers, others 10s of 1000s. servers divided up among hundreds of services Costs for servers dominates datacenter cost: servers 45%, power ifrastructure 25%, maximiize useful work per dollar spent ugly secret: 10-30% CPU utilization considered "good" in datacenters servers not doing anything at all cause server are purchased rarely (quarterly) reassigning servers is hard every tenant hoards servers solution: more agility: any server, any service Network diagram showing L3/L2 datacenter model higher in datacenter, more expensive gear, designed for 1+1 redundancy, scale-up model, higher in model handles higher traffic levels. Failure higher in model is more impactful. 10G off rack level, rack level 1G Generally about 4,000 servers per L2 domain network pod model keeps us from dynamically growing/shrinking capacity VLANs used to isolate properties from each otehr IP addresses topologically determined by ARs Reconfig of IPs and vlan trunks is painful, error-prone, and takes time. No performance isolation (vlan is reachability isolation only) one service sending/receiving too much stomps on other services Less and less capacity available for each server as you go to higher levels of network: 80:1 to 240:1 oversubscriptions 2 types of apps: inward facing (HPC) and outward facing. 80% of traffic is internal traffic; data mining, ad relevance, indexing, etc. dynamic reassignment of servers and map/reduce style computations means explicit TE is almost impossible. Did a detailed study of 1500 servers on 79 ToR switches. Look at every 5-tuple for every connection. Most of the flows are 100 to 1000 bytes; lots of bursty, small traffic. But most bytes are part of flows that are 100MB are larger. Huge dichotomy not seen on internet at large. median of 10 flows per server to other servers. how volatile is traffic? cluster the traffic matrices together. IF you use 40-60 clusters, cover a day's worth of traffic. More clusters gives better fit. traffic patterns change nearly constantly. 80th percentile is 100s; 99 percentile is 800s server to server traffic matrix; most of the traffic is diagonal; servers that need to communicate tend to be grouped to same top of rack switch. but off-rack communications slow down the whole set of server communications. Faults in datacenter: high reliability near top of tree, hard to accomplish maintenance window, unpaired router failed. 0.3% of failure events knocked out all members of a network redundancy group typically at lower layers of network, but not always objectives: developers want network virtualization; want a model where all their servers, and only their servers are plugged into an ethernet switch. Uniform high capacity Performance isolation Layer2 semantics flat addressing; any server use any IP address broadcast transmissions VL2: distinguishing design principles randomize to cope with volatility separate names from locations leverage strengths of end systems build on proven network technology what enables a new solution now? programmable switches with high port density Fast, cheap, flexible (broadcom, fulcrum) 20 port 10G switch--one big chip with 240G List price, $10k small buffers (2MB or 4MB packet buffers) small forwarding table; 10k FIB entries flexible environment; general purpose network processor you can control. centralized coordination scale-out datacenters are not like enterprise networks centralized services already control/monitor health and role of each server (Autopilot) Centralized control of traffic Clos network: ToR connect to aggs, aggs connect to intermediate node switches; no direct cross connects. The bisection bandwidth between each layer is the same, so there's no need for oversubscription You only lose 1/n chunk of bandwidth for a single box; so you can have automated reboot of a device to try to bring it back if it wigs out. Use valiant load balancing every flow is bounced off a random intermediate switch provably hotspot free for any admissable traffic matrix works well in practice. Use encapsulation on cheap dumb devices. two headers; outer header is for intermediate switch, intermediate switch pops outer header, inner header directs packet to destination rack switch. MAC-in-MAC works well. leverage strength of endsystems shim driver at NDIS layer, trap the ARP, bounce to VL2 agent, look up central system, cache the lookup, all communication to that dest no longer pays the lookup penalty. You add extra kernel drivers to network stack when you build the VM anyhow, so it's not that crazy. Applications work with application addresses AAs are flat names; infrastructure addresses invisible to apps How to implement VLB while avoiding need to update state to every host on every topology change? many switches are optimized for uplink passthrough; so it seems to be better to bounce *all* traffic through intermediate switches, rather than trying to short-circuit locally. The intermediate switches all have same IP address, so they all send to the same intermediate IP, it picks one switch. You get anycast+ECMP to get fast failover and good valiant load balancing. They've been growing this, and found nearly perfect load balancing. All-to-all shuffle of 500MB shuffle among 75 servers; get within 94% of perfect balancing; they charge for the extra overhead for extra headers. NICs aren't entirely full duplex; about 1.8Gb not 2Gb bidrectional. Provides good performance isolation as well; as one service starts up, it has no impact on the service being running steady state. VLB does as well as adaptive routing (TE using oracle) on datacenter traffic worst link is 20% busier with VLB; median is same. And that's assuming perfect knowledge of future traffic flows. Related work: OpenFlow wow that went fast! Key to economic data is agility! any server any service network is largest blocker right network model to create is virtual layer 2 per service VL2 uses: randomization name-location separation end systems Q: Joe Provo--shim only applied to intra-datacenter traffic; external traffic is *NOT* encapsulated? A: Yes! Q: This looks familiar to 802.1aq in IEEE; when you did the test case, how many did you look at moving across virtualized domains? A: because they punt to centralized name system, there is no limit to how often servers are switched, or how many servers you use; you can have 10 servers or 100,000 servers; they can move resources on 10ms granularity. Scalability is how many servers can go into VL2 "vlan" and update the information. In terms of number of virtual layer 2 environments, it's looking like 100s to 1000s. IEEE is looking at MAC-in-MAC for silicon based benefits; vlans won't scale, so they use 802.1h header, gives them 16M possibility, use IS-IS to replace spanning tree. Did they look at moving entire topologies, or just servers? They don't want to move whole topology, just movement in the leaves. Todd Underwood, Google; separate tenants, all work for the same company, but they all have different stacks, no coordination among them. this sounds like a competing federation within the same company; why does microsoft need this? A: If you can handle this chaos, you can handle anything! And in addition to hosting their own services, they also do hosting of other outsourced services like exchange and sharepoint. Microsoft has hundreds of internal properties essentially. Q: this punts on making the software side working together, right? Makes the network handle it at the many to many layer. Q: Dani, Peakweb--how often is the shim lookup happening, is it start of every flow? A: Yes, start of every flow; that works out well; you could aggregate, have a routing table, but doing it per dest flow works well. Q: Is it all L3, or is there any spanning tree involved? A: No, network is all L3. Q: Did you look at woven at all? A: Their solution works to about 4,000 servers, but it doesn't scale beyond that. Break for 25 minutes now, 11:40 start time. We'll pop in a few more lightning talks. Somebody left glasses at beer and gear, reg desk has them. :) Break now! Vote for SC members!! Next up, Mirjam Kuhne, RIPE NCC, RIPE Labs, new initiative of RIPE NCC First there was RIPE, the equivalent of NANOG, then NCC came into existence to handle the meeting cordination, registrar, handled mailing lists, etc. RIPE Labs is a website, and a platform and a tool for the community You can test and evaluate new tools and prototypes contribute new ideas why RIPE labs? faster, tighter innovation cycle provide useful prototypes to you earlier adapt to the changing environment more quickly closer involvement of the community openness make feedback and suggestions faster and more effective http://labs.ripe.net/ many of the talks here are perfect candidates for material to post on labs, to get feedback from your colleagues, get research results, post new findings. How can it benefit you? get involved, share information, discover others working on similar issues, get more exposure. Few rules: free and civil discussion between individuals anyone can read content register before contributing no service guarantee content can disappear based on community feedback legal or abuse issues too little resources What's on RIPE Labs? DNS Lameness measurement tool REX, the resource explorer Intro to internet number resource database IP address usage movies 16-bit ASN exhaustion data NetSent next gen information service Please take a look and participate! mir at ripe.net or labs at ripe.net Q: Cathy Aaronson notes that ISP security BOF is looking for place to disseminate information; but they should probably get in touch with you! Kevin Oberman is up next, from ESnet DNSSec Basics--don't fear the signer! why you should sign your data sooner rather than later this is your one shot to experiment with signing when you can screw up and nobody will care! later, you screw up, you disappear from the net. DNSSEC uses public crypto, similar to SSH DNSSEC uses anchor trust system, NOT PKI! No certs! Starts at root, and traces down. Root key is well known Root knows net key net knows es key es key signs *.es.net Perfect time to test and experiment without fear. Once you publish keys, and people validate, you don't want to experiment and fail--you will disappear! signing your information has no impact. Only when you publish your keys will it have impact. It is REALLY getting closer! Root will be signed 2010 Org and Gov are signed now com and net should be signed 2011 Multiple ccTLDs are signed; .se led the way, and have lots of experience; only once did they disappear, and that was due to missing dot in config file; not at all DNSSEC related. Registration issues still being worked on transfers are of particular concern an unhappy losing registrar could hurt you! Implementation Until your parent is ready Develop signing policies and procedures test, test, and test some more key re-signing key rolls management tools find out how to transfer the initial key to your parent (when parent decides) this is a trust issue--are you really "big-bank.com" If you're brave you can test validation (very few doing it--test on internal server first!!) -- if this breaks, your users will hurt (but not outside world) You can give your public keys to the DLV (or ITARs) this can hurt even more! (DLV is automated, works with BIND out of box, it's simpler, but you can chose which way to go) What to sign? Forward zone is big win reverse zone has less value may not want to sign some or all reverse or forward zones signing involves 2 types of keys ZSK, KSK, zone data key and key for sending keys to parent keys need to be rolled regularly if all keys and signatures expire, you lose all access, period. use two active keys data resigned by 2 newest keys sign at short intervals compared to expiration to allow time to fix things. new keys require parent to be notified. ksks are 'safe', not on network (rotate annually) Wait for BIND 9.7, it'll make your life much easier. There are commerical shipping products out there. Make sure there are at least 2 people who can run it, in case one gets hit by a bus. Read NIST SP800-81 SP800-81r1 is out for comment now Read BIND admin reference manual. Once in a lifetime opportunity!! Arien Vijin, AMS-IX an MPLS/VPLS based internet exchange (started off as a coax cable between routers) then became Cisco 5500 switch, AMSIX version 1, then 2001 went to Foundry switches at gig, version 2, version 3 has optical switching AMSIX version 3 vs AMSIX vs 4 June 2009 version 3 six sites, 2 with core switches in middle two star networks E, FE, GE, N*GE connections on BI-15K or RX8 switches N*10GE connextions resilient connected on switching platform (MLX16 or MLX32) two separate networks, one active at any moment in time. selection of active network by VSRP inactive network switch blocks ports to prevent loops photonic switch basically flips from one network to the other network. Network had some scaling problems at the end. Until now, they could always just buy bigger boxes in the core to handle traffic. Summer of 2009, they realized there was no sign of a bigger switch on the horizon to replace the core. core switches fully utilized with 10GE ports limits ISL upgrade no other switches on market platform failover introduces short link flap on all 10GE customer ports--this leads to BGP flaps with more 10G customers this becomes more of an issue AMSIX version 4 requirements scale to 2x port count keep resilience in platform, but reduce impact on failover (photonic switch layer) increase amount of 10G customer ports on access switches more local switching migrate to single architecture platform reduce management overhead use future-proof platform that supports 40GE and 100GE 2010/2011 fully standardized They moved to 4 central core switches, all meshed together; every edge switch has 4 links, one to each core. Photonic switch for 10G members, to have redundancy for customers. MPLS/VPLS-based peering platform scaling of core switches by adding extra switches in parallel 4 LSPs between each pair of access switches primary and secondary (backup) paths defined OSPF bfd for fast detection of link failures RSVP-TE signalled LSPs over predefined paths primary/secondary paths defined VPLS instance per vlan static defined VPLS peers (LDP signalle) load balanced over parallel LSPs over all core routers Layer 2 ACLs instead of port security manual adjustment for now (people have to call with new MAC addresses) Now they're P/PE routers, not core and access switches. ^_^; Resilience is handled by LSP switchover from primary to secondary path; totally transparent to access router. If whole switch breaks down, photonic switch is used to flip all customers to the secondary switch. So, they can only run switches at 50% to allow for photonic failover of traffic. How to migrate the platform without customer impact? Build new version of photonic switch control daemon (PSCD) No VSRP traps, but LSP state in MPLS cloud develop configuration automation describe network in XML, generate configuration from this Move non MPLS capable access switches behind MPLS routers and PXC as a 10GE customer connection Upgrade all non MPLS capable 10GE access switches to Brocade MLX hardware Define migration scenario with no customer impact 2 colocation sites only for simplicity double L2 network VSRP for master/slave selection and loop protection Move GE access behind PXC Migrate one half to MPLS/VPLS network Use PXC to move traffic to MPLS/VPLS network, test for several weeks. After six weeks, did the second half of the network. Now, two separate MPLS/VPLS networks. Waited for traffic on all backbone links to drop below 50%; split uplinks to hit all the core P devices; at that point, traffic then began using paths through all 4 P router cores. Migration--Conclusion Traffic load balancing over multiple core switches solves scaling issues in the core Increased stability of the platform Backbone failures are handled in the MPLS cloud and not seen at the access level Access switch failures are handled by PXC for single pair of switches only Operational experience BFD instability High LC CPU load caused BFD timeouts resolved by increasing timers Bug: ghost tunnels double "up" event for LSP path results in unequal load balancing should be fixed in next patch release multicast replication replication done on ingress PE, not in core only uses 1st link of aggregate of 1st LSP with PIM-SM snooping traffic is balanced over multiple links but has serious bugs bugfixes and load balancing fixes scheduled for future code releases. Ripe TTM boxes used to measure delay through the fabric, GPS timestamps. Enormous amounts of jitter in the fabric, delays up to 40ms in the fabric. Attempts from TTM, send 2 packets per minute, with some entropy change (source port changes) VPLS CAM age out after 60s for 24-port aggregates, traffic often passes a port without programming (CPU learning), high delay does not affect real-world traffic, hopefully will look to change CAM timing packet is claustraphobic? customer stack issue increased stability backbone failures handled by MPLS (not seen by customers) access switch failures handled for a single pair of switches now easier debugging of customer ports swap to different using glimmerglass config generation absolute necessity due to large size MPLS/VPLS configs Scalability (future options) bigger core more ports Some issues were found, but nothing that materially impacted customer traffic Traffic load-sharing over multiple links is good. Q: did anything change for gigE access customers, or are they still homed to one switch? A: nothing changed for gigE customers; glimmerglass is single-mode optical only, and they're too expensive for cheap GigE ports. no growth in 1G ports; no more FE ports; it's really moving to a 10G only fabric. RAS and Avi are up next Future of Internet Exchange Points Brief recap of history of exchange points 0th gen--throw cable over wall; PSI and Sprint conspire to bypass ANS; third network wanted in, MAE-East was born 1st commercial gen: FDDI, ethernet; multi-access, had head of line blocking issues. 2nd gen: ATM exchange points, from AADS/PBNAP to the MAEs, peermaker 3rd gen: GigE exchange points, mostly nonblocking internal switches, PAIX, rise of Equinix, LINX, AMS-iX 4th gen: 10G exchange points, upgrades, scale-out of existing IXes through 2 or 3 revs of hardware Modern exchange points are almost exclusively ethernet based; cheap, no ATM headaches 10GE and Nx10GE have been primary growth for years. Primarily flat L2 VLAN IX has IP block (/24 or so) each member router gets 1 IP any member can talk to any other via L2 some broadcast (ARP) traffic is needed well policed Large IX toplogy (LINX), running 8x10G or 16x10G trunks between locations What's the problem? L2 networks are easy to disrupt forwarding loops easy to create broadcast storms easy to create, no TTL takes down not only exchange point, but overwhelms peering router control plane as well today we work around these issues by locking down port to single MAC hard coded, or learn single MAC only single directly connected router port allowed careful monitoring of member traffic with sniffers good IXes have well trained staff for rapid responses Accountablility most routers have poor L2 stat tracking options in use: Netflow from member router no MAC layer info, can't do inbound traffic some platforms can't do netflow well at all SFlow from member routers or from IX operator still sampled, off by 5% or more MAC accounting from member router not available on vast majority of platforms today None integrate well with provider 95th percentile billing systems IXs are a poor choice for delivering billed services If you can't bill, you can't sell services over the platform. Security Anyone can talk to anyone else vulnerable to traffic injection poor accounting options make this hard to detect. when detected, easy to excuse less security available for selling paid transit Vulnerable to Denial of Service attacks can even be delivered from the outside world if the IX IP block is announced (as is frequently the case) Vulnerable to traffic interception, ARP/CAM manipulation Scalability difficult to scale and debug large layer 2 networks redundancy provided through spanning-tree or similar backup-path protocols large portions of network placed into blocking mode to provide redundancy. Managability poor controls over traffic rates and or QoS difficult to manage multi-router redundancy multiple routers see the same IX/24 in multiple places creates an "anycast" effect to the peer next-hops can result in blackholing if there is an IX segmentation or if there is an outage which doesn't drop link state. Other issues: inter-network jumbo-frames support is difficult no ability to negotiate per-peer MTU almost impossible to find common acceptable MTU for everyone service is constrained to IP only between two routers can't use for L2 transport handoff Avi talks about shared broadcast domain architecture on the exchange points today. Alternative is to use point to point virtual circuits, like the ATM exchanges. Adds overhead to setup process adds security, accountablity advantages Under ethernet, you can do vlans using 802.1q handoff multiple virtual circuit vlans. Biggest issue is limited VLAN ID space limited to 4096 possible IDs--12-bit ID space vlan stacking can scale this in transport but VLANs in this are global across system Means a 65 member exchange would completely fill up the VLAN ID with a full mesh. Traditional VLAN rewrites don't help either. Now, the exchange also has to be arbiter of all the VLANs used on the exchange. Many customers use layer3 switch/routers, so the vlan may be global across the whole device. To get away from broadcast domain without using strict vlans, we need to look at something else. MPLS as transport rather than Ethernet solves vlan scaling problems MPLS pseudowires are 32bits; 4billion VCs VLAN ID not carried with the packet, used only on handoffs VLAN IDs not a shared resource anymore Solves VLAN ID conflict problems members chose vlan ID per VC handoff no requirements for vlan IDs to match on each end solves network scaling problems using MPLS TE far more flexible than L2 protocols allows the IX to build more complex topologies, interconnect more locations, and more efficiently utilize resources. The idea is to move the exchange from L2 to L3 to scale better, give more flexibility, and do better debugging. You can get better stats, you can do parallel traffic handling for scaling and redundancy, and you see link errors when they happen, they aren't masked by blocked ports. Security each virtual circuit would be isolated and secure no mechanism for a third party to inject or sniff traffic significantly reduced DOS potential Accountability Most provide SNMP measurement for vlan subints Members can accurately meaasure traffic on each VC without "guestimation" capable of integrating with most billing systems. Now you can start thinking about selling transport over exchange points, for example Takes the exchange point out of the middle of the traffic accounting process. Services with more accountability and security, you can offer paid services support for "bandwidth on demand" now possible no longer constrained to IP-only or one-router-only can be used to connect transport circuits, SANs, etc. JumboFrame negotiation possible, since MTU is per interconnect Could interconnect with existing metro transport Use Q-in-Q vlan stacking to extend the network onto third party infrastructures imagine a single IX platform to service thousands of buildings! Could auto-negotiate VC setup using a web portal Current exchanges mostly work with careful engineering to protect the L2 core with limited locations and chassis siwth significant redundancy overhead for IP services only A new kind of exchange point would be better could transform a "peering only" platform into a new "ecosystem" to buy and sell services on. Q: Arien from AMS-IX asks about MTU--why does it matter? A: it's for the peer ports on both sides. Q: they offer private interconnects at AMS-IX, but nobody wants to do that, they don't want to move to a tagged port. They like having a single vlan, single IP to talk to everyone. A: The reason RAS doesn't do it is that it's limited in scale, you have to negotiate the vlan IDs with each side; there's a slow provisioning cycle for it; it needs to have same level of speed as what we're used to on IRC. Need to eliminate the fees associated with the VLAN setup, to make it more attractive. It'll burn IPs as well (though for v6, that's not so much of an issue) Having people peer with the route-server is also useful for people who don't speak the language who use the route servers to pass routes back and forth. The question of going outside amsterdam came up, but the member forbade it, so that it wouldn't compete with other transit and transport providers. But within a metro location, it could open more locations to participate on the exchange point. The challenge in doing provisioning to many locations is something that there is a business model for within the metro region. Anything else, fling your questions at lunch; return at 1430 hours! LUNCH!! And Vote! And fill out your survey!!