(this was one of the coolest talks from the three days, actually, and has gotten me *really* jazzed about some cool stuff we can do internally. Huge props to Matt, Barrett, and Todd for putting this together!! --Matt) 2006.06.07 TCP anycast, Matt Levine, Barrett Lyon with thanks to Todd Underwood TCP anycast, don't believe the FUD Todd Underwood is in Chicago Barrett Lyon starts off. [slides may eventually be at: http://www.nanog.org/mtg-0606/pdf/tcp-anycast.pdf IPv4 anycast from a network perspective, nothing special just another route with multiple next-hops services exist on each next-hop, and respond from the anycast IP address. It's the packets, stupid perceived problem: TCP and anycast don't play together for long-lived flows. eg, high-def porn downloads [do porn streams need to last more than 2 minutes?] some claim it exists, and works... yes, been in production for years now. Anycast at CacheFly deployed in 2002 prefix announced on 3 continents 3 POPs in US 5 common carriers (transit) + peering be sensible to who you peer with Effective BGP communities from upstreams is key keep traffic where you want it. Proxy Anycast proxy traffic is easy to anycast! move HTTP traffic through proxy servers. customers are isolated on a VIP/virtual address, which happens to exist in every datacenter. Virtual address lives over common carriers allowing even distribution of traffic state is accomplished with custom hardware to keep state information synchronized across proxies. Node geography anycast nodes that do not keep state must be geographically separated Coasts and countries work really well for keeping route instability largely isolated. Nodes that are near by could possibly require state between them if local routes are unstable. IP utilization "Anycast is wasteful" people use /24's as their service blocks; use 1 /32 out of a whole /24. Really? How much IP space do you need to advertise from 4 sites via unicast? Carriers and Peering for content players, having even peering and carriers is key. you may cause EU eyeballs to go to CA if you're not careful with where you peer with people. having an EU centric transit provider in the US without having the same routes in EU could cause EU traffic to home in the US Use quality global providers to keep traffic balanced. When peering... keep in mind a peer may isolate traffic to a specific anycast node Try to peer with networks where it makes sense; don't advertise your anycast to them where they don't have eyeballs! Try to make sure your peers and transit providers know your communities and what you're trying to do, and make sure you understand their communities well! Benefits of Anycast. for content players moving traffic without major impact or DNS lag provides buffers for major failures allows for simplistic traffic management, with a major (potential) performance upside. it's BGP you don't control, though, so not much you can do to adjust inbound wins. HTTP has significant cost to using DNS to try to shift traffic around; six or more DNS lookups to acquire content; anycast trims those DNS lookups down significantly! Ability to interface tools to traffic management. No TTL issues! Data, May 9, 2006 Renesys: monitored changes in atomic-aggregator for a CacheFly anycast prefix AS path changes and pop changes Keynote: monitored availability/performance of 30k file Revision3: monitored behavour of "longlived" downloads of DiggNation videocast--over 7TB transferred. Renesys data: 130BGP updates for may 9th; low volume day stable prefixes 34 distinct POP changes based on atomic aggregator property on prefixes 130 updates is considered a stable prefix. SJC issue: thirty-five minute window, 0700 to 0735 UTC, saw: 98 updates, 20 actual pop changes based on atomic aggregator changes, all from one san jose provider, fail from SJC to CHI back to SJC unable to correlate these shifts with any traffic changes; mostly likely we don't have a big enough sample size. possibly just not a lot of people using those routes. BGP seems stable--what about TCP flows? AVG time between SJC and CHI and back again was about 20 seconds; very quick on the trigger to go back to SJC; would break all TCP sessions happening at the time. For the most part, TCP seems stable. Keynote: 30k download from 31 locations every 5 minutes, or average of 1 poll per 9.6 seconds compared against 'Keynote Business 40' data collected on May 9, 2006 represents short-lived TCP flows, though. Orange line is Keynote business 40 pegged 100% availability load time was lower than the business 40. (0.2s vs 0.7s for business 40) Revsion 3 data monitored IPTV downloads for 24 hours (thanks, jay!) span port; analyzed packet captures look for new TCP sessions not beginning with SYN compare that against global active connection table. looked for sessions that appeared out of nowhere. Long-lived data 683,204 TCP sessions. anything less than 10 minutes thrown out 23,795 sessions lasted longer than 10 minutes. average file size for the day was 300MB 4 TCP sessions moved between POPs. 0.0006% total pop switch 'failure' rate 0.017% for long-lived (more than 10 minutes) sessions. only looking for sessions that starts in A, moves to B; they dropped 0 of them, due to their state preservation mechanism. you would have dropped 4 out of 23,795 connections with this without state preservation mechanism. Anycast gotchas. large scale changes in provider policies can impact your traffic, up to you to figure out what happened. "Things that are bad" become much, much worse, notably per-packet load balancing across provider or topologicl boundries. eg customer with 2 T1s between Anaheim and Dallas Sprint POPs. per-packet load balancing across the two, before the anycast nodes shared state information between nodes was probably still seeing performance issues. conclusion: stateful anycast is not inherently unstable, and failure/disconnect rates are inline with offering unicast services this is counter-intuitive to some published conclusions from previously published data. some other company did work in this; is there other failure rate data available? Verisign said to not do TCP anycast, but they didn't publish failure rates for TCP. Need to see where TCP really, really breaks. "Trust us, it works" widespread failures cause havok; however, internet doesn't go crazy *that* often. Transitioning to IPv6: as for the move to IPv6, there *is* a plan for it; plan is to hope they're dead by the time customers actually demand v6. what you can do stop telling people TCP doesn't work if you haven't tried it yourself! Just makes them mad to hear it. if your application doesn't handle TCP/IP failures gracefully, don't run anycast, in fact don't run it on the internet at all. SMTP and HTTP work well; browsers support reset and retries, for example Experiment with other applications share your experiences--want to know if their results are anomalous, if they're crazy, or if this really does just work. Q: Mark Kosters, Verisign: did one presentation. It depends on client base; their client base was very far flung, many problems in far reaches of internet and per-packet load balancing showed up in their data. How big a clientele spread do you want to reach, do you hit core, or far edge? A: they try to reach customers where they get money for the material being served; not necessarily geared for global connectivity. Yes, unicast in outlying cases will be more robust, but doesn't scale as well. Q: Randy Bush, IIJ, 10th anniversary of much of the Anycast UDP deployment. Response of "it's good if you engineer it" applies to most things. They wanted to narrow peering and topology; Randy says there are exactly cases where that is exactly NOT true. In 1996, they needed to shed traffic off their backbone; had to support streaming, they did TCP from all anycast to all peers, exact opposite of what they were calling for. A: This talk is specific to content they deal with directly, was engineered for their customer's content and needs. Main point is that it's not as bad as people claim, and benefits can be substantial. And yes, you will need to engineer it to your particular case. Q: Bill Woodcock: over past 10-12 years, many have seen long-lived TCP connections with 0.01% failure rates; Mark's results with j-root were suprisingly different. Methodologies were sound, so why might we be seeing these bimodal results? A: different results trying to be achieved. Streaming video vs universal reachability for DNS. Perhaps worth looking at the specific needs being aimed for? Dalnet runs with very long lived TCP connections, some last for months, they're not picky about who they get transit from, and they're anycasting their IRC servers. Q: Danny McPherson, 10 years ago people were using this to try to pass stuff faster to try to make Keynote numbers look better. Origin AS, announcing prefixes from multiple places, may trigger security, prefix hijacking alerts; the state sharing they do may have helped considerably. A: Not as scary as people thinks it is. some open source state synchronization tools would be nice. If someone can do some work in that area, they'd love to support it. Q: Michael ?, UCB: thanks, good to see studies like this! Not all applications/protocols can handle this; hard to necessarily then generalize this to "TCP" in general like that. A: they meant application more than protocol. Q: TCP has methods of dealing with out-of-order packets. not many people use per-packet load balancing; you don't know what might change in future, though. Q: Matt Peterson, can you give description of size of files being moved around? Also, state mechanisms--you mentioned open source--is it something you might consider releasing? A: content, don't remember exactly, look at Revision3, see their show. Average file size was 350Mb, range from 200-650MB. For their state, they hacked stuff together themselves; they'd be happier to support something community based, rather than release their code. any more Q ask them offline, we HAVE to keep moving to get out of the convention center ontime!!