2008.02.18 NANOG 42 Keynote talk--Amazon, and taming complex systems
Wow. I just gotta say again--wow! Kudos to Josh for pulling in a very key pair of speakers on a very important topic for all of us. I stopped jotting down notes from the slides, and focused on what they were saying very quickly, because the lessons learned are so crucial for any network attempting to scale. Apologies in advance for typos, etc. that leapt in while I was typing. And definitely read through the presentation--I thought we were doing well at rolling out datacenters in 5 weeks, but these guys completed pwned us, rolling out datacenters in two weeks!! URL for talk is at http://www.nanog.org/mtg-0802/keynote.html Matt 2008.02.18 Amazon Keynote Talk Josh Snowhorn introduces the keynote speaker as the program committee member who got Amazon to present. Tom Killalea, VP of technology, Dan Cohn, principle engineer Earth's most consumer-centric company for the past 13 years Consumers Sellers Developers Consumers and Sellers Provide a place where people can find and discover anything they may want to buy online 800,000 sq foot warehouse in Nevada, have about 13 buildings spread around doing fullfillment. Sortation devices make sure the right people get the right products. Software Developers Web Scale computing services... want to give people access to their resources; free up developers from doing the heavy lifing of launching a web service, and focus on the interesting bits. Don't deal with Muck, focus on APIs. Let developers focus on delivering solutions. Developers want a few key items: Storage Computing Queues Queries bandwidth utilized (Amazon web services and website) the orange line is the historical website traffic. The blue line is vendor services, the web services business is now a larger bandwidth consumer than the internal websites. They've been doing rapid growth, but have also increased the ratio of network devices to engineers. Jim Gray A plantetary-scale distributed system operated by a single part time operator. Can we provide infrastructure "muck" so the network engineers don't have to worry about it? Goal is to abstract it as much as possible. How should we trade off consistency, availability, and network partition tolerance (CAP) Eric Brewer claims you can have at most 2 of these invariants for any shared-data system This is a challenge of tradeoffs; hard consistency is nearly impossible in very large systems, you deal with versioning. Real-time dynamic dependency discovery? Goal is to not have it be a static system. Recovery-oriented computing? Can you protect yourself from downstream damage? Communications infrastructure that scale infinitely Answers involve taking a holistic view MAYA, Machine Anomaly Analyzer maps server to remote service being called, showing latency and health of the remote service All of the content is scheduled by people, so the dependency tree is different over time; can't keep track of it over time, but you want to see what's happening right at the moment. They show a call to the main amazon page, and all the tendrils are remote calls that have to happen before the main page gets rendered. Simplicity, auto-configuratabiliy across the whole infrastructure stack Needs a different approach to engineering; put the cycle together, share objectives across the different engineering disciplines involved in building and designing the system. Other constraints "Software above the level of a single device" Tim O'Reilly Client requests come in to page rendering components, then request routing to aggregator servcies, which have request routing to services such as Dynamo, S3, order database, etc. Applications with fault zones wider than a single rack. (melted servers shown) fault zones wider than a single datacenter Latency graphs--a significant issue Orange reads, red writes; average latency, and 99.9% latencies. Very accute attention paid to both; look at the bad as well as average. Aim for 99.9% latency, to see how they do with convergence time, and fast restoration. Change management maintenance windows (none) latency considerations availability considerations With no maintenance windows, they have to be very sensitive to latency and availability, and they're sensitive to even minor perturbations. Hire good people, but even good people only good to 3 sigma Even good people not so good with complex systems that scale large and fast. So, it's all about automation self-configuring, self-verifying, n-scale automation for network elements. Want to make sure they don't have to have humans repeating work; build blocks on each other. APIs keep humans out of repetitive, error-prone processes. Configuration of services that would typically involved engineers, automated and called via APIs. Think differently about policy If...it isn't built into process you have to search for it it isn't auto-enforced ....it may as well not exist Systems verify and enforce policy Make scaling simple simplicity isn't achievable as a passive goal it is a force that must be actively applied Network simplicty at cscale always anticipate the next order of mangitude of growth, even if it's a challenge from a shell to a full-in production datacenter in 2 weeks. This is their challenge and goal. Make room to takel the big questions eliminate mundane work. What is mundane work? Everything that can not be easily defined by heuristics or templating deployment of network deices Operational problems too Basically, automate everything that is normally limited to human speed. troubleshooting, mitigation, problem identification, detail investigation, alarming, ticketing, workarounds, repair, Cool but realistic? Take what's implicit to network engineers, make it explicit in policy and code. How to start? No need to throw out existing network or designs! The network is the only authoritative resource that exists. They take small parts, define the bits of policy, and then write systems to check, and iterate. Reconciliation of policy vs reality creates a self-enforcing cycle You find variances, you either update policy, or enforce it. Iteate on high impact/frquency problems continue cycle, it's a forcing function to get people to understand impact of policy decisions and how to define them network-wide. Results in normalized, consistent, repeatable deployments It can take multiple iterations to get to simple case study--multicast at amazon very heavy user of multicast internally, about 7 years now. simple multicast based approach to make publishing and subscribing topics of interest really easy perhaps too easy we quickly had more (s,g) state than any other network in the world By over 2x Once they made subscription of data resources easy, adoption took off like wildfire, and growth surpassed what they anticipated. It was challenging forwarding and state tracking of vendor hardware. iterative design process for multicast infrastructure PIM-SM (BSR) PIM-SM (no SPT) PIM-SM (static anycast RP) PIM-SM (static anycast RP + MSDP) but all still require tracking s,g state which is too big PIM-BIDIR each transition on live network, with no scheduled outages; very challenging. 1 year later, we got it to 'simple' again on BIDIR; been using BIDIR in core for two or three years now. Systems architecture as a part of holistic simplicity. loosely homogeneous rows of largely identical racks built as-needed accessed through APIs property makes request when they need resource On-demand assignment On-demand release GenericCapacityRequest placed pseudo-randomly into slots in the datacenter programmatic increases and decreases as required requires host level application deployment architecture thousands of applications doing this leads to very interesting network flows if unconstrained leads to complexity again if unconstrained. DiscoveryService NetworkLocalityService (what's the closest/best of available resource of that type) Answer is informed by definition of topology NetworkLocalityService Large-Scale soa fragmented capacity allocation, just like any other component need to provide a programmatic answer to the question of network locality NetworkLocality(srcIP, dandiatedestIP1, canddestIP2, etc) Applications meet the network application and network teams both work together to figure out programatically is it the network or the application Network engineers get to care about the application design, they sit down with developers, figure out the interaction model, the constraints, and how the design of the application can work with the network; and likewise, how the network can grow to work with the application needs. They work with application developers, whether they work for Amazon or not. set Amazon's servers on fires, not yours! www.acadweb.wwu.edu/dbrunner/ New York Times, single person working there on putting all their historical articles online; couldn't build the entire datacenter model, got his bossses credit card, Loaded newspapers from 1851 to 1922 into s3 Set up a hadoop processing cluster on EC2 churned through all 11 million articles in just under 24 hours using 100 EC2 instances Added 1.5TB of publically usable content in 24 hours. Spent $100 with Amazon to do it. http:://www.amazon.com/jobs/ Q: Randy Bush asks about the outage last Friday. Well, dependency mapping is a hard case. you need to do edge case mapping, backoff and recovery, edge cases can cause other systems to tumble down; it's challengign in SOA, where newer applications being built mash together different APIs. Single failure, had expanding effects at higher levels; how do you make sure edge case doesn't affect the common platform. Q: Anne Johnson, interested in concept of locality; how you decide the 'best' server is for a request. There has to be more to it than just IP addresses, right? How do you define closest? Best isn't necessarily the closest, 1/2 ms doesn't matter as much; it's basically an ordered list of available capacity with latency tiers. Trying to figure out how to expose it to external developers. One thing that does look intersting is fault zones; which pieces of infrastructure can fail together. Some level of fate sharing analysis; what pieces share fate, which ones can be isolate. Q: Todd Underwood, Renesys; Amazon transitions from a company that sells items to a company that provides services; is this something that has been schemed over time, or was it a recent decision? For many people, network is just cost center that spends too much. This wasn't part of the business plan, it wasn't amazing prescience, it was that they'd built amazing technology that only they were using; for their ecommerce platform, they started sharing useful primatives with outside entities, very happy with how small companies are using them, and shaping nature of the development. Many of the early decisions on policy and design were under assumption they were always going to be internal; since then, have changed to make sure tools can be externalized, with that thought going into initial design. Q: Dan Blair, Cisco, managing complex information flows in hopefully simple ways from user perspective; do they care about data flow symmetry or asymmetry? Depends on what level; symmetry is important for some pieces, for anycast deployments for example. But symmetry is more on network design than flow levels mainly. Perturbations happen both on small scale and large scale, they tend to be localized, and mostly symmetric. Q: Dino, Cisco--do you see your edge cases increasing, decreasing, are they on network, application, etc? BIDIR example is radical example of simplification; it's been stable for three years now. Simplicity is the first design principle they consider when people get in the room. Are the edge cases increasing? They try to do rapid prototyping, provide minimal functionality early, goal is to interate from there; don't release full featured services initially, then customers influence over time. If you wanted to add new functionality, would you add a new protocol, or bolt onto an existing protocol? Would be evaluated on a case by case basis. More protocols is not desirable and increases opex, in general. As they look at larger and larger systems, they often do find a need for some additional protocols. Centralization vs Distribution--which way do they lean? They go for distributed systems, use clean, hardened APIs, as self-describing as possible. Thanks to everyone for the time, and thanks to Josh for twisting their arms hard enough to come here. 30 minute break now. Back here in 30 minutes. NOT lunch! PGP key signing now, Joel will handle it, follow him to get your keys signed!
participants (1)
-
Matthew Petach