2008.02.19 NANOG 42 100G forwarding challenges
Wow. It's good to back on IPv4. Apologies for the multiple copies of the previous set of notes, gmail got very, very confused about going through the NAT-PT gateway it seems. Next time I'll try sending them from a different mail provider. ^_^;; As pointed out, the i-triple-e has more than 2 e's in it...thanks for pointing out my goof, I'll make sure I count my e's more carefully in the future. :) Matt 2008.02.19 100G forwarding challenges panel Ted Seely/Igor Gashinsky, moderators Joel Goergen -- force 10 Dave Tsaing -- cisco Aris Wong -- foundry Why are we here? We needed 100G a year ago Standard won't be ready until 2010 Challenges to building hardware still. Scale IX operators need it for scaling the exchange points 16x10G LAGS already being planned. Slide showing high-performance computing clusters that need 8x2x10G connectivity. Joel from force10 goes first. VP of technology, chief scientist 100G forwarding architectural challenges OIF meeting in Brussels, 2005, with Alcatel, Force10, Marconi, and a few others. Was unable to convince the people that by 2007 people would require speeds greater than 10G LAG groups. It's been an upwards struggle, and if it wasn't for the participation of end users to help systems companies, peers at Cisco, Foundry, Force10, would never be able to convince manufacturers of components that high speed optics, SERDES would be needed Looking at where we are today vs July 2005, there's full agreement that the project needs to happen, but is lacking agreement on the commonality for moving forward. Please, give your input to your vendors so we can get 802.3ba correct. Chassis design even come into play when planning for 100G speeds. lower system BER connectors N+1 switch fabric reduced EMI clean power routing architecture thermal and cooling cable management Must meet regulatory standards. Those of us here need to really speak up now to make sure that the requirements get put in place first. Backplane/channel signalling--most backplanes now handle differential signalling up to 6.25G; that's great, up to a point. No longer A+B fabric, now N+1 fabrics. You can't cool a simple A+B adequately with dozens of 6.25G channels into a single card. So now, it's distributed switch fabrics with a portion of the traffic on each. spread out the signalling from multiple linecards to multiple fabrics. In 2 years, no advances in backplane signalling to move us beyond 6.25G point. They're packing more and more ports onto each card to reduce the price per port to where market will bear. Single 100G blade can be done now, but nobody buys single port gigE cards; there's already 40G OC768, but that's single port of SONET. To amortize costs on gig side to get it into price point market wants. Need 25G optical/electrical signalling on the backplane in order to hit the targets people want. He and his teams, and competitors are all working on 25G backplane signalling; you'll see specs being proposed in next 4 weeks; do read them and ask questions now! System BER--he and cisco have been strugging with component vendors, that 10^-12 isn't a reasonable thing to build into chassis; need 10^-15th built-in, and 10^-17th for testing conditions. Right now, 802.3ba still lists 10^-12 as the target rate; at 100G, that's a significant amount of packet drops, which isn't acceptable; they're pushing internally to hit 10^-17. The math is out there for testing those BERs. Only one connector out there that works for 25G right now; hopefully the technology will be finished and ready by 2009/2010 timeframe. People need to be concerned about these key items when talking to their vendors! Reduce thickness on backplane to reduce signal issues, get better signal integrity. Currently, if you use 3 or 6G signalling, backplane will be 1/2" thick. at 12 or 25G, backplane will be .25" thick. Cost drops almost 200% at that point, which allows you to shift costs to memory, for example. Reduced EMI--not sure how 100G will perform through agency approvals. Will be SERDES, clock, optics, and even feature related (more features means more things happening on the microprocessor, more emissions). Power; in 2003/2004, as systems vendor, could have significant noise on 3.3V, 2.5V, 1.5V supplies, and things would generally work. Today, you need very tight tolerances or you'll blow the BER. Thermal issues; we're adding 25% more power *in* to try to do more; so energy efficient components will be crucial. Cable management will be crucial; we'll be looking at 400+ gig per blade; how do we handle the cabling for that? Current copper 10G PHY takes 10-15W already; you can't get same density as with fiber PHY that takes 3W. High speed will force narrow interfaces to get higher signalling speeds through the SERDES paths. You'll probably see 10x10 before 4x25, but 4x25 is what will be needed to hit the densities needed on the linecards. Within 2012, will probably hit the 4x25 channels to reach real density levels. We all need to stay involved, and keep giving our feedback! Density has to move to 400G to 500G per blade for economics to work out. Memory, and memory technologies will need to speed up to touch packets at those line rates. A huge amount of memory will be needed to handle packets going through the box; cut-through and store and forward will have to merge in the future. Packet forwarding challenges at 100G David Tiang, Cisco. Linecard perspective. it's 2.5x our last jump at 40G more scalability global tables growing larger potential for prefix explosion as v4 runs out addres reselling->fragmentation IPv6 Growth of VPN usage More flexibility as Internet deals with evolution (eg v4-v6 transition, LISP, pt-mpt MPLS) We'll need additional complexity as we go through these transition periods; people will also want future-proofing on the boxes they buy. How many pps can you really handle? Why programmability? simple forwarding isn't so simple anymore. Trade off of performance vs scalability. Allow DRAM memory to be populated with more routes to continue to scale. Faster convergence in cases of routing changes. Lots of indirection, allow for pointer changes when updates happen, for faster convergence but slower lookups. End up with high bandwidth, highly flexible, but also highly complex. So from 40G to 100G, we have 2.5x MIPs, Memory, TPS, Memory BW, FF-mhz, aiming to keep same power profile no forklift upgrades to the datacenter Mitigations silicon advances (110nm -> 90nm -> 65nm) lower voltages, capacitance, leakage current more efficient memory technologies (dram, sram, tcam) more efficient design (terminations, power supply design) His focus is "can it fit in the power envelope"? ASIC technology ASIC 110nm at 1.2v vs 65nm at 1.0V P(d) = V^2 * C * f capacitance goes down as voltage goes down clock input cpacitance 26% less data input capacitance 50% less Overall, get .45 multiplier; so, trying for 2.5x it's 113%, a bit over. Leakage current issues (man, these slides are dense, download them and read them on your own) Less gates per Gbps had same dynamic power, but less static (leakage) power per Gbps. increase ASIC freq from 250Mhz to 400Mhz, gets you up by 126% on ASIC Memory technologies 40G, FCRAM, 332Mbps/pin, 100G RLDRAM-II, 800Mbps/pin DRAM power gain is 2.5x .55, 138% TCAM 11.25W at 40G equiv Lups vs 7.5W at 100G equiv Lups 73% reduction in TCAM power (better TCAM cell design) Other power savers More efficient SI design internal terminations vs external Thevenin terminations on memory lines Integrated serdes on ASICS replacement of some SRAMs with SDD efficient power supply design discrete designs optimized per load zone decreases power loss through DC-DC converter Integration of service processor function (10W) End result is that they're able to roughly maintain the same power profile for the new ports. Mostly due to TCAM power reductions. ASIC power went up, had to play tricks to offset it. Silicon advantages get us 90% of the way to 100G challenge, but silicon is lagging behind bandwidth growth curve. The remaining 10% comes from more efficient designs which have limited reusability and repeatability; over time, it's going to suck more power, and put out more heat. John Burger and Aris Wong, Foundry Networks 100G challenges, QoS, Policer issues, Agenda key components challenges potential solutions. Key components required for 100G optics, phy/mac packet processor traffic manager, switch fabric interface system backplane everybody needs high speed memory, high speed fabric. speed ups along those pathways are unavoidable. CMOS ASIC process technology supporting chips all need to run faster to support 100G 65nm down to 45nm technology will allow higher speed chips to do faster signalling with less power. various decisions that need to be made after packet arrives; header lookups launched, and rewrites are done, traffic manager does scheduling, shaping, and buffer management. The policer and QoS system needs to function much faster in 100G environment; QoS at 100G, needs to manage flows faster has to be able to prioritize and assign drop precedence as the packets go past. scheduler has to be able to handle the forwarding. Policer typically dual leaky bucket algorithm decisions forward, drop packet mark green, yellow, Slide showing example of dual leaky bucket model. challenges at 100GE packet rate can be up to 150Mpps per port dual leaky bucket algorithm can be compute intensive, and the timing is critical to implement scheduling is harder too. technology node at 65nm or below is needed to ease timing challenges divide and conquer approach, multiple instantiations of policers, but brings coherency challenges how about pipelining instructions? Customers have been speaking up pointing out they need 100G to keep up with network growth challenges in forwarding architectures higher packet rate higher bandwidth Thank you! Questions between panelists Q: system bit error rates; 100GE PHY as well. You want to push vendor to better raw error rates; what about error correction, retransmission, other ways to get to better system error rate. A: Going from 10^-12 to 10^-14, issue is more for component vendor than system vendor; adds time, adds cost to delivery. Comes down to time and costs, may need to reduce 10^-17 goal for now. Possible feed forward erorr correction. But FEC adds 8-20% to bandwidth requirement, which means more heat, more power, etc. DFE, multitap transmitters, currently DFE requires 6 taps, which means latency; for store and forward switch, adds more and more delay into the unit. So BER, FEC, DFE, there's tradeoffs; for DFE, it would take 16+ taps, which adds way more delays; BER of 10^-15 really is the way to go. Q: estimate for 25G signalling? A: he's putting 25% of his resources to get 25G signalling; the stepping stone is first to agree on objective, then agree on steps to take. If we can agree this year, can get the specs nailed down and start getting components rolled by 2010, which will allow linecards out by 2012 timeframe. Igor asks how far away are we from boxes shipping that won't cost $10M each? A: Draft 1.0 is Nov 2008 currently, may go into 2009. Once the draft is done, they can start working on how to meet the spec on the hardware side. The challenge is to get those first generation linecards out without breaking the bank on them or making them cost a bundle. Get 40G out there, reasonably priced without a lot of features; then focus on expanding features and density out to 2014. May do a pre-standard version to get feet wet first. Q: What do you have in mind after 100G--are we at the end of what silicon can handle? They're looking at 45nm, 35nm, they should be able to go to 250G before running out of silicon steam--that's on the forwarding side. Will need to figure out how to cool it, and how to cool the box around it, which goes back to the datacenter side. Q: Anton Kapella, 5 nines, scaling rates higher and higher, binary encoding frequency is getting more challenge, can we do multilevel encoding to get better bit rates, what about going right to optics? A: F10 labs, work shows that they'll be able to extend serdes to 75G per differential pair through coding and signalling methods, without OFDM, but using multilevel coding. pam4, pam8, quam as it's been used... pam4 has issues; should be 1/log2(n); but pam4 isn't scaling to the math at the moment. Not attractive for the next gen coding scheme. lucent/alcatel doing dual-binary; full pulse transmission, 25 tap dfe at tx and rx side. A few weeks from now, a new encoding will be published with bandwidth of 1/n, similar to NRZ. Could implement on current backplane technologies, which is important; it's very expensive to replace backplanes, etc. On optical side--bell labs did optical backplanes in 1998 and 2000; but the manufacturability of them is really tough. deterministic jitter, optical power, etc; still need to do optical to electrical conversion, even if you do it right on die; hoping to hold off on needing to do that until 2016. In multichassis, using optics to distribute switch fabrics out, they're expensive, hard to manufacture, if you can do it electrically stick with that for now. Thanks to all the panelists, and it's break time now!
participants (1)
-
Matthew Petach