It's perfectly safe to have the L2 networks in the middle support the largest MTU values possible (other than maybe triggering an obscure Force10 bug or something :P), so they could roll that out today and you probably wouldn't notice. The real issue is with the L3 networks on either end of the exchange, since if the L3 routers that are trying to talk to each other don't agree about their MTU valus precisely, packets are blackholed. There are no real standards for jumbo frames out there, every vendor (and in many cases particular type/revision of hardware made by that vendor) supports a slightly different size. There is also no negotiation protocol of any kind, so the only way to make these two numbers match precisely is to have the humans on both sides talk to each other and come up with a commonly supported value.
There are two things that make this practically impossible to support at scale, even ignoring all of the grief that comes from trying to find a clueful human to talk to on the other end of your connection to a
party (which is a huge problem in and of itself):
#1. There is currently no mechanism on any major router to set multiple MTU values PER NEXTHOP on a multi-point exchange, so to do jumbo
That is not a new problem. That is also true to today with "last mile" links (e.g. dialup) that support <1500 byte MTU. What is different today is RFC 4821 PMTU discovery which deals with the "black holes". RFC 4821 PMTUD is that "negotiation" that is "lacking". It is there. It is deployed. It actually works. No more relying on someone sending the ICMP packets through in order for PMTUD to work! third frames
over an exchange you would have to pick a single common value that EVERYONE can support. This also means you can't mix and match jumbo and non-jumbo participants over the same exchange, you essentially have to set up an entirely new exchange point (or vlan within the same exchange) dedicated to the jumbo frame support, and you still have to get a common value that everyone can support. Ironically many routers (many kinds of Cisco and Juniper routers at any rate) actually DO support per-nexthop MTUs in hardware, there is just no mechanism exposed to the end user to configure those values, let alone auto-negotiate them.
Is there any gear connected to a major IX that does NOT support large frames? I am not aware of any manufactured today. Even cheap D-Link gear supports them. I believe you would be hard-pressed to locate gear that doesn't support it at any major IX. Granted, it might require the change of a global config value and a reboot for it to take effect in some vendors. http://darkwing.uoregon.edu/~joe/jumbo-clean-gear.html
#2. The major vendors can't even agree on how they represent MTU sizes, so entering the same # into routers from two different vendors can easily result in incompatible MTUs. For example, on Juniper when you type "mtu 9192", this is INCLUSIVE of the L2 header, but on Cisco the opposite is true. So to make a Cisco talk to a Juniper that is configured 9192, you would have to configure mtu 9178. Except it's not even that simple, because now if you start adding vlan tagging the L2 header size is growing. If you now configure vlan tagging on the interface, you've got to make the Cisco side 9174 to match the Juniper's 9192. And if you configure flexible-vlan-tagging so you can support q-in-q, you've now got to configure to Cisco side for 9170.
As an operator who DOES fully support 9k+ jumbos on every internal
Again, the size of the MTU on the IX port doesn't change the size of the packets flowing through that gear. A packet sent from an end point with an MTU of 1500 will be unchanged by the router change. A flow to an end point with <1500 MTU will also be adjusted down by PMTU Discovery just as it is now when communicating with a dialup end point that might have <600 MTU. The only thing that is going to change from the perspective of the routers is the communications originated by the router which will basically just be the BGP session. When the TCP session is established for BGP, the smaller of the two MTU will report an MSS value which is the largest packet size it can support. The other unit will not send a packet larger than this even if it has a larger MTU. Just because the MTU is 9000 doesn't mean it is going to aggregate 1500 byte packets flowing through it into 9000 byte packets, it is going to pass them through unchanged. As for the configuration differences between units, how does that change from the way things are now? A person configuring a Juniper for 1500 byte packets already must know the difference as that quirk of including the headers is just as true at 1500 bytes as it is at 9000 bytes. Does the operator suddenly become less competent with their gear when they use a different value? Also, a 9000 byte MTU would be a happy value that practically everyone supports these days, including ethernet adaptors on host machines. link
in my network, and as many external links as I can find clueful people to talk to on the other end to negotiate the correct values, let me just tell you this is a GIANT PAIN IN THE ASS. And we're not even talking about making sure things actually work right for the end user. Your IGP may not come up at all if the MTUs are misconfigured, but EBGP certainly will, even if the two sides are actually off by a few bytes. The maximum size of a BGP message is 4096 octets, and there is no mechanism to pad a message and try to detect MTU incompatibility, so what will actually happen in real life is the end user will try to send a big jumbo frame through and find that some of their packets are randomly and silently blackholed. This would be an utter nightmare to support and diagnose.
So the router doesn't honor the MSS value of the TCP stream? That would seem like a bug to me. I am not suggesting we set everything to the maximum that it will support because that is different for practically every vendor. I am suggesting that we pick a different "standard" value for the "middle" of the internet of 9000 bytes which practically everything made these days supports. Yes, having everyone set theirs to different values can make for different issues but if we just picked one, 9000, that everyone supports (you can use a larger MTU internally if you are doing things like tunneling which adds additional overhead if you want to maintain the original 9000 byte frame end to end) for the interfaces between networks.
Realistically I don't think you'll ever see even a serious attempt at jumbo frame support implemented in any kind of scale until there is a negotiation protocol and some real standards for the mtu size that must be supported, which is something that no standards body (IEEE, IETF, etc) has seemed inclined to deal with so far. Of course all of this is based on the assumption that path mtu discovery will work correctly once the MTU valus ARE correctly configured on the L3 routers, which is a pretty huge assumption, given all the people who stupidly filter ICMP. Oh and even if you solved all of those problems, I could trivially DoS your router with some packets that would overload your ability to generate ICMP Unreach Needfrag messages for PMTUD, and then all your jumbo frame end users going through that router would be blackholed as well.
The ICMP filtration issue goes away with modern PMTUD that is now supported in Windows, Solaris, Linux, MacOS, and BSD. That is no longer a problem for the end points. And I would highly recommend anyone operating Linux systems in production to run at least 2.6.32 with /proc/sys/net/ipv6/tcp_mtu_probing set to either 1 (blackhole recovery) or 2 (active PMTU discovery probes) in order to avoid the PMTUD problems we already have on the Internet. The MTU issue between routers is only a problem for the traffic originated and terminated between those routes. The MSS might not be accurate if there is a tunnel someplace between the two routers that reduces the effective MTU between them but that is a matter of getting router vendors to also support RFC 4821 themselves to detect and correct that problem. The tools are all there. We have already been operating for quite some time with mixed MTU and effective MTU sizes with tunneling and various "last mile" issues. This adds nothing new to the mix and offers greatly improved performance in both the transactions across the network and from the gear itself in reduced CPU consumption to move a given amount of traffic. See, I told you it was a hornets' nest :)