On Sat, Nov 06, 2010 at 02:21:51PM -0700, George Bonser wrote:
That is not a new problem. That is also true to today with "last mile" links (e.g. dialup) that support <1500 byte MTU. What is different today is RFC 4821 PMTU discovery which deals with the "black holes".
RFC 4821 PMTUD is that "negotiation" that is "lacking". It is there. It is deployed. It actually works. No more relying on someone sending the ICMP packets through in order for PMTUD to work!
The only thing this adds is trial-and-error probing mechanism per flow, to try and recover from the infinite blackholing that would occur if your ICMP is blocked in classic PMTUD. If this actually happened in any scale, it would create a performance and overhead penalty that is far worse than the original problem you're trying to solve. Say you have two routers talking to each other over a L2 switched infrastructure (i.e. an exchange point). In order for PMTUD to function quickly and effectively, the two routers on each end MUST agree on the MTU value of the link between them. If router A thinks it is 9000, and router B thinks it is 8000, when router A comes along and tries to send a 8001 byte packet it will be silently discarded, and the only way to recover from this is with trial-and-error probing by the endpoints after they detect what they believe to be MTU blackholing. This is little more than a desperate ghetto hack designed to save the connection from complete disaster. The point where a protocol is needed is between router A and router B, so they can determine the MTU of the link, without needing to involve the humans in a manual negotiation process. Ideally this would support multi-point LANs over ethernet as well, so .1 could have an MTU of 9000, .2 could have an MTU of 8000, etc. And of course you have to make sure that you can actually PASS the MTU across the wire (if the switch in the middle can't handle it, the packet will also be silently dropped), so you can't just rely on the other side to tell you what size it THINKS it can support. You don't have a shot in hell of having MTUs negotiated correctly or PMTUD work well until this is done.
Is there any gear connected to a major IX that does NOT support large frames? I am not aware of any manufactured today. Even cheap D-Link gear supports them. I believe you would be hard-pressed to locate gear that doesn't support it at any major IX. Granted, it might require the change of a global config value and a reboot for it to take effect in some vendors.
If that doesn't prove my point about every vendor having their own definition of what # is and isn't supported, I don't know what does. Also, I don't know what exchanges YOU connect to, but I very clearly see a giant pile of gear on that list that is still in use today. :)
As for the configuration differences between units, how does that change from the way things are now? A person configuring a Juniper for 1500 byte packets already must know the difference as that quirk of including the headers is just as true at 1500 bytes as it is at 9000 bytes. Does the operator suddenly become less competent with their gear when they use a different value? Also, a 9000 byte MTU would be a happy value that practically everyone supports these days, including ethernet adaptors on host machines.
Everything defaults to 1500 today, so nobody has to do anything. Again, I'm actually doing this with people today on a very large network with lots of peers all over the world, so I have a little bit of experience with exactly what goes wrong. Nearly everyone who tries to figure out the correct MTU between vendors and with a third party network gets it wrong, at least some significant percentage of the time. And honestly I can't even find an interesting number of people willing to turn on BFD, something with VERY clear benefits for improving failure detection time over an IX (for the next time Equinix decides to do one of their 10PM maintenances that causes hours of unreachability until hold timers expire :P). If the IX operators saw any significant demand they would have already turned it on already. -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)