The only thing this adds is trial-and-error probing mechanism per
flow,
to try and recover from the infinite blackholing that would occur if your ICMP is blocked in classic PMTUD. If this actually happened in any scale, it would create a performance and overhead penalty that is far worse than the original problem you're trying to solve.
I ran into this very problem not long ago when attempting to reach a server for a very large network. Our Solaris hosts had no problem transacting with the server. Our linux machines did have a problem and the behavior looked like a typical PMTU black hole. It turned out that "very large network" tunneled the connection inside their network reducing the effective MTU of the encapsulated packets and blocked ICMP from inside their net to the outside. Changing the advertised MSS of the connection to that server to 1380 allowed it to work ( ip route add <ip address> via <gateway> dev <device> advmss 1380 ) and that verified that the problem was an MTU black hole. A little reading revealed why Solaris wasn't having the problem but Linux did. Setting the Linux ip_no_pmtu_disc sysctl to 1 resulted in the Linux behavior matching the Solaris behavior.
Say you have two routers talking to each other over a L2 switched infrastructure (i.e. an exchange point). In order for PMTUD to function quickly and effectively, the two routers on each end MUST agree on the MTU value of the link between them. If router A thinks it is 9000, and router B thinks it is 8000, when router A comes along and tries to send a 8001 byte packet it will be silently discarded, and the only way to recover from this is with trial-and-error probing by the endpoints after they detect what they believe to be MTU blackholing. This is little more than a desperate ghetto hack designed to save the connection from complete disaster.
Correct. Devices on the same vlan will need to use the same MTU. And why is that a problem? That is just as true then as it is today. Nothing changes. All you are doing is changing from everyone using 1500 to everyone using 9000 on that vlan. Nothing else changes. Why is that any kind of issue?
The point where a protocol is needed is between router A and router B, so they can determine the MTU of the link, without needing to involve the humans in a manual negotiation process.
When the TCP/IP connection is opened between the routers for a routing session, they should each send the other an MSS value that says how large a packet they can accept. You already have that information available. TCP provides that negotiation for directly connected machines. Again, nothing changes from the current method of operating. If I showed up at a peering switch and wanted to use 1000 byte MTU, I would probably have some problems. The point I am making is that 1500 is a relic value that hamstrings Internet performance and there is no good reason not to use 9000 byte MTU at peering points (by all participants) since it A: introduces no new problems and B: I can't find a vendor of modern gear at a peering point that doesn't support it though there may be some ancient gear at some peering points in use by some of the peers. I can not think of a problem changing from 1500 to 9000 as the standard at peering points introduces. It would also speed up the loading of the BGP routes between routers at the peering points. If Joe Blow at home with a dialup connection with an MTU of 576 is talking to a server at Y! with an MTU of 10 billion, changing a peering path from 1500 to 9000 bytes somewhere in the path is not going to change that PMTU discovery one iota. It introduces no problem whatsoever. It changes nothing.
If that doesn't prove my point about every vendor having their own definition of what # is and isn't supported, I don't know what does. Also, I don't know what exchanges YOU connect to, but I very clearly see a giant pile of gear on that list that is still in use today. :)
That is a list of 9000 byte clean gear. The very bottom is the stuff that doesn't support it. Of the stuff that doesn't support it, how much is connected directly to a peering point? THAT is the bottleneck I am talking about right now. One step at a time. Removing the bottleneck at the peering points is all I am talking about. That will not change PMTU issues elsewhere and those will stand just exactly as they are today without any change. In fact it will ensure that there are *fewer* PMTU discovery issues by being able to support a larger range of packets without having to fragment them. We *already* have SONET MTU of >4000 and this hasn't broken anything since the invention of SONET.