Ah, large MTUs. Like many other "academic" backbones, we implemented large (9192 bytes) MTUs on our backbone and 9000 bytes on some hosts. See [1] for an illustration. Here are *my* current thoughts on increasing the Internet MTU beyond its current value, 1500. (On the topic, see also [2] - a wiki page which is actually served on a 9000-byte MTU server :-) Benefits of >1500-byte MTUs: Several benefits of moving to larger MTUs, say in the 9000-byte range, were cited. I don't find them too convincing anymore. 1. Fewer packets reduce work for routers and hosts. Routers: Most backbones seem to size their routers to sustain (near-) line-rate traffic even with small (64-byte) packets. That's a good thing, because if networks were dimensioned to just work at average packet sizes, they would be pretty easy to DoS by sending floods of small packets. So I don't see how raising the MTU helps much unless you also raise the minimum packet size - which might be interesting, but I haven't heard anybody suggest that. This should be true for routers and middleboxes in general, although there are certainly many places (especially firewalls) where pps limitations ARE an issue. But again, raising the MTU doesn't help if you're worried about the worst case. And I would like to see examples where it would help significantly even in the normal case. In our network it certainly doesn't - we have Mpps to spare. Hosts: For hosts, filling high-speed links at 1500-byte MTU has often been difficult at certain times (with Fast Ethernet in the nineties, GigE 4-5 years ago, 10GE today), due to the high rate of interrupts/context switches and internal bus crossings. Fortunately tricks like polling-instead-of-interrupts (Saku Ytti mentioned this), Interrupt Coalescence and Large-Send Offload have become commonplace these days. These give most of the end-system performance benefits of large packets without requiring any support from the network. 2. Fewer bytes (saved header overhead) free up bandwidth. TCP segments over Ethernet with 1500 byte MTU is "only" 94.2% efficient, while with 9000 byte MTU it would be 99.?% efficient. While an improvement would certainly be nice, 94% already seems "good enough" to me. (I'm ignoring the byte savings due to fewer ACKs. On the other hand not all packets will be able to grow sixfold - some transfers are small.) 3. TCP runs faster. This boils down to two aspects (besides the effects of (1) and (2)): a) TCP reaches its "cruising speed" faster. Especially with LFNs (Long Fat Networks, i.e. paths with a large bandwidth*RTT product), it can take quite a long time until TCP slow-start has increased the window so that the maximum achievable rate is reached. Since the window increase happens in units of MSS (~MTU), TCPs with larger packets reach this point proportionally faster. This is significant, but there are alternative proposals to solve this issue of slow ramp-up, for example HighSpeed TCP [3]. b) You get a larger share of a congested link. I think this is true when a TCP-with-large-packets shares a congested link with TCPs-with-small-packets, and the packet loss probability isn't proportional to the size of the packet. In fact the large-packet connection can get a MUCH larger share (sixfold for 9K vs. 1500) if the loss probability is the same for everybody (which it often will be, approximately). Some people consider this a fairness issue, other think it's a good incentive for people to upgrade their MTUs. About the issues: * Current Path MTU Discovery doesn't work reliably. Path MTU Discovery as specified in RFC 1191/1981 relies on ICMP messages to discover when a smaller MTU has to be used. When these ICMP messages fail to arrive (or be sent), the sender will happily continue to send too-large packets into the blackhole. This problem is very real. As an experiment, try configuring an MTU < 1500 on a backbone link which has Ethernet-connected customers behind it. I bet that you'll receive LOUD complaints before long. Some other people mention that Path MTU Discovery has been refined with "blackhole detection" methods in some systems. This is widely implemented, but not configured (although it probably could be with a "Service Pack"). Note that a new Path MTU Discovery proposal was just published as RFC 4821 [4]. This is also supposed to solve the problem of relying on ICMP messages. Please, let's wait for these more robust PMTUD mechanisms to be universally deployed before trying to increase the Internet MTU. * IP assumes a consistent MTU within a logical subnet. This seems to be a pretty fundamental assumption, and Iljitsch's original mail suggests that we "fix" this. Umm, ok, I hope we don't miss anything important that makes use of this assumption. Seriously, I think it's illusionary to try to change this for general networks, in particular large LANs. It might work for exchange points or other controlled cases where the set of protocols is fairly well defined, but then exchange points have other options such as separate "jumbo" VLANs. For campus/datacenter networks, I agree that the consistent-MTU requirement is a big problem for deploying larger MTUs. This is true within my organization - most servers that could use larger MTUs (NNTP servers for example) live on the same subnet with servers that will never bother to be upgraded. The obvious solution is to build smaller subnets - for our test servers I usually configure a separate point-to-point subnet for each of its Ethernet interfaces (I don't trust this bridging-magic anyway :-). * Most edges will not upgrade anyway. On the slow edges of the network (residual modem users, exotic places, cellular data users etc.), people will NOT upgrade their MTU to 9000 byte, because a single such packet would totally kill the VoIP experience. For medium-fast networks, large MTUs don't cause problems, but they don't help either. So only a few super-fast edges have an incentive to do this at all. For the core networks that support large MTUs (like we do), this is frustrating because all our routers now probably carve their internal buffers for 9000-byte packets that never arrive. Maybe we're wasting lots of expensive linecard memory this way? * Chicken/egg As long as only a small minority of hosts supports >1500-byte MTUs, there is no incentive for anyone important to start supporting them. A public server supporting 9000-byte MTUs will be frustrated when it tries to use them. The overhead (from attempted large packets that don't make it) and potential trouble will just not be worth it. This is a little similar to IPv6. So I don't see large MTUs coming to the Internet at large soon. They probably make sense in special cases, maybe for "land-speed records" and dumb high-speed video equipment, or for server-to-server stuff such as USENET news. (And if anybody out there manages to access [2] or http://ndt.switch.ch/ with 9000-byte MTUs, I'd like to hear about it :-) -- Simon. [1] Here are a few tracepaths (more or less traceroute with integrated PMTU discovery) from a host on our network in Switzerland. 9000-byte packets make it across our national backbone (SWITCH), the European academic backbone (GEANT2), Abilene and CENIC in the US, as well as through AARnet in Australia (even over IPv6). But the link from the last wide-area backbone to the receiving site inevitably has a 1500-byte MTU ("pmtu 1500"). : leinen@mamp1[leinen]; tracepath www.caida.org 1: mamp1-eth2.switch.ch (130.59.35.78) 0.110ms pmtu 9000 1: swiMA1-G2-6.switch.ch (130.59.35.77) 1.029ms 2: swiMA2-G2-5.switch.ch (130.59.36.194) 1.141ms 3: swiEL2-10GE-1-4.switch.ch (130.59.37.77) 4.127ms 4: swiCE3-10GE-1-3.switch.ch (130.59.37.65) 4.726ms 5: swiCE2-10GE-1-4.switch.ch (130.59.36.209) 4.901ms 6: switch.rt1.gen.ch.geant2.net (62.40.124.21) asymm 7 4.429ms 7: so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22) asymm 8 12.551ms 8: abilene-wash-gw.rt1.fra.de.geant2.net (62.40.125.18) asymm 9 105.099ms 9: 64.57.28.12 (64.57.28.12) asymm 10 121.619ms 10: kscyng-iplsng.abilene.ucaid.edu (198.32.8.81) asymm 11 153.796ms 11: dnvrng-kscyng.abilene.ucaid.edu (198.32.8.13) asymm 12 158.520ms 12: snvang-dnvrng.abilene.ucaid.edu (198.32.8.1) asymm 13 180.784ms 13: losang-snvang.abilene.ucaid.edu (198.32.8.94) asymm 14 177.487ms 14: hpr-lax-gsr1--abilene-LA-10ge.cenic.net (137.164.25.2) asymm 20 179.106ms 15: riv-hpr--lax-hpr-10ge.cenic.net (137.164.25.5) asymm 21 185.183ms 16: hpr-sdsc-sdsc2--riv-hpr-ge.cenic.net (137.164.27.54) asymm 18 186.368ms 17: hpr-sdsc-sdsc2--riv-hpr-ge.cenic.net (137.164.27.54) asymm 18 185.861ms pmtu 1500 18: cider.caida.org (192.172.226.123) asymm 19 186.264ms reached Resume: pmtu 1500 hops 18 back 19 : leinen@mamp1[leinen]; tracepath www.aarnet.edu.au 1: mamp1-eth2.switch.ch (130.59.35.78) 0.095ms pmtu 9000 1: swiMA1-G2-6.switch.ch (130.59.35.77) 1.024ms 2: swiMA2-G2-5.switch.ch (130.59.36.194) 1.115ms 3: swiEL2-10GE-1-4.switch.ch (130.59.37.77) 3.989ms 4: swiCE3-10GE-1-3.switch.ch (130.59.37.65) 4.731ms 5: swiCE2-10GE-1-4.switch.ch (130.59.36.209) 4.771ms 6: switch.rt1.gen.ch.geant2.net (62.40.124.21) asymm 7 4.424ms 7: so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22) asymm 8 12.536ms 8: ge-3-3-0.bb1.a.fra.aarnet.net.au (202.158.204.249) asymm 9 13.207ms 9: so-0-1-0.bb1.a.sin.aarnet.net.au (202.158.194.145) asymm 10 217.846ms 10: so-3-3-0.bb1.a.per.aarnet.net.au (202.158.194.129) asymm 11 275.651ms 11: so-0-1-0.bb1.a.adl.aarnet.net.au (202.158.194.6) asymm 12 293.854ms 12: so-0-1-0.bb1.a.adl.aarnet.net.au (202.158.194.6) 297.989ms pmtu 1500 13: tiny-teddy.aarnet.edu.au (203.21.37.30) asymm 12 297.462ms reached Resume: pmtu 1500 hops 13 back 12 : leinen@mamp1[leinen]; tracepath6 www.aarnet.edu.au 1?: [LOCALHOST] pmtu 9000 1: swiMA1-G2-6.switch.ch 1.328ms 2: swiMA2-G2-5.switch.ch 1.703ms 3: swiEL2-10GE-1-4.switch.ch 4.529ms 4: swiCE3-10GE-1-3.switch.ch 5.278ms 5: swiCE2-10GE-1-4.switch.ch 5.493ms 6: switch.rt1.gen.ch.geant2.net asymm 7 5. 99ms 7: so-7-2-0.rt1.fra.de.geant2.net asymm 8 13.239ms 8: ge-3-3-0.bb1.a.fra.aarnet.net.au asymm 9 13.970ms 9: so-0-1-0.bb1.a.sin.aarnet.net.au asymm 10 218.718ms 10: so-3-3-0.bb1.a.per.aarnet.net.au asymm 11 267.225ms 11: so-0-1-0.bb1.a.adl.aarnet.net.au asymm 12 299. 78ms 12: so-0-1-0.bb1.a.adl.aarnet.net.au 298.473ms pmtu 1500 12: www.ipv6.aarnet.edu.au 292.893ms reached Resume: pmtu 1500 hops 12 back 12 [2] PERT Knowledgebase article: http://kb.pert.geant2.net/PERTKB/JumboMTU [3] RFC 3649, HighSpeed TCP for Large Congestion Windows, S. Floyd, December 2003 [4] RFC 4821, Packetization Layer Path MTU Discovery. M. Mathis, J. Heffner, March 2007