Thoughts on increasing MTUs on the internet

Iljitsch van Beijnum

12 Apr 2007 12 Apr '07

9:20 a.m.

Dear NANOGers, It irks me that today, the effective MTU of the internet is 1500 bytes, while more and more equipment can handle bigger packets. What do you guys think about a mechanism that allows hosts and routers on a subnet to automatically discover the MTU they can use towards other systems on the same subnet, so that: 1. It's no longer necessary to limit the subnet MTU to that of the least capable system 2. It's no longer necessary to manage 1500 byte+ MTUs manually Any additional issues that such a mechanism would have to address?

Show replies by date

Pierfrancesco Caci

12 Apr 12 Apr

10:02 a.m.

:-> "Iljitsch" == Iljitsch van Beijnum <iljitsch@muada.com> writes: > Dear NANOGers, > It irks me that today, the effective MTU of the internet is 1500 > bytes, while more and more equipment can handle bigger packets. > What do you guys think about a mechanism that allows hosts and > routers on a subnet to automatically discover the MTU they can use > towards other systems on the same subnet, so that: > 1. It's no longer necessary to limit the subnet MTU to that of the > least capable system > 2. It's no longer necessary to manage 1500 byte+ MTUs manually > Any additional issues that such a mechanism would have to address? wouldn't that work only if the switch in the middle of your neat office lan is a real switch (i.e. not flooding oversize packets to hosts that can't handle them, possibly crashing their NIC drivers) and it's itself capable of larger MTUs? Pf -- ------------------------------------------------------------------------------- Pierfrancesco Caci | Network & System Administrator - INOC-DBA: 6762*PFC p.caci@seabone.net | Telecom Italia Sparkle - http://etabeta.noc.seabone.net/ Linux clarabella 2.6.12-10-686-smp #1 SMP Fri Sep 15 16:47:57 UTC 2006 i686 GNU/Linux

Iljitsch van Beijnum

11:03 a.m.

On 12-apr-2007, at 12:02, Pierfrancesco Caci wrote:

...

wouldn't that work only if the switch in the middle of your neat office lan is a real switch (i.e. not flooding oversize packets to hosts that can't handle them, possibly crashing their NIC drivers) and it's itself capable of larger MTUs?

Well, yes, being compatible with stuff that doesn't support larger packets pretty much goes without saying. I don't think there is any need to worry about crashing drivers, packets that are longer than they should are a common error condition that drivers are supposed to handle without incident. (They often keep a "giant" count.) A more common problem would be two hosts that support jumboframes with a switch in the middle that doesn't. So it's necessary to test for this and avoid excessive numbers or large packets when something in the middle doesn't support them.

Stephen Wilcox

11:55 a.m.

On Thu, Apr 12, 2007 at 01:03:45PM +0200, Iljitsch van Beijnum wrote:

...

On 12-apr-2007, at 12:02, Pierfrancesco Caci wrote:

...
wouldn't that work only if the switch in the middle of your neat office lan is a real switch (i.e. not flooding oversize packets to hosts that can't handle them, possibly crashing their NIC drivers) and it's itself capable of larger MTUs?

Well, yes, being compatible with stuff that doesn't support larger packets pretty much goes without saying. I don't think there is any need to worry about crashing drivers, packets that are longer than they should are a common error condition that drivers are supposed to handle without incident. (They often keep a "giant" count.)

A more common problem would be two hosts that support jumboframes with a switch in the middle that doesn't. So it's necessary to test for this and avoid excessive numbers or large packets when something in the middle doesn't support them.

the internet is broken.. too many firewalls dropping icmp, too many hard coded systems that work for 'default' but dont actually allow for alternative parameters that should work according to the RFCs if you can fix all that then it might work alternatively if you can redesign path mtu discovery that might work too.. Martin Levy suggested this too me only two weeks ago, he had an idea of sending two packets initially - one 'default' and one at the higher mtu .. if the higher one gets dropped somewhere you can quickly spot it and revert to 'default' behaviour. I think his explanation was more complicated but it was an interesting idea Steve

Saku Ytti

11:50 a.m.

On (2007-04-12 11:20 +0200), Iljitsch van Beijnum wrote:

...

What do you guys think about a mechanism that allows hosts and routers on a subnet to automatically discover the MTU they can use towards other systems on the same subnet, so that: 1. It's no longer necessary to limit the subnet MTU to that of the least capable system

2. It's no longer necessary to manage 1500 byte+ MTUs manually

To me this sounds adding complexity for rather small pay-off. And then we'd have to ask IXP people, would the enable this feature if it was available? If so, why don't they offer high MTU VLAN today? And in the end, pay-off of larger MTU is quite small, perhaps some interrupts are saved but not sure how relevant that is in poll() based NIC drivers. Of course bigger pay-off would be that users could use tunneling and still offer 1500 to LAN. IXP peeps, why are you not offering high MTU VLAN option?

...

From my point of view, this is biggest reason why we today generally don't have higher end-to-end MTU. I know that some IXPs do, eg. NetNOD but generally it's not offered even though many users would opt to use it.

Thanks, -- ++ytti

Mikael Abrahamsson

12:03 p.m.

On Thu, 12 Apr 2007, Saku Ytti wrote:

...

IXP peeps, why are you not offering high MTU VLAN option?

Netnod in Sweden offer MTU 4470 option. Otoh it's not so easy operationally since for instance Juniper and Cisco calculates MTU differently. But I don't really see it beneficial to try to up the endsystem MTU to over standard ethernet MTU, if you think it's operationally troublesome with PMTUD now, imagine when everybody is running different MTU. Biggest benefit would be if the transport network people run PPPoE and other tunneled traffic over, would allow for whatever MTU needed to carry unfragmented 1500 byte tunneled packets, so we could assure that all hosts on the internet actually have 1500 IP MTU transparently. -- Mikael Abrahamsson email: swmike@swm.pp.se

Niels Bakker

12:58 p.m.

* swmike@swm.pp.se (Mikael Abrahamsson) [Thu 12 Apr 2007, 14:07 CEST]:

...

On Thu, 12 Apr 2007, Saku Ytti wrote:

...
IXP peeps, why are you not offering high MTU VLAN option? Biggest benefit would be if the transport network people run PPPoE and other tunneled traffic over, would allow for whatever MTU needed to carry unfragmented 1500 byte tunneled packets, so we could assure that all hosts on the internet actually have 1500 IP MTU transparently.

How much traffic from DSLAM to service provider is currently being exchanged across IXPs? (My money's on "as close to 0 to not really matter") -- Niels.

Joel Jaeggli

3:52 p.m.

Niels Bakker wrote:

...

* swmike@swm.pp.se (Mikael Abrahamsson) [Thu 12 Apr 2007, 14:07 CEST]:

...
On Thu, 12 Apr 2007, Saku Ytti wrote:

...
IXP peeps, why are you not offering high MTU VLAN option? Biggest benefit would be if the transport network people run PPPoE and other tunneled traffic over, would allow for whatever MTU needed to carry unfragmented 1500 byte tunneled packets, so we could assure that all hosts on the internet actually have 1500 IP MTU transparently.

How much traffic from DSLAM to service provider is currently being exchanged across IXPs?

How much l2 vpn traffic is being exchanged across the public internet? (my money is on a lot)

...

(My money's on "as close to 0 to not really matter")

-- Niels.

Gian Constantine

2:04 p.m.

I agree. The throughput gains are small. You're talking about a difference between a 4% header overhead versus a 1% header overhead (for TCP). One could argue a decreased pps impact on intermediate systems, but when factoring in the existing packet size distribution on the Internet and the perceived adjustment seen by a migration to 4470 MTU support, the gains remain small. Development costs and the OpEx costs of implementation and support will, likely, always outweigh the gains. Gian Anthony Constantine On Apr 12, 2007, at 7:50 AM, Saku Ytti wrote:

...

On (2007-04-12 11:20 +0200), Iljitsch van Beijnum wrote:

...
What do you guys think about a mechanism that allows hosts and routers on a subnet to automatically discover the MTU they can use towards other systems on the same subnet, so that: 1. It's no longer necessary to limit the subnet MTU to that of the least capable system

2. It's no longer necessary to manage 1500 byte+ MTUs manually

To me this sounds adding complexity for rather small pay-off. And then we'd have to ask IXP people, would the enable this feature if it was available? If so, why don't they offer high MTU VLAN today? And in the end, pay-off of larger MTU is quite small, perhaps some interrupts are saved but not sure how relevant that is in poll() based NIC drivers. Of course bigger pay-off would be that users could use tunneling and still offer 1500 to LAN.

IXP peeps, why are you not offering high MTU VLAN option? From my point of view, this is biggest reason why we today generally don't have higher end-to-end MTU. I know that some IXPs do, eg. NetNOD but generally it's not offered even though many users would opt to use it.

Thanks, -- ++ytti

Iljitsch van Beijnum

2:28 p.m.

On 12-apr-2007, at 16:04, Gian Constantine wrote:

...

I agree. The throughput gains are small. You're talking about a difference between a 4% header overhead versus a 1% header overhead (for TCP).

6% including ethernet overhead and assuming the very common TCP timestamp option.

...

One could argue a decreased pps impact on intermediate systems, but when factoring in the existing packet size distribution on the Internet and the perceived adjustment seen by a migration to 4470 MTU support, the gains remain small.

Average packet size on the internet has been fairly constant at around 500 bytes the past 10 years or so from my vantage point. You only need to make 7% of all packets 9000 bytes and you double that. This means that you can have twice the amount of data transferred for the same amount of per-packet work. If you're at 100% of your CPU or TCAM capacity today, that is a huge win. On the other hand, if you need to buy equipment that can do line rate at 64 bytes per packet, it doesn't matter much. There are other benefits too, though. For instance, TCP can go much faster with bigger packets. Additional tunnel/VPN overhead isn't as bad.

...

Development costs and the OpEx costs of implementation and support will, likely, always outweigh the gains.

Gains will go up as networks get faster and faster, implementation should approach zero over time and support shouldn't be an issue if it works fully automatically. Others mentioned ICMP filtering and PMTUD problems. Filtering shouldn't be an issue for a mechanism that is local to a subnet, and if it is, there's still no problem if the mechanism takes the opposite approach of PMTUD. With PMTUD, the assumption is that large works, and extra messages result in a smaller packet size. By exchanging large messages that indicate the capability to exchange large messages, form and function align, and if an indication that large messages are possible isn't received, it's not used and there are no problems.

Saku Ytti

4:07 p.m.

On (2007-04-12 16:28 +0200), Iljitsch van Beijnum wrote:

...

On 12-apr-2007, at 16:04, Gian Constantine wrote:

...
I agree. The throughput gains are small. You're talking about a difference between a 4% header overhead versus a 1% header overhead (for TCP).

6% including ethernet overhead and assuming the very common TCP timestamp option.

Out of curiosity how is this calculated? [ytti@ytti.fi ~]% echo "1450/(1+7+6+6+2+1500+4+12)*100"|bc -l 94.27828348504551365400 [ytti@ytti.fi ~]% echo "8950/(1+7+6+6+2+9000+4+12)*100"|bc -l 99.02633325957070148200 [ytti@ytti.fi ~]% I calculated less than 5% from 1500 to 9000, with ethernet and adding TCP timestamp. What did I miss? Or compared without tcp timestamp and 1500 to 4470. [ytti@ytti.fi ~]% echo "1460/(1+7+6+6+2+1500+4+12)*100"|bc -l 94.92847854356306892000 [ytti@ytti.fi ~]% echo "4410/(1+7+6+6+2+4470+4+12)*100"|bc -l 97.82608695652173913000 Less than 3%. However, I don't think it's relevant if it's 1% or 10%, bigger benefit would be to give 1500 end-to-end, even with eg. ipsec to the office. -- ++ytti

Saku Ytti

4:35 p.m.

...

Or compared without tcp timestamp and 1500 to 4470. [ytti@ytti.fi ~]% echo "1460/(1+7+6+6+2+1500+4+12)*100"|bc -l 94.92847854356306892000 [ytti@ytti.fi ~]% echo "4410/(1+7+6+6+2+4470+4+12)*100"|bc -l 97.82608695652173913000

Apparently 70-40 is too hard for me. [ytti@ytti.fi ~]% echo "4430/(1+7+6+6+2+4470+4+12)*100"|bc -l 98.26974267968056787900 so ~3.3% -- ++ytti

Gian Constantine

5:03 p.m.

I did a rough, top-of-the-head, with ~60 bytes header (ETH, IP, TCP) into 1500 and 4470 (a mistake, on my part, not to use 9216). I still think the cost outweighs the gain, though there are some reasonable arguments for the increase. Gian Anthony Constantine On Apr 12, 2007, at 12:07 PM, Saku Ytti wrote:

...

On (2007-04-12 16:28 +0200), Iljitsch van Beijnum wrote:

...
On 12-apr-2007, at 16:04, Gian Constantine wrote:

...
I agree. The throughput gains are small. You're talking about a difference between a 4% header overhead versus a 1% header overhead (for TCP).

6% including ethernet overhead and assuming the very common TCP timestamp option.

Out of curiosity how is this calculated? [ytti@ytti.fi ~]% echo "1450/(1+7+6+6+2+1500+4+12)*100"|bc -l 94.27828348504551365400 [ytti@ytti.fi ~]% echo "8950/(1+7+6+6+2+9000+4+12)*100"|bc -l 99.02633325957070148200 [ytti@ytti.fi ~]%

I calculated less than 5% from 1500 to 9000, with ethernet and adding TCP timestamp. What did I miss?

Or compared without tcp timestamp and 1500 to 4470. [ytti@ytti.fi ~]% echo "1460/(1+7+6+6+2+1500+4+12)*100"|bc -l 94.92847854356306892000 [ytti@ytti.fi ~]% echo "4410/(1+7+6+6+2+4470+4+12)*100"|bc -l 97.82608695652173913000

Less than 3%.

However, I don't think it's relevant if it's 1% or 10%, bigger benefit would be to give 1500 end-to-end, even with eg. ipsec to the office.

-- ++ytti

Iljitsch van Beijnum

5:51 p.m.

On 12-apr-2007, at 18:07, Saku Ytti wrote:

...

...
...
I agree. The throughput gains are small. You're talking about a difference between a 4% header overhead versus a 1% header overhead (for TCP).

...

...
6% including ethernet overhead and assuming the very common TCP timestamp option.

...

Out of curiosity how is this calculated?

8 bytes preamble 14 bytes ethernet II header 20 bytes IP header 20 bytes TCP header 12 bytes timestamp option 4 bytes FCS/CRC 12 bytes equivalent inter frame gap 90 bytes total overhead, 52 deducted from the ethernet payload, 38 added to it. 90 / (1500 - 52 = 1448) * 100 = 6.21 90 / (9000 - 52 = 8948) * 100 = 1 Also note that the real overhead is much bigger because for every two full size TCP packets an ACK is sent so that adds 90 bytes per 2 data packets, or increases the overhead to 9% / 1.5%.

Saku Ytti

6:08 p.m.

On (2007-04-12 19:51 +0200), Iljitsch van Beijnum wrote:

...

8 bytes preamble 14 bytes ethernet II header 20 bytes IP header 20 bytes TCP header 12 bytes timestamp option 4 bytes FCS/CRC 12 bytes equivalent inter frame gap

90 bytes total overhead, 52 deducted from the ethernet payload, 38 added to it.

90 / (1500 - 52 = 1448) * 100 = 6.21

90 / (9000 - 52 = 8948) * 100 = 1

Also note that the real overhead is much bigger because for every two full size TCP packets an ACK is sent so that adds 90 bytes per 2 data packets, or increases the overhead to 9% / 1.5%.

Aren't you double penalizing? Should it be: [ytti@nekrotuska ~]% echo "90 / (1500+38) * 100"|bc -l 5.85175552665799739900 Or other way to say it: [ytti@nekrotuska ~]% echo "100-(1448/(1+7+6+6+2+1500+4+12)*100)"|bc -l 5.85175552665799740000 -- ++ytti

Warren Kumari

3:56 p.m.

On Apr 12, 2007, at 10:04 AM, Gian Constantine wrote:

...

I agree. The throughput gains are small. You're talking about a difference between a 4% header overhead versus a 1% header overhead (for TCP).

One of the "benefits" of larger MTU is that, during the additive increase phase, or after recovering from congestion, you reach full speed sooner -- it does also mean that if you do reach congestion, you throw away more data, and, because of the length of flows, are probably more likely to cause congestion...

...

One could argue a decreased pps impact on intermediate systems, but when factoring in the existing packet size distribution on the Internet and the perceived adjustment seen by a migration to 4470 MTU support, the gains remain small.t

...

Development costs and the OpEx costs of implementation and support will, likely, always outweigh the gains.

Gian Anthony Constantine

On Apr 12, 2007, at 7:50 AM, Saku Ytti wrote:

...
On (2007-04-12 11:20 +0200), Iljitsch van Beijnum wrote:

...
What do you guys think about a mechanism that allows hosts and routers on a subnet to automatically discover the MTU they can use towards other systems on the same subnet, so that: 1. It's no longer necessary to limit the subnet MTU to that of the least capable system

2. It's no longer necessary to manage 1500 byte+ MTUs manually

To me this sounds adding complexity for rather small pay-off. And then we'd have to ask IXP people, would the enable this feature if it was available? If so, why don't they offer high MTU VLAN today? And in the end, pay-off of larger MTU is quite small, perhaps some interrupts are saved but not sure how relevant that is in poll() based NIC drivers. Of course bigger pay-off would be that users could use tunneling and still offer 1500 to LAN.

IXP peeps, why are you not offering high MTU VLAN option? From my point of view, this is biggest reason why we today generally don't have higher end-to-end MTU. I know that some IXPs do, eg. NetNOD but generally it's not offered even though many users would opt to use it.

Thanks, -- ++ytti

-- Some people are like Slinkies......Not really good for anything but they still bring a smile to your face when you push them down the stairs.

Will Hargrave

11:17 p.m.

Saku Ytti wrote:

...

IXP peeps, why are you not offering high MTU VLAN option? From my point of view, this is biggest reason why we today generally don't have higher end-to-end MTU. I know that some IXPs do, eg. NetNOD but generally it's not offered even though many users would opt to use it.

At LONAP a jumbo frames peering vlan is on the 'to investigate' list. I am not sure if there is that much interest though. Another vlan, another SVI, another peering session... The fabric itself is enabled to 9216 bytes; we have several members exchanging L2TP DSL traffic at higher MTUs but this is currently done over private (i.e. member addressed) vlans. There are some other possible IX applications... MPLS springs to mind as another network technology which requires at least baby giants; what would providers use this for? Handoff of multiprovider l2/l3 VPNs? The other technology which sees people deploying jumbos out there is storage. Selling storage as well as transit over the IX? It could happen :-) -- Will Hargrave will@lonap.net Technical Director LONAP Ltd

Saku Ytti

13 Apr 13 Apr

5:20 a.m.

On (2007-04-13 00:17 +0100), Will Hargrave wrote:

...

At LONAP a jumbo frames peering vlan is on the 'to investigate' list. I am not sure if there is that much interest though. Another vlan, another SVI, another peering session...

Why another? For neighbours that are willing to peer over eg. VLAN MTU 9k, peer with them only over that VLAN. I don't see much point peering over both VLANs. What I remember discussing with unnamed european IXP staff was that they were worried about loosing 'frame too big' counters. Since of course then the switch environment would accept bigger frames even on the 1500 MTU VLAN. And if member misconfigures the small MTU VLAN, and calls to IXP complaining how IXP is dropping their frames (due to sending over 1500bytes) IXP staff can't quickly diagnose the problem from interface counters. I argued that it's mostly irrelevant, since IXP staff can ping from IXP 'small mtu VLAN' the customer they're suspecting sending too large frames, and confirm this if router replies to a ping over 1500 bytes. But then again, I have 0 operational experience running IXP and it's easy for me to oversimplify the issue.

...

The fabric itself is enabled to 9216 bytes; we have several members exchanging L2TP DSL traffic at higher MTUs but this is currently done over private (i.e. member addressed) vlans.

This I believe to be biggest gain, tunneling, eg. ability to run IPSec site-to-site while providing full 1500bytes to LAN.

...

There are some other possible IX applications... MPLS springs to mind as another network technology which requires at least baby giants; what would providers use this for? Handoff of multiprovider l2/l3 VPNs?

The other technology which sees people deploying jumbos out there is storage. Selling storage as well as transit over the IX? It could happen :-)

-- Will Hargrave will@lonap.net Technical Director LONAP Ltd

-- ++ytti

Steven M. Bellovin

12 Apr 12 Apr

1:26 p.m.

On Thu, 12 Apr 2007 11:20:18 +0200 Iljitsch van Beijnum <iljitsch@muada.com> wrote:

...

Dear NANOGers,

It irks me that today, the effective MTU of the internet is 1500 bytes, while more and more equipment can handle bigger packets.

What do you guys think about a mechanism that allows hosts and routers on a subnet to automatically discover the MTU they can use towards other systems on the same subnet, so that:

1. It's no longer necessary to limit the subnet MTU to that of the least capable system

2. It's no longer necessary to manage 1500 byte+ MTUs manually

Any additional issues that such a mechanism would have to address?

Last I heard, the IEEE won't go along, and they're the ones who standardize 802.3. A few years ago, the IETF was considering various jumbogram options. As best I recall, that was the official response from the relevant IEEE folks: "no". They're concerned with backward compatibility. Perhaps that has changed (and I certainly) don't remember who sent that note. --Steve Bellovin, http://www.cs.columbia.edu/~smb

Florian Weimer

2:12 p.m.

* Steven M. Bellovin:

...

A few years ago, the IETF was considering various jumbogram options. As best I recall, that was the official response from the relevant IEEE folks: "no". They're concerned with backward compatibility.

Gigabit ethernet has already broken backwards compatibility and is essentially point-to-point, so the old compatibility concerns no longer apply. Jumbo frame opt-in could even be controlled with a protocol above layer 2.

Steven M. Bellovin

3:05 p.m.

On Thu, 12 Apr 2007 16:12:43 +0200 Florian Weimer <fw@deneb.enyo.de> wrote:

...

* Steven M. Bellovin:

...
A few years ago, the IETF was considering various jumbogram options. As best I recall, that was the official response from the relevant IEEE folks: "no". They're concerned with backward compatibility.

Gigabit ethernet has already broken backwards compatibility and is essentially point-to-point, so the old compatibility concerns no longer apply. Jumbo frame opt-in could even be controlled with a protocol above layer 2.

I'm neither attacking nor defending the idea; I'm merely reporting. I'll also note that the IETF is very unlikely to challenge IEEE on this. There's an informal agreement on who owns which standards. The IETF resents attempts at modifications to its standards by other standards bodies; by the same token, it tries to avoid doing that to others. --Steve Bellovin, http://www.cs.columbia.edu/~smb

Florian Weimer

3:42 p.m.

* Steven M. Bellovin:

...

On Thu, 12 Apr 2007 16:12:43 +0200 Florian Weimer <fw@deneb.enyo.de> wrote:

...
* Steven M. Bellovin:

...
A few years ago, the IETF was considering various jumbogram options. As best I recall, that was the official response from the relevant IEEE folks: "no". They're concerned with backward compatibility.

Gigabit ethernet has already broken backwards compatibility and is essentially point-to-point, so the old compatibility concerns no longer apply. Jumbo frame opt-in could even be controlled with a protocol above layer 2.

...

I'm neither attacking nor defending the idea; I'm merely reporting.

I just wanted to point out that the main reason why this couldn't be done without breaking backwards compatibility is gone (shared physical medium with unknown and unforeseeable receiver capabilities).

...

I'll also note that the IETF is very unlikely to challenge IEEE on this.

It's certainly unwise to do so before PMTUD works without ICMP support. 8-)

Iljitsch van Beijnum

3:32 p.m.

On 12-apr-2007, at 15:26, Steven M. Bellovin wrote:

...

Last I heard, the IEEE won't go along, and they're the ones who standardize 802.3.

I knew there was a reason we use ethernet II rather than IEEE 802.3 for IP. :-)

...

A few years ago, the IETF was considering various jumbogram options. As best I recall, that was the official response from the relevant IEEE folks: "no". They're concerned with backward compatibility.

Obviously keeping the same maximum packet size when moving from 10 to 100 to 1000 to 10000 Mbps is suboptimal. However, if the newer standards were to mandate a larger maximum packet size, a station connected to a 10/100/1000 switch at 1000 Mbps would be able to send packets that a 10 Mbps station wouldn't be able to receive. (And the 802.3 length field starts clashing with ethernet II type codes.) However, to a large degree this ship has sailed because many vendors implement jumboframes. If we can fix the interoperability issue at layer 3 for IP that the IEEE can't fix at layer 2 for 802.3, then I don't see how anyone could have a problem with that. Also, such a mechanism would obviously be layer 2 agnostic, so in theory, it doesn't step on the IEEE's turf at all.

Randy Bush

6:15 p.m.

...

A few years ago, the IETF was considering various jumbogram options. As best I recall, that was the official response from the relevant IEEE folks: "no". They're concerned with backward compatibility.

worse. they felt that the ether checksum is good at 1500 and not so good at 4k etc. they *really* did not want to do jumbo. i worked that doc. randy

Iljitsch van Beijnum

7:18 p.m.

On 12-apr-2007, at 20:15, Randy Bush wrote:

...

...
A few years ago, the IETF was considering various jumbogram options. As best I recall, that was the official response from the relevant IEEE folks: "no". They're concerned with backward compatibility.

...

worse. they felt that the ether checksum is good at 1500 and not so good at 4k etc. they *really* did not want to do jumbo. i worked that doc.

It looks to me that the checksum issue is highly exaggerated or even completely wrong (as in the 1500 / 4k claim above). From http:// www.aarnet.edu.au/engineering/networkdesign/mtu/size.html : --- The ethernet packet also contains a Frame Check Sequence, which is a 32-bit CRC of the frame. The weakening of this frame check which greater frame sizes is explored in R. Jain's "Error Characteristics of Fiber Distributed Data Interface (FDDI)", which appeared in IEEE Transactions on Communications, August 1990. Table VII shows a table of Hamming Distance versus frame size. Unfortunately, the CRC for frames greater than 11445 bytes only has a minimum Hamming Distance of 3. The implication being that the CRC will only detect one-bit and two-bit errors (and not non-burst 3-bit or 4-bit errors). The CRC for between 375 and 11543 bytes has a minimum Hamming Distance of 4, implying that all 1-bit, 2-bit and 3-bit errors are detected and most non-burst 4-bit errors are detected. The paper has two implications. Firstly, the power of ethernet's Frame Check Sequence is the major limitation on increasing the ethernet MTU beyond 11444 bytes. Secondly, frame sizes under 11445 bytes are as well protected by ethernet's Frame Check Sequence as frame sizes under 1518 bytes. --- Is the FCS supposed to provide guaranteed protection against a certain number of bit errors per packet? I don't believe that's the case. With random bit errors, there's still only a risk of not detecting an error in the order of 1 : 2^32, regardless of the length of the packet. But even *any* effective weakening of the FCS caused by an increased packet size is considered unacceptable, it's still possible to do 11543 byte packets without changing the FCS algorithm. Also, I don't see fundamental problem in changing the FCS for a new 802.3 standard, as switches can strip off a 64-bit FCS and add a 32- bit one as required.

Randy Bush

7:21 p.m.

...

It looks to me that the checksum issue is highly exaggerated or even completely wrong (as in the 1500 / 4k claim above). From http://www.aarnet.edu.au/engineering/networkdesign/mtu/size.html :

glad you have an opinion. take it to the ieee. randy

Buhrmaster, Gary

6:16 p.m.

...

Last I heard, the IEEE won't go along, and they're the ones who standardize 802.3.

A few years ago, the IETF was considering various jumbogram options. As best I recall, that was the official response from the relevant IEEE folks: "no". They're concerned with backward compatibility.

As I remember it, the IEEE did not say "no" (that is not the style of such standards bodies). Instead, they said something along the lines of "We will consider any proposal that does not break (existing) standards/implementations". And, to the best of my knowledge, the smart people of the world have not yet made a proposal that meets the requirements (and I believe more than a few have tried to think the issues through). There is absolutely nothing to prevent one from implementing "jumbos" (if you can even agree how large that should be). It just seems that whatever one implements will likely not be an IEEE standard (unless one is smarter than the last set of smart people). Gary

Randy Bush

6:51 p.m.

...

...
A few years ago, the IETF was considering various jumbogram options. As best I recall, that was the official response from the relevant IEEE folks: "no". They're concerned with backward compatibility.

As I remember it, the IEEE did not say "no"

i was in the middle of this one. they said "no." the checksum becomes much weaker at 4k and 9k. and ether does have errors. randy

Randy Bush

7:28 p.m.

mark does not have posting privs and has asked me to post the following for him: --- To: Gian Constantine <constantinegi@corp.earthlink.net> From: Mark Allman <mallman@icir.org> cc: NANOG list <nanog@merit.edu> Subject: Re: Thoughts on increasing MTUs on the internet Date: Thu, 12 Apr 2007 11:47:35 -0400 Folks-

...

I agree. The throughput gains are small. You're talking about a difference between a 4% header overhead versus a 1% header overhead (for TCP).

This does not begin to reflect the gain. Check out the model of TCP performance given in: M. Mathis, J. Semke, J. Mahdavi, T. Ott, "The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm", Computer Communication Review, volume 27, number3, July 1997. (number 35 at http://www.psc.edu/~mathis/papers/index.html) The key point is that performance is directly proportional to packet size. So, an increase in the packet size is much more than a simple lowering of the overhead. In addition, the newly published RFC 4821 offers a different way to do PMTUD without relying on ICMP feedback (essentially by trying different packet sizes and trying to infer things from whether they get dropped). A good general reference to the subject of bigger MTUs is Matt Mathis' page on the subject: http://www.psc.edu/~mathis/MTU/ allman -- Mark Allman -- ICIR/ICSI -- http://www.icir.org/mallman/

Stephen Satchell

13 Apr 13 Apr

3 a.m.

Steven M. Bellovin wrote:

...

On Thu, 12 Apr 2007 11:20:18 +0200 Iljitsch van Beijnum <iljitsch@muada.com> wrote:

...
Dear NANOGers,

It irks me that today, the effective MTU of the internet is 1500 bytes, while more and more equipment can handle bigger packets.

What do you guys think about a mechanism that allows hosts and routers on a subnet to automatically discover the MTU they can use towards other systems on the same subnet, so that:

1. It's no longer necessary to limit the subnet MTU to that of the least capable system

2. It's no longer necessary to manage 1500 byte+ MTUs manually

Any additional issues that such a mechanism would have to address?

Last I heard, the IEEE won't go along, and they're the ones who standardize 802.3.

A few years ago, the IETF was considering various jumbogram options. As best I recall, that was the official response from the relevant IEEE folks: "no". They're concerned with backward compatibility.

Perhaps that has changed (and I certainly) don't remember who sent that note.

No, I doubt it will change. The CRC algorithm used in Ethernet is already strained by the 1500-byte-plus payload size. 802.3 won't extend to any larger size without running a significant risk of the CRC algorithm failing. From a practical side, the cost of developing, qualifying, and selling new chipsets to handle jumbo packets would jack up the cost of inside equipment. What is the payback? How much money do you save going to jumbo packets? Show me the numbers.

Saku Ytti

5:22 a.m.

On (2007-04-12 20:00 -0700), Stephen Satchell wrote:

...

From a practical side, the cost of developing, qualifying, and selling new chipsets to handle jumbo packets would jack up the cost of inside equipment. What is the payback? How much money do you save going to jumbo packets?

It's rather hard to find ethernet gear operators could imagine using in peering or core that do not support +9k MTU's. -- ++ytti

Valdis.Kletnieks＠vt.edu

2:36 p.m.

On Fri, 13 Apr 2007 08:22:49 +0300, Saku Ytti said:

...

On (2007-04-12 20:00 -0700), Stephen Satchell wrote:

...
From a practical side, the cost of developing, qualifying, and selling new chipsets to handle jumbo packets would jack up the cost of inside equipment. What is the payback? How much money do you save going to jumbo packets?

It's rather hard to find ethernet gear operators could imagine using in peering or core that do not support +9k MTU's.

Note that the number of routers in the "core" is probably vastly outweighted by the number of border and edge routers. There's a *lot* of old eBay routers out there - and until you get a clean path all the way back to the source system, you won't *see* any 9K packets. What's the business case for upgrading an older edge router to support 9K MTU, when the only source of packets coming in is a network of Windows boxes (both servers and end systems in offices) run by somebody who wouldn't believe an Ethernet has anything other than a 1500 MTU if you stapled the spec sheet to their forehead? For that matter, what releases of Windows support setting a 9K MTU? That's probably the *real* uptake limiter.

Leigh Porter

2:52 p.m.

I don't think it matters that everything can use jumbograms or that every single device on the Internet supports them. Heck, I still know networks with kit that does not support VLSM! What would be good is if when a jumbogram capable path on the Internet exists, jumbograms can be used. This way it does not matter than some box somewhere does not support anything greater than a 1500 byte MTU, anything with such a box in the path will simply not support a jumbogram. How do you find out? Just send a jumbogram across the path and see what happens.. ;-) -- Leigh Porter UK Broadband -----Original Message----- From: owner-nanog@merit.edu on behalf of Valdis.Kletnieks@vt.edu Sent: Fri 4/13/2007 3:36 PM To: Saku Ytti Cc: NANOG list Subject: Re: Thoughts on increasing MTUs on the internet On Fri, 13 Apr 2007 08:22:49 +0300, Saku Ytti said:

...

On (2007-04-12 20:00 -0700), Stephen Satchell wrote:

...
From a practical side, the cost of developing, qualifying, and selling new chipsets to handle jumbo packets would jack up the cost of inside equipment. What is the payback? How much money do you save going to jumbo packets?

It's rather hard to find ethernet gear operators could imagine using in peering or core that do not support +9k MTU's.

Mikael Abrahamsson

3:27 p.m.

On Fri, 13 Apr 2007, Leigh Porter wrote:

...

What would be good is if when a jumbogram capable path on the Internet exists, jumbograms can be used.

Yes, and it would be good if PMTUD worked, and ECN, oh and large UDP-packets for DNS, and BCP38, and... and... and. The internet is a very diverse and complicated beast and if end systems can properly detect PMTU by doing discovery of this, it might work. Requiring the core and distribution to change isn't going to happen overnight, so end systems first. Make sure they can properly detect PMTU by use of nothing more than "is this packet size getting thru" (ie no ICMP-NEED-TO-FRAG) or alike, then we might see partial adoption of larger MTU in some parts and if this becomes a major customer requirement then it might spread. -- Mikael Abrahamsson email: swmike@swm.pp.se

Stephen Sprunk

5:31 p.m.

Thus spake "Mikael Abrahamsson" <swmike@swm.pp.se>

...

The internet is a very diverse and complicated beast and if end systems can properly detect PMTU by doing discovery of this, it might work. ... Make sure they can properly detect PMTU by use of nothing more than "is this packet size getting thru" (ie no ICMP-NEED-TO-FRAG) or alike, then we might see partial adoption of larger MTU in some parts and if this becomes a major customer requirement then it might spread.

PMTU Black Hole Detection works well in my experience, but unfortunately MS doesn't turn it on by default, which is where all of the L2VPN with <1500 MTU issues come from; turn BHD on and the problems just go away... (And, as others have noted, there's better PMTUD algorithms that are designed to work _with_ black holes, but IME they're not really needed) Still, we have a (mostly) working solution for wide-area use; what's missing is the critical step in getting varying MTUs working on a single subnet. All the solutions so far have required setting a higher, but still fixed, MTU for every device and that isn't realistic on the edge except in tightly controlled environments like HPC or internal datacenters. Perry Lorier's solution is rather clever; perhaps we don't even need a protocol sanctioned by the IEEE or IETF? S Stephen Sprunk "Those people who think they know everything CCIE #3723 are a great annoyance to those of us who do." K5SSS --Isaac Asimov

Lasher, Donn

7:18 p.m.

...

PMTU Black Hole Detection works well in my experience, but unfortunately MS doesn't turn it on by default, which is where all of the L2VPN with <1500 MTU issues come from; turn BHD on and

-----Original Message----- From: owner-nanog@merit.edu [mailto:owner-nanog@merit.edu] On Behalf Of Stephen Sprunk Sent: Friday, April 13, 2007 10:32 AM To: Mikael Abrahamsson Cc: North American Noise and Off-topic Gripes Subject: Re: Thoughts on increasing MTUs on the internet the problems just go away... (And, as others

...

have noted, there's better PMTUD algorithms that are designed to work _with_ black holes, but IME they're not really needed)

I wish I'd had your experience. PMTU _can_ work well, but on the internet as a whole, far too many ignorant paranoid admins block PMTU, mostly by accident, causing all sorts of unpleasantness. Clearing DF only takes you so far. Unless both ends are aware, and respond apppropriately to the squeeze in the middle, you're back to square one. Unless there were some other method of MTU Discovery implemented, depending on something like PMTU discovery may fail just as dramatically on larger packets as it does on 1500byte now.

Stephen Sprunk

9:15 p.m.

Thus spake "Lasher, Donn" <DLasher@newedgenetworks.com>

...

...
PMTU Black Hole Detection works well in my experience, but unfortunately MS doesn't turn it on by default, which is where all of the L2VPN with <1500 MTU issues come from; turn BHD on and the problems just go away... (And, as others have noted, there's better PMTUD algorithms that are designed to work _with_ black holes, but IME they're not really needed)

I wish I'd had your experience. PMTU _can_ work well, but on the internet as a whole, far too many ignorant paranoid admins block PMTU, mostly by accident, causing all sorts of unpleasantness.

You can't block PMTUD per se, just the ICMP messages that dumber implementations rely on. And, as I noted, MS's implementation is dumb by default, which leads to the problems we're all familiar with. "PMTU Black Hole Detection" is appropriately named; one registry change* and a reboot is all you need to solve the problem. Of course, that's non-trivial to implement when there's hundreds of millions of boxes with the wrong setting...

...

Clearing DF only takes you so far. Unless both ends are aware, and respond apppropriately to the squeeze in the middle, you're back to square one.

Smarter implementations still set DF. The difference is that when they get neither an ACK nor an ICMP, they try progressively smaller sizes until they do get a response of some kind. They make a note of what works and continue on with that, with the occasional larger probe in case the problem was transient. In fact, one could consider Lorier's "mtud" to be roughly the same idea; it's only needed because the stack's own PMTUD code is typically bypassed for on-subnet destinations and/or not as smart as it should be. S * HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\ Parameters\EnablePMTUBHDetect=1 Stephen Sprunk "Those people who think they know everything CCIE #3723 are a great annoyance to those of us who do." K5SSS --Isaac Asimov

Steve Meuse

3:42 p.m.

On 4/13/07, Valdis.Kletnieks@vt.edu <Valdis.Kletnieks@vt.edu> wrote:

...

For that matter, what releases of Windows support setting a 9K MTU? That's probably the *real* uptake limiter.

Most, if not all. I have an XP box that has a GigE with 9k MTU. -- -Steve

Adrian Chadd

4:24 p.m.

On Fri, Apr 13, 2007, Steve Meuse wrote:

...

On 4/13/07, Valdis.Kletnieks@vt.edu <Valdis.Kletnieks@vt.edu> wrote:

...
For that matter, what releases of Windows support setting a 9K MTU? That's probably the *real* uptake limiter.

Most, if not all. I have an XP box that has a GigE with 9k MTU.

Lucky you. The definition of "large frames" varies depending entirely upon driver. I came up against this when a client nicely asked about jumbo frames on his shiny new Cisco 3560 switch - and none of his computers could agree on anything greater than 4k. And, to make things worse - a few of the drivers wanted to enforce certain values rather than any value between 1500 and an upper limit - making the whole feat impossible. Yay for non-clear specifications. The skeptic in me says "ain't going to happen." The believer in me says "Ah, that'd be cool, wouldn't it?" The realist in me says "probably best to mandate that kind of stuff with the next revision of the ipv6-internet with the first few bits set to 010 instead of 001. :) The real uptake limiter is the disagreement on implementation. Some of you have to remember how this whole internet thing started and grew (I've only read about the collaboration in books.) Adrian

michael.dillon＠bt.com

9:03 a.m.

...

No, I doubt it will change. The CRC algorithm used in Ethernet is already strained by the 1500-byte-plus payload size. 802.3 won't extend to any larger size without running a significant risk of the CRC algorithm failing.

I believe this has already been debunked.

...

From a practical side, the cost of developing, qualifying, and selling new chipsets to handle jumbo packets would jack up the cost of inside equipment. What is the payback? How much money do you save going to jumbo packets?

I believe that the change is intended to apply to routers and the ethernet switches that interconnect them in PoPs and NAPs and exchange points. Therefore the cost of a small chipset modification is likely to be negligible in the grand scheme of things. As for numbers, it is not dollar figures that I want to see. I would like the people who have jumbo packets inside their end-user networks to run some MTU discovery and publish a full MTU matrix on all paths on the Internet. That way we can all see where there is end-to-end support for large MTUs and people who want to make buying decisions on this basis will have something other than vendor assurances to show that a network supports jumbograms. --Michael Dillon

Keegan.Holley＠sungard.com

12 Apr 12 Apr

3:34 p.m.

I think it's a great idea operationally, less work for the routers and more efficient use of bandwidth. It would also be useful to devise some way to at least partially reassemble fragmented frames at links capable of large MTU's. Since most PC's are on a subnet with a MTU of 1500 (or 1519) packets would still be limited to 1500B or fragmented before they reach the higher speed links. The problem with bringing this to fruition in the internet is going to be cost and effort. The ATT's and Verizons of the world are going to see this as a major upgrade without much benefit or profit. The Cisco's and Junipers are going to say the same thing when they have to write this into their code plus interoperability with other vendors implementations of it. Iljitsch van Beijnum <iljitsch@muada.com> Sent by: owner-nanog@merit.edu 04/12/2007 05:20 AM To NANOG list <nanog@merit.edu> cc Subject Thoughts on increasing MTUs on the internet Dear NANOGers, It irks me that today, the effective MTU of the internet is 1500 bytes, while more and more equipment can handle bigger packets. What do you guys think about a mechanism that allows hosts and routers on a subnet to automatically discover the MTU they can use towards other systems on the same subnet, so that: 1. It's no longer necessary to limit the subnet MTU to that of the least capable system 2. It's no longer necessary to manage 1500 byte+ MTUs manually Any additional issues that such a mechanism would have to address?

Stephen Wilcox

5:45 p.m.

On Thu, Apr 12, 2007 at 11:34:43AM -0400, Keegan.Holley@sungard.com wrote:

...

I think it's a great idea operationally, less work for the routers and more efficient use of bandwidth. It would also be useful to devise some way to at least partially reassemble fragmented frames at links capable of large MTU's.

I think you underestimate the memory and cpu required on large links to be able to buffer the data that would allow a reassembly by an intermediate router

...

Since most PC's are on a subnet with a MTU of 1500 (or 1519) packets would still be limited to 1500B or fragmented before they reach the higher speed links. The problem with bringing this to fruition in the internet is going to be cost and effort. The ATT's and Verizons of the world are going to see this as a major upgrade without much benefit or profit. The Cisco's and Junipers are going to say the same thing when they have to write this into their code plus interoperability with other vendors implementations of it.

I dont think any of the above will throw out any particular objection.. I think your problem is in figuring out a way to implement this globally and not break stuff which relies so heavily upon 1500 bytes much of which does not even cater for the possibility another MTU might be possible. Steve

...

Iljitsch van Beijnum <iljitsch@muada.com> Sent by: owner-nanog@merit.edu

04/12/2007 05:20 AM

To

NANOG list <nanog@merit.edu>

cc

Subject

Thoughts on increasing MTUs on the internet

Dear NANOGers, It irks me that today, the effective MTU of the internet is 1500 bytes, while more and more equipment can handle bigger packets. What do you guys think about a mechanism that allows hosts and routers on a subnet to automatically discover the MTU they can use towards other systems on the same subnet, so that: 1. It's no longer necessary to limit the subnet MTU to that of the least capable system 2. It's no longer necessary to manage 1500 byte+ MTUs manually Any additional issues that such a mechanism would have to address?

Joe Loiacono

6:12 p.m.

Large MTUs enable significant throughput performance enhancements for large data transfers over long round-trip times (RTTs.) The original question had to do with local subnet to local subnet where the difference would not be noticable. But for users transferring large data sets over long distances (e.g. LHC experimental data from CERN in France to universities in the US) large MTUs can make a big difference. For an excellent and detailed (though becoming dated) examination of this see: "Raising the Internet MTU" Matt Mathis, et. al. http://www.psc.edu/~mathis/MTU/ Joe Stephen Wilcox <steve@telecomplete.co.uk> Sent by: owner-nanog@merit.edu 04/12/2007 01:45 PM To Keegan.Holley@sungard.com cc NANOG list <nanog@merit.edu> Subject Re: Thoughts on increasing MTUs on the internet On Thu, Apr 12, 2007 at 11:34:43AM -0400, Keegan.Holley@sungard.com wrote:

...

I think it's a great idea operationally, less work for the routers

and more

...

efficient use of bandwidth. It would also be useful to devise some way to at least partially reassemble fragmented frames at links capable of large MTU's.

...

Since most PC's are on a subnet with a MTU of 1500 (or 1519) packets would still be limited to 1500B or fragmented before they reach the higher speed links. The problem with bringing this to fruition in the internet is going to be cost and effort. The ATT's and Verizons of the world are going to see this as a major upgrade without much benefit or profit. The Cisco's and Junipers are going to say the same thing when they have to write

I think you underestimate the memory and cpu required on large links to be able to buffer the data that would allow a reassembly by an intermediate router this

...

into their code plus interoperability with other vendors implementations of it.

...

Iljitsch van Beijnum <iljitsch@muada.com> Sent by: owner-nanog@merit.edu

04/12/2007 05:20 AM

To

NANOG list <nanog@merit.edu>

cc

Subject

Thoughts on increasing MTUs on the internet

Dear NANOGers, It irks me that today, the effective MTU of the internet is 1500 bytes, while more and more equipment can handle bigger packets. What do you guys think about a mechanism that allows hosts and routers on a subnet to automatically discover the MTU they can use towards other systems on the same subnet, so that: 1. It's no longer necessary to limit the subnet MTU to that of the least capable system 2. It's no longer necessary to manage 1500 byte+ MTUs manually Any additional issues that such a mechanism would have to address?

Mikael Abrahamsson

8:05 p.m.

On Thu, 12 Apr 2007, Joe Loiacono wrote:

...

Large MTUs enable significant throughput performance enhancements for large data transfers over long round-trip times (RTTs.) The original

This is solved by increasing TCP window size, it doesn't depend very much on MTU. Larger MTU is better for devices that for instance do per-packet interrupting, like most endsystems probably do. It doesn't increase long-RTT transfer performance per se (unless you have high packetloss because you'll slow-start more efficiently). -- Mikael Abrahamsson email: swmike@swm.pp.se

Joe Loiacono

8:31 p.m.

owner-nanog@merit.edu wrote on 04/12/2007 04:05:43 PM:

...

On Thu, 12 Apr 2007, Joe Loiacono wrote:

...
Large MTUs enable significant throughput performance enhancements for large data transfers over long round-trip times (RTTs.) The original

This is solved by increasing TCP window size, it doesn't depend very

much

...

on MTU.

Window size is of course critical, but it turns out that MTU also impacts rates (as much as 33%, see below): MSS 0.7 Rate = ----- * ------- RTT (P)**0.5 MSS = Maximum Segment Size RTT = Round Trip Time P = packet loss Mathis, et. al. have 'verified the model through both simulation and live Internet measurements.' Also (http://www.aarnet.edu.au/engineering/networkdesign/mtu/why.html): "This is shown to be the case in Anand and Hartner's "TCP/IP Network Stack Performance in Linux Kernel 2.4 and 2.5" in Proceedings of the Ottawa Linux Symposium, 2002. Their experience was that a machine using a 1500 byte MTU could only reach 750Mbps whereas the same machine configured with 9000 byte MTUs handsomely reached 1Gbps." AARnet - Australia's Academic and Research Network

...

Larger MTU is better for devices that for instance do per-packet interrupting, like most endsystems probably do. It doesn't increase long-RTT transfer performance per se (unless you have high packetloss because you'll slow-start more efficiently).

-- Mikael Abrahamsson email: swmike@swm.pp.se

Mikael Abrahamsson

8:48 p.m.

On Thu, 12 Apr 2007, Joe Loiacono wrote:

...

Window size is of course critical, but it turns out that MTU also impacts rates (as much as 33%, see below):

MSS 0.7 Rate = ----- * ------- RTT (P)**0.5

MSS = Maximum Segment Size RTT = Round Trip Time P = packet loss

So am I to understand that with 0 packetloss I get infinite rate? And TCP window size doesn't affect the rate? I am quite confused by this statement. Yes, under congestion larger MSS is better, but without congestion I don't see where it would differ apart from the interrupt load I mentioned earlier? -- Mikael Abrahamsson email: swmike@swm.pp.se

Joe Loiacono

9:31 p.m.

I believe the formula applies when the TCP window size is held constant (and maybe as large as is necessary for the bandwidth-delay product). Obviously P going to zero is a problem; there are practical limitations. But bit error rate is usually not zero over long distances. The formula is not mine, it's not new, and there is empirical evidence to support it. Check out the links for more (and better :-) info. Joe owner-nanog@merit.edu wrote on 04/12/2007 04:48:09 PM:

...

On Thu, 12 Apr 2007, Joe Loiacono wrote:

...
Window size is of course critical, but it turns out that MTU also

impacts

...

...
rates (as much as 33%, see below):

MSS 0.7 Rate = ----- * ------- RTT (P)**0.5

MSS = Maximum Segment Size RTT = Round Trip Time P = packet loss

So am I to understand that with 0 packetloss I get infinite rate? And TCP window size doesn't affect the rate?

I am quite confused by this statement. Yes, under congestion larger MSS is better, but without congestion I don't see where it would differ apart from the interrupt load I mentioned earlier?

-- Mikael Abrahamsson email: swmike@swm.pp.se

David W. Hankins

9:28 p.m.

Hopefully I'll be forgiven for geeking out over DHCP on nanog-l twice in the same week. On Thu, Apr 12, 2007 at 11:20:18AM +0200, Iljitsch van Beijnum wrote:

...

1. It's no longer necessary to limit the subnet MTU to that of the least capable system

I dunno for that.

...

2. It's no longer necessary to manage 1500 byte+ MTUs manually

But for this, there has been (for a long time now) a DHCPv4 option to give a client its MTU for the interface being configured (#26, RFC2132). The thing is, not very many (if any) clients actually request it. Possibly because of problem #1 (if you change your MTU, and no one else does, you're hosed). So, if you solve for the first problem in isolation, you can easily just use DHCP to solve the second with virtually no work and probably "only" (heh) client software updates. I could also note that your first problem plagues DHCP software today...it's further complicated...let's just say it sucks, and bad. If one were to solve that problem for DHCP speakers, you could probably put a siphon somewhere in the process. But it's an even harder problem to solve. -- David W. Hankins "If you don't do it right the first time, Software Engineer you'll just have to do it again." Internet Systems Consortium, Inc. -- Jack T. Hankins

Daniel Senie

9:58 p.m.

At 05:28 PM 4/12/2007, David W. Hankins wrote:

...

Hopefully I'll be forgiven for geeking out over DHCP on nanog-l twice in the same week.

On Thu, Apr 12, 2007 at 11:20:18AM +0200, Iljitsch van Beijnum wrote:

...
1. It's no longer necessary to limit the subnet MTU to that of the least capable system

I dunno for that.

Indeed. I do hope the vocal advocates for general use of larger MTU sizes on Ethernet have had in their careers the opportunity to enjoy the fun that ensues with LAN technologies were multiple MTUs are supported, namely token ring and FDDI. Debugging networks where MTU and MRU mismatches occur can be interesting, to say the least. It's not just a matter of receiving stations noticing there's packets coming in that are too big. Depending on the design of the interface chips, the packet may not be received at all, and no indication sent to the driver. The result can be endless re-sending of information, doomed to failure. OSPF has a way to negotiate MTU over LAN segments to deal with exactly this situation. I uncovered the problem debugging a largish OSPF network that would run for weeks or months, then fail to converge. Multi-access media benefits from predictable MTU/MRU sizes. Ethernet was well served by the fixed size. I have no issue with allowing for a larger MTU size, but disagree with attempts to reduce everyone on the link to the lowest common denominator UNLESS that negotiation is repeated periodically (with MTU sizes able to both increase and decrease). If systemns negotiate a particular size among all players on a LAN, and a new station is introduced, the decision process for what to do must be understood. An alternative is to limit everyone to 1500 byte MTUs unless or until adjacent stations negotiate a larger window size. At the LAN level, this could be handled in ARP or similar, but the real desire would be to find a way to negotiate endpoint-to-endpoint at the IP layer. Don't even get into IP multicast...

...

...
2. It's no longer necessary to manage 1500 byte+ MTUs manually

But for this, there has been (for a long time now) a DHCPv4 option to give a client its MTU for the interface being configured (#26, RFC2132).

The thing is, not very many (if any) clients actually request it. Possibly because of problem #1 (if you change your MTU, and no one else does, you're hosed).

Trying to do this via DHCP is, IMO, doomed to failure. The systems most likely to be in need of larger MTUs are likely servers, and probably not on DHCP-assigned addresses.

...

So, if you solve for the first problem in isolation, you can easily just use DHCP to solve the second with virtually no work and probably "only" (heh) client software updates.

I could also note that your first problem plagues DHCP software today...it's further complicated...let's just say it sucks, and bad.

If one were to solve that problem for DHCP speakers, you could probably put a siphon somewhere in the process.

But it's an even harder problem to solve.

DHCP has enough issues and problems today, I think we're in agreement that heaping more on it might not be prudent. Dan

David W. Hankins

10:09 p.m.

On Thu, Apr 12, 2007 at 05:58:07PM -0400, Daniel Senie wrote:

...

...
...
2. It's no longer necessary to manage 1500 byte+ MTUs manually

But for this, there has been (for a long time now) a DHCPv4 option to give a client its MTU for the interface being configured (#26, RFC2132).

Trying to do this via DHCP is, IMO, doomed to failure. The systems most likely to be in need of larger MTUs are likely servers, and probably not on DHCP-assigned addresses.

If you're bothering to statically configure a system with a fixed address (such as with a server), why can you not also statically configure it with an MTU? -- David W. Hankins "If you don't do it right the first time, Software Engineer you'll just have to do it again." Internet Systems Consortium, Inc. -- Jack T. Hankins

Daniel Senie

10:18 p.m.

At 06:09 PM 4/12/2007, David W. Hankins wrote:

...

On Thu, Apr 12, 2007 at 05:58:07PM -0400, Daniel Senie wrote:

...
...
...
2. It's no longer necessary to manage 1500 byte+ MTUs manually

But for this, there has been (for a long time now) a DHCPv4 option to give a client its MTU for the interface being configured (#26, RFC2132).

Trying to do this via DHCP is, IMO, doomed to failure. The systems most likely to be in need of larger MTUs are likely servers, and probably not on DHCP-assigned addresses.

If you're bothering to statically configure a system with a fixed address (such as with a server), why can you not also statically configure it with an MTU?

Neither addresses interoperability on a multi-access medium where a new station could be introduced, and can result in the same MTU/MRU mismatch problems that were seen on token ring and FDDI. The problem is you might open a conversation (whatever the protocol), then get into trouble when larger data packets follow smaller initial conversation opening packets. Or you can work with the same assumptions people use today: all stations on a particular network segment must use the same MTU size, whether that's the standard Ethernet size, or a larger size, and a warning sign hanging from the switch, saying "use MTU size of xxxx or suffer the consequences."

David W. Hankins

10:31 p.m.

On Thu, Apr 12, 2007 at 06:18:56PM -0400, Daniel Senie wrote:

...

Neither addresses interoperability on a multi-access medium where a new station could be introduced, and can result in the same MTU/MRU mismatch problems that were seen on token ring and FDDI.

Solving Ilijitsch's "#1" is a separate problem, and you can solve them in isolation. If you chose to do so, "#2" is already solved for all hosts where dynamic configuration is desirable. -- David W. Hankins "If you don't do it right the first time, Software Engineer you'll just have to do it again." Internet Systems Consortium, Inc. -- Jack T. Hankins

Perry Lorier

11:41 p.m.

Iljitsch van Beijnum wrote:

...

Dear NANOGers,

It irks me that today, the effective MTU of the internet is 1500 bytes, while more and more equipment can handle bigger packets.

What do you guys think about a mechanism that allows hosts and routers on a subnet to automatically discover the MTU they can use towards other systems on the same subnet, so that:

1. It's no longer necessary to limit the subnet MTU to that of the least capable system

2. It's no longer necessary to manage 1500 byte+ MTUs manually

Any additional issues that such a mechanism would have to address?

I have a half completed, prototype "mtud" that runs under Linux. It sets the interface to 9k, but sets the route for the subnet down to 1500. It then watches the arp table for new arp entries. As a new MAC is added, it sends a 9k UDP datagram to that host and listens for an ICMP port unreachable reply (like traceroute does). If the error arrives, it assumes that host can receive packets that large, and adds a host route with the larger MTU to that host. It steps up the mtu's from 1500 to 16k trying to rapidly increase the MTU without having to wait for annoying timeouts. If anything goes wrong somewhere along the way, (a host is firewalled or whatever) then it won't receive the ICMP reply, and won't raise the MTU. The idea is that you can run this on routers/servers on a network that has 9k mtu's but not all the hosts are assured to be 9k capable, and it will increase correctly detect the available MTU between servers, or routers, but still be able to correctly talk to machines that are still stuck with 1500 byte mtu's etc. In other interesting data points in this field, for some reason a while ago we had reason to do some throughput tests under Linux with varying the MTU using e1000's and ended up with this pretty graph: http://wand.net.nz/~perry/mtu.png we never had the time to investigate exactly what was going on, but interestingly at 8k MTU's (which is presumably what NFS would use), performance is exceptionally poor compared to 9k and 1500 byte MTU's. Our (untested) hypothesis is that the Linux kernel driver isn't smart about how it allocates it's buffers.

Simon Leinen

13 Apr 13 Apr

11:39 p.m.

Ah, large MTUs. Like many other "academic" backbones, we implemented large (9192 bytes) MTUs on our backbone and 9000 bytes on some hosts. See [1] for an illustration. Here are *my* current thoughts on increasing the Internet MTU beyond its current value, 1500. (On the topic, see also [2] - a wiki page which is actually served on a 9000-byte MTU server :-) Benefits of >1500-byte MTUs: Several benefits of moving to larger MTUs, say in the 9000-byte range, were cited. I don't find them too convincing anymore. 1. Fewer packets reduce work for routers and hosts. Routers: Most backbones seem to size their routers to sustain (near-) line-rate traffic even with small (64-byte) packets. That's a good thing, because if networks were dimensioned to just work at average packet sizes, they would be pretty easy to DoS by sending floods of small packets. So I don't see how raising the MTU helps much unless you also raise the minimum packet size - which might be interesting, but I haven't heard anybody suggest that. This should be true for routers and middleboxes in general, although there are certainly many places (especially firewalls) where pps limitations ARE an issue. But again, raising the MTU doesn't help if you're worried about the worst case. And I would like to see examples where it would help significantly even in the normal case. In our network it certainly doesn't - we have Mpps to spare. Hosts: For hosts, filling high-speed links at 1500-byte MTU has often been difficult at certain times (with Fast Ethernet in the nineties, GigE 4-5 years ago, 10GE today), due to the high rate of interrupts/context switches and internal bus crossings. Fortunately tricks like polling-instead-of-interrupts (Saku Ytti mentioned this), Interrupt Coalescence and Large-Send Offload have become commonplace these days. These give most of the end-system performance benefits of large packets without requiring any support from the network. 2. Fewer bytes (saved header overhead) free up bandwidth. TCP segments over Ethernet with 1500 byte MTU is "only" 94.2% efficient, while with 9000 byte MTU it would be 99.?% efficient. While an improvement would certainly be nice, 94% already seems "good enough" to me. (I'm ignoring the byte savings due to fewer ACKs. On the other hand not all packets will be able to grow sixfold - some transfers are small.) 3. TCP runs faster. This boils down to two aspects (besides the effects of (1) and (2)): a) TCP reaches its "cruising speed" faster. Especially with LFNs (Long Fat Networks, i.e. paths with a large bandwidth*RTT product), it can take quite a long time until TCP slow-start has increased the window so that the maximum achievable rate is reached. Since the window increase happens in units of MSS (~MTU), TCPs with larger packets reach this point proportionally faster. This is significant, but there are alternative proposals to solve this issue of slow ramp-up, for example HighSpeed TCP [3]. b) You get a larger share of a congested link. I think this is true when a TCP-with-large-packets shares a congested link with TCPs-with-small-packets, and the packet loss probability isn't proportional to the size of the packet. In fact the large-packet connection can get a MUCH larger share (sixfold for 9K vs. 1500) if the loss probability is the same for everybody (which it often will be, approximately). Some people consider this a fairness issue, other think it's a good incentive for people to upgrade their MTUs. About the issues: * Current Path MTU Discovery doesn't work reliably. Path MTU Discovery as specified in RFC 1191/1981 relies on ICMP messages to discover when a smaller MTU has to be used. When these ICMP messages fail to arrive (or be sent), the sender will happily continue to send too-large packets into the blackhole. This problem is very real. As an experiment, try configuring an MTU < 1500 on a backbone link which has Ethernet-connected customers behind it. I bet that you'll receive LOUD complaints before long. Some other people mention that Path MTU Discovery has been refined with "blackhole detection" methods in some systems. This is widely implemented, but not configured (although it probably could be with a "Service Pack"). Note that a new Path MTU Discovery proposal was just published as RFC 4821 [4]. This is also supposed to solve the problem of relying on ICMP messages. Please, let's wait for these more robust PMTUD mechanisms to be universally deployed before trying to increase the Internet MTU. * IP assumes a consistent MTU within a logical subnet. This seems to be a pretty fundamental assumption, and Iljitsch's original mail suggests that we "fix" this. Umm, ok, I hope we don't miss anything important that makes use of this assumption. Seriously, I think it's illusionary to try to change this for general networks, in particular large LANs. It might work for exchange points or other controlled cases where the set of protocols is fairly well defined, but then exchange points have other options such as separate "jumbo" VLANs. For campus/datacenter networks, I agree that the consistent-MTU requirement is a big problem for deploying larger MTUs. This is true within my organization - most servers that could use larger MTUs (NNTP servers for example) live on the same subnet with servers that will never bother to be upgraded. The obvious solution is to build smaller subnets - for our test servers I usually configure a separate point-to-point subnet for each of its Ethernet interfaces (I don't trust this bridging-magic anyway :-). * Most edges will not upgrade anyway. On the slow edges of the network (residual modem users, exotic places, cellular data users etc.), people will NOT upgrade their MTU to 9000 byte, because a single such packet would totally kill the VoIP experience. For medium-fast networks, large MTUs don't cause problems, but they don't help either. So only a few super-fast edges have an incentive to do this at all. For the core networks that support large MTUs (like we do), this is frustrating because all our routers now probably carve their internal buffers for 9000-byte packets that never arrive. Maybe we're wasting lots of expensive linecard memory this way? * Chicken/egg As long as only a small minority of hosts supports >1500-byte MTUs, there is no incentive for anyone important to start supporting them. A public server supporting 9000-byte MTUs will be frustrated when it tries to use them. The overhead (from attempted large packets that don't make it) and potential trouble will just not be worth it. This is a little similar to IPv6. So I don't see large MTUs coming to the Internet at large soon. They probably make sense in special cases, maybe for "land-speed records" and dumb high-speed video equipment, or for server-to-server stuff such as USENET news. (And if anybody out there manages to access [2] or http://ndt.switch.ch/ with 9000-byte MTUs, I'd like to hear about it :-) -- Simon. [1] Here are a few tracepaths (more or less traceroute with integrated PMTU discovery) from a host on our network in Switzerland. 9000-byte packets make it across our national backbone (SWITCH), the European academic backbone (GEANT2), Abilene and CENIC in the US, as well as through AARnet in Australia (even over IPv6). But the link from the last wide-area backbone to the receiving site inevitably has a 1500-byte MTU ("pmtu 1500"). : leinen@mamp1[leinen]; tracepath www.caida.org 1: mamp1-eth2.switch.ch (130.59.35.78) 0.110ms pmtu 9000 1: swiMA1-G2-6.switch.ch (130.59.35.77) 1.029ms 2: swiMA2-G2-5.switch.ch (130.59.36.194) 1.141ms 3: swiEL2-10GE-1-4.switch.ch (130.59.37.77) 4.127ms 4: swiCE3-10GE-1-3.switch.ch (130.59.37.65) 4.726ms 5: swiCE2-10GE-1-4.switch.ch (130.59.36.209) 4.901ms 6: switch.rt1.gen.ch.geant2.net (62.40.124.21) asymm 7 4.429ms 7: so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22) asymm 8 12.551ms 8: abilene-wash-gw.rt1.fra.de.geant2.net (62.40.125.18) asymm 9 105.099ms 9: 64.57.28.12 (64.57.28.12) asymm 10 121.619ms 10: kscyng-iplsng.abilene.ucaid.edu (198.32.8.81) asymm 11 153.796ms 11: dnvrng-kscyng.abilene.ucaid.edu (198.32.8.13) asymm 12 158.520ms 12: snvang-dnvrng.abilene.ucaid.edu (198.32.8.1) asymm 13 180.784ms 13: losang-snvang.abilene.ucaid.edu (198.32.8.94) asymm 14 177.487ms 14: hpr-lax-gsr1--abilene-LA-10ge.cenic.net (137.164.25.2) asymm 20 179.106ms 15: riv-hpr--lax-hpr-10ge.cenic.net (137.164.25.5) asymm 21 185.183ms 16: hpr-sdsc-sdsc2--riv-hpr-ge.cenic.net (137.164.27.54) asymm 18 186.368ms 17: hpr-sdsc-sdsc2--riv-hpr-ge.cenic.net (137.164.27.54) asymm 18 185.861ms pmtu 1500 18: cider.caida.org (192.172.226.123) asymm 19 186.264ms reached Resume: pmtu 1500 hops 18 back 19 : leinen@mamp1[leinen]; tracepath www.aarnet.edu.au 1: mamp1-eth2.switch.ch (130.59.35.78) 0.095ms pmtu 9000 1: swiMA1-G2-6.switch.ch (130.59.35.77) 1.024ms 2: swiMA2-G2-5.switch.ch (130.59.36.194) 1.115ms 3: swiEL2-10GE-1-4.switch.ch (130.59.37.77) 3.989ms 4: swiCE3-10GE-1-3.switch.ch (130.59.37.65) 4.731ms 5: swiCE2-10GE-1-4.switch.ch (130.59.36.209) 4.771ms 6: switch.rt1.gen.ch.geant2.net (62.40.124.21) asymm 7 4.424ms 7: so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22) asymm 8 12.536ms 8: ge-3-3-0.bb1.a.fra.aarnet.net.au (202.158.204.249) asymm 9 13.207ms 9: so-0-1-0.bb1.a.sin.aarnet.net.au (202.158.194.145) asymm 10 217.846ms 10: so-3-3-0.bb1.a.per.aarnet.net.au (202.158.194.129) asymm 11 275.651ms 11: so-0-1-0.bb1.a.adl.aarnet.net.au (202.158.194.6) asymm 12 293.854ms 12: so-0-1-0.bb1.a.adl.aarnet.net.au (202.158.194.6) 297.989ms pmtu 1500 13: tiny-teddy.aarnet.edu.au (203.21.37.30) asymm 12 297.462ms reached Resume: pmtu 1500 hops 13 back 12 : leinen@mamp1[leinen]; tracepath6 www.aarnet.edu.au 1?: [LOCALHOST] pmtu 9000 1: swiMA1-G2-6.switch.ch 1.328ms 2: swiMA2-G2-5.switch.ch 1.703ms 3: swiEL2-10GE-1-4.switch.ch 4.529ms 4: swiCE3-10GE-1-3.switch.ch 5.278ms 5: swiCE2-10GE-1-4.switch.ch 5.493ms 6: switch.rt1.gen.ch.geant2.net asymm 7 5. 99ms 7: so-7-2-0.rt1.fra.de.geant2.net asymm 8 13.239ms 8: ge-3-3-0.bb1.a.fra.aarnet.net.au asymm 9 13.970ms 9: so-0-1-0.bb1.a.sin.aarnet.net.au asymm 10 218.718ms 10: so-3-3-0.bb1.a.per.aarnet.net.au asymm 11 267.225ms 11: so-0-1-0.bb1.a.adl.aarnet.net.au asymm 12 299. 78ms 12: so-0-1-0.bb1.a.adl.aarnet.net.au 298.473ms pmtu 1500 12: www.ipv6.aarnet.edu.au 292.893ms reached Resume: pmtu 1500 hops 12 back 12 [2] PERT Knowledgebase article: http://kb.pert.geant2.net/PERTKB/JumboMTU [3] RFC 3649, HighSpeed TCP for Large Congestion Windows, S. Floyd, December 2003 [4] RFC 4821, Packetization Layer Path MTU Discovery. M. Mathis, J. Heffner, March 2007

Fred Baker

11:55 p.m.

I agree with many of your thoughts. This is essentially the same discussion we had upgrading from the 576 byte common MTU of the ARPANET to the 1500 byte MTU of Ethernet-based networks. Larger MTUs are a good thing, but are not a panacea. The biggest value in real practice is IMHO that the end systems deal with a lower interrupt rate when moving the same amount of data. That said, some who are asking about larger MTUs are asking for values so large that CRC schemes lose their value in error detection, and they find themselves looking at higher layer FEC technologies to make up for the issue. Given that there is an equipment cost related to larger MTUs, I believe that there is such a thing as an MTU that is impractical. 1500 byte MTUs in fact work. I'm all for 9K MTUs, and would recommend them. I don't see the point of 65K MTUs. On Apr 14, 2007, at 7:39 AM, Simon Leinen wrote:

...

Ah, large MTUs. Like many other "academic" backbones, we implemented large (9192 bytes) MTUs on our backbone and 9000 bytes on some hosts. See [1] for an illustration. Here are *my* current thoughts on increasing the Internet MTU beyond its current value, 1500. (On the topic, see also [2] - a wiki page which is actually served on a 9000-byte MTU server :-)

Benefits of >1500-byte MTUs:

Several benefits of moving to larger MTUs, say in the 9000-byte range, were cited. I don't find them too convincing anymore.

1. Fewer packets reduce work for routers and hosts.

Routers:

Most backbones seem to size their routers to sustain (near-) line-rate traffic even with small (64-byte) packets. That's a good thing, because if networks were dimensioned to just work at average packet sizes, they would be pretty easy to DoS by sending floods of small packets. So I don't see how raising the MTU helps much unless you also raise the minimum packet size - which might be interesting, but I haven't heard anybody suggest that.

This should be true for routers and middleboxes in general, although there are certainly many places (especially firewalls) where pps limitations ARE an issue. But again, raising the MTU doesn't help if you're worried about the worst case. And I would like to see examples where it would help significantly even in the normal case. In our network it certainly doesn't - we have Mpps to spare.

Hosts:

For hosts, filling high-speed links at 1500-byte MTU has often been difficult at certain times (with Fast Ethernet in the nineties, GigE 4-5 years ago, 10GE today), due to the high rate of interrupts/context switches and internal bus crossings. Fortunately tricks like polling-instead-of-interrupts (Saku Ytti mentioned this), Interrupt Coalescence and Large-Send Offload have become commonplace these days. These give most of the end-system performance benefits of large packets without requiring any support from the network.

2. Fewer bytes (saved header overhead) free up bandwidth.

TCP segments over Ethernet with 1500 byte MTU is "only" 94.2% efficient, while with 9000 byte MTU it would be 99.?% efficient. While an improvement would certainly be nice, 94% already seems "good enough" to me. (I'm ignoring the byte savings due to fewer ACKs. On the other hand not all packets will be able to grow sixfold - some transfers are small.)

3. TCP runs faster.

This boils down to two aspects (besides the effects of (1) and (2)):

a) TCP reaches its "cruising speed" faster.

Especially with LFNs (Long Fat Networks, i.e. paths with a large bandwidth*RTT product), it can take quite a long time until TCP slow-start has increased the window so that the maximum achievable rate is reached. Since the window increase happens in units of MSS (~MTU), TCPs with larger packets reach this point proportionally faster.

This is significant, but there are alternative proposals to solve this issue of slow ramp-up, for example HighSpeed TCP [3].

b) You get a larger share of a congested link.

I think this is true when a TCP-with-large-packets shares a congested link with TCPs-with-small-packets, and the packet loss probability isn't proportional to the size of the packet. In fact the large-packet connection can get a MUCH larger share (sixfold for 9K vs. 1500) if the loss probability is the same for everybody (which it often will be, approximately). Some people consider this a fairness issue, other think it's a good incentive for people to upgrade their MTUs.

About the issues:

* Current Path MTU Discovery doesn't work reliably.

Path MTU Discovery as specified in RFC 1191/1981 relies on ICMP messages to discover when a smaller MTU has to be used. When these ICMP messages fail to arrive (or be sent), the sender will happily continue to send too-large packets into the blackhole. This problem is very real. As an experiment, try configuring an MTU < 1500 on a backbone link which has Ethernet-connected customers behind it. I bet that you'll receive LOUD complaints before long.

Some other people mention that Path MTU Discovery has been refined with "blackhole detection" methods in some systems. This is widely implemented, but not configured (although it probably could be with a "Service Pack").

Note that a new Path MTU Discovery proposal was just published as RFC 4821 [4]. This is also supposed to solve the problem of relying on ICMP messages.

Please, let's wait for these more robust PMTUD mechanisms to be universally deployed before trying to increase the Internet MTU.

* IP assumes a consistent MTU within a logical subnet.

This seems to be a pretty fundamental assumption, and Iljitsch's original mail suggests that we "fix" this. Umm, ok, I hope we don't miss anything important that makes use of this assumption.

Seriously, I think it's illusionary to try to change this for general networks, in particular large LANs. It might work for exchange points or other controlled cases where the set of protocols is fairly well defined, but then exchange points have other options such as separate "jumbo" VLANs.

For campus/datacenter networks, I agree that the consistent-MTU requirement is a big problem for deploying larger MTUs. This is true within my organization - most servers that could use larger MTUs (NNTP servers for example) live on the same subnet with servers that will never bother to be upgraded. The obvious solution is to build smaller subnets - for our test servers I usually configure a separate point-to-point subnet for each of its Ethernet interfaces (I don't trust this bridging-magic anyway :-).

* Most edges will not upgrade anyway.

On the slow edges of the network (residual modem users, exotic places, cellular data users etc.), people will NOT upgrade their MTU to 9000 byte, because a single such packet would totally kill the VoIP experience. For medium-fast networks, large MTUs don't cause problems, but they don't help either. So only a few super-fast edges have an incentive to do this at all.

For the core networks that support large MTUs (like we do), this is frustrating because all our routers now probably carve their internal buffers for 9000-byte packets that never arrive. Maybe we're wasting lots of expensive linecard memory this way?

* Chicken/egg

As long as only a small minority of hosts supports >1500-byte MTUs, there is no incentive for anyone important to start supporting them. A public server supporting 9000-byte MTUs will be frustrated when it tries to use them. The overhead (from attempted large packets that don't make it) and potential trouble will just not be worth it. This is a little similar to IPv6.

So I don't see large MTUs coming to the Internet at large soon. They probably make sense in special cases, maybe for "land-speed records" and dumb high-speed video equipment, or for server-to-server stuff such as USENET news.

(And if anybody out there manages to access [2] or http:// ndt.switch.ch/ with 9000-byte MTUs, I'd like to hear about it :-) -- Simon.

[1] Here are a few tracepaths (more or less traceroute with integrated PMTU discovery) from a host on our network in Switzerland. 9000-byte packets make it across our national backbone (SWITCH), the European academic backbone (GEANT2), Abilene and CENIC in the US, as well as through AARnet in Australia (even over IPv6). But the link from the last wide-area backbone to the receiving site inevitably has a 1500-byte MTU ("pmtu 1500").

: leinen@mamp1[leinen]; tracepath www.caida.org 1: mamp1-eth2.switch.ch (130.59.35.78) 0.110ms pmtu 9000 1: swiMA1-G2-6.switch.ch (130.59.35.77) 1.029ms 2: swiMA2-G2-5.switch.ch (130.59.36.194) 1.141ms 3: swiEL2-10GE-1-4.switch.ch (130.59.37.77) 4.127ms 4: swiCE3-10GE-1-3.switch.ch (130.59.37.65) 4.726ms 5: swiCE2-10GE-1-4.switch.ch (130.59.36.209) 4.901ms 6: switch.rt1.gen.ch.geant2.net (62.40.124.21) asymm 7 4.429ms 7: so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22) asymm 8 12.551ms 8: abilene-wash-gw.rt1.fra.de.geant2.net (62.40.125.18) asymm 9 105.099ms 9: 64.57.28.12 (64.57.28.12) asymm 10 121.619ms 10: kscyng-iplsng.abilene.ucaid.edu (198.32.8.81) asymm 11 153.796ms 11: dnvrng-kscyng.abilene.ucaid.edu (198.32.8.13) asymm 12 158.520ms 12: snvang-dnvrng.abilene.ucaid.edu (198.32.8.1) asymm 13 180.784ms 13: losang-snvang.abilene.ucaid.edu (198.32.8.94) asymm 14 177.487ms 14: hpr-lax-gsr1--abilene-LA-10ge.cenic.net (137.164.25.2) asymm 20 179.106ms 15: riv-hpr--lax-hpr-10ge.cenic.net (137.164.25.5) asymm 21 185.183ms 16: hpr-sdsc-sdsc2--riv-hpr-ge.cenic.net (137.164.27.54) asymm 18 186.368ms 17: hpr-sdsc-sdsc2--riv-hpr-ge.cenic.net (137.164.27.54) asymm 18 185.861ms pmtu 1500 18: cider.caida.org (192.172.226.123) asymm 19 186.264ms reached Resume: pmtu 1500 hops 18 back 19 : leinen@mamp1[leinen]; tracepath www.aarnet.edu.au 1: mamp1-eth2.switch.ch (130.59.35.78) 0.095ms pmtu 9000 1: swiMA1-G2-6.switch.ch (130.59.35.77) 1.024ms 2: swiMA2-G2-5.switch.ch (130.59.36.194) 1.115ms 3: swiEL2-10GE-1-4.switch.ch (130.59.37.77) 3.989ms 4: swiCE3-10GE-1-3.switch.ch (130.59.37.65) 4.731ms 5: swiCE2-10GE-1-4.switch.ch (130.59.36.209) 4.771ms 6: switch.rt1.gen.ch.geant2.net (62.40.124.21) asymm 7 4.424ms 7: so-7-2-0.rt1.fra.de.geant2.net (62.40.112.22) asymm 8 12.536ms 8: ge-3-3-0.bb1.a.fra.aarnet.net.au (202.158.204.249) asymm 9 13.207ms 9: so-0-1-0.bb1.a.sin.aarnet.net.au (202.158.194.145) asymm 10 217.846ms 10: so-3-3-0.bb1.a.per.aarnet.net.au (202.158.194.129) asymm 11 275.651ms 11: so-0-1-0.bb1.a.adl.aarnet.net.au (202.158.194.6) asymm 12 293.854ms 12: so-0-1-0.bb1.a.adl.aarnet.net.au (202.158.194.6) 297.989ms pmtu 1500 13: tiny-teddy.aarnet.edu.au (203.21.37.30) asymm 12 297.462ms reached Resume: pmtu 1500 hops 13 back 12 : leinen@mamp1[leinen]; tracepath6 www.aarnet.edu.au 1?: [LOCALHOST] pmtu 9000 1: swiMA1-G2-6.switch.ch 1.328ms 2: swiMA2-G2-5.switch.ch 1.703ms 3: swiEL2-10GE-1-4.switch.ch 4.529ms 4: swiCE3-10GE-1-3.switch.ch 5.278ms 5: swiCE2-10GE-1-4.switch.ch 5.493ms 6: switch.rt1.gen.ch.geant2.net asymm 7 5. 99ms 7: so-7-2-0.rt1.fra.de.geant2.net asymm 8 13.239ms 8: ge-3-3-0.bb1.a.fra.aarnet.net.au asymm 9 13.970ms 9: so-0-1-0.bb1.a.sin.aarnet.net.au asymm 10 218.718ms 10: so-3-3-0.bb1.a.per.aarnet.net.au asymm 11 267.225ms 11: so-0-1-0.bb1.a.adl.aarnet.net.au asymm 12 299. 78ms 12: so-0-1-0.bb1.a.adl.aarnet.net.au 298.473ms pmtu 1500 12: www.ipv6.aarnet.edu.au 292.893ms reached Resume: pmtu 1500 hops 12 back 12

[2] PERT Knowledgebase article: http://kb.pert.geant2.net/PERTKB/ JumboMTU

[3] RFC 3649, HighSpeed TCP for Large Congestion Windows, S. Floyd, December 2003

[4] RFC 4821, Packetization Layer Path MTU Discovery. M. Mathis, J. Heffner, March 2007

Peter Dambier

14 Apr 14 Apr

7:38 a.m.

New subject: 1500 does not work: Thoughts on increasing MTUs on the internet

Fred Baker wrote:

...

... 1500 byte MTUs in fact work. I'm all for 9K MTUs, and would recommend them. I don't see the point of 65K MTUs. ...

Well, with almost everybody using PPP0E in germany and at least half of europe our mtu is somewhere arround 1480. Many routers are braindead (ICMP lobotomiced). When you hit somebody on an ip2ip link or IPv6 tunnel your mtu goes down to even smaller packets and things live ftp or ssh simply break. I have seen many gamers on mtu = 1024 and smaller. Kind regards Peter and Karin Dambier -- Peter and Karin Dambier Cesidian Root - Radice Cesidiana Rimbacher Strasse 16 D-69509 Moerlenbach-Bonsweiher +49(6209)795-816 (Telekom) +49(6252)750-308 (VoIP: sipgate.de) mail: peter@peter-dambier.de mail: peter@echnaton.arl.pirates http://iason.site.voila.fr/ https://sourceforge.net/projects/iason/ http://www.cesidianroot.com/

Marshall Eubanks

1:13 p.m.

New subject: 1500 does not work: Thoughts on increasing MTUs on the internet

Hello; On Apr 14, 2007, at 3:38 AM, Peter Dambier wrote:

...

Fred Baker wrote:

...
... 1500 byte MTUs in fact work. I'm all for 9K MTUs, and would recommend them. I don't see the point of 65K MTUs. ...

Well, with almost everybody using PPP0E in germany and at least half of europe our mtu is somewhere arround 1480. Many routers are braindead (ICMP lobotomiced).

When you hit somebody on an ip2ip link or IPv6 tunnel your mtu goes down to even smaller packets and things live ftp or ssh simply break. I have seen many gamers on mtu = 1024 and smaller.

I advise people doing streaming to not use MTU's larger than ~1450 for these sorts of reasons.

...

Kind regards Peter and Karin Dambier

Regards Marshall

...

-- Peter and Karin Dambier Cesidian Root - Radice Cesidiana Rimbacher Strasse 16 D-69509 Moerlenbach-Bonsweiher +49(6209)795-816 (Telekom) +49(6252)750-308 (VoIP: sipgate.de) mail: peter@peter-dambier.de mail: peter@echnaton.arl.pirates http://iason.site.voila.fr/ https://sourceforge.net/projects/iason/ http://www.cesidianroot.com/

Petri Helenius

15 Apr 15 Apr

5:49 a.m.

New subject: 1500 does not work: Thoughts on increasing MTUs on the internet

Marshall Eubanks wrote:

...

I advise people doing streaming to not use MTU's larger than ~1450 for these sorts of reasons.

The unfortunate side-effect of that is that most prominent streaming apps (don't know about Youtube though) then send fragmented UDP packets which leads to reassembly overhead and in case of lost packets, a significantly larger lost data than neccessary. Pete

...

...
Kind regards Peter and Karin Dambier

Regards Marshall

...
--Peter and Karin Dambier Cesidian Root - Radice Cesidiana Rimbacher Strasse 16 D-69509 Moerlenbach-Bonsweiher +49(6209)795-816 (Telekom) +49(6252)750-308 (VoIP: sipgate.de) mail: peter@peter-dambier.de mail: peter@echnaton.arl.pirates http://iason.site.voila.fr/ https://sourceforge.net/projects/iason/ http://www.cesidianroot.com/

Marshall Eubanks

12:58 p.m.

New subject: 1500 does not work: Thoughts on increasing MTUs on the internet

On Apr 15, 2007, at 1:49 AM, Petri Helenius wrote:

...

Marshall Eubanks wrote:

...
I advise people doing streaming to not use MTU's larger than ~1450 for these sorts of reasons.

The unfortunate side-effect of that is that most prominent streaming apps (don't know about Youtube though) then send fragmented UDP packets which leads to reassembly overhead and in case of lost packets, a significantly larger lost data than neccessary.

Pete

Dear Pete; The streaming servers that I have dealt with (such as Darwin Streaming Server) do the fragmentation at the application layer. They thus send out lots of packets at or near (in this case) 1450 bytes, but they are not UDP fragments. That's the whole point - many networks will not deliver fragments at all, much less the increased risk of loss when they do. (I use Cox Cable at home, and this network apparently does not forward fragments and also has an apparent MTU of 1480 bytes.) Just looked at a YouTube dump, btw, and almost all of the packets are 1448 bytes. Regards Marshall

...

...
...
Kind regards Peter and Karin Dambier

Regards Marshall

...
--Peter and Karin Dambier Cesidian Root - Radice Cesidiana Rimbacher Strasse 16 D-69509 Moerlenbach-Bonsweiher +49(6209)795-816 (Telekom) +49(6252)750-308 (VoIP: sipgate.de) mail: peter@peter-dambier.de mail: peter@echnaton.arl.pirates http://iason.site.voila.fr/ https://sourceforge.net/projects/iason/ http://www.cesidianroot.com/

Petri Helenius

3:48 p.m.

New subject: 1500 does not work: Thoughts on increasing MTUs on the internet

Marshall Eubanks wrote:

...

Dear Pete;

The streaming servers that I have dealt with (such as Darwin Streaming Server) do the fragmentation at the application layer. They thus send out lots of packets at or near (in this case) 1450 bytes, but they are not UDP fragments. That's the whole point - many networks will not deliver fragments at all, much less the increased risk of loss when they do. (I use Cox Cable at home, and this network apparently does not forward fragments and also has an apparent MTU of 1480 bytes.)

Just looked at a YouTube dump, btw, and almost all of the packets are 1448 bytes.

I'm referring to Windows Media and Real. They are the worst offenders, though not too many use them without HTTP. Pete

Douglas Otis

14 Apr 14 Apr

5:22 p.m.

On Apr 13, 2007, at 4:55 PM, Fred Baker wrote:

...

The biggest value in real practice is IMHO that the end systems deal with a lower interrupt rate when moving the same amount of data. That said, some who are asking about larger MTUs are asking for values so large that CRC schemes lose their value in error detection, and they find themselves looking at higher layer FEC technologies to make up for the issue. Given that there is an equipment cost related to larger MTUs, I believe that there is such a thing as an MTU that is impractical.

1500 byte MTUs in fact work. I'm all for 9K MTUs, and would recommend them. I don't see the point of 65K MTUs.

Keep in mind that a 9KB MTU still reduces the Ethernet CRC effectiveness by a fair amount. Adoption of CRC32c by SCTP and iSCSI has a larger Hamming distance restoring the detection rates for Jumbo packets. -Doug

Iljitsch van Beijnum

8:10 p.m.

On 14-apr-2007, at 19:22, Douglas Otis wrote:

...

...
1500 byte MTUs in fact work. I'm all for 9K MTUs, and would recommend them. I don't see the point of 65K MTUs.

...

Keep in mind that a 9KB MTU still reduces the Ethernet CRC effectiveness by a fair amount.

In the article "Error Characteristics of FDDI" by Raj Jain (see http://citeseer.ist.psu.edu/341988.html ) table VII says: Hamming Distance of FCS Polynomal Hamming Max Frame Size Weight Octets 3 11454 4 375 5 37 Of course a 9000 byte packets has 6 times the number of bits in it, so the chance of having a number of bit errors in the packet that exceeds the hamming distance is ~ 6 times greater. I can't find bit error rate specs for various types of ethernet real quick, but if you assume 10^-9 that means that ~ 1 in 10000 11454 byte packets has one bit error, so around 1 in 10^12 has four bit errors and has a _chance_ to defeat the CRC32. The naieve assumption that only 1 in 2^32 of those packets with 3 flipped bits will have a valid CRC32 is probably incorrect, but the CRC should still catch most of those packetss for a fairly large value of "most". For 1500 byte packets the fraction of packets with three bits flipped would be around 1 : 10^15, correcting for the larger number of packets per given amount of data, that's a difference of about 1 : 100. That seems like a lot, but getting better quality fiber easily compensates for this. Expressed differently, the average amount of data transmitted where you see one packet with three flipped bits is around 10 petabytes for 11454 byte packets and some 1.3 exabytes for 1500 byte packets. For the large packets that would be one packet in three years at 1 Gbps, for the small ones one packet in 380 years.

Bill Stewart

11:13 p.m.

One of my customers comments that he doesn't care about jumbograms of 9K or 4K - what he really wants is to be sure the networks support MTUs of at least 1600-1700 bytes, so that various combinations of IPSEC, UDP-padding, PPPoE, etc. don't break the real 1500-byte packets underneath.

Randy Bush

11:35 p.m.

...

One of my customers comments that he doesn't care about jumbograms of 9K or 4K - what he really wants is to be sure the networks support MTUs of at least 1600-1700 bytes, so that various combinations of IPSEC, UDP-padding, PPPoE, etc. don't break the real 1500-byte packets underneath.

nice to have smart customers!

Stephen Sprunk

15 Apr 15 Apr

1:13 a.m.

Thus spake "Bill Stewart" <nonobvious@gmail.com>

...

One of my customers comments that he doesn't care about jumbograms of 9K or 4K - what he really wants is to be sure the networks support MTUs of at least 1600-1700 bytes, so that various combinations of IPSEC, UDP-padding, PPPoE, etc. don't break the real 1500-byte packets underneath.

This is a more realistic case, and support for "baby jumbos" of 2kB to 3kB is almost universal even on mid-range networking gear. However, the problems of getting it deployed are mostly the same, except one can take the end nodes out of the picture in the simplest case. OTOH, if we had a viable solution to the variable-MTU mess in the first place, you could just upgrade every network to the largest MTU possible and hosts would figure out what the PMTU was and nobody would be sending 1500-byte packets; they'd be either something like 1400 bytes or 9000 bytes, depending on whether the path included segments that hadn't been upgraded yet... S Stephen Sprunk "Those people who think they know everything CCIE #3723 are a great annoyance to those of us who do." K5SSS --Isaac Asimov

Douglas Otis

3:46 a.m.

On Apr 14, 2007, at 1:10 PM, Iljitsch van Beijnum wrote:

...

On 14-apr-2007, at 19:22, Douglas Otis wrote:

...
...
1500 byte MTUs in fact work. I'm all for 9K MTUs, and would recommend them. I don't see the point of 65K MTUs.

Keep in mind that a 9KB MTU still reduces the Ethernet CRC effectiveness by a fair amount.

I can't find bit error rate specs for various types of ethernet real quick, but if you assume 10^-9 that means that ~ 1 in 10000 11454 byte packets has one bit error, so around 1 in 10^12 has four bit errors and has a _chance_ to defeat the CRC32. The naieve assumption that only 1 in 2^32 of those packets with 3 flipped bits will have a valid CRC32 is probably incorrect, but the CRC should still catch most of those packetss for a fairly large value of "most".

http://www.ietf.org/rfc/rfc3385.txt http://citeseer.ist.psu.edu/koopman02bit.html

...

For 1500 byte packets the fraction of packets with three bits flipped would be around 1 : 10^15, correcting for the larger number of packets per given amount of data, that's a difference of about 1 : 100.

Quoting from "When The CRC and TCP Checksum Disagree" by Jonathan Stone and Craig Partridge: http://citeseer.ist.psu.edu/cache/papers/cs/21401/ http:zSzzSzsigcomm.it.uu.sezSzconfzSzpaperzSzsigcomm2000-9-1.pdf/ stone00when.pdf "Traces of Internet packets from the past two years show that between 1 packet in 1,100 and 1 packet in 32,000 fails the TCP checksum, even on links where link-level CRCs should catch all but 1 in 4 billion errors. For certain situations, the rate of checksum failures can be even higher: in one hour-long test we observed a checksum failure of 1 packet in 400. We investigate why so many errors are observed, when link-level CRCs should catch nearly all of them. We have collected nearly 500,000 packets which failed the TCP or UDP or IP checksum. This dataset shows the Internet has a wide variety of error sources which can not be detected by link-level checks. We describe analysis tools that have identified nearly 100 different error patterns. Categorizing packet errors, we can infer likely causes which explain roughly half the observed errors. The causes span the entire spectrum of a network stack, from memory errors to bugs in TCP. After an analysis we conclude that the checksum will fail to detect errors for roughly 1 in 16 million to 10 billion packets. From our analysis of the cause of errors, we propose simple changes to several protocols which will decrease the rate of undetected error. Even so, the highly non-random distribution of errors strongly suggests some applications should employ application-level checksums or equivalents." Hardware weaknesses within DSLAMs or various memory arrays, such as a weak driver on some internal interface, can generate high levels of multi-bit errors not detected by TCP checksums. When affecting the same bit within an interface, more than 1 out of 100 may go undetected.

...

That seems like a lot, but getting better quality fiber easily compensates for this. Expressed differently, the average amount of data transmitted where you see one packet with three flipped bits is around 10 petabytes for 11454 byte packets and some 1.3 exabytes for 1500 byte packets. For the large packets that would be one packet in three years at 1 Gbps, for the small ones one packet in 380 years.

Consider that the CRC is not always carried with the packet between interfaces. -Doug

Joe Greco

14 Apr 14 Apr

5:18 a.m.

...

As long as only a small minority of hosts supports >1500-byte MTUs, there is no incentive for anyone important to start supporting them. A public server supporting 9000-byte MTUs will be frustrated when it tries to use them. The overhead (from attempted large packets that don't make it) and potential trouble will just not be worth it. This is a little similar to IPv6.

So I don't see large MTUs coming to the Internet at large soon. They probably make sense in special cases, maybe for "land-speed records" and dumb high-speed video equipment, or for server-to-server stuff such as USENET news.

It is *certainly* helpful for USENET news. So perhaps it is time to chuck the whole thing out and start over. There seem to be enough projects out there (cleanslate.stanford.edu, etc) that are looking at just that topic... maybe it is time for a new network design with IPv6, flexible MTU's, etc. The existing MTU 1500 situation made sense on ten megabit ethernet, of course, and at the time, the overall design of the Internet, and the capabilities of the underlying network hardware were such that it wasn't that reasonable or practical to consider trying to make it negotiable. There is no valid technical reason for that situation with modern hardware. The reasons people argue against larger MTU all appear to have to do with hysterical raisins. 1500 was okay at 10 megabits. That could imply 15000 for 100 megabits, and 150000 for 1 gigabit. There probably isn't a huge number of applications for such large MTU's, and certainly universal support is not likely to happen, but we have to realize that the speeds of networks will continue to increase, and in five years we'll probably be running terabit networks everywhere. I could picture 150K MTU's being useful at those speeds. The goal shouldn't really be to simply allow for some fixed higher MTU. If any of these "redesign the Internet" programs succeed, we should be very certain that MTU flexibility is a core feature. ... JG -- Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net "We call it the 'one bite at the apple' rule. Give me one chance [and] then I won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN) With 24 million small businesses in the US alone, that's way too many apples.

Joe Maimon

15 Apr 15 Apr

2:06 a.m.

Simon Leinen wrote:

...

* Current Path MTU Discovery doesn't work reliably.

Please, let's wait for these more robust PMTUD mechanisms to be universally deployed before trying to increase the Internet MTU.

I think this is the proper summary of where we are at: Trying to restore one of the original design goals of ipv4 -- reliable internetworking of different MTU sized networks. But the waiting game doesnt work, act local and think global.

...

* IP assumes a consistent MTU within a logical subnet.

This seems to be a pretty fundamental assumption, and Iljitsch's original mail suggests that we "fix" this.

This is an implementation detail, since local IP nodes have no conception of remote IP nodes subnet detals.

6789

Age (days ago)

6792

Last active (days ago)

List overview

Download

67 comments

36 participants

participants (36)

Adrian Chadd
Bill Stewart
Buhrmaster, Gary
Daniel Senie
David W. Hankins
Douglas Otis
Florian Weimer
Fred Baker
Gian Constantine
Iljitsch van Beijnum
Joe Greco
Joe Loiacono
Joe Maimon
Joel Jaeggli
Keegan.Holley＠sungard.com
Lasher, Donn
Leigh Porter
Marshall Eubanks
michael.dillon＠bt.com
Mikael Abrahamsson
Niels Bakker
Perry Lorier
Peter Dambier
Petri Helenius
Pierfrancesco Caci
Randy Bush
Saku Ytti
Simon Leinen
Stephen Satchell
Stephen Sprunk
Stephen Wilcox
Steve Meuse
Steven M. Bellovin
Valdis.Kletnieks＠vt.edu
Warren Kumari
Will Hargrave