RINA - scott whaps at the nanog hornets nest :-) - Test

RINA - scott whaps at the nanog hornets nest :-)

Scott Weeks

5 Nov 2010 5 Nov '10

10:32 p.m.

It's really quiet in here. So, for some Friday fun let me whap at the hornets nest and see what happens... >;-) http://www.ionary.com/PSOC-MovingBeyondTCP.pdf -------------------------------------------------------------- "NAT is your friend" "IP doesn’t handle addressing or multi-homing well at all" "The IETF’s proposed solution to the multihoming problem is called LISP, for Locator/Identifier Separation Protocol. This is already running into scaling problems, and even when it works, it has a failover time on the order of thirty seconds." "TCP and IP were split the wrong way" "IP lacks an addressing architecture" "Packet switching was designed to complement, not replace, the telephone network. IP was not optimized to support streaming media, such as voice, audio broadcasting, and video; it was designed to not be the telephone network." -------------------------------------------------------------- And so, "...the first principle of our proposed new network architecture: Layers are recursive." I can hear the angry hornets buzzing already. :-) scott

Show replies by date

Mark Smith

5 Nov 5 Nov

11:26 p.m.

On Fri, 5 Nov 2010 15:32:30 -0700 "Scott Weeks" <surfer@mauigateway.com> wrote:

...

It's really quiet in here. So, for some Friday fun let me whap at the hornets nest and see what happens... >;-)

http://www.ionary.com/PSOC-MovingBeyondTCP.pdf

Who ever wrote that doesn't know what they're talking about. LISP is not the IETF's proposed solution (the IETF don't have one, the IRTF do), and streaming media was seen to be one of the early applications of the Internet - these types of applications is why TCP was split out of IP, why UDP was invented, and why UDP has has a significantly different protocol number to TCP.

...

-------------------------------------------------------------- "NAT is your friend"

"IP doesn’t handle addressing or multi-homing well at all"

"The IETF’s proposed solution to the multihoming problem is called LISP, for Locator/Identifier Separation Protocol. This is already running into scaling problems, and even when it works, it has a failover time on the order of thirty seconds."

"TCP and IP were split the wrong way"

"IP lacks an addressing architecture"

"Packet switching was designed to complement, not replace, the telephone network. IP was not optimized to support streaming media, such as voice, audio broadcasting, and video; it was designed to not be the telephone network." --------------------------------------------------------------

And so, "...the first principle of our proposed new network architecture: Layers are recursive."

I can hear the angry hornets buzzing already. :-)

scott

Marshall Eubanks

6 Nov 6 Nov

1:40 a.m.

On Nov 5, 2010, at 7:26 PM, Mark Smith wrote:

...

On Fri, 5 Nov 2010 15:32:30 -0700 "Scott Weeks" <surfer@mauigateway.com> wrote:

...
It's really quiet in here. So, for some Friday fun let me whap at the hornets nest and see what happens... >;-)

http://www.ionary.com/PSOC-MovingBeyondTCP.pdf

Who ever wrote that doesn't know what they're talking about. LISP is not the IETF's proposed solution (the IETF don't have one, the IRTF do),

Um, I would not agree. The IRTF RRG considered and is documenting a lot of things, but did not come to any consensus as to which one should be a "proposed solution." Regards Marshall

...

and streaming media was seen to be one of the early applications of the Internet - these types of applications is why TCP was split out of IP, why UDP was invented, and why UDP has has a significantly different protocol number to TCP.

...
-------------------------------------------------------------- "NAT is your friend"

"IP doesn’t handle addressing or multi-homing well at all"

"The IETF’s proposed solution to the multihoming problem is called LISP, for Locator/Identifier Separation Protocol. This is already running into scaling problems, and even when it works, it has a failover time on the order of thirty seconds."

"TCP and IP were split the wrong way"

"IP lacks an addressing architecture"

"Packet switching was designed to complement, not replace, the telephone network. IP was not optimized to support streaming media, such as voice, audio broadcasting, and video; it was designed to not be the telephone network." --------------------------------------------------------------

And so, "...the first principle of our proposed new network architecture: Layers are recursive."

I can hear the angry hornets buzzing already. :-)

scott

Mark Smith

2:38 p.m.

On Fri, 5 Nov 2010 21:40:30 -0400 Marshall Eubanks <tme@americafree.tv> wrote:

...

On Nov 5, 2010, at 7:26 PM, Mark Smith wrote:

...
On Fri, 5 Nov 2010 15:32:30 -0700 "Scott Weeks" <surfer@mauigateway.com> wrote:

...
It's really quiet in here. So, for some Friday fun let me whap at the hornets nest and see what happens... >;-)

http://www.ionary.com/PSOC-MovingBeyondTCP.pdf

Who ever wrote that doesn't know what they're talking about. LISP is not the IETF's proposed solution (the IETF don't have one, the IRTF do),

Um, I would not agree. The IRTF RRG considered and is documenting a lot of things, but did not come to any consensus as to which one should be a "proposed solution."

I probably got a bit keen, I've been reading through the IRTF RRG "Recommendation for a Routing Architecture" draft which, IIRC, makes a recommendation to pursue Identifier/Locator Network Protocol rather than LISP. Regards, Mark.

...

Regards Marshall

...
and streaming media was seen to be one of the early applications of the Internet - these types of applications is why TCP was split out of IP, why UDP was invented, and why UDP has has a significantly different protocol number to TCP.

...
-------------------------------------------------------------- "NAT is your friend"

"IP doesn’t handle addressing or multi-homing well at all"

"The IETF’s proposed solution to the multihoming problem is called LISP, for Locator/Identifier Separation Protocol. This is already running into scaling problems, and even when it works, it has a failover time on the order of thirty seconds."

"TCP and IP were split the wrong way"

"IP lacks an addressing architecture"

"Packet switching was designed to complement, not replace, the telephone network. IP was not optimized to support streaming media, such as voice, audio broadcasting, and video; it was designed to not be the telephone network." --------------------------------------------------------------

And so, "...the first principle of our proposed new network architecture: Layers are recursive."

I can hear the angry hornets buzzing already. :-)

scott

Marshall Eubanks

11:55 p.m.

On Nov 6, 2010, at 10:38 AM, Mark Smith wrote:

...

On Fri, 5 Nov 2010 21:40:30 -0400 Marshall Eubanks <tme@americafree.tv> wrote:

...
On Nov 5, 2010, at 7:26 PM, Mark Smith wrote:

...
On Fri, 5 Nov 2010 15:32:30 -0700 "Scott Weeks" <surfer@mauigateway.com> wrote:

...
It's really quiet in here. So, for some Friday fun let me whap at the hornets nest and see what happens... >;-)

http://www.ionary.com/PSOC-MovingBeyondTCP.pdf

Who ever wrote that doesn't know what they're talking about. LISP is not the IETF's proposed solution (the IETF don't have one, the IRTF do),

Um, I would not agree. The IRTF RRG considered and is documenting a lot of things, but did not come to any consensus as to which one should be a "proposed solution."

I probably got a bit keen, I've been reading through the IRTF RRG "Recommendation for a Routing Architecture" draft which, IIRC, makes a recommendation to pursue Identifier/Locator Network Protocol rather than LISP.

That is not a consensus document - as it says To this end, this document surveys many of the proposals that were brought forward for discussion in this activity, as well as some of the subsequent analysis and the architectural recommendation of the chairs. and (Section 17) Unfortunately, the group did not reach rough consensus on a single best approach. The Chairs suggested that work continue on ILNP, but it is a stretch to characterize that as the RRG's solution, much less the IRTF's. (LISP is an IETF WG now, but with an experimental focus on its charter - "The LISP WG is NOT chartered to develop the final or standard solution for solving the routing scalability problem.") Regards Marshall

...

Regards, Mark.

...
Regards Marshall

...
and streaming media was seen to be one of the early applications of the Internet - these types of applications is why TCP was split out of IP, why UDP was invented, and why UDP has has a significantly different protocol number to TCP.

...
-------------------------------------------------------------- "NAT is your friend"

"IP doesn’t handle addressing or multi-homing well at all"

"The IETF’s proposed solution to the multihoming problem is called LISP, for Locator/Identifier Separation Protocol. This is already running into scaling problems, and even when it works, it has a failover time on the order of thirty seconds."

"TCP and IP were split the wrong way"

"IP lacks an addressing architecture"

"Packet switching was designed to complement, not replace, the telephone network. IP was not optimized to support streaming media, such as voice, audio broadcasting, and video; it was designed to not be the telephone network." --------------------------------------------------------------

And so, "...the first principle of our proposed new network architecture: Layers are recursive."

I can hear the angry hornets buzzing already. :-)

scott

Richard A Steenbergen

5 Nov 5 Nov

11:36 p.m.

On Fri, Nov 05, 2010 at 03:32:30PM -0700, Scott Weeks wrote:

...

It's really quiet in here. So, for some Friday fun let me whap at the hornets nest and see what happens... >;-)

Arguments about locator/identifier splits aside (which I happen to agree with), this thing goes off the deep end on page 7 when it starts talking about peering infrastructure. Infact pretty much every sentence on that page is blatantly wrong. :) -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)

Mans Nilsson

6 Nov 6 Nov

7 a.m.

Subject: RINA - scott whaps at the nanog hornets nest :-) Date: Fri, Nov 05, 2010 at 03:32:30PM -0700 Quoting Scott Weeks (surfer@mauigateway.com):

...

It's really quiet in here. So, for some Friday fun let me whap at the hornets nest and see what happens... >;-)

http://www.ionary.com/PSOC-MovingBeyondTCP.pdf

This tired bumblebee concludes that another instance of "Two bypassed computer scientists who are angry that ISO OSI didn't catch on gripe about this, and call IP esp. IPv6, names in effort to taint it." isn't enough to warrant anything but a yawn. More troubling might be http://www.iec62379.org/ and what they (I think they are ATM advocates of the most bellheaded form) are trying to push into ISO standard. Including gems like "Research during the decade leading up to 2010 shows that the connectionless packet switching paradigm that is inherent in Internet Protocol is unsuitable for an increasing proportion of the traffic on the Internet. " Sic! Now that is something to bite into. -- Måns Nilsson primary/secondary/besserwisser/machina MN-1334-RIPE +46 705 989668 Do I have a lifestyle yet?

Jack Bates

4:45 p.m.

On 11/5/2010 5:32 PM, Scott Weeks wrote:

...

It's really quiet in here. So, for some Friday fun let me whap at the hornets nest and see what happens...>;-)

http://www.ionary.com/PSOC-MovingBeyondTCP.pdf

SCTP is a great protocol. It has already been implemented in a number of stacks. With these benefits over that theory, it still hasn't become mainstream yet. People are against change. They don't want to leave v4. They don't want to leave tcp/udp. Technology advances, but people will only change when they have to. Jack (lost brain cells actually reading that pdf)

George Bonser

7:15 p.m.

...

Sent: Saturday, November 06, 2010 9:45 AM To: nanog@nanog.org Subject: Re: RINA - scott whaps at the nanog hornets nest :-)

On 11/5/2010 5:32 PM, Scott Weeks wrote:

...
It's really quiet in here. So, for some Friday fun let me whap at

the hornets nest and see what happens...>;-)

...
http://www.ionary.com/PSOC-MovingBeyondTCP.pdf

SCTP is a great protocol. It has already been implemented in a number of stacks. With these benefits over that theory, it still hasn't become mainstream yet. People are against change. They don't want to leave v4. They don't want to leave tcp/udp. Technology advances, but people will only change when they have to.

Jack (lost brain cells actually reading that pdf)

I believe SCTP will become more widely used in the mobile device world. You can have several different streams so you can still get an IM, for example, while you are streaming a movie. Eliminating the "head of line" blockage on thin connections is really valuable. It would be particularly useful where you have different types of traffic from a single destination. File transfer, for example, might be a good application where one might wish to issue interactive commands to move around the directory structure while a large file transfer is taking place. If you really want to shake a hornet's nest, try getting people to get rid of this idiotic 1500 byte MTU in the "middle of the internet" and try to get everyone to adopt 9000 byte frames as the standard. That change right there would provide a huge performance increase, load reduction on networks and servers, and with a greater number of native ethernet end to end connections, there is no reason to use 1500 byte MTUs. This is particularly true with modern PMUT methods (such as with modern Linux kernels ... /proc/sys/net/ipv4/tcp_mtu_probing set to either 1 or 2). While the end points should just be what they are, there is no reason for the "middle" portion, the long haul transport part, to be MTU 1500. http://staff.psc.edu/mathis/MTU/

Michael Hallgren

7:27 p.m.

Le samedi 06 novembre 2010 à 12:15 -0700, George Bonser a écrit :

...

...
Sent: Saturday, November 06, 2010 9:45 AM To: nanog@nanog.org Subject: Re: RINA - scott whaps at the nanog hornets nest :-)

On 11/5/2010 5:32 PM, Scott Weeks wrote:

...
It's really quiet in here. So, for some Friday fun let me whap at

the hornets nest and see what happens...>;-)

...
http://www.ionary.com/PSOC-MovingBeyondTCP.pdf

SCTP is a great protocol. It has already been implemented in a number of stacks. With these benefits over that theory, it still hasn't become mainstream yet. People are against change. They don't want to leave v4. They don't want to leave tcp/udp. Technology advances, but people will only change when they have to.

Jack (lost brain cells actually reading that pdf)

I believe SCTP will become more widely used in the mobile device world. You can have several different streams so you can still get an IM, for example, while you are streaming a movie. Eliminating the "head of line" blockage on thin connections is really valuable.

It would be particularly useful where you have different types of traffic from a single destination. File transfer, for example, might be a good application where one might wish to issue interactive commands to move around the directory structure while a large file transfer is taking place.

If you really want to shake a hornet's nest, try getting people to get rid of this idiotic 1500 byte MTU in the "middle of the internet"

I doubt that 1500 is (still) widely used in our Internet... Might be, though, that most of us don't go all the way to 9k. mh

...

and try to get everyone to adopt 9000 byte frames as the standard. That change right there would provide a huge performance increase, load reduction on networks and servers, and with a greater number of native ethernet end to end connections, there is no reason to use 1500 byte MTUs. This is particularly true with modern PMUT methods (such as with modern Linux kernels ... /proc/sys/net/ipv4/tcp_mtu_probing set to either 1 or 2).

While the end points should just be what they are, there is no reason for the "middle" portion, the long haul transport part, to be MTU 1500.

http://staff.psc.edu/mathis/MTU/

George Bonser

7:32 p.m.

...

I doubt that 1500 is (still) widely used in our Internet... Might be, though, that most of us don't go all the way to 9k.

mh

Last week I asked the operator of fairly major public peering points if they supported anything larger than 1500 MTU. The answer was "no".

Matthew Petach

8:01 p.m.

On Sat, Nov 6, 2010 at 12:32 PM, George Bonser <gbonser@seven.com> wrote:

...

...
I doubt that 1500 is (still) widely used in our Internet... Might be, though, that most of us don't go all the way to 9k.

mh

Last week I asked the operator of fairly major public peering points if they supported anything larger than 1500 MTU. The answer was "no".

There's still a metric buttload of SONET interfaces in the core that won't go above 4470. So, you might conceivably get 4k MTU at some point in the future, but it's really, *really* unlikely you'll get to 9k MTU any time in the next decade. Matt

George Bonser

8:09 p.m.

...

There's still a metric buttload of SONET interfaces in the core that won't go above 4470.

So, you might conceivably get 4k MTU at some point in the future, but it's really, *really* unlikely you'll get to 9k MTU any time in the next decade.

Matt

Agreed. But even 4470 is better than 1500. 1500 was fine for 10G ethernet, it is actually pretty silly for GigE and better. This survey that Dykstra did back in 1999 points out exactly what you mentioned: http://sd.wareonearth.com/~phil/jumbo.html And that was over a decade ago. There is no reason, in my opinion, for the various peering points to be a 1500 byte bottleneck in a path that might otherwise be larger. Increasing that from 1500 to even 3000 or 4500 gives a measurable performance boost over high latency connections such as from Europe to APAC or Western US. This is not to mention a reduction in the number of ACK packets flying back and forth across the Internet and a general reduction in the number of packets that must be processed for a given transaction.

George Bonser

8:12 p.m.

...

1500 was fine for 10G

I meant, of course, 10M ethernet.

George Bonser

8:22 p.m.

...

...
Last week I asked the operator of fairly major public peering points

if they supported anything larger than 1500 MTU. The answer was "no".

...
There's still a metric buttload of SONET interfaces in the core that won't go above 4470.

So, you might conceivably get 4k MTU at some point in the future, but it's really, *really* unlikely you'll get to 9k MTU any time in the next decade.

Matt

There is no reason why we are still using 1500 byte MTUs at exchange points.

...

From Dykstra's paper (note that this was written in 1999 before wide deployment of GigE):

(quote) Does GigE have a place in a NAP? Not if it reduces the available MTU! Network Access Points (NAPs) are at the very "core" of the internet. They are where multiple wide area networks come together. A great deal of internet paths traverse at least one NAP. If NAPs put a limitation on MTU, then all WANs, LANs, and end systems that traverse that NAP are subject to that limitation. There is nothing the end systems could do to lift the performance limit imposed by the NAP's MTU. Because of their critically important place in the internet, NAPs should be doing everything they can to remove performance bottlenecks. They should be among the most permissive nodes in the network as far as the parameter space they make available to network applications. The economic and bandwidth arguments for GigE NAPs however are compelling. Several NAPs today are based on switched FDDI (100 Mbps, 4 KB MTU) and are running out of steam. An upgrade to OC3 ATM (155 Mbps, 9 KB MTU) is hard to justify since it only provides a 50% increase in bandwidth. And trying to install a switch that could support 50+ ports of OC12 ATM is prohibitively expensive! A 64 port GigE switch however can be had for about $100k and delivers 50% more bandwidth per port at about 1/3 the cost of OC12 ATM. The problem however is 1500 byte frames, but GigE with jumbo frames would permit full FDDI MTU's and only slightly reduce a full Classical IP over ATM MTU (9180 bytes). A recent example comes from the Pacific Northwest Gigapop in Seattle which is based on a collection of Foundry gigabit ethernet switches. At Supercomputing '99, Microsoft and NCSA demonstrated HDTV over TCP at over 1.2 Gbps from Redmond to Portland. In order to achieve that performance they used 9000 byte packets and thus had to bypass the switches at the NAP! Let's hope that in the future NAPs don't place 1500 byte packet limitations on applications. (end quote) Having the exchange point of ethernet connections at >1500 MTU will not in any way adversely impact the traffic on the path. If the end points are already at 1500, this change is completely transparent to them. If the end points are capable of >1500 already, then it would allow the flow to increase its packet sizes and reduce the number of packets flowing through the network and give a huge gain in performance, even in the face of packet loss.

Matthew Petach

8:29 p.m.

On Sat, Nov 6, 2010 at 1:22 PM, George Bonser <gbonser@seven.com> wrote:

...

...
...
Last week I asked the operator of fairly major public peering points

if they supported anything larger than 1500 MTU. The answer was "no".

...
There's still a metric buttload of SONET interfaces in the core that won't go above 4470.

So, you might conceivably get 4k MTU at some point in the future, but it's really, *really* unlikely you'll get to 9k MTU any time in the next decade.

Matt

There is no reason why we are still using 1500 byte MTUs at exchange points.

Completely agree with you on that point. I'd love to see Equinix, AMSIX, LINX, DECIX, and the rest of the large exchange points put out statements indicating their ability to transparently support jumbo frames through their fabrics, or at least indicate a roadmap and a timeline to when they think they'll be able to support jumbo frames throughout the switch fabrics. Matt

George Bonser

8:38 p.m.

...

Completely agree with you on that point. I'd love to see Equinix, AMSIX, LINX, DECIX, and the rest of the large exchange points put out statements indicating their ability to transparently support jumbo frames through their fabrics, or at least indicate a roadmap and a timeline to when they think they'll be able to support jumbo frames throughout the switch fabrics.

Matt

Yes, in moving from SONET to Ethernet exchange points, we have actually reduced the potential performance of applications across the network for no good reason, in many cases.

Michael Hallgren

9:25 p.m.

Le samedi 06 novembre 2010 à 13:29 -0700, Matthew Petach a écrit :

...

On Sat, Nov 6, 2010 at 1:22 PM, George Bonser <gbonser@seven.com> wrote:

...
...
...
Last week I asked the operator of fairly major public peering points

if they supported anything larger than 1500 MTU. The answer was "no".

...
There's still a metric buttload of SONET interfaces in the core that won't go above 4470.

So, you might conceivably get 4k MTU at some point in the future, but it's really, *really* unlikely you'll get to 9k MTU any time in the next decade.

Matt

There is no reason why we are still using 1500 byte MTUs at exchange points.

Completely agree with you on that point. I'd love to see Equinix, AMSIX, LINX, DECIX, and the rest of the large exchange points put out statements indicating their ability to transparently support jumbo frames through their fabrics, or at least indicate a roadmap and a timeline to when they think they'll be able to support jumbo frames throughout the switch fabrics.

Agree. Some people do: Netnod. ;) (1500 in one option, 4470 in another, part of a single interconnection deal -- unless I'm mistaken about the contractual side of things). mh

...

Matt

sthaug＠nethelp.no

9:34 p.m.

...

Completely agree with you on that point. I'd love to see Equinix, AMSIX, LINX, DECIX, and the rest of the large exchange points put out statements indicating their ability to transparently support jumbo frames through their fabrics, or at least indicate a roadmap and a timeline to when they think they'll be able to support jumbo frames throughout the switch fabrics.

The Netnod IX in Sweden has offered 4470 MTU for many years. From http://www.netnod.se/technical_information.shtml "One VLAN handles standard sized Ethernet frames (MTU <1500 bytes) and one handles Ethernet Jumbo frames with MTU-size 4470 bytes." Steinar Haug, Nethelp consulting, sthaug@nethelp.no

Will Hargrave

7 Nov 7 Nov

8:35 a.m.

On 6 Nov 2010, at 20:29, Matthew Petach wrote:

...

...
There is no reason why we are still using 1500 byte MTUs at exchange points. Completely agree with you on that point. I'd love to see Equinix, AMSIX, LINX, DECIX, and the rest of the large exchange points put out statements indicating their ability to transparently support jumbo frames through their fabrics, or at least indicate a roadmap and a timeline to when they think they'll be able to support jumbo frames throughout the switch fabrics.

At LONAP we've been able to support jumbo frames (at 9000+ depending on how you count it) for some years. We have been running large MTU p2p vlans for members for some time - L2TP handoff and so on. What we don't do is support >1500byte MTU on the shared peering vlan, and I don't see this changing anytime soon. There isn't demand; multiple vlans split your critical mass even if you are able to decide on a lowest common denominator above 1500. I imagine the situation is similar for other exchanges (apart from Netnod as already mentioned). I won't bother to further reiterate the contents of <20101106203616.GH1902@gerbil.cluepon.net>; others can just read Ras's post for a concise description. :-) -- Will Hargrave Technical Director LONAP Ltd

Michael Hallgren

6 Nov 6 Nov

9:17 p.m.

Le samedi 06 novembre 2010 à 13:01 -0700, Matthew Petach a écrit :

...

On Sat, Nov 6, 2010 at 12:32 PM, George Bonser <gbonser@seven.com> wrote:

...
...
I doubt that 1500 is (still) widely used in our Internet... Might be, though, that most of us don't go all the way to 9k.

mh

Last week I asked the operator of fairly major public peering points if they supported anything larger than 1500 MTU. The answer was "no".

There's still a metric buttload of SONET interfaces in the core that won't go above 4470.

So, you might conceivably get 4k MTU at some point in the future, but it's really, *really* unlikely you'll get to 9k MTU any time in the next decade.

Right, though I'm unsure of "decade" since we're moving off SDH/Sonet quite agressively. mh

...

Matt

Richard A Steenbergen

8:36 p.m.

On Sat, Nov 06, 2010 at 12:32:55PM -0700, George Bonser wrote:

...

...
I doubt that 1500 is (still) widely used in our Internet... Might be, though, that most of us don't go all the way to 9k.

Last week I asked the operator of fairly major public peering points if they supported anything larger than 1500 MTU. The answer was "no".

It would be absolutely trivial for them to enable jumbo frames, there is just no demand for them to do so, as supporting Internet wide jumbo frames (particularly over exchange points) is highly non-scalable in practice. It's perfectly safe to have the L2 networks in the middle support the largest MTU values possible (other than maybe triggering an obscure Force10 bug or something :P), so they could roll that out today and you probably wouldn't notice. The real issue is with the L3 networks on either end of the exchange, since if the L3 routers that are trying to talk to each other don't agree about their MTU valus precisely, packets are blackholed. There are no real standards for jumbo frames out there, every vendor (and in many cases particular type/revision of hardware made by that vendor) supports a slightly different size. There is also no negotiation protocol of any kind, so the only way to make these two numbers match precisely is to have the humans on both sides talk to each other and come up with a commonly supported value. There are two things that make this practically impossible to support at scale, even ignoring all of the grief that comes from trying to find a clueful human to talk to on the other end of your connection to a third party (which is a huge problem in and of itself): #1. There is currently no mechanism on any major router to set multiple MTU values PER NEXTHOP on a multi-point exchange, so to do jumbo frames over an exchange you would have to pick a single common value that EVERYONE can support. This also means you can't mix and match jumbo and non-jumbo participants over the same exchange, you essentially have to set up an entirely new exchange point (or vlan within the same exchange) dedicated to the jumbo frame support, and you still have to get a common value that everyone can support. Ironically many routers (many kinds of Cisco and Juniper routers at any rate) actually DO support per-nexthop MTUs in hardware, there is just no mechanism exposed to the end user to configure those values, let alone auto-negotiate them. #2. The major vendors can't even agree on how they represent MTU sizes, so entering the same # into routers from two different vendors can easily result in incompatible MTUs. For example, on Juniper when you type "mtu 9192", this is INCLUSIVE of the L2 header, but on Cisco the opposite is true. So to make a Cisco talk to a Juniper that is configured 9192, you would have to configure mtu 9178. Except it's not even that simple, because now if you start adding vlan tagging the L2 header size is growing. If you now configure vlan tagging on the interface, you've got to make the Cisco side 9174 to match the Juniper's 9192. And if you configure flexible-vlan-tagging so you can support q-in-q, you've now got to configure to Cisco side for 9170. As an operator who DOES fully support 9k+ jumbos on every internal link in my network, and as many external links as I can find clueful people to talk to on the other end to negotiate the correct values, let me just tell you this is a GIANT PAIN IN THE ASS. And we're not even talking about making sure things actually work right for the end user. Your IGP may not come up at all if the MTUs are misconfigured, but EBGP certainly will, even if the two sides are actually off by a few bytes. The maximum size of a BGP message is 4096 octets, and there is no mechanism to pad a message and try to detect MTU incompatibility, so what will actually happen in real life is the end user will try to send a big jumbo frame through and find that some of their packets are randomly and silently blackholed. This would be an utter nightmare to support and diagnose. Realistically I don't think you'll ever see even a serious attempt at jumbo frame support implemented in any kind of scale until there is a negotiation protocol and some real standards for the mtu size that must be supported, which is something that no standards body (IEEE, IETF, etc) has seemed inclined to deal with so far. Of course all of this is based on the assumption that path mtu discovery will work correctly once the MTU valus ARE correctly configured on the L3 routers, which is a pretty huge assumption, given all the people who stupidly filter ICMP. Oh and even if you solved all of those problems, I could trivially DoS your router with some packets that would overload your ability to generate ICMP Unreach Needfrag messages for PMTUD, and then all your jumbo frame end users going through that router would be blackholed as well. Great idea in theory, epic disaster in practice, at least given the mechanisms currently at our disposal. :) -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)

Jack Bates

8:56 p.m.

On 11/6/2010 3:36 PM, Richard A Steenbergen wrote:

...

#2. The major vendors can't even agree on how they represent MTU sizes, so entering the same # into routers from two different vendors can easily result in incompatible MTUs. For example, on Juniper when you type "mtu 9192", this is INCLUSIVE of the L2 header, but on Cisco the opposite is true. So to make a Cisco talk to a Juniper that is configured 9192, you would have to configure mtu 9178. Except it's not even that simple, because now if you start adding vlan tagging the L2 header size is growing. If you now configure vlan tagging on the interface, you've got to make the Cisco side 9174 to match the Juniper's 9192. And if you configure flexible-vlan-tagging so you can support q-in-q, you've now got to configure to Cisco side for 9170.

I agree with the rest, but actually, I've found that juniper has a manual physical mtu with a separate logical mtu available, while cisco sets a logical mtu and autocalculates the physical mtu (or perhaps the physical is just hard set to maximum). It depends on the equipment in cisco, though. L3 and L2 interfaces treat mtu differently, especially noticeable when doing q-in-q on default switches without adjusting the mtu. Also noticeable in mtu setting methods on a c7600(l2 vs l3 methods) In practice, i think you can actually pop the physical mtu on the juniper much higher than necessary, so long as you set the family based logical mtu's at the appropriate value. Jack

Dan White

10:10 p.m.

On 06/11/10 15:56 -0500, Jack Bates wrote:

...

On 11/6/2010 3:36 PM, Richard A Steenbergen wrote:

...
#2. The major vendors can't even agree on how they represent MTU sizes, so entering the same # into routers from two different vendors can easily result in incompatible MTUs. For example, on Juniper when you type "mtu 9192", this is INCLUSIVE of the L2 header, but on Cisco the opposite is true. So to make a Cisco talk to a Juniper that is configured 9192, you would have to configure mtu 9178. Except it's not even that simple, because now if you start adding vlan tagging the L2 header size is growing. If you now configure vlan tagging on the interface, you've got to make the Cisco side 9174 to match the Juniper's 9192. And if you configure flexible-vlan-tagging so you can support q-in-q, you've now got to configure to Cisco side for 9170.

I agree with the rest, but actually, I've found that juniper has a manual physical mtu with a separate logical mtu available, while cisco sets a logical mtu and autocalculates the physical mtu (or perhaps the physical is just hard set to maximum). It depends on the equipment in cisco, though. L3 and L2 interfaces treat mtu differently, especially noticeable when doing q-in-q on default switches without adjusting the mtu. Also noticeable in mtu setting methods on a c7600(l2 vs l3 methods)

In practice, i think you can actually pop the physical mtu on the juniper much higher than necessary, so long as you set the family based logical mtu's at the appropriate value.

Cisco calls this 'routing mtu' and 'jumbo mtu' on the platform we have to distinguish between layer 3 mtu (where packets which exceed that size get fragmented) and layer 2 mtu (where frames that exceed that size get dropped on the floor as 'giants'). We always set layer 2 mtu as high as we can on our switches (9000+), and strictly leave everything else (layer 3) at 1500 bytes. In my experience, setting two hosts to differing layer 3 MTUs will lead to fragmentation at some point along the routing path or within one of the hosts. With Path MTU Discovery moved to the end hosts in v6, the concept of a standardized MTU should go away, and open up much larger MTUs. However, that may not happen until dual stacked v4/v6 goes away. -- Dan White

George Bonser

9:21 p.m.

...

It's perfectly safe to have the L2 networks in the middle support the largest MTU values possible (other than maybe triggering an obscure Force10 bug or something :P), so they could roll that out today and you probably wouldn't notice. The real issue is with the L3 networks on either end of the exchange, since if the L3 routers that are trying to talk to each other don't agree about their MTU valus precisely, packets are blackholed. There are no real standards for jumbo frames out there, every vendor (and in many cases particular type/revision of hardware made by that vendor) supports a slightly different size. There is also no negotiation protocol of any kind, so the only way to make these two numbers match precisely is to have the humans on both sides talk to each other and come up with a commonly supported value.

...

There are two things that make this practically impossible to support at scale, even ignoring all of the grief that comes from trying to find a clueful human to talk to on the other end of your connection to a

...

party (which is a huge problem in and of itself):

#1. There is currently no mechanism on any major router to set multiple MTU values PER NEXTHOP on a multi-point exchange, so to do jumbo

That is not a new problem. That is also true to today with "last mile" links (e.g. dialup) that support <1500 byte MTU. What is different today is RFC 4821 PMTU discovery which deals with the "black holes". RFC 4821 PMTUD is that "negotiation" that is "lacking". It is there. It is deployed. It actually works. No more relying on someone sending the ICMP packets through in order for PMTUD to work! third frames

...

over an exchange you would have to pick a single common value that EVERYONE can support. This also means you can't mix and match jumbo and non-jumbo participants over the same exchange, you essentially have to set up an entirely new exchange point (or vlan within the same exchange) dedicated to the jumbo frame support, and you still have to get a common value that everyone can support. Ironically many routers (many kinds of Cisco and Juniper routers at any rate) actually DO support per-nexthop MTUs in hardware, there is just no mechanism exposed to the end user to configure those values, let alone auto-negotiate them.

Is there any gear connected to a major IX that does NOT support large frames? I am not aware of any manufactured today. Even cheap D-Link gear supports them. I believe you would be hard-pressed to locate gear that doesn't support it at any major IX. Granted, it might require the change of a global config value and a reboot for it to take effect in some vendors. http://darkwing.uoregon.edu/~joe/jumbo-clean-gear.html

...

#2. The major vendors can't even agree on how they represent MTU sizes, so entering the same # into routers from two different vendors can easily result in incompatible MTUs. For example, on Juniper when you type "mtu 9192", this is INCLUSIVE of the L2 header, but on Cisco the opposite is true. So to make a Cisco talk to a Juniper that is configured 9192, you would have to configure mtu 9178. Except it's not even that simple, because now if you start adding vlan tagging the L2 header size is growing. If you now configure vlan tagging on the interface, you've got to make the Cisco side 9174 to match the Juniper's 9192. And if you configure flexible-vlan-tagging so you can support q-in-q, you've now got to configure to Cisco side for 9170.

...

As an operator who DOES fully support 9k+ jumbos on every internal

Again, the size of the MTU on the IX port doesn't change the size of the packets flowing through that gear. A packet sent from an end point with an MTU of 1500 will be unchanged by the router change. A flow to an end point with <1500 MTU will also be adjusted down by PMTU Discovery just as it is now when communicating with a dialup end point that might have <600 MTU. The only thing that is going to change from the perspective of the routers is the communications originated by the router which will basically just be the BGP session. When the TCP session is established for BGP, the smaller of the two MTU will report an MSS value which is the largest packet size it can support. The other unit will not send a packet larger than this even if it has a larger MTU. Just because the MTU is 9000 doesn't mean it is going to aggregate 1500 byte packets flowing through it into 9000 byte packets, it is going to pass them through unchanged. As for the configuration differences between units, how does that change from the way things are now? A person configuring a Juniper for 1500 byte packets already must know the difference as that quirk of including the headers is just as true at 1500 bytes as it is at 9000 bytes. Does the operator suddenly become less competent with their gear when they use a different value? Also, a 9000 byte MTU would be a happy value that practically everyone supports these days, including ethernet adaptors on host machines. link

...

in my network, and as many external links as I can find clueful people to talk to on the other end to negotiate the correct values, let me just tell you this is a GIANT PAIN IN THE ASS. And we're not even talking about making sure things actually work right for the end user. Your IGP may not come up at all if the MTUs are misconfigured, but EBGP certainly will, even if the two sides are actually off by a few bytes. The maximum size of a BGP message is 4096 octets, and there is no mechanism to pad a message and try to detect MTU incompatibility, so what will actually happen in real life is the end user will try to send a big jumbo frame through and find that some of their packets are randomly and silently blackholed. This would be an utter nightmare to support and diagnose.

So the router doesn't honor the MSS value of the TCP stream? That would seem like a bug to me. I am not suggesting we set everything to the maximum that it will support because that is different for practically every vendor. I am suggesting that we pick a different "standard" value for the "middle" of the internet of 9000 bytes which practically everything made these days supports. Yes, having everyone set theirs to different values can make for different issues but if we just picked one, 9000, that everyone supports (you can use a larger MTU internally if you are doing things like tunneling which adds additional overhead if you want to maintain the original 9000 byte frame end to end) for the interfaces between networks.

...

Realistically I don't think you'll ever see even a serious attempt at jumbo frame support implemented in any kind of scale until there is a negotiation protocol and some real standards for the mtu size that must be supported, which is something that no standards body (IEEE, IETF, etc) has seemed inclined to deal with so far. Of course all of this is based on the assumption that path mtu discovery will work correctly once the MTU valus ARE correctly configured on the L3 routers, which is a pretty huge assumption, given all the people who stupidly filter ICMP. Oh and even if you solved all of those problems, I could trivially DoS your router with some packets that would overload your ability to generate ICMP Unreach Needfrag messages for PMTUD, and then all your jumbo frame end users going through that router would be blackholed as well.

The ICMP filtration issue goes away with modern PMTUD that is now supported in Windows, Solaris, Linux, MacOS, and BSD. That is no longer a problem for the end points. And I would highly recommend anyone operating Linux systems in production to run at least 2.6.32 with /proc/sys/net/ipv6/tcp_mtu_probing set to either 1 (blackhole recovery) or 2 (active PMTU discovery probes) in order to avoid the PMTUD problems we already have on the Internet. The MTU issue between routers is only a problem for the traffic originated and terminated between those routes. The MSS might not be accurate if there is a tunnel someplace between the two routers that reduces the effective MTU between them but that is a matter of getting router vendors to also support RFC 4821 themselves to detect and correct that problem. The tools are all there. We have already been operating for quite some time with mixed MTU and effective MTU sizes with tunneling and various "last mile" issues. This adds nothing new to the mix and offers greatly improved performance in both the transactions across the network and from the gear itself in reduced CPU consumption to move a given amount of traffic. See, I told you it was a hornets' nest :)

Matthew Petach

9:34 p.m.

On Sat, Nov 6, 2010 at 2:21 PM, George Bonser <gbonser@seven.com> wrote:

...

As for the configuration differences between units, how does that change from the way things are now? A person configuring a Juniper for 1500 byte packets already must know the difference as that quirk of including the headers is just as true at 1500 bytes as it is at 9000 bytes. Does the operator suddenly become less competent with their gear when they use a different value? Also, a 9000 byte MTU would be a happy value that practically everyone supports these days, including ethernet adaptors on host machines.

While I think 9k for exchange points is an excellent target, I'll reiterate that there's a *lot* of SONET interfaces out there that won't be going away any time soon, so practically speaking, you won't really get more than 4400 end-to-end, even if you set your hosts to 9k as well. And yes, I agree with ras; having routers able to adjust on a per-session basis would be crucial; otherwise, we'd have to ask the peeringdb folks to add a field that lists each participant's interface MTU at each exchange, and part of peermaker would be a check that could warn you, "sorry, you can't peer with network X, your MTU is too small." ;-P (though that would make for an interesting deepering notice..."sorry, we will be unable to peer with networks who cannot support large MTUs at exchange point X after this date.") Matt

George Bonser

9:49 p.m.

...

While I think 9k for exchange points is an excellent target, I'll reiterate that there's a *lot* of SONET interfaces out there that won't be going away any time soon, so practically speaking, you won't really get more than 4400 end-to-end, even if you set your hosts to 9k as well.

Agreed. But in the meantime, removing the 1500 bottlenecks at the ethernet peering ports would at least provide the potential for the connection to scale up to the 4400 available by the SONET links. Right now, nothing is possible above 1500 for most flows that traverse an ethernet peering point. My point is that 1500 is a relic. Put another way, how come PoS at 4400 in the path doesn't break anything currently between endpoints while any suggestion that ethernet be made larger than 1500 in the path causes all this reaction? We already HAVE MTUs larger than 1500 in the "middle" part of the path. This really doesn't change much of anything from that perspective. For example, simply taking Ethernet to 3000 would still be smaller than SONET and even that would provide measurable benefit. There is a certain "but that is the way it has always been done" inertia that I believe needs to be overcome. Increasing the path MTU has the potential to greatly improve performance at practically no cost to anyone involved. We are throttling performance of the Internet for no sound technical reason, in my opinion. Now I could see where someone selling "jumbo" paths at a premium might be reluctant to see the Internet generally go that path as it would decrease their "value add", but that is a different story.

sthaug＠nethelp.no

9:40 p.m.

...

RFC 4821 PMTUD is that "negotiation" that is "lacking". It is there. It is deployed. It actually works. No more relying on someone sending the ICMP packets through in order for PMTUD to work!

For some value of "works". There are way too many places filtering ICMP for PMTUD to work consistently. PMTUD is *not* the solution, unfortunately. Steinar Haug, Nethelp consulting, sthaug@nethelp.no

George Bonser

9:42 p.m.

...

-----Original Message----- From: sthaug@nethelp.no [mailto:sthaug@nethelp.no] Sent: Saturday, November 06, 2010 2:40 PM To: George Bonser Cc: ras@e-gerbil.net; nanog@nanog.org Subject: Re: RINA - scott whaps at the nanog hornets nest :-)

...
RFC 4821 PMTUD is that "negotiation" that is "lacking". It is there. It is deployed. It actually works. No more relying on someone sending the ICMP packets through in order for PMTUD to work!

For some value of "works". There are way too many places filtering ICMP for PMTUD to work consistently. PMTUD is *not* the solution, unfortunately.

Steinar Haug, Nethelp consulting, sthaug@nethelp.no

I guess you missed the part about 4821 PMTUD does not rely on ICMP. Modern PMTUD does not rely on ICMP and works even where it is filtered.

sthaug＠nethelp.no

9:57 p.m.

...

...
...
RFC 4821 PMTUD is that "negotiation" that is "lacking". It is there. It is deployed. It actually works. No more relying on someone sending the ICMP packets through in order for PMTUD to work!

For some value of "works". There are way too many places filtering ICMP for PMTUD to work consistently. PMTUD is *not* the solution, unfortunately.

I guess you missed the part about 4821 PMTUD does not rely on ICMP.

Modern PMTUD does not rely on ICMP and works even where it is filtered.

As long as the implementations are few and far between: https://www.psc.edu/~mathis/MTU/ http://www.ietf.org/mail-archive/web/rrg/current/msg05816.html the traditional ICMP-based PMTUD is what most of use face today. Steinar Haug, Nethelp consulting, sthaug@nethelp.no

George Bonser

10:14 p.m.

...

As long as the implementations are few and far between:

https://www.psc.edu/~mathis/MTU/ http://www.ietf.org/mail-archive/web/rrg/current/msg05816.html

the traditional ICMP-based PMTUD is what most of use face today.

Steinar Haug, Nethelp consulting, sthaug@nethelp.no

It is already the standard with currently shipping Solaris and on by default. It ships in Linux 2.6.32 but is off by default (sysctl I referred to earlier). It ships with Microsoft Windows as "Blackhole Router Detection" and is on by default since Windows 2003 SP2. The notion that it isn't widely deployed is not the case. It has been much more widely deployed now than it was 12 months ago. And again, deploying 9000 byte MTU in the MIDDLE of the network is not going to change PMTUD one iota unless the rest of the path between both end points is 9000 bytes since the end points are already probably 1500 hundred anyway. Changing the MTU on a router in the path is not going to cause the packets flowing through it to change in size. It will not introduce any additional PMTU issues as those are end-to-end problems anyway, if anything it should REDUCE them by making the path 9000 byte clean in the middle, there shouldn't BE any PMTU problems in the middle of the network and things like reduced effective MTU from tunnels in the middle of networks disappears. For example, if some network is using MTU 1500 and tunnels something over GRE and doesn't enlarge the MTU of the interfaces handling that tunnel, and if they block ICMP from inside their net, then they have introduced a PMTU issue by reducing the effective MTU of the encapsulated packets. I deal with that very problem all the time. Increasing the MTU on those paths to 9000 would enable 1500 byte packets to travel unmolested and eliminate that PMTU problem. In fact, many networks already get around that problem by increasing the MTU on tunnels just so they can avoid fragmenting the encapsulated packet. Increasing to 9000 would REDUCE problems across the network for end points using an MTU smaller than 9000.

Doug Barton

10:41 p.m.

On 11/6/2010 3:14 PM, George Bonser wrote:

...

It ships with Microsoft Windows as "Blackhole Router Detection" and is on by default since Windows 2003 SP2.

The first item returned on a blekko search is the following article which indicates that it is on by default in Windows 2008/Vista/2003/XP/2000. The article seems to predate Win7. hth, Doug -- Nothin' ever doesn't change, but nothin' changes much. -- OK Go Breadth of IT experience, and depth of knowledge in the DNS. Yours for the right price. :) http://SupersetSolutions.com/

Jack Bates

9:47 p.m.

On 11/6/2010 4:40 PM, sthaug@nethelp.no wrote:

...

For some value of "works". There are way too many places filtering ICMP for PMTUD to work consistently. PMTUD is *not* the solution, unfortunately.

He was referring to the updated RFC 4821. " In the absence of ICMP messages, the proper MTU is determined by starting with small packets and probing with successively larger packets. The bulk of the algorithm is implemented above IP, in the transport layer (e.g., TCP) or other "Packetization Protocol" that is responsible for determining packet boundaries." It is designed to support working without ICMP. It's draw back is the ramp time, which makes it useless for small transactions, but it can be argued that small transactions don't need larger MTUs. Jack

George Bonser

9:52 p.m.

...

He was referring to the updated RFC 4821.

" In the absence of ICMP messages, the proper MTU is determined by starting with small packets and probing with successively larger packets. The bulk of the algorithm is implemented above IP, in the transport layer (e.g., TCP) or other "Packetization Protocol" that is responsible for determining packet boundaries."

It is designed to support working without ICMP. It's draw back is the ramp time, which makes it useless for small transactions, but it can

...

argued that small transactions don't need larger MTUs.

Jack

That is also somewhat mitigated in that it operates in two modes. The first mode is what I would call "passive" mode and only comes into play once a black hole is detected. It does not change the operation of TCP until a packet disappears. The second method is the "active" mode where it actively probes with increasing packet sizes until it hits a black hole or gets an ICMP response.

Jack Bates

10:05 p.m.

On 11/6/2010 4:52 PM, George Bonser wrote:

...

That is also somewhat mitigated in that it operates in two modes. The first mode is what I would call "passive" mode and only comes into play once a black hole is detected. It does not change the operation of TCP until a packet disappears. The second method is the "active" mode where it actively probes with increasing packet sizes until it hits a black hole or gets an ICMP response.

While it reads well, what implementations are actually in use? As with most protocols, it is useless if it doesn't have a high penetration. Jack

George Bonser

10:19 p.m.

...

...
While it reads well, what implementations are actually in use? As with most protocols, it is useless if it doesn't have a high penetration.

Jack

Solaris 10, in use and on by default. Available on Windows for a very long time as "blackhole router detection" was off by default originally, on by default since Win2003 SP2 and is on by default in Win7. It is on by default since Windows XP SP3. It is available on Linux but not yet on by default. I expect that will change once it gets enough use. I am not sure of the default deployment in MacOS and BSD but know it is available.

Richard A Steenbergen

10:20 p.m.

On Sat, Nov 06, 2010 at 02:21:51PM -0700, George Bonser wrote:

...

That is not a new problem. That is also true to today with "last mile" links (e.g. dialup) that support <1500 byte MTU. What is different today is RFC 4821 PMTU discovery which deals with the "black holes".

RFC 4821 PMTUD is that "negotiation" that is "lacking". It is there. It is deployed. It actually works. No more relying on someone sending the ICMP packets through in order for PMTUD to work!

The only thing this adds is trial-and-error probing mechanism per flow, to try and recover from the infinite blackholing that would occur if your ICMP is blocked in classic PMTUD. If this actually happened in any scale, it would create a performance and overhead penalty that is far worse than the original problem you're trying to solve. Say you have two routers talking to each other over a L2 switched infrastructure (i.e. an exchange point). In order for PMTUD to function quickly and effectively, the two routers on each end MUST agree on the MTU value of the link between them. If router A thinks it is 9000, and router B thinks it is 8000, when router A comes along and tries to send a 8001 byte packet it will be silently discarded, and the only way to recover from this is with trial-and-error probing by the endpoints after they detect what they believe to be MTU blackholing. This is little more than a desperate ghetto hack designed to save the connection from complete disaster. The point where a protocol is needed is between router A and router B, so they can determine the MTU of the link, without needing to involve the humans in a manual negotiation process. Ideally this would support multi-point LANs over ethernet as well, so .1 could have an MTU of 9000, .2 could have an MTU of 8000, etc. And of course you have to make sure that you can actually PASS the MTU across the wire (if the switch in the middle can't handle it, the packet will also be silently dropped), so you can't just rely on the other side to tell you what size it THINKS it can support. You don't have a shot in hell of having MTUs negotiated correctly or PMTUD work well until this is done.

...

Is there any gear connected to a major IX that does NOT support large frames? I am not aware of any manufactured today. Even cheap D-Link gear supports them. I believe you would be hard-pressed to locate gear that doesn't support it at any major IX. Granted, it might require the change of a global config value and a reboot for it to take effect in some vendors.

http://darkwing.uoregon.edu/~joe/jumbo-clean-gear.html

If that doesn't prove my point about every vendor having their own definition of what # is and isn't supported, I don't know what does. Also, I don't know what exchanges YOU connect to, but I very clearly see a giant pile of gear on that list that is still in use today. :)

...

As for the configuration differences between units, how does that change from the way things are now? A person configuring a Juniper for 1500 byte packets already must know the difference as that quirk of including the headers is just as true at 1500 bytes as it is at 9000 bytes. Does the operator suddenly become less competent with their gear when they use a different value? Also, a 9000 byte MTU would be a happy value that practically everyone supports these days, including ethernet adaptors on host machines.

Everything defaults to 1500 today, so nobody has to do anything. Again, I'm actually doing this with people today on a very large network with lots of peers all over the world, so I have a little bit of experience with exactly what goes wrong. Nearly everyone who tries to figure out the correct MTU between vendors and with a third party network gets it wrong, at least some significant percentage of the time. And honestly I can't even find an interesting number of people willing to turn on BFD, something with VERY clear benefits for improving failure detection time over an IX (for the next time Equinix decides to do one of their 10PM maintenances that causes hours of unreachability until hold timers expire :P). If the IX operators saw any significant demand they would have already turned it on already. -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)

George Bonser

10:49 p.m.

...

The only thing this adds is trial-and-error probing mechanism per

flow,

...

to try and recover from the infinite blackholing that would occur if your ICMP is blocked in classic PMTUD. If this actually happened in any scale, it would create a performance and overhead penalty that is far worse than the original problem you're trying to solve.

I ran into this very problem not long ago when attempting to reach a server for a very large network. Our Solaris hosts had no problem transacting with the server. Our linux machines did have a problem and the behavior looked like a typical PMTU black hole. It turned out that "very large network" tunneled the connection inside their network reducing the effective MTU of the encapsulated packets and blocked ICMP from inside their net to the outside. Changing the advertised MSS of the connection to that server to 1380 allowed it to work ( ip route add <ip address> via <gateway> dev <device> advmss 1380 ) and that verified that the problem was an MTU black hole. A little reading revealed why Solaris wasn't having the problem but Linux did. Setting the Linux ip_no_pmtu_disc sysctl to 1 resulted in the Linux behavior matching the Solaris behavior.

...

Say you have two routers talking to each other over a L2 switched infrastructure (i.e. an exchange point). In order for PMTUD to function quickly and effectively, the two routers on each end MUST agree on the MTU value of the link between them. If router A thinks it is 9000, and router B thinks it is 8000, when router A comes along and tries to send a 8001 byte packet it will be silently discarded, and the only way to recover from this is with trial-and-error probing by the endpoints after they detect what they believe to be MTU blackholing. This is little more than a desperate ghetto hack designed to save the connection from complete disaster.

Correct. Devices on the same vlan will need to use the same MTU. And why is that a problem? That is just as true then as it is today. Nothing changes. All you are doing is changing from everyone using 1500 to everyone using 9000 on that vlan. Nothing else changes. Why is that any kind of issue?

...

The point where a protocol is needed is between router A and router B, so they can determine the MTU of the link, without needing to involve the humans in a manual negotiation process.

When the TCP/IP connection is opened between the routers for a routing session, they should each send the other an MSS value that says how large a packet they can accept. You already have that information available. TCP provides that negotiation for directly connected machines. Again, nothing changes from the current method of operating. If I showed up at a peering switch and wanted to use 1000 byte MTU, I would probably have some problems. The point I am making is that 1500 is a relic value that hamstrings Internet performance and there is no good reason not to use 9000 byte MTU at peering points (by all participants) since it A: introduces no new problems and B: I can't find a vendor of modern gear at a peering point that doesn't support it though there may be some ancient gear at some peering points in use by some of the peers. I can not think of a problem changing from 1500 to 9000 as the standard at peering points introduces. It would also speed up the loading of the BGP routes between routers at the peering points. If Joe Blow at home with a dialup connection with an MTU of 576 is talking to a server at Y! with an MTU of 10 billion, changing a peering path from 1500 to 9000 bytes somewhere in the path is not going to change that PMTU discovery one iota. It introduces no problem whatsoever. It changes nothing.

...

If that doesn't prove my point about every vendor having their own definition of what # is and isn't supported, I don't know what does. Also, I don't know what exchanges YOU connect to, but I very clearly see a giant pile of gear on that list that is still in use today. :)

That is a list of 9000 byte clean gear. The very bottom is the stuff that doesn't support it. Of the stuff that doesn't support it, how much is connected directly to a peering point? THAT is the bottleneck I am talking about right now. One step at a time. Removing the bottleneck at the peering points is all I am talking about. That will not change PMTU issues elsewhere and those will stand just exactly as they are today without any change. In fact it will ensure that there are *fewer* PMTU discovery issues by being able to support a larger range of packets without having to fragment them. We *already* have SONET MTU of >4000 and this hasn't broken anything since the invention of SONET.

George Bonser

10:52 p.m.

...

and that verified that the problem was an MTU black hole. A little reading revealed why Solaris wasn't having the problem but Linux did. Setting the Linux ip_no_pmtu_disc sysctl to 1 resulted in the Linux behavior matching the Solaris behavior.

Oops, meant tcp_mtu_probing

George Bonser

11:30 p.m.

Re: large MTU One place where this has the potential to greatly improve performance is in transfers of large amounts of data such as vendors supporting the downloading of movies, cloud storage vendors, and movement of other large content and streaming. The *first* step in being able to realize those gains is in removing the "low hanging fruit" of bottlenecks in that path. The lowest hanging fruit is the peering points. Changing those should introduce no "new" problems as the peering points aren't currently the source of MTU path discovery problems and increasing the MTU removes a discovery issue point, only reducing the MTU would create one. In transitioning from SONET to Ethernet, we are actually reducing potential performance by reducing the effective MTU from >4000 to <2000. So even increasing bandwidth is of no use if you are potentially reducing performance end to end by reducing the effective maximum MTU of the path. In that diagram on Phil Dykstra's page linked to earlier, even though the number of packets on that OC3 backbone were mostly (by a large margin) le 1500 bytes, the majority of the TRAFFIC was carried by packets ge 1500 bytes. http://sd.wareonearth.com/~phil/pktsize_hist.gif " The above graph is from a study[1] of traffic on the InternetMCI backbone in 1998. It shows the distribution of packet sizes flowing over a particular backbone OC3 link. There is clearly a wall at 1500 bytes (the ethernet limit), but there is also traffic up to the 4000 byte FDDI MTU. But here is a more surprising fact: while the number of packets larger than 1500 bytes appears small, more than 50% of the bytes were carried by such packets because of their larger size." [1] the nature of the beast: recent traffic measurements from an Internet backbone http://www.caida.org/outreach/papers/1998/Inet98/

Niels Bakker

11:55 p.m.

* gbonser@seven.com (George Bonser) [Sun 07 Nov 2010, 00:30 CET]:

...

Re: large MTU

One place where this has the potential to greatly improve performance is in transfers of large amounts of data such as vendors supporting the downloading of movies, cloud storage vendors, and movement of other large content and streaming. The *first* step in being able to realize those gains is in removing the "low hanging fruit" of bottlenecks in that path. The lowest hanging fruit is the peering points. Changing those should introduce no "new" problems as the peering points aren't currently the source of MTU path discovery problems and increasing the MTU removes a discovery issue point, only reducing the MTU would create one.

On the contrary. You're proposing to fuck around with the one place on the whole Internet that has pretty clear and well adhered-to rules and expectations about MTU size supported by participants, and basically re-live the problems from MAE-East and other shared Ethernet/FDDI platforms with mismatching MTU sizes brought us during their existence.

...

In transitioning from SONET to Ethernet, we are actually reducing potential performance by reducing the effective MTU from >4000 to <2000. So even increasing bandwidth is of no use if you are potentially reducing performance end to end by reducing the effective maximum MTU of the path.

These performance gains are minimal at best, and probably completely offset by the delays introduced by the packet loss that the probing will cause for any connection that doesn't live close to forever. I'm not even going to bother commenting on your research link from production traffic in *1998*. -- Niels. -- "It's amazing what people will do to get their name on the internet, which is odd, because all you really need is a Blogspot account." -- roy edroso, alicublog.blogspot.com

George Bonser

7 Nov 7 Nov

12:21 a.m.

...

On the contrary. You're proposing to fuck around with the one place on the whole Internet that has pretty clear and well adhered-to rules and expectations about MTU size supported by participants, and basically re-live the problems from MAE-East and other shared Ethernet/FDDI platforms with mismatching MTU sizes brought us during their existence.

Ok, there is another alternative. Peering points could offer a 1500 byte vlan and a 9000 byte vlan one existing peering points and all new ones be 9000 from the start. Then there is no "fucking around" with anything. You show up to the new peering point, your MTU is 9000, you are done. No messing with anything. Only SHORTENING MTUs in the middle causes PMTU problems. Increasing them does not. And someone attempting to send frames larger than 1500 right now would see only a decrease in PMTU issues from such an increase in MTU at the peering points, not an increase of issues.

...

These performance gains are minimal at best, and probably completely offset by the delays introduced by the packet loss that the probing will cause for any connection that doesn't live close to forever.

Huh? You don't need to do probing. You can simply operate in passive mode. Also, even if using active probing mode, the probing stops once the MTU is discovered. In passive mode there is no probing at all unless you hit a black hole. And the performance improvements I suppose are "minimal" if you consider going from a maximum of 6.5Meg/sec for a transfer from LA to NY to 40Meg for the same transfer to be "minimal"

...

From one of the earlier linked documents:

(quote) Let's take an example: New York to Los Angeles. Round Trip Time (rtt) is about 40 msec, and let's say packet loss is 0.1% (0.001). With an MTU of 1500 bytes (MSS of 1460), TCP throughput will have an upper bound of about 6.5 Mbps! And no, that is not a window size limitation, but rather one based on TCP's ability to detect and recover from congestion (loss). With 9000 byte frames, TCP throughput could reach about 40 Mbps. Or let's look at that example in terms of packet loss rates. Same round trip time, but let's say we want to achieve a throughput of 500 Mbps (half a "gigabit"). To do that with 9000 byte frames, we would need a packet loss rate of no more than 1x10^-5. With 1500 byte frames, the required packet loss rate is down to 2.8x10^-7! While the jumbo frame is only 6 times larger, it allows us the same throughput in the face of 36 times more packet loss. (end quote) So if you consider >5x performance boost to be "minimal" yeah, I guess. Or being able to operate at todays transfer rates in the face of 36x more packet loss to be "minimal" improvement, I suppose.

George Bonser

12:26 a.m.

...

So if you consider >5x performance boost to be "minimal" yeah, I guess. Or being able to operate at todays transfer rates in the face of 36x more packet loss to be "minimal" improvement, I suppose.

And those improvements in performance get larger the longer the latency of the connection. For transit from US to APAC or Europe, the improvement would be even greater.

Jack Bates

1:58 a.m.

On 11/6/2010 7:21 PM, George Bonser wrote:

...

(quote) Let's take an example: New York to Los Angeles. Round Trip Time (rtt) is about 40 msec, and let's say packet loss is 0.1% (0.001). With an MTU of 1500 bytes (MSS of 1460), TCP throughput will have an upper bound of about 6.5 Mbps! And no, that is not a window size limitation, but rather one based on TCP's ability to detect and recover from congestion (loss). With 9000 byte frames, TCP throughput could reach about 40 Mbps.

I prefer much less packet loss in a majority of my transmissions, which in turn brings those numbers closer together. Jack

George Bonser

3:26 a.m.

...

I prefer much less packet loss in a majority of my transmissions,

which

...

in turn brings those numbers closer together.

Jack

True, though t the idea that it greatly reduces packets in flight for a given amount of data gives a lot of benefit, particularly over high latency connections. Considering throughput <= ~0.7 * MSS / (rtt * sqrt(packet_loss)) (from http://sd.wareonearth.com/~phil/jumbo.html) and that packet loss to places such as China is often greater than zero, the benefits of increased PMTU become obvious. Increase that latency from 20ms to 200ms and the benefits of increased MSS are obvious. The only real argument here against changing existing peering points is "all peers must have the same MTU". So far I haven't heard any real argument against it for a new peering point which is starting from a green field. It isn't going to change how anyone's network behaves internally and increasing MTU doesn't produce PMTU issues for transit traffic. It just seems a shame that two servers with FDDI interfaces using SONET long haul are going to perform much better on a coast to coast transfer than a pair with a GigE over ethernet long haul simply because of the MTU issue. Increasing the bandwidth of a path to GigE shouldn't result in reduced performance but in this case it would. At least one peering point provider has offered to create a jumbo VLAN for experimentation.

Niels Bakker

3:31 a.m.

* gbonser@seven.com (George Bonser) [Sun 07 Nov 2010, 04:27 CET]:

...

It just seems a shame that two servers with FDDI interfaces using SONET

Earth to George Bonser: IT IS NOT 1998 ANYMORE. -- Niels.

Jack Bates

3:33 a.m.

On 11/6/2010 10:31 PM, Niels Bakker wrote:

...

* gbonser@seven.com (George Bonser) [Sun 07 Nov 2010, 04:27 CET]:

...
It just seems a shame that two servers with FDDI interfaces using SONET

Earth to George Bonser: IT IS NOT 1998 ANYMORE.

We don't fly sr71s or use bigger MTU interfaces. Get with the times! :) Jack

George Bonser

3:38 a.m.

...

-----Original Message----- From: Niels Bakker [mailto:niels=nanog@bakker.net] Sent: Saturday, November 06, 2010 8:32 PM To: nanog@nanog.org Subject: Re: RINA - scott whaps at the nanog hornets nest :-)

* gbonser@seven.com (George Bonser) [Sun 07 Nov 2010, 04:27 CET]:

...
It just seems a shame that two servers with FDDI interfaces using SONET

Earth to George Bonser: IT IS NOT 1998 ANYMORE.

Exactly my point. Why should we adopt newer technology while using configuration parameters that degrade performance? 1500 was designed for thick net. It is absolutely stupid to use it for GigE or higher speeds and I do mean absolutely idiotic. It is going backwards in performance. No wonder there is still so much transport using SONET. Using Ethernet reduces your effective performance over long distance paths.

George Bonser

3:44 a.m.

...

...
* gbonser@seven.com (George Bonser) [Sun 07 Nov 2010, 04:27 CET]:

...
It just seems a shame that two servers with FDDI interfaces using SONET

Earth to George Bonser: IT IS NOT 1998 ANYMORE.

Exactly my point. Why should we adopt newer technology while using configuration parameters that degrade performance?

1500 was designed for thick net. It is absolutely stupid to use it for GigE or higher speeds and I do mean absolutely idiotic. It is going backwards in performance. No wonder there is still so much transport using SONET. Using Ethernet reduces your effective performance over long distance paths.

And by that I mean using 1500 MTU is what degrades the performance, not the ethernet physical transport. Using MTU 9000 would give you better performance than SONET. That is why Internet2 pushes so hard for people to use the largest possible MTU and the suggested MINIMUM is 9000.

Mikael Abrahamsson

6:44 a.m.

On Sat, 6 Nov 2010, George Bonser wrote:

...

And by that I mean using 1500 MTU is what degrades the performance, not the ethernet physical transport. Using MTU 9000 would give you better performance than SONET. That is why Internet2 pushes so hard for people to use the largest possible MTU and the suggested MINIMUM is 9000.

I tried to get IEEE to go for higher MTU on 100GE. When taking into account what the responses were, this is never going to change. Also, if we're going to go for bigger MTUs, going from 1500 to 9000 is basically worthless, if we really want to do something, we should go for 64k or even bigger. About 1500 MTU degrading performance, that's a TCP implementation issue, not really a network issue. Interrupt performance in end systems for high-speed transfers isn't really a general problem, and not until you reach speeds of several gigabit/s. Routers handle PPS just fine, this was "solved" long ago after we stopped using regular CPUs in them. Increasing MTU on the Internet is not something driven by the end-users, so it's not going to happen in the near future. They are just fine with 1500 MTU. Higher MTU is a nice to have, not something that is seriously hindering performance on the Internet as it is today or in the next few tens of years. -- Mikael Abrahamsson email: swmike@swm.pp.se

George Bonser

7:19 a.m.

...

Also, if we're going to go for bigger MTUs, going from 1500 to 9000 is basically worthless, if we really want to do something, we should go for 64k or even bigger.

I agree but we need to work with what we have. Practically everything currently appearing at a peering point will support 9000. Getting equipment that would support 64000 would be more difficult.

...

About 1500 MTU degrading performance, that's a TCP implementation issue, not really a network issue.

True, but TCP is what we are stuck with for right now. Different protocols could be developed to handle the small packets better.

...

Interrupt performance in end systems for high-speed transfers isn't really a general problem, and not until you reach speeds of several gigabit/s. Routers handle PPS just fine, this was "solved" long ago after we stopped using regular CPUs in them.

We are starting to move to 10Gig + peering connections. I have two 10G peering ports currently on order. "several gigabits/sec" is here today.

...

Increasing MTU on the Internet is not something driven by the end- users, so it's not going to happen in the near future.

It depends on what those end users are doing. If they are loading a web page, you are probably correct. If they are an enterprise user transferring log files from Hong Kong to New York, it makes a huge difference, particularly the moment a packet gets lost somewhere. At some point it becomes faster to put data on disks and fly them across the ocean than to try to transmit it by TCP with 1500 byte MTU. Trying to explain to someone that they are not going to get any better performance on that daily data transfer from the far East by upgrading from 100 Mb to GigE is hard for them to understand as it is a bit counter-intuitive. They believe 1G is "faster" than 100Meg, when it isn't. If you tell them they could get a faster file transfer rate by using a an OC-3 with MTU 4000 than they would get by upgrading from 100Mb ethernet to GigE with MTU 1500, they just don't get it. In fact, telling them that they won't get one iota of improvement going from 100Mb to GigE doesn't make sense to them because they believe GigE is "faster". It isn't faster, it is fatter. And yes, TCP is the limiting factor but we are stuck with it for now. I can't change what protocol is being used but I can change the MTU of the existing protocol. I believe the demand for such high-bandwidth streams is going to greatly increase. This is particularly true as people move out of academic environments where they are used to working on Abilene (Inet2) and move into industry and the programs they built won't work because it takes two days to send one day's worth of data. There are end users and there are end users. It depends on the sort of end user you are talking about and what they are doing. If they are watching TV, they might want a higher MTU. If they are on Twitter, they don't care. Industry end users will have different requirements from residential end users.

...

They are just fine with 1500 MTU. Higher MTU is a nice to have, not something that is seriously hindering performance on the Internet as it is today or in the next few tens of years.

I disagree with that statement because I believe that the next few years will see an increased demand for high-bandwidth traffic that needs to be delivered quickly (HDTV from Tokyo to London, for example). One of the reasons people aren't interested is because they don't know. They are ignorant in most cases. They just know "bandwidth". They believe that if they get a fatter pipe, it will improve their viewing of that Australian porn. Then they pay for the upgrade and it doesn't change a thing. It doesn't change the data transfer rate at all. They go from a 10Meg to a 100Meg pipe and that file *still* transfers at 3Meg/sec. If they could increase the MTU to 9000, they might get 15Meg/sec. But you are correct, going to an even higher MTU is what is really needed but going for what is attainable is the first step. Everyone can physically do 9000 at the peering points (or at least as far as I am aware they can) and the only thing that is preventing that is just not wanting to because they don't fully appreciate the benefit and believe it might break something. Increasing MTU never breaks PMTUD. PMTUD is only needed because something in the path has a *smaller* MTU than the end points. The end points don't care of the path in between has a larger MTU.

Brielle Bruns

7:34 a.m.

So, question I don't want to forget between now and when I wake up (since its late in my neck of the woods)... Has any work been done with >1500 mtu on 802.11 links? Is it feasable, or even possible? I'm in the middle of rolling out a wisp in an area, and it dawned on me I never even considered this aspect of the mtu issue. -- Brielle Bruns http://www.sosdg.org / http://www.ahbl.org -----Original Message----- From: "George Bonser" <gbonser@seven.com> Date: Sun, 7 Nov 2010 00:19:03 To: <nanog@nanog.org> Subject: RE: RINA - scott whaps at the nanog hornets nest :-)

...

Also, if we're going to go for bigger MTUs, going from 1500 to 9000 is basically worthless, if we really want to do something, we should go for 64k or even bigger.

I agree but we need to work with what we have. Practically everything currently appearing at a peering point will support 9000. Getting equipment that would support 64000 would be more difficult.

...

About 1500 MTU degrading performance, that's a TCP implementation issue, not really a network issue.

True, but TCP is what we are stuck with for right now. Different protocols could be developed to handle the small packets better.

...

Interrupt performance in end systems for high-speed transfers isn't really a general problem, and not until you reach speeds of several gigabit/s. Routers handle PPS just fine, this was "solved" long ago after we stopped using regular CPUs in them.

We are starting to move to 10Gig + peering connections. I have two 10G peering ports currently on order. "several gigabits/sec" is here today.

...

Increasing MTU on the Internet is not something driven by the end- users, so it's not going to happen in the near future.

...

They are just fine with 1500 MTU. Higher MTU is a nice to have, not something that is seriously hindering performance on the Internet as it is today or in the next few tens of years.

Mikael Abrahamsson

7:38 a.m.

On Sun, 7 Nov 2010, George Bonser wrote:

...

True, but TCP is what we are stuck with for right now. Different protocols could be developed to handle the small packets better.

We're not "stuck" with TCP, TCP is being developed all the time. http://en.wikipedia.org/wiki/TCP_congestion_avoidance_algorithm

...

We are starting to move to 10Gig + peering connections. I have two 10G peering ports currently on order. "several gigabits/sec" is here today.

I was talking about end users, not network.

...

It depends on what those end users are doing. If they are loading a web page, you are probably correct. If they are an enterprise user transferring log files from Hong Kong to New York, it makes a huge difference, particularly the moment a packet gets lost somewhere. At some point it becomes faster to put data on disks and fly them across the ocean than to try to transmit it by TCP with 1500 byte MTU. Trying

Oh, come on. Get real. The world TCP speed record is 10GE right now, it'll go higher as soon as there are higher interface speeds to be had. I can easily get 100 megabit/s long-distance between two linux boxes without tweaking the settings much.

...

I disagree with that statement because I believe that the next few years will see an increased demand for high-bandwidth traffic that needs to be delivered quickly (HDTV from Tokyo to London, for example).

MTU and "quickly" has very little to do with each other.

...

3Meg/sec. If they could increase the MTU to 9000, they might get 15Meg/sec.

Or they might tweak some other TCP settings and get 30 meg/s with existing 1500 MTU. It's WAY easier to tweak existing TCP than trying to get the whole network to go to a higher MTU. We do 4470 internally and on peering links where the other end agrees, but getting it to work all the way to the end customer isn't really easy. As with IPv6, doing the core is easy, doing the access is much harder.

...

it might break something. Increasing MTU never breaks PMTUD. PMTUD is only needed because something in the path has a *smaller* MTU than the end points. The end points don't care of the path in between has a larger MTU.

But in a transition some end systems will have 9000 MTU and some parts of the network will have smaller, so then you get problems. -- Mikael Abrahamsson email: swmike@swm.pp.se

George Bonser

7:54 a.m.

...

Oh, come on. Get real. The world TCP speed record is 10GE right now, it'll go higher as soon as there are higher interface speeds to be had.

You can buy 100G right now. I also believe there are some 40G available, too. Also, check this: http://media.caltech.edu/press_releases/13216 That was in 2008.

...

I can easily get 100 megabit/s long-distance between two linux boxes without tweaking the settings much.

Until you drop a packet. I can get 100 Megabits/sec with UDP without tweaking it at all. Getting 100Meg/sec San Francisco to London is a challenge over a typical Internet path (i.e. not a dedicated leased path).

...

Or they might tweak some other TCP settings and get 30 meg/s with existing 1500 MTU. It's WAY easier to tweak existing TCP than trying to get the whole network to go to a higher MTU. We do 4470 internally and on peering links where the other end agrees, but getting it to work all the way to the end customer isn't really easy.

I guess you didn't read the links earlier. It has nothing to do with stack tweaks. The moment you lose a single packet, you are toast. And there is a limit to how much you can buffer because at some point it becomes difficult to locate a packet to resend. *If* you have a perfect path, sure, but that is generally not available, particularly to APAC.

...

But in a transition some end systems will have 9000 MTU and some parts of the network will have smaller, so then you get problems.

Which is no different than end systems that have 9000 today. A lot of networks run jumbo frames internally now. Maybe a lot more than you realize. When you are using NFS and iSCSI and other things like database queries that return large output, large MTUs save you a lot of packets. NFS reads in 8K chunks, that can easily fit in a 9000 byte packets. It is more common in enterprise and academic networks that you might be aware.

Mikael Abrahamsson

8 a.m.

On Sun, 7 Nov 2010, George Bonser wrote:

...

I guess you didn't read the links earlier. It has nothing to do with stack tweaks. The moment you lose a single packet, you are toast. And

TCP SACK. I'm too tired to correct your other statements that lack basis in reality (or at least in my reality). -- Mikael Abrahamsson email: swmike@swm.pp.se

George Bonser

8:24 a.m.

...

On Sun, 7 Nov 2010, George Bonser wrote:

...
I guess you didn't read the links earlier. It has nothing to do

with

...

...
stack tweaks. The moment you lose a single packet, you are toast. And

TCP SACK.

I'm too tired to correct your other statements that lack basis in reality (or at least in my reality).

But the point being, why should everyone have to be forced into making multiple tweaks to their stacks to accommodate a worst case when a single change (and possibly a change in total buffer size) is all that is needed to get improved performance globally? With modern PMTUd that is nearly globally supported at this point, it just isn't as big of an issue as it was, say, 5 years ago. It isn't that big of an issue but it does seem to be a very inexpensive change that offers a large benefit. It will happen on its own as more and more networks configure internally for larger frames and as more people migrate out of academia where 9000 is the norm these days into industry.

Will Hargrave

8:45 a.m.

On 7 Nov 2010, at 08:24, George Bonser wrote:

...

It will happen on its own as more and more networks configure internally for larger frames and as more people migrate out of academia where 9000 is the norm these days into industry.

I used to run a large academic network; there was a vanishingly small incidence of edge ports supporting >1500byte MTU. It's possibly even more tricky than the IX situation to support in an environment where you commonly have mixed devices at different speeds (most 100mbit devices will not support >1500) on a single L2, often under different administrative control.

George Bonser

8:58 a.m.

...

I used to run a large academic network; there was a vanishingly small incidence of edge ports supporting >1500byte MTU. It's possibly even more tricky than the IX situation to support in an environment where you commonly have mixed devices at different speeds (most 100mbit devices will not support >1500) on a single L2, often under different administrative control.

At the edge, sure. There are all sorts of problems there. The major two being 1: much of America still uses dialup or some form of PPoE that has <1500 MTU anyway. The problem at the other end is that the large content providers are generally behind load balancers that often don't support jumbo frames. So if you are talking to an ISP that serves residential customers or "eyeballs" that are viewing content from the major portals, it makes no sense. But if you are talking about data between corporate data centers or from one company for another where they are ethernet end to end, the picture changes. Dykstra's note of that study in 1998 showed that while the majority of the *packets* were <1500 the majority of the *data bytes* were in packets >1500. So considerably more than 50% of the packets were moving <50% of the data. But the networks that are now running >1500 internally can't talk to each other with those packet sizes across the general Internet until the longer haul path supports it and again you are talking about a small number of end points sending large amounts of data. It will work itself out but it will probably be consumer demand for higher performance data streams that finally does it. (only awake to make sure nothing goes bonkers during the time change).

Jeff Kell

4:02 p.m.

On 11/7/2010 3:45 AM, Will Hargrave wrote:

...

I used to run a large academic network; there was a vanishingly small incidence of edge ports supporting >1500byte MTU.

I run a moderately sized academic network, and know some details of our "other" campus infrastructure (some larger, some smaller). We have two chassis that could do L3 >1500, perhaps 10 with some upgrades. And perhaps a quarter of our switches could do L2 >1500 (we have a lot of older cheap gear at the access layer). The only "demand" for >1500 is iSCSI or FCoE, I can see a "need" for backup traffic off the server farms >1500. We have >1500 enabled in those areas but it's rather localized and not on the consumer side of the network. There's also the computing clusters in the mix, but again that is localized. There are enough headaches getting the marginal 1500s over the various encapsulations, tagging, tunneling, VPNs, etc. I would have to agree on the small edge population even capable of >1500. Jeff

George Bonser

8:42 a.m.

...

...
I guess you didn't read the links earlier. It has nothing to do

with

...

...
stack tweaks. The moment you lose a single packet, you are toast. And

TCP SACK.

Certainly helps but still has limitations. If you have too many packets in flight, it can take too long to locate the SACKed packet in some implementations, this can cause a TCP timeout and resetting the window to 1. It varies from one implementation to another. The above was for some implementations of Linux. The larger the window (high speed, high latency paths) the worse this problem is. In other words, sure, you can get great performance but when you hit a lost packet, depending on which packet is lost, you can also take a huge performance hit depending on who is doing the talking or what they are talking to. Common advice on stack tuning " for very large BDP paths where the TCP window is > 20 MB, you are likely to hit the Linux SACK implementation problem. If Linux has too many packets in flight when it gets a SACK event, it takes too long to located the SACKed packet, and you get a TCP timeout and CWND goes back to 1 packet. Restricting the TCP buffer size to about 12 MB seems to avoid this problem, but clearly limits your total throughput. Another solution is to disable SACK." Even if you don't have such as system, you might be talking to one. But anyway, I still think 1500 is a really dumb MTU value for modern interfaces and unnecessarily retards performance over long distances.

Simon Horman

2 Dec 2 Dec

11:21 p.m.

On Sun, Nov 07, 2010 at 01:42:33AM -0700, George Bonser wrote:

...

...
...
I guess you didn't read the links earlier. It has nothing to do

with

...
...
stack tweaks. The moment you lose a single packet, you are toast. And

TCP SACK.

Certainly helps but still has limitations. If you have too many packets in flight, it can take too long to locate the SACKed packet in some implementations, this can cause a TCP timeout and resetting the window to 1. It varies from one implementation to another. The above was for some implementations of Linux. The larger the window (high speed, high latency paths) the worse this problem is. In other words, sure, you can get great performance but when you hit a lost packet, depending on which packet is lost, you can also take a huge performance hit depending on who is doing the talking or what they are talking to.

Common advice on stack tuning " for very large BDP paths where the TCP window is > 20 MB, you are likely to hit the Linux SACK implementation problem. If Linux has too many packets in flight when it gets a SACK event, it takes too long to located the SACKed packet, and you get a TCP timeout and CWND goes back to 1 packet. Restricting the TCP buffer size to about 12 MB seems to avoid this problem, but clearly limits your total throughput. Another solution is to disable SACK." Even if you don't have such as system, you might be talking to one.

Do you know if any work is being done on resolving this problem? It seems that work in that area might be more fruitful than banging your head against increasing the MTU.

...

But anyway, I still think 1500 is a really dumb MTU value for modern interfaces and unnecessarily retards performance over long distances.

Mans Nilsson

7 Nov 7 Nov

7:02 a.m.

Subject: RE: RINA - scott whaps at the nanog hornets nest :-) Date: Sat, Nov 06, 2010 at 08:38:33PM -0700 Quoting George Bonser (gbonser@seven.com):

...

No wonder there is still so much transport using SONET. Using Ethernet reduces your effective performance over long distance paths.

The only reason to use (10)GE for transmission in WAN is the completely baroque price difference in interface pricing. With todays line rates, the components and complexity of a line card are pretty much equal between SDH and GE. There is no reason to overcharge for the better interface except because they (all vendors do this) can. We've just ordered a new WAN to be built, and we're going with GE over (mostly) WDM because the interface prices are like six times higher per megabit for SDH. (which would have cost roughly equally per line given that it is quite OK to run SDH without the SDH equipment, just using WDM) Oh, s/SDH/SONET/ on above, but I'm in Europe, so.. -- Måns Nilsson primary/secondary/besserwisser/machina MN-1334-RIPE +46 705 989668 If I am elected, the concrete barriers around the WHITE HOUSE will be replaced by tasteful foam replicas of ANN MARGARET!

George Bonser

7:34 a.m.

...

The only reason to use (10)GE for transmission in WAN is the completely baroque price difference in interface pricing. With todays line rates, the components and complexity of a line card are pretty much equal between SDH and GE. There is no reason to overcharge for the better interface except because they (all vendors do this) can.

Yes, I really don't understand that either. You would think that the investment in developing and deploying all that SONET infrastructure has been paid back by now and they can lower the prices dramatically. One would think the vendors would be practically giving it away, particularly if people understood the potential improvement in performance, though the difference between 1500 and 4000 is probably not all that much except on long distance ( >2000km ) paths.

Richard A Steenbergen

7:51 a.m.

On Sun, Nov 07, 2010 at 12:34:56AM -0700, George Bonser wrote:

...

Yes, I really don't understand that either. You would think that the investment in developing and deploying all that SONET infrastructure has been paid back by now and they can lower the prices dramatically. One would think the vendors would be practically giving it away, particularly if people understood the potential improvement in performance, though the difference between 1500 and 4000 is probably not all that much except on long distance ( >2000km ) paths.

Careful, you're rapidly working your way up to nanog kook status with these absurd claims based on no logic whatsoever. -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)

George Bonser

8:07 a.m.

...

...
Yes, I really don't understand that either. You would think that

the

...
investment in developing and deploying all that SONET infrastructure has been paid back by now and they can lower the prices dramatically. One would think the vendors would be practically giving it away, particularly if people understood the potential improvement in performance, though the difference between 1500 and 4000 is probably not all that much except on long distance ( >2000km ) paths.

Careful, you're rapidly working your way up to nanog kook status with these absurd claims based on no logic whatsoever.

My aploligies. It just seemed to me that the investment in SONET, particularly the lower data rates, should be pretty much paid back by now. How long has OC-12 been around? I can understand a certain amount of premium for something that doesn't sell as much but the difference in prices can be quite amazing in some markets. Some differential might be justified but why so much? An OC-12 SFP optic costs nearly $3,000 from one vendor, list. Their list price for a GigE SFP optical module is about 30% of that. What is it about the optic module that would cause it to be 3 times as expensive for an interface with half the bandwidth? A 4-port OC-12 module is 37,500 list. A 4-port 10G module is $10,000 less for 10x the bandwidth. In other words, what is the differential in the manufacturing costs of those? I don't believe it is as much as the differential in the selling price.

Mark Smith

8 Nov 8 Nov

8:44 a.m.

On Sun, 7 Nov 2010 01:07:17 -0700 "George Bonser" <gbonser@seven.com> wrote:

...

...
...
Yes, I really don't understand that either. You would think that

the

...
investment in developing and deploying all that SONET infrastructure has been paid back by now and they can lower the prices dramatically. One would think the vendors would be practically giving it away, particularly if people understood the potential improvement in performance, though the difference between 1500 and 4000 is probably not all that much except on long distance ( >2000km ) paths.

Careful, you're rapidly working your way up to nanog kook status with these absurd claims based on no logic whatsoever.

My aploligies. It just seemed to me that the investment in SONET, particularly the lower data rates, should be pretty much paid back by now. How long has OC-12 been around? I can understand a certain amount of premium for something that doesn't sell as much but the difference in prices can be quite amazing in some markets. Some differential might be justified but why so much?

An OC-12 SFP optic costs nearly $3,000 from one vendor, list. Their list price for a GigE SFP optical module is about 30% of that. What is it about the optic module that would cause it to be 3 times as expensive for an interface with half the bandwidth? A 4-port OC-12 module is 37,500 list. A 4-port 10G module is $10,000 less for 10x the bandwidth.

In other words, what is the differential in the manufacturing costs of those? I don't believe it is as much as the differential in the selling price.

Once the base manufacturing cost is covered, supply and demand dictate the price a.k.a. "you charge what the market will bear." While at least one person/organisation continues to pay sonet/sdh pricing, that's what will be charged.

Mans Nilsson

11:38 a.m.

Subject: RE: RINA - scott whaps at the nanog hornets nest :-) Date: Sun, Nov 07, 2010 at 12:34:56AM -0700 Quoting George Bonser (gbonser@seven.com):

...

Yes, I really don't understand that either. You would think that the investment in developing and deploying all that SONET infrastructure has been paid back by now and they can lower the prices dramatically. One would think the vendors would be practically giving it away, particularly if people understood the potential improvement in performance, though the difference between 1500 and 4000 is probably not all that much except on long distance ( >2000km ) paths.

Even if larger MTUen are interesting (but most of the time not worth the work) the sole reason I like SDH as my WAN technology is the presence of signalling -- so that both ends of a link are aware of its status near-instantly (via protocol parts like RDI etc). In GE it is legal to not receive any packets, which means that "oblivious" is a possible state for such a connection. With associated routing implications. -- Måns Nilsson primary/secondary/besserwisser/machina MN-1334-RIPE +46 705 989668 Is this the line for the latest whimsical YUGOSLAVIAN drama which also makes you want to CRY and reconsider the VIETNAM WAR?

George Bonser

4:53 p.m.

...

Even if larger MTUen are interesting (but most of the time not worth the work) the sole reason I like SDH as my WAN technology is the presence of signalling -- so that both ends of a link are aware of its status near-instantly (via protocol parts like RDI etc). In GE it is legal to not receive any packets, which means that "oblivious" is a possible state for such a connection. With associated routing implications.

I wasn't talking about changing anything at any of the edges. The idea was just to get the "middle" portion of the internet, the peering points to a place that would support frames larger than 1500. It is practically impossible for anyone to send such a packet off-net until that happens. There was nothing that said everyone should change to a higher MTU. I was saying that there are cases where it can be useful for certain types of transfers but the state of today's internet is that you can't do it even if you want to except by special arrangement. Considering the state of today's modern hardware, there isn't a technical reason why those points can't be set to handle larger packets should one come along. That's all. I wasn't suggesting everyone set their home system for a larger MTU, I was suggesting that the peering points be able to handle them should one pass through. Now I agree, on an existing exchange having a "flag day" for everyone to change might not be worthwhile but on a new exchange where you have a green field, there is no reason to limit the MTU at that point to 1500. Having a larger MTU in the middle of the path does not introduce PMTUD issues. PMTUD issues are introduced by having a smaller MTU somewhere in the middle of the path. The conversation was quickly dragged into areas other than what the suggestion was about. What was interesting was the email I got from people who need to move a lot of science and engineering data on a daily basis who said their networking people didn't "get it" either and it is causing them problems. Not everyone is going to need to use large frames. But people who do need them can't use them and there really isn't a technical reason for that. That specific portion of the Internet, the peering points between networks, carries traffic from all sorts of users, not just people at home with their twitter app open. Enabling the passage of larger packets doesn't mean advocating that everyone use them or changing anyone's customer edge configuration. It wouldn't change anyone's routing, wouldn't impact anyone's PMTUD problems. I don't believe that is "kooky". A lot of other people have been calling for the same thing for quite some time. But making a network "jumbo clean" doesn't do a lot of good if the peering points are the bottleneck. That's all. Removing that bottleneck is all that the suggestion was about.

Mans Nilsson

6:36 p.m.

Subject: RE: RINA - scott whaps at the nanog hornets nest :-) Date: Mon, Nov 08, 2010 at 08:53:47AM -0800 Quoting George Bonser (gbonser@seven.com):

...

...
Even if larger MTUen are interesting (but most of the time not worth the work) the sole reason I like SDH as my WAN technology is the presence of signalling -- so that both ends of a link are aware of its status near-instantly (via protocol parts like RDI etc). In GE it is legal to not receive any packets, which means that "oblivious" is a possible state for such a connection. With associated routing implications.

I wasn't talking about changing anything at any of the edges. The idea was just to get the "middle" portion of the internet, the peering points to a place that would support frames larger than 1500. It is practically impossible for anyone to send such a packet off-net until that happens.

Know what? We have not one, but five or so Internet Exchange points in Sweden, where there are 802.1q VLANS setup for higher MTU (4470 for hysterical raisins) . My impression is that people use them, but I'm also being informed by statistics that there is a _very_ steep drop in packet count vs size once 1500 is reached. It is setup, but the edge is where packets are made, not the core. Thus, noone can send large packets. Anyway. I'd concur that links where routers exchange very large routing tables benefit from PMTUD (most) and larger MTU (to some degree), but I'd argue that most IXPen see few prefixes per peering, up to a few thousand max. The large tables run via PNI and paid transit, as well as iBGP. There, I've seen drastical improvements in convergence time once PMTUD was introduced and arcane MSS defaults dealt with. MTU mattered not much. Given this empirical data, clearly pointing to the fact that It Does Not Matter, I think we can stop this nonsense now. -- Måns Nilsson primary/secondary/besserwisser/machina MN-1334-RIPE +46 705 989668 I was making donuts and now I'm on a bus!

Jack Bates

8:29 p.m.

On 11/8/2010 12:36 PM, Mans Nilsson wrote:

...

I'd concur that links where routers exchange very large routing tables benefit from PMTUD (most) and larger MTU (to some degree), but I'd argue that most IXPen see few prefixes per peering, up to a few thousand max. The large tables run via PNI and paid transit, as well as iBGP. There, I've seen drastical improvements in convergence time once PMTUD was introduced and arcane MSS defaults dealt with. MTU mattered not much.

Given this empirical data, clearly pointing to the fact that It Does Not Matter, I think we can stop this nonsense now.

His point wasn't to benefit the BGP routers at the IX, but to support those who need to transmit > 1500 size packets and have the ability to create them on the edge. In particular, the impact of running long distances (high latency) with higher packet drop probability. In such a scenario, it does matter. Even if you don't see that many > 1500 byte packets, doesn't imply that it doesn't matter. I have v6 peerings and see very little traffic on them compared to v4. Should I then state that v6 doesn't matter? If people have an expectation of not making it through core networks at

...

1500, they won't bother trying to send >1500. If the IX doesn't support >1500, why would people connecting to the IX care if their backbones support >1500?

Jack

Valdis.Kletnieks＠vt.edu

9:51 p.m.

On Mon, 08 Nov 2010 19:36:49 +0100, Mans Nilsson said:

...

Given this empirical data, clearly pointing to the fact that It Does Not Matter, I think we can stop this nonsense now.

That's right up there with the sites that blackhole their abuse@ address, and then claim they never actually see any complaints. Or forcing NAT at the edge, and saying "The fact we get no complaints means It Does Not Matter", ignoring SCTP and similar use cases where it *does* matter. If in fact It Does Not Matter, why did the Internet2 folks make any effort to support 9000 end-to-end? http://proj.sunet.se/LSR2/index.html says they used an MTU of 4470.. and then add "and we used only about half the MTU size (which generates heavier CPU-load on the end-hosts)", which pretty much implies the previous record was at 9000 or so. So there's empirical data that It Does Indeed Matter (at least to some people).

Nick Hilliard

10:08 p.m.

On 08/11/2010 21:51, Valdis.Kletnieks@vt.edu wrote:

...

So there's empirical data that It Does Indeed Matter (at least to some people).

It certainly does. However, there is lots more empirical data to suggest that It Does Not Matter to most service providers. We tried introducing it to INEX several years ago. Of 40-something connected parties, only one was really interested enough to do something about it. Another indicated interest but then pulled back when they realised a) the amount of work it would take to support it across their network, b) the scope for painful breakage if they accidentally got something wrong somewhere and c) the benefit they would get from it. Probably most interesting aspect was the cost / benefit analysis. One the one hand, there was little to no benefit for end-users and hosted services on the commercial ISP that showed interest. However, the NREN which was interested could have made real use out of it, in terms of dealing with very speed single-stream data transfers. Anyway, all of the arguments for it, both pro and con, have been rehashed on this thread. The bottom line is that for most companies, it simply isn't worth the effort, but that for some NRENs, it is. Let's move on now. Nick

Jack Bates

11:05 p.m.

On 11/8/2010 4:08 PM, Nick Hilliard wrote:

...

Anyway, all of the arguments for it, both pro and con, have been rehashed on this thread. The bottom line is that for most companies, it simply isn't worth the effort, but that for some NRENs, it is.

I think a lot of that is misinformation and confusion. A company looks at it and thinks of the issues deploying it to end users, and misses the benefits of deploying it at the core only handling special requests. This is especially true for hosting companies, where a majority of connections to servers need to stay at low MTU to keep things streamlined, but for specific cases could increase MTU for things such as cross country backups. Many servers can handle these dual MTU setups. Larger MTU is beneficial when someone controls the 2 endpoints and has use for it. They can request for the larger MTU connection with their providers/datacenters, but if the core systems aren't supporting it, they'll die miserably. Jack

Mans Nilsson

11:17 p.m.

Subject: Re: RINA - scott whaps at the nanog hornets nest :-) Date: Mon, Nov 08, 2010 at 10:08:53PM +0000 Quoting Nick Hilliard (nick@foobar.org):

...

On 08/11/2010 21:51, Valdis.Kletnieks@vt.edu wrote:

...
So there's empirical data that It Does Indeed Matter (at least to some people).

...

Anyway, all of the arguments for it, both pro and con, have been rehashed on this thread. The bottom line is that for most companies, it simply isn't worth the effort, but that for some NRENs, it is.

And NREN-NREN traffic typically does not traverse commercial IXen. (Even though ISTR Sunet and Nordunet having peerings configured on Netnod). Instead, from empire-building reasons or job security or "No research project is complete unless the professor gets a new laptop and a wavelength to CERN (or in USA, pick a DoE site) from the project money", NRENs build their own... I am convinced that some applications actually benefit from this, though.

...

Let's move on now.

Indeed. -- Måns Nilsson primary/secondary/besserwisser/machina MN-1334-RIPE +46 705 989668 Yow! I want my nose in lights!

Chris Adams

10:09 p.m.

Once upon a time, Valdis.Kletnieks@vt.edu <Valdis.Kletnieks@vt.edu> said:

...

That's right up there with the sites that blackhole their abuse@ address, and then claim they never actually see any complaints.

What about telcos that disable error counters and then say "we don't see any errors"? -- Chris Adams <cmadams@hiwaay.net> Systems and Network Administrator - HiWAAY Internet Services I don't speak for anybody but myself - that's enough trouble.

Niels Bakker

9 Nov 9 Nov

12:54 p.m.

* gbonser@seven.com (George Bonser) [Mon 08 Nov 2010, 17:54 CET]:

...

I wasn't talking about changing anything at any of the edges. The idea was just to get the "middle" portion of the internet, the peering points to a place that would support frames larger than 1500. It is practically impossible for anyone to send such a packet off-net until that happens.

If you think peering points are the "middle" portion of the internet that all packets have to traverse, then this thread is beyond hope. -- Niels.

Nathan Eisenberg

6:53 p.m.

...

If you think peering points are the "middle" portion of the internet that all packets have to traverse, then this thread is beyond hope.

-- Niels.

Making sweeping generalizations at thin air is fun! This statement could be easily true, just as it could be easily false. Nathan

Richard A Steenbergen

7 Nov 7 Nov

7:49 a.m.

On Sun, Nov 07, 2010 at 08:02:28AM +0100, Mans Nilsson wrote:

...

The only reason to use (10)GE for transmission in WAN is the completely baroque price difference in interface pricing. With todays line rates, the components and complexity of a line card are pretty much equal between SDH and GE. There is no reason to overcharge for the better interface except because they (all vendors do this) can.

To be fair, there are SOME legitimate reasons for a cost difference. For example, ethernet has very high overhead on small packets and tops out at 14.8Mpps over 10GE, whereas SONET can do 7 bytes of overhead for your PPP/HDLC and FCS etc and easily end up doing well over 40Mpps of IP packets. The cost of the lookup ASIC that only has to support the Ethernet link is going to be a lot cheaper, or let you handle a lot more links on the same chip. At this point it's only half price gouging of the silly telco customers with money to blow. There really are significant cost savings for the vendors in using the more popular and commoditized technology, even though it may be technically inferior. Think of it like the old IDE vs SCSI wars, when enough people get onboard with the cheaper interior technology, eventually they start shoehorning on all the features and functionality that you wanted from the other one in the first place. :) -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)

Mark Smith

8 Nov 8 Nov

8:50 a.m.

On Sun, 7 Nov 2010 01:49:20 -0600 Richard A Steenbergen <ras@e-gerbil.net> wrote:

...

On Sun, Nov 07, 2010 at 08:02:28AM +0100, Mans Nilsson wrote:

...
The only reason to use (10)GE for transmission in WAN is the completely baroque price difference in interface pricing. With todays line rates, the components and complexity of a line card are pretty much equal between SDH and GE. There is no reason to overcharge for the better interface except because they (all vendors do this) can.

To be fair, there are SOME legitimate reasons for a cost difference. For example, ethernet has very high overhead on small packets and tops out at 14.8Mpps over 10GE, whereas SONET can do 7 bytes of overhead for your PPP/HDLC and FCS etc and easily end up doing well over 40Mpps of IP packets. The cost of the lookup ASIC that only has to support the Ethernet link is going to be a lot cheaper, or let you handle a lot more links on the same chip.

At this point it's only half price gouging of the silly telco customers with money to blow. There really are significant cost savings for the vendors in using the more popular and commoditized technology, even though it may be technically inferior. Think of it like the old IDE vs SCSI wars, when enough people get onboard with the cheaper interior technology, eventually they start shoehorning on all the features and functionality that you wanted from the other one in the first place. :)

That sounds a lot like the "Worse is Better" argument The Rise of ``Worse is Better'' http://www.jwz.org/doc/worse-is-better.html This quote would be quite applicable to Ethernet - "The lesson to be learned from this is that it is often undesirable to go for the right thing first. It is better to get half of the right thing available so that it spreads like a virus. Once people are hooked on it, take the time to improve it to 90% of the right thing." I think ethernet gaining OAM would be an example of improving to 90% of the right thing (15 or so years after being invented and deployed), while those technologies that tried to be right from the outset (token ring, ATM etc.) have or are disappearing.

Matthew Petach

7 Nov 7 Nov

3:19 a.m.

On Sat, Nov 6, 2010 at 5:21 PM, George Bonser <gbonser@seven.com> wrote: ...

...

(quote) Let's take an example: New York to Los Angeles. Round Trip Time (rtt) is about 40 msec, and let's say packet loss is 0.1% (0.001). With an MTU of 1500 bytes (MSS of 1460), TCP throughput will have an upper bound of about 6.5 Mbps! And no, that is not a window size limitation, but rather one based on TCP's ability to detect and recover from congestion (loss). With 9000 byte frames, TCP throughput could reach about 40 Mbps.

I'd like to order a dozen of those 40ms RTT LA to NYC wavelengths, please. If you could just arrange a suitable demonstration of packet-level delivery time of 40ms from Los Angeles to New York and back, I'm sure there would be a *long* line of people behind me, checks in hand. ^_^ Matt

George Bonser

3:29 a.m.

...

I'd like to order a dozen of those 40ms RTT LA to NYC wavelengths, please.

If you could just arrange a suitable demonstration of packet-level delivery time of 40ms from Los Angeles to New York and back, I'm sure there would be a *long* line of people behind me, checks in hand. ^_^

Matt

Yeah, he must have goofed on that. The 40ms must be the one-way time, not the RTT. I get a pretty consistent 80ms to NY from California.

Richard A Steenbergen

6 Nov 6 Nov

11:37 p.m.

On Sat, Nov 06, 2010 at 03:49:19PM -0700, George Bonser wrote:

...

When the TCP/IP connection is opened between the routers for a routing session, they should each send the other an MSS value that says how large a packet they can accept. You already have that information available. TCP provides that negotiation for directly connected machines.

You're proposing that routers should dynamically alter the interface MTU based on the TCP MSS value they receive from an EBGP neighbor? I barely know where to begin, but first off MSS is not MTU, it is only loosely related to MTU. MSS is affected by TCP options (window scale, sack, MD5 authentication, etc), and MSS between routers can be set to any value a user chooses. There is absolutely no guarantee that MSS is going to lead to a correct guess at the MTU. Also, many routers still default to having PMTUD turned off, would you suggest that they should set the physical interface MTU to 576 based on that? :) And alas, it's one hell of a layer violation too. A negotiation protocol is needed, but you could argue about where it should be for days. Maybe at the physical layer as part of auto-negotiation, maybe at the L3<->L2 layer (i.e. negotiate it per IP as part of arp or neighbor discovery), hell maybe even in BGP, but keyed off MSS is way over the top. :)

...

Again, nothing changes from the current method of operating. If I showed up at a peering switch and wanted to use 1000 byte MTU, I would probably have some problems. The point I am making is that 1500 is a relic value that hamstrings Internet performance and there is no good reason not to use 9000 byte MTU at peering points (by all participants) since it A: introduces no new problems and B: I can't find a vendor of modern gear at a peering point that doesn't support it though there may be some ancient gear at some peering points in use by some of the peers.

Have you ever tried showing up to the Internet with a 1000 byte MTU? The only time that works correctly today is when you're rewriting TCP MSS values as the packet goes through the constrained link, which may be fine for the GRE tunnel to a Linux box at your house, but clearly can't work on the real Internet.

...

I can not think of a problem changing from 1500 to 9000 as the standard at peering points introduces. It would also speed up the

This suggests a serious lack of imagination on your part. :)

...

loading of the BGP routes between routers at the peering points. If

It's a very very modest increase at best.

...

Joe Blow at home with a dialup connection with an MTU of 576 is talking to a server at Y! with an MTU of 10 billion, changing a peering path from 1500 to 9000 bytes somewhere in the path is not going to change that PMTU discovery one iota. It introduces no problem whatsoever. It changes nothing.

You know one very good reason for the people on a dialup connection to have low MTUs is serialization delay. As link speeds have gotten faster but MTUs have stayed the same, one tangible benefit is the lack of a need for fair queueing to keep big packets from significantly increasing the latency of small packets. Overall I agree with the theory of larger MTUs... Improved efficiency, being able to do page-flipping with your payload, not having to worry about screwing things up if you DO need to use a tunnel or turn on IPsec, it's all well and good... But from a practical standpoint there are still a lot of very serious issues that have not been addressed, and anyone who actually tries to do this at scale is in for a world of hurt. I for one would love to see the situation improved, but trying to gloss over it and pretend the problems don't exist just delays the day when it actually CAN be supported.

...

That is a list of 9000 byte clean gear. The very bottom is the stuff that doesn't support it. Of the stuff that doesn't support it, how much is connected directly to a peering point? THAT is the bottleneck

This argument is completely destroyed at the line that says 7206VXR w/PA-GE, you don't need to read any further.

...

I am talking about right now. One step at a time. Removing the bottleneck at the peering points is all I am talking about. That will not change PMTU issues elsewhere and those will stand just exactly as they are today without any change. In fact it will ensure that there are *fewer* PMTU discovery issues by being able to support a larger range of packets without having to fragment them.

The issues I listed are precisely why it doesn't work at peering points. I know this because I do a lot of peering, and I spend a lot of time dealing with getting people to peer at larger MTU values (correctly). If it was easier to do without breaking stuff, I'd be a lot more successful at it. :)

...

We *already* have SONET MTU of >4000 and this hasn't broken anything since the invention of SONET.

SONET MTU works because it's on by default, it's the same size everywhere, and every piece of gear supports it. It also doesn't accomplish anything, as almost no packets flowing through your SONET links are > 1500 bytes, and if you actually tried to show up to the Internet with a PC and a 4474 byte MTU you'd have a bad time. At any rate, I'm going to stop arguing this one, as I think we've beaten this dead horse enough for one day. Please read what I said carefully, I promise you this isn't as easy as you think it is. :) -- Richard A Steenbergen <ras@e-gerbil.net> http://www.e-gerbil.net/ras GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)

Jack Bates

9:02 p.m.

On 11/6/2010 2:15 PM, George Bonser wrote:

...

I believe SCTP will become more widely used in the mobile device world. You can have several different streams so you can still get an IM, for example, while you are streaming a movie. Eliminating the "head of line" blockage on thin connections is really valuable.

I agree, but it is definitely a slow start. I personally like the fact that SCTP actually fixes the host addressing issues and other benefits which can be especially useful with v6 and mobile IP. Jack

Mark Smith

7 Nov 7 Nov

1:14 a.m.

On Sat, 06 Nov 2010 11:45:01 -0500 Jack Bates <jbates@brightok.net> wrote:

...

On 11/5/2010 5:32 PM, Scott Weeks wrote:

...
It's really quiet in here. So, for some Friday fun let me whap at the hornets nest and see what happens...>;-)

http://www.ionary.com/PSOC-MovingBeyondTCP.pdf

SCTP is a great protocol. It has already been implemented in a number of stacks. With these benefits over that theory, it still hasn't become mainstream yet. People are against change. They don't want to leave v4. They don't want to leave tcp/udp. Technology advances, but people will only change when they have to.

Lock of SCTP availability is nothing to do with people's avoidance of change - it's likely that deployed Linux kernels in the last 3 to 5 years already have it complied in. IPv4 NAT is what has prevented it from being deployed, because NATs don't understand it and therefore can't NAT addresses carried within it. This is one of the reasons why NAT is bad for the Internet - it has prevented deployment and/or utilisation of new transport protocols, such as SCTP or DCCP, that provide benefits over UDP or TCP.

...

Jack (lost brain cells actually reading that pdf)

Glad I haven't then, just the quotes from it hurt. Regards, Mark.

William Herrin

11:57 p.m.

On Fri, Nov 5, 2010 at 6:32 PM, Scott Weeks <surfer@mauigateway.com> wrote:

...

It's really quiet in here. So, for some Friday fun let me whap at the hornets nest and see what happens... >;-)

And so, "...the first principle of our proposed new network architecture: Layers are recursive."

Hi Scott, Anyone who has bridged an ethernet via a TCP based IPSec tunnel understands that layers are recursive.

...

http://www.ionary.com/PSOC-MovingBeyondTCP.pdf

John Day has been chasing this notion long enough to write three network stacks. If it works and isn't obviously inferior in its operational resource consumption, where's the proof-of-concept code? The last time this was discussed in the Routing Research Group, none of the proponents were able to adequately describe how to build a translation/forwarding table in the routers or whatever passes for routers in this design. Regards, Bill Herrin -- William D. Herrin ................ herrin@dirtside.com bill@herrin.us 3005 Crane Dr. ...................... Web: <http://bill.herrin.us/> Falls Church, VA 22042-3004

Scott Brim

8 Nov 8 Nov

12:46 a.m.

On 11/08/2010 07:57 GMT+08:00, William Herrin wrote:

...

On Fri, Nov 5, 2010 at 6:32 PM, Scott Weeks <surfer@mauigateway.com> wrote:

...
It's really quiet in here. So, for some Friday fun let me whap at the hornets nest and see what happens... >;-)

And so, "...the first principle of our proposed new network architecture: Layers are recursive."

Hi Scott,

Anyone who has bridged an ethernet via a TCP based IPSec tunnel understands that layers are recursive.

Tony Finch

3:56 p.m.

On Sun, 7 Nov 2010, William Herrin wrote:

...

...
http://www.ionary.com/PSOC-MovingBeyondTCP.pdf

The last time this was discussed in the Routing Research Group, none of the proponents were able to adequately describe how to build a translation/forwarding table in the routers or whatever passes for routers in this design.

I note that he doesn't actually describe how to implement a large-scale addressing and routing architecture. It's all handwaving. And he seems to think that core routers can cope with per-flow state. The only bits he's at all concrete about are the transport protocol, which isn't really where the unsolved problems are. Tony. -- f.anthony.n.finch <dot@dotat.at> http://dotat.at/ HUMBER THAMES DOVER WIGHT PORTLAND: NORTH BACKING WEST OR NORTHWEST, 5 TO 7, DECREASING 4 OR 5, OCCASIONALLY 6 LATER IN HUMBER AND THAMES. MODERATE OR ROUGH. RAIN THEN FAIR. GOOD.

Eugen Leitl

4:17 p.m.

On Mon, Nov 08, 2010 at 03:56:17PM +0000, Tony Finch wrote:

...

I note that he doesn't actually describe how to implement a large-scale addressing and routing architecture. It's all handwaving.

I'm probably vying for nanog-kook status as well, but in high-dimensional spaces blocking is arbitrarily improbable. Think of higher-dimensional analogs of 3d-Bresenham (which is local-knowledge only), then blow away most of the links. It still works. You have to wire the network appropriately to loosely follow geography (which is of course currently a show-stopper) and label the nodes appropriately -- as a bonus, you can derive node ID by mutual iterative refinement, pretty much like relativistic time of flight mutual "triangulation". Another issue is purely photonic cut-through at very high data rates: there's not that much time to do a routing decision even if your packet stuck in molasses as slow light or circulating in a fiber loop FIFO. So not only are photonic gates expensive (and conversion to electronics and back is right out), you might not stack too many individual gate delays on top of each other. Networks are much too smart still, what you need is the barest decoration upon the raw physics of this universe.

...

And he seems to think that core routers can cope with per-flow state.

The only bits he's at all concrete about are the transport protocol, which isn't really where the unsolved problems are.

Jack Bates

4:48 p.m.

On 11/8/2010 9:56 AM, Tony Finch wrote:

...

I note that he doesn't actually describe how to implement a large-scale addressing and routing architecture. It's all handwaving.

That's an extremely hard to address problem. While there are many proposals, they usually do away with features which we utilize. I'm looking at a graph on the noc screen right now which shows how grotesque natural load balancing can be between 3 AS interconnects. I have enough free overhead to allow this, but eventually I will have to start applying policies to balance better. This implies that I'll eventually have to advertise sub-aggregate v6 prefixes to balance as well (perhaps some /31 or /32 announcements overlaying the /27). The problem with most of the other methods are they ignore policies and the desired route to reach a network, and instead rely on any way to get there. But let's be honest, the current problems tend to be memory problems, not performance problems. It annoys me that vendors did this last increment in such a small scale guaranteeing we'll be buying new hardware again soon. Jack

5334

Age (days ago)

5361

Last active (days ago)

List overview

Download

88 comments

26 participants

participants (26)

Brielle Bruns
Chris Adams
Dan White
Doug Barton
Eugen Leitl
George Bonser
Jack Bates
Jeff Kell
Mans Nilsson
Mark Smith
Marshall Eubanks
Matthew Petach
Michael Hallgren
Mikael Abrahamsson
Nathan Eisenberg
Nick Hilliard
Niels Bakker
Richard A Steenbergen
Scott Brim
Scott Weeks
Simon Horman
sthaug＠nethelp.no
Tony Finch
Valdis.Kletnieks＠vt.edu
Will Hargrave
William Herrin