IPv6 day and tunnels

newer
NTT handing packets to Reliance...

older
ZOMG: IPv6 a plot to stymie FBI...

Joe Maimon

4 Jun 2012 4 Jun '12

1:38 a.m.

Well, IPv6 day isnt here yet, and my first casualty is the browser on the wife's machine, firefox now configured to not query AAAA. Now www.facebook.com loads again. Looks like a tunnel mtu issue. I have not as of yet traced the definitive culprit, who is (not) sending ICMP too big, who is (not) receiving them, etc. www.arin.net works and worked for years. www.facebook.com stopped June 1. So IPv6 fixes the fragmentation and MTU issues of IPv4 by how exactly? Or was the fix incorporating the breakage into the basic design? In IPv4 I can make tunneling just work nearly all of the time. So I have to munge a tcp mss header, or clear a df-bit, or fragment the encapsulated packet when all else fails, but at least the tools are there. And on the host, /proc/sys/net In IPv6, it seems my options are a total throwback, with the best one turning the sucker off. Nobody (on that station) needs it anyways. Joe

Show replies by date

Cameron Byrne

4 Jun 4 Jun

1:59 a.m.

On Sun, Jun 3, 2012 at 6:38 PM, Joe Maimon <jmaimon@ttec.com> wrote:

...

Well, IPv6 day isnt here yet, and my first casualty is the browser on the wife's machine, firefox now configured to not query AAAA.

Now www.facebook.com loads again.

Looks like a tunnel mtu issue. I have not as of yet traced the definitive culprit, who is (not) sending ICMP too big, who is (not) receiving them, etc.

www.arin.net works and worked for years. www.facebook.com stopped June 1.

So IPv6 fixes the fragmentation and MTU issues of IPv4 by how exactly?

Or was the fix incorporating the breakage into the basic design?

In IPv4 I can make tunneling just work nearly all of the time. So I have to munge a tcp mss header, or clear a df-bit, or fragment the encapsulated packet when all else fails, but at least the tools are there. And on the host, /proc/sys/net

In IPv6, it seems my options are a total throwback, with the best one turning the sucker off. Nobody (on that station) needs it anyways.

Joe

#1 don't tunnel unless you really need to. #2 see #1 #3 use happy eyeballs, http://tools.ietf.org/html/rfc6555, Chrome has a good implementation, but this does not solve MTU issues. #4 MSS hacks work at the TCP layer and still work regardless of IPv4 or IPv6. #5 According to the IETF, MSS hacks do not exist and neither do MTU issues http://www.ietf.org/mail-archive/web/v6ops/current/msg12933.html PSA time: Please use http://test-ipv6.com/ and pass this good advice around to the people you know. Thanks, Cameron

Joe Maimon

2:19 a.m.

Cameron Byrne wrote:

...

...
#1 don't tunnel unless you really need to.

Tunnels are ipv4 only now?

...

#2 see #1

#3 use happy eyeballs, http://tools.ietf.org/html/rfc6555, Chrome has a good implementation, but this does not solve MTU issues.

Because the initial connections are made just fine. PMTUD with probing should work, but does not seem to. Probably a (lack of) deployment issue.

...

#4 MSS hacks work at the TCP layer and still work regardless of IPv4 or IPv6.

But the equipment needs to support it. Again IPv6 lags.

...

#5 According to the IETF, MSS hacks do not exist and neither do MTU issues http://www.ietf.org/mail-archive/web/v6ops/current/msg12933.html

Thanks for that. I expect soon tunnels wont either.

...

PSA time: Please use http://test-ipv6.com/ and pass this good advice around to the people you know.

Excellent site/tool.

...

Thanks,

Cameron

Thank you. Joe

Jimmy Hess

3:40 a.m.

On 6/3/12, Cameron Byrne <cb.list6@gmail.com> wrote:

...

On Sun, Jun 3, 2012 at 6:38 PM, Joe Maimon <jmaimon@ttec.com> wrote: [snip] #5 According to the IETF, MSS hacks do not exist and neither do MTU issues http://www.ietf.org/mail-archive/web/v6ops/current/msg12933.html

They couldn't be more wrong. MTU issues still exist, and not just with tunnelling, but tunneling should be an expected scenario for IP. The protocol IPv6 still handles it very poorly, by still requiring external ICMP messages, through the unreliable PTMUD scheme, matters are as bad if not worse than with IPv4. It's just so unfortunate that IPv6 couldn't provide a good solution to one of IP's more troublesome deficiencies.

...

Cameron -- -JH

Jeroen Massar

4:54 a.m.

On 3 Jun 2012, at 20:40, Jimmy Hess <mysidia@gmail.com> wrote:

...

On 6/3/12, Cameron Byrne <cb.list6@gmail.com> wrote:

...
On Sun, Jun 3, 2012 at 6:38 PM, Joe Maimon <jmaimon@ttec.com> wrote: [snip] #5 According to the IETF, MSS hacks do not exist and neither do MTU issues http://www.ietf.org/mail-archive/web/v6ops/current/msg12933.html

They couldn't be more wrong. MTU issues still exist, and not just with tunnelling, but tunneling should be an expected scenario for IP.

The protocol IPv6 still handles it very poorly, by still requiring external ICMP messages, through the unreliable PTMUD scheme, matters are as bad if not worse than with IPv4.

As ICMPv6 is an integral part of IPv6 how exactly is ICMP "external"? You do realize what the function of ICMP is I hope? If one is so stupid to just block ICMP then one should also accept that one loses functionality. If the people in the IETF would have decided to inline the headers that are ICMPv6 into the IPv6 header then there for sure would have been people who would have blocked the equivalent of PacketTooBig in there too. As long as people can block stuff they will block stuff that they should not have blocked, nothing the IETF can do about, stupidity exists behind the keyboard. That said, pMTU discovery works awesomely in the 10+ years that I have been actively been using IPv6, if it does no work for you, find the issue and resolve it. (tracepath is a great tool for this btw)

...

It's just so unfortunate that IPv6 couldn't provide a good solution to one of IP's more troublesome deficiencies.

Did you ever bother to comment about your supposed issue in the IETF? Greets, Jeroen

Jimmy Hess

6:20 a.m.

...

If one is so stupid to just block ICMP then one should also accept that one loses functionality. ICMP tends to get blocked by firewalls by default; There are legitimate reasons to block ICMP, esp w V6. Security device manufacturers tend to indicate all the "lost functionality" is

On 6/3/12, Jeroen Massar <jeroen@unfix.org> wrote: optional functionality not required for a working device.

...

If the people in the IETF would have decided to inline the headers that are ICMPv6 into the IPv6 header then there for sure would have been people who would have blocked the equivalent of PacketTooBig in there too. As long as

Over reliance on "PacketTooBig" is a source of the problem; the idea that too large packets should be blindly generated under ordinary circumstances, carried many hops, and dropped with an error returned a potentially long distance that the sender in each direction is expected to see and act upon, at the expense of high latency for both peers, during initial connection establishment. Routers don't always know when a packet is too big to reach their next hop, especially in case of Broadcast traffic, so they don't know to return a PacketTooBig error, especially in the case of L2 tunneling PPPoE for example, there may be a L2 bridge on the network in between routers with a lower MRU than either of the router's immediate links, eg because PPP, 802.1p,q + MPLS labels, or other overhead are affixed to Ethernet frames, somewhere on the switched path between routers. The problem is not that "Tunneling is bad"; the problem is the IP protocol has issues. The protocol should be designed so that there will not be issues with tunnelling or different MRU Ethernet links. The real solution is for reverse path MTU (MRU) information to be discovered between L3 neighbors by L2 probing, and discovered MRU exchanged using NDP, so routers know the lowest MRU on each directly connected interface, then for the worst case reduction in reverse path MTU to be included in the routing information passed via L3 routing protocols both IGPs and EGPs to the next hop. That is, no router should be allowed to enter a route into its forwarding table, until the worst case reverse MTU is discovered, to reach that network, with the exception, that a device may be configured with a default route, and some directly connected networks. The need for "Too Big" messages is then restricted to nodes connected to terminal networks. And there should be no such thing as packet fragmentation. -- -JH

Jeroen Massar

6:41 a.m.

On 3 Jun 2012, at 23:20, Jimmy Hess <mysidia@gmail.com> wrote:

...

On 6/3/12, Jeroen Massar <jeroen@unfix.org> wrote:

...
If one is so stupid to just block ICMP then one should also accept that one loses functionality. ICMP tends to get blocked by firewalls by default

Which firewall product does that?

...

; There are legitimate reasons to block ICMP, esp w V6.

The moment one decides to block ICMPv6 you are likely breaking features of IPv6, chose wisely. There are several RFCs pointing out what one could and what one Must never block. Packet Too Big is a very well known one that one should not block. If you decide to block anyway then well, your problem that your network breaks.

...

Security device manufacturers tend to indicate all the "lost functionality" is optional functionality not required for a working device.

I suggest that you vote with your money and chose a different vendor if they shove that through your throat. Upgrading braincells is another option though ;)

...

...
If the people in the IETF would have decided to inline the headers that are ICMPv6 into the IPv6 header then there for sure would have been people who would have blocked the equivalent of PacketTooBig in there too. As long as

Over reliance on "PacketTooBig" is a source of the problem; the idea that too large packets should be blindly generated under ordinary circumstances, carried many hops, and dropped with an error returned a potentially long distance that the sender in each direction is expected to see and act upon, at the expense of high latency for both peers, during initial connection establishment.

High latency? You do realize that it is only one roundtrip max that might happen and that there is no shorter way to inform your side of this situation?

...

Routers don't always know when a packet is too big to reach their next hop, especially in case of Broadcast traffic,

You do realize that IPv6 does not have the concept of broadcast do you?! ;) There is only: unicast, multicast and anycast (and anycast is just unicast as it is a routing trick)

...

so they don't know to return a PacketTooBig error, especially in the case of L2 tunneling PPPoE for example, there may be a L2 bridge on the network in between routers with a lower MRU than either of the router's immediate links, eg because PPP, 802.1p,q + MPLS labels, or other overhead are affixed to Ethernet frames, somewhere on the switched path between routers.

If you have a broken L2 network there is nothing that an L3 protocol can do about it. Please properly configure it, stuff tend to work better that way.

...

The problem is not that "Tunneling is bad"; the problem is the IP protocol has issues. The protocol should be designed so that there will not be issues with tunnelling or different MRU Ethernet links.

There is no issue as long as you properly respond with PtB and process them when received. If your medium is <1280 then your medium has to solve the fragging of packets.

...

The real solution is for reverse path MTU (MRU) information to be discovered between L3 neighbors by L2 probing, and discovered MRU exchanged using NDP, so routers know the lowest MRU on each directly connected interface, then for the worst case reduction in reverse path MTU to be included in the routing information passed via L3 routing protocols both IGPs and EGPs to the next hop.

You do realize that NDP only works on the local link and not further?! ;) Also, carrying MTU and full routing info to end hosts is definitely not something a lot of operators would like to do let alone see in their networks. Similar to you not wanting ICMP in your network even though that is the agreed upon standard.

...

That is, no router should be allowed to enter a route into its forwarding table, until the worst case reverse MTU is discovered, to reach that network, with the exception, that a device may be configured with a default route, and some directly connected networks.

If you want this in your network just configure it everywhere to 1280 and then process and answer PtBs on the edge. Your network, your problem that you will never use jumbo frames.

...

The need for "Too Big" messages is then restricted to nodes connected to terminal networks. And there should be no such thing as packet fragmentation.

The fun thing is though that this Internet thing is quite a bit larger than your imaginary network... Greets, Jeroen

Owen DeLong

7:01 a.m.

On Jun 3, 2012, at 11:20 PM, Jimmy Hess wrote:

...

...
If one is so stupid to just block ICMP then one should also accept that one loses functionality. ICMP tends to get blocked by firewalls by default; There are legitimate reasons to block ICMP, esp w V6. Security device manufacturers tend to indicate all the "lost functionality" is

On 6/3/12, Jeroen Massar <jeroen@unfix.org> wrote: optional functionality not required for a working device.

If you feel the need to block ICMP (I'm not convinced this is an actual need), then you should do so very selectively in IPv6. Blocking packet too big messages, especially is definitely harmful in IPv6 and PMTU-D is _NOT_ optional functionality. Any firewall/security device manufacturer that says it is will not get any business from me (or anyone else who considers their requirements properly before purchasing).

...

...
If the people in the IETF would have decided to inline the headers that are ICMPv6 into the IPv6 header then there for sure would have been people who would have blocked the equivalent of PacketTooBig in there too. As long as

Over reliance on "PacketTooBig" is a source of the problem; the idea that too large packets should be blindly generated under ordinary circumstances, carried many hops, and dropped with an error returned a potentially long distance that the sender in each direction is expected to see and act upon, at the expense of high latency for both peers, during initial connection establishment.

Actually, this generally will NOT affect initial connection establishment and due to slow start usually adds a very small amount of latency about 3-5kb into the conversation.

...

Routers don't always know when a packet is too big to reach their next hop, especially in case of Broadcast traffic, so they don't know to return a PacketTooBig error, especially in the case of L2 tunneling PPPoE for example, there may be a L2 bridge on the network in between routers with a lower MRU than either of the router's immediate links, eg because PPP, 802.1p,q + MPLS labels, or other overhead are affixed to Ethernet frames, somewhere on the switched path between routers.

That is a misconfiguration of the routers. Any routers in such a circumstance need their interface configured for the lower MTU or things are going to break with or without ICMP Packet Too Big messages because even if you didn't have the DF bit, the router has no way to know to fragment the packet. An L2 device should not be fragmenting L3 packets.

...

The problem is not that "Tunneling is bad"; the problem is the IP protocol has issues. The protocol should be designed so that there will not be issues with tunnelling or different MRU Ethernet links.

And there are not issues so long as things are configured correctly. Misconfiguration will cause issues no matter how well the protocol is designed. The problem you are describing so far is not a problem with the protocol, it is a problem with misconfigured devices.

...

The real solution is for reverse path MTU (MRU) information to be discovered between L3 neighbors by L2 probing, and discovered MRU exchanged using NDP, so routers know the lowest MRU on each directly connected interface, then for the worst case reduction in reverse path MTU to be included in the routing information passed via L3 routing protocols both IGPs and EGPs to the next hop.

This could compensate for some amount of misconfiguration, but you're adding a lot of overhead and a whole bunch of layering violations in order to do it. I think it would be much easier to just fix the configuration errors.

...

That is, no router should be allowed to enter a route into its forwarding table, until the worst case reverse MTU is discovered, to reach that network, with the exception, that a device may be configured with a default route, and some directly connected networks.

I don't see how this would no cause more problems than you claim it will solve.

...

The need for "Too Big" messages is then restricted to nodes connected to terminal networks. And there should be no such thing as packet fragmentation.

There should be no such thing as packet fragmentation in the current protocol. What is needed is for people to simply configure things correctly and allow PTB messages to pass as designed. Owen

Matthew Huff

10:38 a.m.

...

...
An L2 device should not be fragmenting L3 packets.

Layer 2 fragmentation used (20+ years ago) to be a common thing with bridged topologies like token-ring to Ethernet source-routing. Obviously, no so much anymore (at least I hope not), but it can and does happen. I think part of the problem is that ISPs, CDN, hosting companies, etc. have assumed IPv6 is just IPv4 with longer addresses and haven't spent the time learning the differences like what was pointed out that ICMPv6 is a required protocol for IPv6 to work correctly. MTU issues are an annoyance with IPv4 but are a brokenness with IPv6. Knowledge with come, but it may take a bit of beating over the head for a while.

Joel Maslak

2:16 p.m.

On Jun 4, 2012, at 1:01 AM, Owen DeLong <owen@delong.com> wrote:

...

Any firewall/security device manufacturer that says it is will not get any business from me (or anyone else who considers their requirements properly before purchasing).

Unfortunately many technology people seem to have the idea, "If I don't understand it, it's a hacker" when it comes to network traffic. And often they don't understand ICMP (or at least PMTU). So anything not understood gets blocked. Then there is the Law of HTTP... The Law of HTTP is pretty simple: Anything that isn't required for *ALL* HTTP connections on day one of protocol implementation will never be able to be used universally. This includes, sadly, PMTU. If reaching all possible endpoints is important to your application, you better do it via HTTP and better not require PMTU. It's also why protocols typically can't be extended today at any layer other than the "HTTP" layer. As for the IETF trying to not have people reset DF...good luck with that one...besides, I think there is more broken ICMP handling than there are paths that would allow a segment to bounce around for 120 seconds...

Templin, Fred L

2:39 p.m.

Hi, There was quite a bit discussion on IPv6 PMTUD on the v6ops list within the past couple of weeks. Studies have shown that PTB messages can be dropped due to filtering even for ICMPv6. There was also concern for the one (or more) RTTs required for PMTUD to work, and for dealing with bogus PTB messages. The concerns were explicitly linked to IPv6 tunnels, so I drafted a proposed solution: https://datatracker.ietf.org/doc/draft-generic-v6ops-tunmtu/ In this proposal the tunnel ingress performs the following treatment of packets of various sizes: 1) For IPv6 packets no larger than 1280, admit the packet into the tunnel w/o fragmentation. Assumption is that all IPv6 links have to support a 1280 MinMTU, so the packet will get through. 2) For IPv6 packets larger than 1500, admit the packet into the tunnel w/o fragmentation. Assumption is that the sender would only send a 1501+ packet if it has some way of policing the PMTU on its own, e.g. through the use of RC4821. 3) For IPv6 packets between 1281-1500, break the packet into two (roughly) equal-sized pieces and admit each piece into the tunnel. (In other words, intentionally violate the IPv6 deprecation of router fragmentation.) Assumption is that the final destination can reassemble at least 1500, and that the 32-bit Identification value inserted by the tunnel provides sufficient assurance against reassembly mis-associations. I presume no one here would object to clauses 1) and 2). Clause 3) is obviously a bit more controversial - but, what harm would it cause from an operational standpoint? Thanks - Fred fred.l.templin@boeing.com

Brett Frankenberger

4:35 p.m.

On Mon, Jun 04, 2012 at 07:39:58AM -0700, Templin, Fred L wrote:

...

https://datatracker.ietf.org/doc/draft-generic-v6ops-tunmtu/

3) For IPv6 packets between 1281-1500, break the packet into two (roughly) equal-sized pieces and admit each piece into the tunnel. (In other words, intentionally violate the IPv6 deprecation of router fragmentation.) Assumption is that the final destination can reassemble at least 1500, and that the 32-bit Identification value inserted by the tunnel provides sufficient assurance against reassembly mis-associations.

Fragmenting the outer packet, rather than the inner packet, gets around the problem of router fragmentation of packets. The outer packet is a new packet and there's nothing wrong with the originator of that packet fragmenting it. Of course, that forces reassembly on the other tunnel endpoint, rather than on the ultimate end system, which might be problematic with some endpoints and traffic volumes. (With IPv4 in IPv4 tunnels, this is what I've always done. 1500 byte MTU on the tunnel, fragment the outer packet, let the other end of the tunnel do the reassembly. Not providing 1500 byte end-to-end (at least with in the network I control) for IPv4 has proven to consume lots of troubleshooting time; fragmenting the inner packet doesn't work unless you ignore the DF bit that is typically set by TCP endpoints who want to do PMTU discovery.)

...

I presume no one here would object to clauses 1) and 2). Clause 3) is obviously a bit more controversial - but, what harm would it cause from an operational standpoint?

-- Brett

Templin, Fred L

5:19 p.m.

Hi Brett,

...

-----Original Message----- From: Brett Frankenberger [mailto:rbf+nanog@panix.com] Sent: Monday, June 04, 2012 9:35 AM To: Templin, Fred L Cc: nanog@nanog.org Subject: Re: IPv6 day and tunnels

On Mon, Jun 04, 2012 at 07:39:58AM -0700, Templin, Fred L wrote:

...
https://datatracker.ietf.org/doc/draft-generic-v6ops-tunmtu/

3) For IPv6 packets between 1281-1500, break the packet into two (roughly) equal-sized pieces and admit each piece into the tunnel. (In other words, intentionally violate the IPv6 deprecation of router fragmentation.) Assumption is that the final destination can reassemble at least 1500, and that the 32-bit Identification value inserted by the tunnel provides sufficient assurance against reassembly mis-associations.

Fragmenting the outer packet, rather than the inner packet, gets around the problem of router fragmentation of packets. The outer packet is a new packet and there's nothing wrong with the originator of that packet fragmenting it.

Of course, that forces reassembly on the other tunnel endpoint, rather than on the ultimate end system, which might be problematic with some endpoints and traffic volumes.

There are a number of issues with fragmenting the outer packet. First, as you say, fragmenting the outer packet requires the tunnel egress to perform reassembly. This may be difficult for tunnel egresses that are configured on core routers. Also, when IPv4 is used as the outer encapsulation layer, the 16-bit ID field can result in reassembly errors at high data rates [RFC4963]. Additionally, encapsulating a 1500 inner packet in an outer IP header results in a 1500+ outer packet - and the ingress has no way of knowing whether the egress is capable of reassembling larger than 1500.

...

(With IPv4 in IPv4 tunnels, this is what I've always done. 1500 byte MTU on the tunnel, fragment the outer packet, let the other end of the tunnel do the reassembly. Not providing 1500 byte end-to-end (at least with in the network I control) for IPv4 has proven to consume lots of troubleshooting time; fragmenting the inner packet doesn't work unless you ignore the DF bit that is typically set by TCP endpoints who want to do PMTU discovery.)

Ignoring the (explicit) DF bit for IPv4 and ignoring the (implicit) DF bit for IPv6 is what I am suggesting. Thanks - Fred fred.l.templin@boeing.com

...

...
I presume no one here would object to clauses 1) and 2). Clause 3) is obviously a bit more controversial - but, what harm would it cause from an operational standpoint?

-- Brett

Masataka Ohta

7:05 p.m.

Templin, Fred L wrote:

...

Also, when IPv4 is used as the outer encapsulation layer, the 16-bit ID field can result in reassembly errors at high data rates [RFC4963].

As your proposal, too, gives up to have unique IDs, does that matter? Note that, with your draft, a route change between two tunnels with same C may cause block corruption.

...

Additionally, encapsulating a 1500 inner packet in an outer IP header results in a 1500+ outer packet - and the ingress has no way of knowing whether the egress is capable of reassembling larger than 1500.

Operators are responsible to have tunnel end points with sufficient capabilities.

...

...
(With IPv4 in IPv4 tunnels, this is what I've always done. 1500 byte MTU on the tunnel, fragment the outer packet, let the other end of the tunnel do the reassembly. Not providing 1500 byte end-to-end (at least with in the network I control) for IPv4 has proven to consume lots of troubleshooting time; fragmenting the inner packet doesn't work unless you ignore the DF bit that is typically set by TCP endpoints who want to do PMTU discovery.)

Ignoring the (explicit) DF bit for IPv4 and ignoring the (implicit) DF bit for IPv6 is what I am suggesting.

Thanks - Fred fred.l.templin@boeing.com

...
...
I presume no one here would object to clauses 1) and 2). Clause 3) is obviously a bit more controversial - but, what harm would it cause from an operational standpoint?

-- Brett

Templin, Fred L

7:26 p.m.

...

-----Original Message----- From: Masataka Ohta [mailto:mohta@necom830.hpcl.titech.ac.jp] Sent: Monday, June 04, 2012 12:06 PM To: nanog@nanog.org Subject: Re: IPv6 day and tunnels

Templin, Fred L wrote:

...
Also, when IPv4 is used as the outer encapsulation layer, the 16-bit ID field can result in reassembly errors at high data rates [RFC4963].

As your proposal, too, gives up to have unique IDs, does that matter?

This is taken care of by rate limiting at the tunnel ingress. For IPv4-in-(foo) tunnels, rate limit is 11Mbps which may be a bit limiting for some applications. For IPv6-in-(foo) tunnels, rate limit is 733Gbps which should be acceptable for most applications.

...

Note that, with your draft, a route change between two tunnels with same C may cause block corruption.

There are several built-in mitigations for this. First, the tunnel ingress does not assign Identification values sequentially but rather "skips around" to avoid synchronizing with some other node that is sending fragments to the same (src,dst) pair. Secondly, the ingress chooses random fragment sizes for the A and B portions of the packet so that the A portion of packet 1 does not match up properly with the B portion of packet 2 and hence will be dropped. Finally, even if the A portion of packet 1 somehow matches up with the B portion of packet 2 the Internet checksum provides an additional line of defense.

...

...
Additionally, encapsulating a 1500 inner packet in an outer IP header results in a 1500+ outer packet - and the ingress has no way of knowing whether the egress is capable of reassembling larger than 1500.

Operators are responsible to have tunnel end points with sufficient capabilities.

It is recommended that IPv4 nodes be able to reassemble as much as their connected interface MTUs. In the vast majority of cases that means that the nodes should be able to reassemble 1500. But, there is no assurance of anything more! Thanks - Fred fred.l.templin@boeing.com

...

...
...
(With IPv4 in IPv4 tunnels, this is what I've always done. 1500 byte MTU on the tunnel, fragment the outer packet, let the other end of the tunnel do the reassembly. Not providing 1500 byte end-to-end (at least with in the network I control) for IPv4 has proven to consume lots of troubleshooting time; fragmenting the inner packet doesn't work unless you ignore the DF bit that is typically set by TCP endpoints who want to do PMTU discovery.)

Ignoring the (explicit) DF bit for IPv4 and ignoring the (implicit) DF bit for IPv6 is what I am suggesting.

Thanks - Fred fred.l.templin@boeing.com

...
...
I presume no one here would object to clauses 1) and 2). Clause 3) is obviously a bit more controversial - but, what harm would it cause from an operational standpoint?

-- Brett

Masataka Ohta

8:07 p.m.

Templin, Fred L wrote:

...

...
As your proposal, too, gives up to have unique IDs, does that matter?

This is taken care of by rate limiting at the tunnel

No, I'm talking about: Note that a possible conflict exists when IP fragmentation has already been performed by a source host before the fragments arrive at the tunnel ingress.

...

...
Note that, with your draft, a route change between two tunnels with same C may cause block corruption.

There are several built-in mitigations for this. First, the tunnel ingress does not assign Identification values sequentially but rather "skips around" to avoid synchronizing with some other node that is sending fragments to the same

I'm talking about two tunnels with same "skip" value.

...

Secondly, the ingress chooses random fragment sizes for the A and B portions of the packet so that the A portion of packet 1 does not match up properly with the B portion of packet 2 and hence will be dropped.

You can do so with outer fragment, too. Moreover, it does not have to be random but regular, which effectively extend ID length.

...

Finally, even if the A portion of packet 1 somehow matches up with the B portion of packet 2 the Internet checksum provides an additional line of defense.

Thus, don't insist on having unique IDs so much.

...

It is recommended that IPv4 nodes be able to reassemble as much as their connected interface MTUs. In the vast majority of cases that means that the nodes should be able to reassemble 1500. But, there is no assurance of anything more!

I'm talking about not protocol recommendation but proper operation. Masataka Ohta

Templin, Fred L

9:23 p.m.

...

-----Original Message----- From: Masataka Ohta [mailto:mohta@necom830.hpcl.titech.ac.jp] Sent: Monday, June 04, 2012 1:08 PM To: Templin, Fred L; nanog@nanog.org Subject: Re: IPv6 day and tunnels

Templin, Fred L wrote:

...
...
As your proposal, too, gives up to have unique IDs, does that matter?

This is taken care of by rate limiting at the tunnel

No, I'm talking about:

Note that a possible conflict exists when IP fragmentation has already been performed by a source host before the fragments arrive at the tunnel ingress.

...
...
Note that, with your draft, a route change between two tunnels with same C may cause block corruption.

There are several built-in mitigations for this. First, the tunnel ingress does not assign Identification values sequentially but rather "skips around" to avoid synchronizing with some other node that is sending fragments to the same

I'm talking about two tunnels with same "skip" value.

There are several factors to consider. First, each tunnel ingress chooses its initial Identification value (or values) randomly and independent of all other tunnel ingresses. Secondly, the packet arrival rates at the various tunnel ingresses are completely independent and in no way correlated. So, while an occasional reassembly collision is possible the 32-bit Identification value would make it extremely rare. And the variability of packet arrivals between the tunnel endpoints would make it such that a string of consecutive collisions would never happen. So, I'm not sure that a randomly-chosen "skip" value is even necessary.

...

...
Secondly, the ingress chooses random fragment sizes for the A and B portions of the packet so that the A portion of packet 1 does not match up properly with the B portion of packet 2 and hence will be dropped.

You can do so with outer fragment, too. Moreover, it does not have to be random but regular, which effectively extend ID length.

Outer fragmentation cooks the tunnel egresses at high data rates. End systems are expected and required to reassemble on their own behalf.

...

...
Finally, even if the A portion of packet 1 somehow matches up with the B portion of packet 2 the Internet checksum provides an additional line of defense.

Thus, don't insist on having unique IDs so much.

Non-overlapping fragments are disallowed for IPv6, but I think are still allowed for IPv4. So, IPv4 still needs the unique IDs by virtue of rate limiting.

...

...
It is recommended that IPv4 nodes be able to reassemble as much as their connected interface MTUs. In the vast majority of cases that means that the nodes should be able to reassemble 1500. But, there is no assurance of anything more!

I'm talking about not protocol recommendation but proper operation.

I don't see any operational guidance recommending the tunnel ingress to configure an MRU of 1520 or larger. Thanks - Fred fred.l.templin@boeing.com

...

Masataka Ohta

Masataka Ohta

11:40 p.m.

Templin, Fred L wrote:

...

I'm not sure that a randomly-chosen "skip" value is even necessary.

It is not necessary, because, for ID uniqueness fundamentalists, single event is bad enough and for most operators, slight possibility is acceptable.

...

Outer fragmentation cooks the tunnel egresses at high data rates.

Have egresses with proper performance. That's the proper operation.

...

End systems are expected and required to reassemble on their own behalf.

That is not a proper operation of tunnels.

...

...
Thus, don't insist on having unique IDs so much.

Non-overlapping fragments are disallowed for IPv6, but I think are still allowed for IPv4. So, IPv4 still needs the unique IDs by virtue of rate limiting.

Even though there is no well defined value of MSL?

...

...
I'm talking about not protocol recommendation but proper operation.

I don't see any operational guidance recommending the tunnel ingress to configure an MRU of 1520 or larger.

I'm talking about not operation guidance but proper operation. Proper operators can, without any guidance, perform proper operation. Masataka Ohta

Templin, Fred L

5 Jun 5 Jun

2:45 p.m.

...

-----Original Message----- From: Masataka Ohta [mailto:mohta@necom830.hpcl.titech.ac.jp] Sent: Monday, June 04, 2012 4:40 PM To: Templin, Fred L; nanog@nanog.org Subject: Re: IPv6 day and tunnels

Templin, Fred L wrote:

...
I'm not sure that a randomly-chosen "skip" value is even necessary.

It is not necessary, because, for ID uniqueness fundamentalists, single event is bad enough and for most operators, slight possibility is acceptable.

...
Outer fragmentation cooks the tunnel egresses at high data rates.

Have egresses with proper performance. That's the proper operation.

How many core routers would be happy to reassemble at line rates without a forklift upgrade and/or strong administrative tuning?

...

...
End systems are expected and required to reassemble on their own behalf.

That is not a proper operation of tunnels.

Why not?

...

...
...
Thus, don't insist on having unique IDs so much.

Non-overlapping fragments are disallowed for IPv6, but I think are still allowed for IPv4. So, IPv4 still needs the unique IDs by virtue of rate limiting.

Even though there is no well defined value of MSL?

MSL is well defined. For TCP, it is defined in RFC793. For IPv4 reassembly, it is defined in RFC1122. For IPv6 reassembly, it is defined in RFC2460.

...

...
...
I'm talking about not protocol recommendation but proper operation.

I don't see any operational guidance recommending the tunnel ingress to configure an MRU of 1520 or larger.

I'm talking about not operation guidance but proper operation.

The tunnel ingress cannot count on administrative tuning on the egress - all it can count on is reassembly of 1500 or smaller and it can't count on good performance even at those levels.

...

Proper operators can, without any guidance, perform proper operation.

No amount of proper operation can fix a platform that does not have adequate performance. And, there is no way for the tunnel ingress to tell what, if any, mitigations have been applied at the egress. Thanks - Fred fred.l.templin@boeing.com

...

Masataka Ohta

Masataka Ohta

4:37 p.m.

Templin, Fred L wrote:

...

...
Have egresses with proper performance. That's the proper operation.

...

How many core routers would be happy to reassemble at line rates without a forklift upgrade and/or strong administrative tuning?

You don't have to do it with core routers.

...

...
...
End systems are expected and required to reassemble on their own behalf.

That is not a proper operation of tunnels.

...

Why not?

Lack of transparency.

...

...
Even though there is no well defined value of MSL?

MSL is well defined. For TCP, it is defined in RFC793. For IPv4 reassembly, it is defined in RFC1122. For IPv6 reassembly, it is defined in RFC2460.

As you can see, they are different values.

...

...
I'm talking about not operation guidance but proper operation.

The tunnel ingress cannot count on administrative tuning on the egress

I'm afraid you don't understand tunnel operation at all.

...

No amount of proper operation can fix a platform that does not have adequate performance.

Choosing a proper platform is a part of proper operation. Masataka Ohta

Templin, Fred L

5:46 p.m.

...

-----Original Message----- From: Masataka Ohta [mailto:mohta@necom830.hpcl.titech.ac.jp] Sent: Tuesday, June 05, 2012 9:37 AM To: Templin, Fred L Cc: nanog@nanog.org Subject: Re: IPv6 day and tunnels

Templin, Fred L wrote:

...
...
Have egresses with proper performance. That's the proper operation.

...
How many core routers would be happy to reassemble at line rates without a forklift upgrade and/or strong administrative tuning?

You don't have to do it with core routers.

Tunnel endpoints can be located either nearer the edges or nearer the middle. Tunnel endpoints that are located nearer the edges might be able to do reassembly at nominal data rates, but there is no assurance of a maximum MRU greater than 1500 (which is too small to reassemble a 1500+20 packet). Tunnel endpoints that are located nearer the middle can be swamped trying to keep up with reassembly at high data rates - again, with no MRU assurances.

...

...
...
...
End systems are expected and required to reassemble on their own behalf.

That is not a proper operation of tunnels.

...
Why not?

Lack of transparency.

Huh?

...

...
...
Even though there is no well defined value of MSL?

MSL is well defined. For TCP, it is defined in RFC793. For IPv4 reassembly, it is defined in RFC1122. For IPv6 reassembly, it is defined in RFC2460.

As you can see, they are different values.

RFC793 sets MSL to 120 seconds. RFC1122 uses MSL as the upper bound for reassembly buffer timeouts. IPv6 doesn't reference MSL but sets reassembly buffer timeouts to 60 seconds. Personally, I can't imagine a reassembly that takes even as long as 30seconds being of any use to the final destination even if it were to finally complete. But, if we set 60 sec as the reassembly timeout value for IPv* I'm sure we'd be OK.

...

...
...
I'm talking about not operation guidance but proper operation.

The tunnel ingress cannot count on administrative tuning on the egress

I'm afraid you don't understand tunnel operation at all.

I don't? Are you sure?

...

...
No amount of proper operation can fix a platform that does not have adequate performance.

Choosing a proper platform is a part of proper operation.

This is getting tiresome. Fred fred.l.temlin@boeing.com

...

Masataka Ohta

Masataka Ohta

6:36 p.m.

Templin, Fred L wrote:

...

...
You don't have to do it with core routers.

Tunnel endpoints can be located either nearer the edges or nearer the middle. Tunnel endpoints that are located nearer the edges might be able to do reassembly at nominal data rates, but there is no assurance of a maximum MRU greater than 1500 (which is too small to reassemble a 1500+20 packet). Tunnel endpoints that are located nearer the middle can be swamped trying to keep up with reassembly at high data rates - again, with no MRU assurances.

As operators know outer fragmentation is used to carry inner 1500B packets, the proper operation is to have equipments with large enough MRU. As core routers may be good at fragmentation but not particularly good at reassembly, operators do not have to insist on using core routers.

...

...
I'm afraid you don't understand tunnel operation at all.

I don't? Are you sure?

See above. Masataka Ohta

Templin, Fred L

7:09 p.m.

...

-----Original Message----- From: Masataka Ohta [mailto:mohta@necom830.hpcl.titech.ac.jp] Sent: Tuesday, June 05, 2012 11:36 AM To: Templin, Fred L; nanog@nanog.org Subject: Re: IPv6 day and tunnels

Templin, Fred L wrote:

...
...
You don't have to do it with core routers.

Tunnel endpoints can be located either nearer the edges or nearer the middle. Tunnel endpoints that are located nearer the edges might be able to do reassembly at nominal data rates, but there is no assurance of a maximum MRU greater than 1500 (which is too small to reassemble a 1500+20 packet). Tunnel endpoints that are located nearer the middle can be swamped trying to keep up with reassembly at high data rates - again, with no MRU assurances.

As operators know outer fragmentation is used to carry inner 1500B packets, the proper operation is to have equipments with large enough MRU.

As core routers may be good at fragmentation but not particularly good at reassembly, operators do not have to insist on using core routers.

I am making a general statement that applies to all tunnels everywhere. For those, specs say that all that is required for MRU is 1500 and not 1500+20. *Unless there is some explicit pre-arrangement between the tunnel endpoints*, the ingress has no way of knowing whether the egress can do better than 1500 outer packet (meaning 1480 inner packet). That is certainly the case for point-to-multipoint "automatic" tunnels as many of these IPv6 transition technologies are. Fred fred.l.templin@boeing.com

...

...
...
I'm afraid you don't understand tunnel operation at all.

I don't? Are you sure?

See above.

Masataka Ohta

Masataka Ohta

7:41 p.m.

Templin, Fred L wrote:

...

I am making a general statement that applies to all tunnels everywhere.

General statement? Even though you assume tunnel MTU 1500B and tunnel overhead 20B?

...

For those, specs say that all that is required for MRU is 1500 and not 1500+20.

That is a requirement for hosts with Ethernet interface, which is, by no means, general and has nothing to do with tunnels. For the general argument on tunnels, see, for example, RFC2473 "Generic Packet Tunneling in IPv6", where there is no requirement of 1500. Note that the RFC uses outer fragmentation: (b) if the original IPv6 packet is equal or smaller than the IPv6 minimum link MTU, the tunnel entry-point node encapsulates the original packet, and subsequently fragments the resulting IPv6 tunnel packet into IPv6 fragments that do not exceed the Path MTU to the tunnel exit-point. Masataka Ohta

Templin, Fred L

8:09 p.m.

...

-----Original Message----- From: Masataka Ohta [mailto:mohta@necom830.hpcl.titech.ac.jp] Sent: Tuesday, June 05, 2012 12:42 PM To: Templin, Fred L Cc: nanog@nanog.org Subject: Re: IPv6 day and tunnels

Templin, Fred L wrote:

...
I am making a general statement that applies to all tunnels everywhere.

General statement?

General statement for IPv6-in-IPv4 tunneling, yes. But inner fragmentation applies equally for *-in-* tunneling.

...

Even though you assume tunnel MTU 1500B

What I am after is a tunnel MTU of infinity. 1500 is the minimum packet size that MUST get through. 1501+ packets are admitted into the tunnel unconditionally in hopes that they MIGHT get through.

...

and tunnel overhead 20B?

The size "20" represents the size of the IPv4 encaps header. The size "40" would represent the size of an IPv6 encaps header. The size "foo" would represent the size of some other encapsulation overhead, e.g., for IPsec tunnels, IP/UDP tunnels, etc. So, let the size of the encaps header(s) be "X", substitute X for 20 everywhere and you will see that the approach is fully generally applicable.

...

...
For those, specs say that all that is required for MRU is 1500 and not 1500+20.

That is a requirement for hosts with Ethernet interface, which is, by no means, general and has nothing to do with tunnels.

RFC2460 says the MinMRU for IPv6 nodes is 1500. RFC1122 says that IPv4 hosts should reassemble as much as their connected interfaces (1500 for Ethernet). RFC1812 says the MinMRU for IPv4 routers is 576.

...

For the general argument on tunnels, see, for example, RFC2473 "Generic Packet Tunneling in IPv6", where there is no requirement of 1500.

Note that the RFC uses outer fragmentation:

(b) if the original IPv6 packet is equal or smaller than the IPv6 minimum link MTU, the tunnel entry-point node encapsulates the original packet, and subsequently fragments the resulting IPv6 tunnel packet into IPv6 fragments that do not exceed the Path MTU to the tunnel exit-point.

Wow - that is an interesting quote out of context. The text you quoted is describing the limiting condition to make sure that 1280 and smaller get through even if the path MTU is deficient. In that case alone, outer fragmentation is needed. My document also allows for outer fragmentation on the inner fragments. But, like the RFC4213-derived IPv6 transition mechanisms treats outer fragmentation as an anomalous condition to be avoided if possible - not a steady state operational approach. See Section 3.2 of RFC4213. Thanks - Fred fred.l.templin@boeing.com

...

Masataka Ohta

Masataka Ohta

9:44 p.m.

Templin, Fred L wrote:

...

General statement for IPv6-in-IPv4 tunneling, yes. But inner fragmentation applies equally for *-in-* tunneling.

...
Even though you assume tunnel MTU 1500B

What I am after is a tunnel MTU of infinity. 1500 is the minimum packet size that MUST get through. 1501+ packets are admitted into the tunnel unconditionally in hopes that they MIGHT get through.

Infinity? You can't carry 65516B in an IPv4 packet.

...

My document also allows for outer fragmentation on the inner fragments. But, like the RFC4213-derived IPv6 transition mechanisms treats outer fragmentation as an anomalous condition to be avoided if possible - not a steady state operational approach. See Section 3.2 of RFC4213.

Instead, see the last two lines in second last slide of: http://meetings.apnic.net/__data/assets/file/0018/38214/pathMTU.pdf It is a common condition. Masataka Ohta

Templin, Fred L

10:01 p.m.

...

-----Original Message----- From: Masataka Ohta [mailto:mohta@necom830.hpcl.titech.ac.jp] Sent: Tuesday, June 05, 2012 2:44 PM To: Templin, Fred L; nanog@nanog.org Subject: Re: IPv6 day and tunnels

Templin, Fred L wrote:

...
General statement for IPv6-in-IPv4 tunneling, yes. But inner fragmentation applies equally for *-in-* tunneling.

...
Even though you assume tunnel MTU 1500B

What I am after is a tunnel MTU of infinity. 1500 is the minimum packet size that MUST get through. 1501+ packets are admitted into the tunnel unconditionally in hopes that they MIGHT get through.

Infinity? You can't carry 65516B in an IPv4 packet.

I should qualify that by saying: 1) For tunnels over IPv4, let infinity equal (2^16 - 1) minus the length of the encapsulation headers 2) For tunnels over IPv6, let infinity equal (2^32 - 1) minus the length of the encapsulation headers

...

...
My document also allows for outer fragmentation on the inner fragments. But, like the RFC4213-derived IPv6 transition mechanisms treats outer fragmentation as an anomalous condition to be avoided if possible - not a steady state operational approach. See Section 3.2 of RFC4213.

Instead, see the last two lines in second last slide of:

http://meetings.apnic.net/__data/assets/file/0018/38214/pathMTU.pdf

It is a common condition.

Are you interested in only supporting tinygrams? IMHO, go big or go home! Fred fred.l.templin@boeing.com

...

Masataka Ohta

Masataka Ohta

10:41 p.m.

Templin, Fred L wrote:

...

...
Infinity? You can't carry 65516B in an IPv4 packet.

...

2) For tunnels over IPv6, let infinity equal (2^32 - 1)

You can't carry a 65516B IPv6 packet in an IPv4 packet.

...

...
Instead, see the last two lines in second last slide of:

http://meetings.apnic.net/__data/assets/file/0018/38214/pathMTU.pdf

It is a common condition.

Are you interested in only supporting tinygrams? IMHO, go big or go home!

Bigger packets makes it rather circuit switching than packet switching. The way to lose. Faster is the way to go. Masataka Ohta

Templin, Fred L

10:50 p.m.

...

-----Original Message----- From: Masataka Ohta [mailto:mohta@necom830.hpcl.titech.ac.jp] Sent: Tuesday, June 05, 2012 3:41 PM To: Templin, Fred L Cc: nanog@nanog.org Subject: Re: IPv6 day and tunnels

Templin, Fred L wrote:

...
...
Infinity? You can't carry 65516B in an IPv4 packet.

...
2) For tunnels over IPv6, let infinity equal (2^32 - 1)

You can't carry a 65516B IPv6 packet in an IPv4 packet.

No, but you can carry a ((2^32 - 1) - X) IPv6 packet in an IPv6 packet. Just insert a jumbogram extension header.

...

...
...
Instead, see the last two lines in second last slide of:

http://meetings.apnic.net/__data/assets/file/0018/38214/pathMTU.pdf

It is a common condition.

Are you interested in only supporting tinygrams? IMHO, go big or go home!

Bigger packets makes it rather circuit switching than packet switching. The way to lose.

Faster is the way to go.

Why only fast when you can have both big *and* fast? See Matt's pages on raising the Internet MTU: http://staff.psc.edu/mathis/MTU/ Time on the wire is what matters, and on a 100Gbps wire you can push 6MB in 480usec. That seems more like packet switching latency rather than circuit switching latency. Fred fred.l.templin@boeing.com

...

Masataka Ohta

Masataka Ohta

6 Jun 6 Jun

1:42 a.m.

Templin, Fred L wrote:

...

...
You can't carry a 65516B IPv6 packet in an IPv4 packet.

No, but you can carry a ((2^32 - 1) - X) IPv6 packet in an IPv6 packet.

I'm afraid you wrote:

...

...
General statement for IPv6-in-IPv4 tunneling, yes. But

and

...

...
What I am after is a tunnel MTU of infinity.

in a single mail.

...

...
Bigger packets makes it rather circuit switching than packet switching. The way to lose.

Faster is the way to go.

Why only fast when you can have both big *and* fast?

Because bigger packets makes it rather circuit switching than packet switching, which is the way to lose.

...

See Matt's pages on raising the Internet MTU:

http://staff.psc.edu/mathis/MTU/

A page with too narrow perspective.

...

Time on the wire is what matters,

In senses you have never imagined, yes.

...

and on a 100Gbps wire you can push 6MB in 480usec. That seems more like packet switching latency rather than circuit switching latency.

100Gbps is boringly slow. Are you interested in only supporting slowgrams? IMHO, go fast or go home! At 1Tbps optical packet switched network, there is no practical buffer other than fiber delay lines. If MTU is 1500B, a delay for a packet is 12ns long, delay for which requires 2.5m fiber. For practical packet lose probability, delay for tens of packets is necessary, which is not a problem. 9000B may still be acceptable. But, 6MB means too lengthy fiber. That's how time matters. Worse, at a 10Mbps edge of a network with 1Tbps backbone, 6MB packets means 4.8 seconds of blocking of other packets, which is why it is like circuit switching. Or, at a 1Tbps link in a super computer, 48usec is too much blocking. That's another way how time matters. Are you interested in only supporting circuitgrams? IMHO, go packet or go ITU! Masataka Ohta

Owen DeLong

4:58 a.m.

...

...
...
Bigger packets makes it rather circuit switching than packet switching. The way to lose.

Faster is the way to go.

Why only fast when you can have both big *and* fast?

Because bigger packets makes it rather circuit switching than packet switching, which is the way to lose.

Er... No. It's attitudes like this that killed ATM. (argument about whether the ATM cell payload should be 64 or 128 octets lead to a mathematical compromise decision that was completely unworkable and vastly inferior to either choice. Unfortunately, neither the US telcos (128) or the EU telcos (64) would give ground and accept the other standard.) Larger packets for sustained flows of large amounts of data do not make it circuit switched, they make packet switching more efficient by reducing overhead. Especially on higher bandwidth links. Admittedly, if you go to too large an MTU for your bps, you can create HOL blocking issues which have the same loss characteristics as circuit switching. However, let's say that anything >10ms HOL blocking is our definition of bad. At 10Gbps, that's 100,000 bits or 12,500 octets. At 100Gbps, that's 125,000 octets. Given the combination of Moore's law and the deployment lifecycle, designs we do today in this regard can be expected to last ~12 years or more, so they should be prepared for at least 16x. At 1,600 Gbps, that puts our target maximum MTU up around 200M octets.

...

...
See Matt's pages on raising the Internet MTU:

http://staff.psc.edu/mathis/MTU/

A page with too narrow perspective.

...
Time on the wire is what matters,

In senses you have never imagined, yes.

? Time on the wire is what matters. It is the primary distinction between packet and circuit switching.

...

...
and on a 100Gbps wire you can push 6MB in 480usec. That seems more like packet switching latency rather than circuit switching latency.

100Gbps is boringly slow.

His case only gets better when you go faster. Seriously, Mas, try to keep up.

...

Are you interested in only supporting slowgrams? IMHO, go fast or go home!

At 1Tbps optical packet switched network, there is no practical buffer other than fiber delay lines.

Which argues for an even larger MTU.

...

If MTU is 1500B, a delay for a packet is 12ns long, delay for which requires 2.5m fiber. For practical packet lose probability, delay for tens of packets is necessary, which is not a problem.

9000B may still be acceptable.

But, 6MB means too lengthy fiber.

6MB at 1TB/sec is 48 microseconds which is 120 km fiber. Modern single-mode spools 10 times that size can actually be built within reason.

...

That's how time matters.

Worse, at a 10Mbps edge of a network with 1Tbps backbone, 6MB packets means 4.8 seconds of blocking of other packets, which is why it is like circuit switching.

But you wouldn't carry a 6MB MTU out to a 10 Mbps edge. You'd drop the edge MTU to 1500. That's why Path MTU Discovery is useful.

...

Or, at a 1Tbps link in a super computer, 48usec is too much blocking.

In which case you would want to use a smaller MTU. However, I doubt that anyone is likely to run a tunnel in a situation where 48microseconds is too much latency and we are talking about tunnel MTUs here.

...

That's another way how time matters.

Are you interested in only supporting circuitgrams? IMHO, go packet or go ITU!

Sigh... I now realize that the other end of this conversation is a human with too narrow a view. Owen

Joe Maimon

1:28 p.m.

Owen DeLong wrote:

...

Given the combination of Moore's law and the deployment lifecycle, designs we do today in this regard can be expected to last ~12 years or more, so they should be prepared for at least 16x. At 1,600 Gbps, that puts our target maximum MTU up around 200M octets.

If ICMP is the only interop method we have, we will still be using 1500 with 1280 the next most popular value. L2 uniformity requirements is a contributing factor to this as well as to the various ways ICMP does not serve well for L3. Consider a NBMA networks, or commonly found today, NHRP tunnela. All endpoints must use the same MTU. Perhaps some devices are capabable of much higher. Perhaps some tunnel egresses are capable of re-assembly. Static configuration of MTU is not ideal. It does not allow for dynamic changes in circumstances, such as tunnels/encapsulation rerouting between different MTU links. Between mpls and ethernet services, encapsulation and re-encapsulation is more and more prevalent, making MTU an issue again and again. Blocking is not the only way ICMP messages may fail to arrive. Configured MTU values can very easily be incorrect and hard to detect - this will get worse as L2 networks are glued together through various providers and segments. Devices in the path may have rate limiting or may simply be unable to route back to the source. Devices should be able to dynamically detect MTU on L3 links, and yes, maybe even for L2 adjancencies as well. L3 protocols should not have to rely solely on ICMP exception handling to work properly. Joe

Masataka Ohta

7 Jun 7 Jun

12:56 a.m.

Owen DeLong wrote:

...

...
Because bigger packets makes it rather circuit switching than packet switching, which is the way to lose.

...

Er... No. It's attitudes like this that killed ATM.

ATM committed suicide because its slow target speed (64Kbps voice) and inappropriate QoS theory required small cell of 32B.

...

(argument about whether the ATM cell payload should be 64 or 128 octets

It was between 32B and 64B.

...

lead to a mathematical compromise

It is no mathematical. Instead, 48B is an algebraic mean of 32B and 64B.

...

Larger packets for sustained flows of large amounts of data do not make it circuit switched,

As I already gave blocking examples with problematic blocking times, it's your problem of lack of understanding on why circuit switching is bad.

...

Admittedly, if you go to too large an MTU for your bps, you can create HOL blocking issues which have the same loss characteristics as circuit switching. However, let's say that anything>10ms HOL blocking is our definition of bad. At 10Gbps, that's 100,000 bits or 12,500 octets. At 100Gbps, that's 125,000 octets.

People, like you, who think 10ms blocking is fine care voice communications only and are people for circuit switching. We, people working on the Internet, which began as a computer network, know 48usec can be significantly lengthy for many computations today. When experimental Ethernet was 3Mbps, 1500B meant 4ms blocking, which was tolerable because computers were slow to compute and most computation was done inside a single computer. But, Moore's law changed everything. Go your home of ITU. Masataka Ohta

Templin, Fred L

4:05 p.m.

Here is Matt's full table and descriptive text: "Note that there is no specific reason to require any particular MTU at any particular rate. As a general principle, we prefer declining packet times (and declining worst case jitter) as you go to higher rates. Actual Vision Alternate 1 Alternate 2 Rate MTU Time MTU Time MTU Time MTU Time 10 Mb/s 1.5kB 1200uS 100 Mb/s 1.5kB 120uS 12kB 960uS 9kB 720uS 4.3kB 433uS 1 Gb/s 1.5kB 12uS 96kB 768uS 64kB 512uS 9kB 72uS 10 Gb/s 1.5kB 1.2uS 750kB 600uS 150kB 120uS 64kB 51.2uS 100 Gb/s 6MB 480uS 1.5MB 120uS 64kB 5.12uS 1 Tb/s 50MB 400uS 15MB 120uS 64kB 0.512uS The above numbers are very speculative about what MTUs might make sense in the market. We keep updating them as we learn more about how MTU affects the balance between switching costs and end-system costs vs. end-to-end performance." If you wish, you can also consider Alternate 3 for 9kB: 72us@1Gbps, 7.2us@10Gbps, .72us@100Gbps, .072us@1Tbps. Fred fred.l.templin@boeing.com

Masataka Ohta

12 Jun 12 Jun

11:47 a.m.

Templin, Fred L wrote:

...

If you wish, you can also consider Alternate 3 for 9kB: 72us@1Gbps, 7.2us@10Gbps, .72us@100Gbps, .072us@1Tbps.

So? Have you learned enough about Moore's law that, at 10Gbps era, 72us of delay is often significant? Masataka Ohta

Templin, Fred L

5:19 p.m.

...

-----Original Message----- From: Masataka Ohta [mailto:mohta@necom830.hpcl.titech.ac.jp] Sent: Tuesday, June 12, 2012 4:47 AM To: Templin, Fred L Cc: Owen DeLong; nanog@nanog.org Subject: Re: IPv6 day and tunnels

Templin, Fred L wrote:

...
If you wish, you can also consider Alternate 3 for 9kB: 72us@1Gbps, 7.2us@10Gbps, .72us@100Gbps, .072us@1Tbps.

So?

Have you learned enough about Moore's law that, at 10Gbps era, 72us of delay is often significant?

I frankly haven't thought about it any further. You say 1280+ belongs in ITU, and I say 1280- belongs in ATM. Larger packets means fewer interrupts and fewer packets in flight, which is good Moore's law or no. Accommodation of MTU diversity is what matters. Fred fred.l.templin@boeing.com

...

Masataka Ohta

Masataka Ohta

8:33 p.m.

Templin, Fred L wrote:

...

...
Have you learned enough about Moore's law that, at 10Gbps era, 72us of delay is often significant?

I frankly haven't thought about it any further.

That's your problem.

...

You say 1280+ belongs in ITU, and I say 1280- belongs in ATM.

As I already said, 9KB is fine for me. Small cell size (32~48B, not 1280-) of ATM is derived from slow (64Kbps voice) speed and short (0.1s) delay requirement with fair queuing and has no relevance to today's network.

...

Larger packets means fewer interrupts and fewer packets in flight, which is good Moore's law or no.

That is a basic misunderstanding of those who thought jumbograms were good. They (or you) thought supercomputers are vector computers, very slow to react against interrupts, and have no IO processors to take care of packet handling. The reality with Moore's law, however, is that NIC cards can take care of even TCP, which makes jumbograms totally unnecessary. Moreover, the huge number of scalar processors in modern supercomputers means communication granularity is (depending on computation algorithm) often tiny, which means networks in supercomputers must be able to handle small packets efficiently. Larger packets means, in addition to longer HOL blocking, more delay to pack more data in the packets, even though processors often want to receive data with less delay. Thus, as with other features of IPv6, jumbograms are no useful but harmful. Masataka Ohta

Templin, Fred L

8:53 p.m.

...

As I already said, 9KB is fine for me.

Then you will agree that accommodation of MTU diversity is a MUST (my point). Fred

Masataka Ohta

9:11 p.m.

Templin, Fred L wrote:

...

...
As I already said, 9KB is fine for me.

Then you will agree that accommodation of MTU diversity is a MUST (my point).

Not necessarily, as IPv4 can take care of itself and IPv6 is hopeless. Masataka Ohta

Templin, Fred L

9:26 p.m.

...

-----Original Message----- From: Masataka Ohta [mailto:mohta@necom830.hpcl.titech.ac.jp] Sent: Tuesday, June 12, 2012 2:12 PM To: Templin, Fred L Cc: Owen DeLong; nanog@nanog.org Subject: Re: IPv6 day and tunnels

Templin, Fred L wrote:

...
...
As I already said, 9KB is fine for me.

Then you will agree that accommodation of MTU diversity is a MUST (my point).

Not necessarily, as IPv4 can take care of itself and IPv6 is hopeless.

IPv4 can take care of it how - with broken PMTUD or with broken fragmentation/reassembly? And, you won't get any argument from me that IPv6 has been stuck for years for good reasons - but MTU failures can soon be taken off the list. Fred fred.l.templin@boeing.com

Masataka Ohta

19 Jun 19 Jun

1:10 p.m.

Templin, Fred L wrote:

...

...
Not necessarily, as IPv4 can take care of itself and IPv6 is hopeless.

IPv4 can take care of it how - with broken PMTUD or

As you know, RFC1191 style PMTUD is broken both for IPv4 and IPv6.

...

with broken fragmentation/reassembly?

Fragmentation is fine, especially with RFC4821 style PMTUD, even though RFC4821 tries to make people believe it is broken, because accidental ID match is negligibly rare even with IPv4.

...

And, you won't get any argument from me that IPv6 has been stuck for years for good reasons - but MTU failures can soon be taken off the list.

Now, it's time for you to return v6-ops to defend your draft from Joe Touch. Note that there is no point for IPv6 forbid fragmentation by intermediate routers. Masataka Ohta

Templin, Fred L

4:35 p.m.

...

-----Original Message----- From: Masataka Ohta [mailto:mohta@necom830.hpcl.titech.ac.jp] Sent: Tuesday, June 19, 2012 6:10 AM To: Templin, Fred L Cc: Owen DeLong; nanog@nanog.org Subject: Re: IPv6 day and tunnels

Templin, Fred L wrote:

...
...
Not necessarily, as IPv4 can take care of itself and IPv6 is hopeless.

IPv4 can take care of it how - with broken PMTUD or

As you know, RFC1191 style PMTUD is broken both for IPv4 and IPv6.

Unfortunately, there is evidence that this is the case.

...

...
with broken fragmentation/reassembly?

Fragmentation is fine, especially with RFC4821 style PMTUD, even though RFC4821 tries to make people believe it is broken, because accidental ID match is negligibly rare even with IPv4.

The 16-bit IP ID, plus the 120sec MSL, limits the rate for fragmentable packets to 6.4Mbps for a 1500 MTU. Exceeding this rate leads to the possibility of fragment misassociations (RFC4963). This would not be a problem if there were some stronger integrity check than just the Internet checksum, but with the current system we don't have that.

...

...
And, you won't get any argument from me that IPv6 has been stuck for years for good reasons - but MTU failures can soon be taken off the list.

Now, it's time for you to return v6-ops to defend your draft from Joe Touch.

Note that there is no point for IPv6 forbid fragmentation by intermediate routers.

I wasn't there when the decision was made, but based on my findings I don't disagree. Fred fred.l.templin@boeing.com

...

Masataka Ohta

Joe Maimon

4 Jun 4 Jun

3:09 p.m.

Owen DeLong wrote:

...

There should be no such thing as packet fragmentation in the current protocol. What is needed is for people to simply configure things correctly and allow PTB messages to pass as designed.

Owen

You are absolutely correct. Are you talking about IPv4 or IPv6? Joe

Cameron Byrne

5:26 p.m.

On Sun, Jun 3, 2012 at 11:20 PM, Jimmy Hess <mysidia@gmail.com> wrote:

...

...
If one is so stupid to just block ICMP then one should also accept that one loses functionality. ICMP tends to get blocked by firewalls by default; There are legitimate reasons to block ICMP, esp w V6. Security device manufacturers tend to indicate all the "lost functionality" is

On 6/3/12, Jeroen Massar <jeroen@unfix.org> wrote: optional functionality not required for a working device.

In case security policy folks need a reference on what ICMPv6 functionality is required for IPv6 to work correctly, please reference http://www.ietf.org/rfc/rfc4890.txt CB

Joe Maimon

2:05 a.m.

Joe Maimon wrote:

...

Looks like a tunnel mtu issue. I have not as of yet traced the definitive culprit, who is (not) sending ICMP too big, who is (not) receiving them, etc.

The culprit is the v6 tunnel, which wanders into v4 ipsec/gre tunnels, which means the best fix is ipv6 mtu 1280 on the tunnels, and possibly on the hosts. PMTUD works fine, just comes up with the wrong answer. 1280, the new 1500.

...

Joe

bmanning＠vacation.karoshi.com

3:26 a.m.

On Sun, Jun 03, 2012 at 10:05:40PM -0400, Joe Maimon wrote:

...

Joe Maimon wrote:

...
Looks like a tunnel mtu issue. I have not as of yet traced the definitive culprit, who is (not) sending ICMP too big, who is (not) receiving them, etc.

The culprit is the v6 tunnel, which wanders into v4 ipsec/gre tunnels, which means the best fix is ipv6 mtu 1280 on the tunnels, and possibly on the hosts. PMTUD works fine, just comes up with the wrong answer.

1280, the new 1500.

...
Joe

actually, to be safe, 1220. /bill

Jeroen Massar

3:35 a.m.

On 2012-06-03 20:26, bmanning@vacation.karoshi.com wrote:

...

On Sun, Jun 03, 2012 at 10:05:40PM -0400, Joe Maimon wrote: [..] actually, to be safe, 1220.

That will work really well with the minimum IPv6 MTU being 1280 ;) Greets, Jeroen

Joel Maslak

2:33 a.m.

On Jun 3, 2012, at 7:38 PM, Joe Maimon <jmaimon@ttec.com> wrote:

...

www.arin.net works and worked for years. www.facebook.com stopped June 1.

So IPv6 fixes the fragmentation and MTU issues of IPv4 by how exactly?

It doesn't fix the fragmentation issues. It assumes working PMTU. For what it's worth, I also use a tunnel without issue to reach www.facebook.com via IPv6, with an MTU of 1476 (since it's running over a 1492 byte IPv4 PPoE tunnel...).

Mark Andrews

3:14 a.m.

In message <4FCC11B2.2090405@ttec.com>, Joe Maimon writes:

...

Well, IPv6 day isnt here yet, and my first casualty is the browser on the wife's machine, firefox now configured to not query AAAA.

Now www.facebook.com loads again.

Looks like a tunnel mtu issue. I have not as of yet traced the definitive culprit, who is (not) sending ICMP too big, who is (not) receiving them, etc.

www.arin.net works and worked for years. www.facebook.com stopped June 1.

So IPv6 fixes the fragmentation and MTU issues of IPv4 by how exactly?

Or was the fix incorporating the breakage into the basic design?

In IPv4 I can make tunneling just work nearly all of the time. So I have to munge a tcp mss header, or clear a df-bit, or fragment the encapsulated packet when all else fails, but at least the tools are there. And on the host, /proc/sys/net

In IPv6, it seems my options are a total throwback, with the best one turning the sucker off. Nobody (on that station) needs it anyways.

Joe

If facebook isn't working for you over a tunnel, and other sites are, complain to the site. If they don't let through ICMPv6 PTB then the site needs to add "route change -inet6 change -mtu 1280" or equivalent to every box. This isn't rocket science. If you choose to break PMTU discovery then you can take the necessary steps to avoid requiring that PMTU Discovery works. This is practical for IPv6. For IPv4 it is impractical to do the same. The IPv6 Advanced Socket API even has controls so that you can make the PMTUD choice on a per socket basis. Mark -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org

Masataka Ohta

5:41 a.m.

Joe Maimon wrote:

...

So IPv6 fixes the fragmentation and MTU issues of IPv4 by how exactly?

Completely wrongly.

...

Or was the fix incorporating the breakage into the basic design?

Yes. Because IPv6 requires ICMP packet too big generated against multicast, it is designed to cause ICMP implosions, which means ISPs must filter ICMP packet too big at least against multicast packets and, as distinguishing them from unicast ones is not very easy, often against unicast ones. For further details, see my presentation at APNIC32: http://meetings.apnic.net/32/program/apops How Path MTU Discovery Doesn't work Masataka Ohta

...

In IPv4 I can make tunneling just work nearly all of the time. So I have to munge a tcp mss header, or clear a df-bit, or fragment the encapsulated packet when all else fails, but at least the tools are there. And on the host, /proc/sys/net

FYI, IETF is trying to inhibit clearing DF bit explicitly with draft-ietf-intarea-ipv4-id-update-05.txt >> IPv4 datagram transit devices MUST NOT clear the DF bit. which is now under the last call. Masataka Ohta

Jeroen Massar

6:27 a.m.

On 3 Jun 2012, at 22:41, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:

...

Joe Maimon wrote:

...
So IPv6 fixes the fragmentation and MTU issues of IPv4 by how exactly?

Completely wrongly.

Got a better solution? ;)

...

...
Or was the fix incorporating the breakage into the basic design?

Yes.

Because IPv6 requires ICMP packet too big generated against multicast, it is designed to cause ICMP implosions, which means ISPs must filter ICMP packet too big at least against multicast packets and, as distinguishing them from unicast ones is not very easy, often against unicast ones.

I do not see the problem that you are seeing, to adress the two issues in your slides: - for multicast just set your max packetsize to 1280, no need for pmtu and thus this "implosion" You think might happen. The sender controls the packetsize anyway and one does not want to frag packets for multicast thus 1280 solves all of it. - when doing IPv6 inside IPv6 the outer path has to be 1280+tunneloverhead, if it is not then you need to use a tunneling protocol that knows how to frag and reassemble as is acting as a medium with an mtu less than the minimum of 1280 Greets, Jeroen

Masataka Ohta

1:36 p.m.

Jeroen Massar wrote:

...

...
...
So IPv6 fixes the fragmentation and MTU issues of IPv4 by how exactly?

Completely wrongly.

Got a better solution? ;)

IPv4 without PMTUD, of course.

...

...
Because IPv6 requires ICMP packet too big generated against multicast, it is designed to cause ICMP implosions, which means ISPs must filter ICMP packet too big at least against multicast packets and, as distinguishing them from unicast ones is not very easy, often against unicast ones.

I do not see the problem that you are seeing, to adress the two issues in your slides: - for multicast just set your max packetsize to 1280, no need for pmtu and thus this "implosion"

It is a sender of a multicast packet, not you as some ISP, who set max packet size to 1280B or 1500B. You can do nothing against a sender who consciously (not necessarily maliciously) set it to 1500B. The only protection is not to generate packet too big and to block packet too big at least against multicast packets. If you don't want to inspect packets so deeply (beyond first 64B, for example), packet too big against unicast packets are also blocked. That you don't enable multicast in your network does not mean you have nothing to do with packet too big against multicast, because you may be on a path of returning ICMPs. That is, you should still block them.

...

You think might happen. The sender controls the packetsize anyway and one does not want to frag packets for multicast thus 1280 solves all of it.

That's what I said in IETF IPv6 WG more than 10 years ago, but all the other WG members insisted on having multicast PMTUD, ignoring the so obvious problem of packet implosions. Thus, RFC2463 requires: Sending a Packet Too Big Message makes an exception to one of the rules of when to send an ICMPv6 error message, in that unlike other messages, it is sent in response to a packet ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ received with an IPv6 multicast destination address, or a linklayer ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ multicast or link-layer broadcast address. They have not yet obsoleted the feature. So, you should assume some, if not all, of them still insist on using multicast PMTUD to make multicast packet size larger than 1280B. In addition, there should be malicious guys.

...

- when doing IPv6 inside IPv6 the outer path has to be 1280+tunneloverhead, if it is not then

Because PMTUD is not expected to work, you must assume MTU of outer path is 1280B, as is specified "simply restrict itself to sending packets no larger than 1280 octets" in RFC2460.

...

you need to use a tunneling protocol that knows how to frag and reassemble as is acting as a medium with an mtu less than the minimum of 1280

That's my point in my second last slide. Considering that many inner packet will be just 1280B long, many packets will be fragmented, as a result of stupid attempt to make multicast PMTUD work, unless you violate RFC2460 to blindly send packets a little larger than 1280B. Masataka Ohta

Jeroen Massar

2:07 p.m.

On 4 Jun 2012, at 06:36, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:

...

Jeroen Massar wrote:

...
...
...
So IPv6 fixes the fragmentation and MTU issues of IPv4 by how exactly?

Completely wrongly.

Got a better solution? ;)

IPv4 without PMTUD, of course.

We are (afaik) discussing IPv6 in this thread, I assume you typo'd here ;)

...

...
...
Because IPv6 requires ICMP packet too big generated against multicast, it is designed to cause ICMP implosions, which means ISPs must filter ICMP packet too big at least against multicast packets and, as distinguishing them from unicast ones is not very easy, often against unicast ones.

I do not see the problem that you are seeing, to adress the two issues in your slides: - for multicast just set your max packetsize to 1280, no need for pmtu and thus this "implosion"

It is a sender of a multicast packet, not you as some ISP, who set max packet size to 1280B or 1500B.

If a customer already miraculously has the rare capability of sending multicast packets in the rare case that a network is multicast enabled then they will also have been told to use a max packet size of 1280 to avoid any issues when it is expected that some endpoint might have that max MTU. I really cannot see the problem with this as multicast networks tend to be rare and very much closed. Heck, for that matter the m6bone is currently pretty much in a dead state for quite a while already.... :(

...

You can do nothing against a sender who consciously (not necessarily maliciously) set it to 1500B.

Of course you can, the first hop into your network can generate a single PtB and presto the issue becomes a problem of the sender. As the sender's intention is likely to reach folks they will adhere to that advice too instead of just sending packets which get rejected at the first hop.

...

The only protection is not to generate packet too big and to block packet too big at least against multicast packets.

No need, as above, reject and send PtB and all is fine.

...

If you don't want to inspect packets so deeply (beyond first 64B, for example), packet too big against unicast packets are also blocked.

Routing (forwarding packets) is in no way "expection".

...

That you don't enable multicast in your network does not mean you have nothing to do with packet too big against multicast, because you may be on a path of returning ICMPs. That is, you should still block them.

Blocking returning ICMPv6 PtB where you are looking at the original packet which is echod inside the data of the ICMPv6 packet would indeed require one to look quite deep, but if one is so determined to firewall them well, then you would have to indeed. I do not see a reason to do so though. Please note that the src/dst of the packet itself is unicast even if the PtB will be for a multicast packet. I guess one should not be so scared of ICMP, there are easier ways to overload a network. Proper BCP38 goes a long way.

...

...
You think might happen. The sender controls the packetsize anyway and one does not want to frag packets for multicast thus 1280 solves all of it.

That's what I said in IETF IPv6 WG more than 10 years ago, but all the other WG members insisted on having multicast PMTUD, ignoring the so obvious problem of packet implosions.

They did not ignore you, they realized that not everybody has the same requirements. With the current spec you can go your way and break pMTU requiring manual 1280 settings, while other networks can use pMTU in their networks. Everbody wins.

...

So, you should assume some, if not all, of them still insist on using multicast PMTUD to make multicast packet size larger than 1280B.

As networks become more and more jumbo frame enabled, what exactly is the problem with this?

...

In addition, there should be malicious guys.

...
- when doing IPv6 inside IPv6 the outer path has to be 1280+tunneloverhead, if it is not then

Because PMTUD is not expected to work,

You assume it does not work, but as long as per the spec people do not filter it, it works.

...

you must assume MTU of outer path is 1280B, as is specified "simply restrict itself to sending packets no larger than 1280 octets" in RFC2460.

While for multicast enabled networks that might hit the minimum MTU this might be true-ish, it does not make it universally true.

...

...
you need to use a tunneling protocol that knows how to frag and reassemble as is acting as a medium with an mtu less than the minimum of 1280

That's my point in my second last slide.

Then you word it wrongly. It is not the problem of IPv6 that you chose to layer it inside so many stacks that the underlying medium cannot transport packets bigger as 1280, that medium has to take care of it.

...

Considering that many inner packet will be just 1280B long, many packets will be fragmented, as a result of stupid attempt to make multicast PMTUD work, unless you violate RFC2460 to blindly send packets a little larger than 1280B.

Your statement only works when: - you chose a medium unable to send packets with a minimum of 1280 Which thus makes the medium IPv6 incapable, the mediums issue to frag - someone filters ICMP PtB even though one should not - when in the rare case with the above someone actually uses interdomain multicast I hope you see how much of a non-issue this thus is. Please fix your network instead, kthx. Greets, Jeroen

Jared Mauch

2:31 p.m.

On Jun 4, 2012, at 10:07 AM, Jeroen Massar wrote:

...

On 4 Jun 2012, at 06:36, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:

...
Jeroen Massar wrote:

...
...
...
So IPv6 fixes the fragmentation and MTU issues of IPv4 by how exactly?

Completely wrongly.

Got a better solution? ;)

IPv4 without PMTUD, of course.

We are (afaik) discussing IPv6 in this thread, I assume you typo'd here ;)

He is comparing & contrasting with the behavior of IPv4 v IPv6. If your PMTU is broken for v4 because people do wholesale blocks of ICMP, there is a chance they will have the same problem with wholesale blocks of ICMPv6 packets. The interesting thing about IPv6 is it's "just close enough" to IPv4 in many ways that people don't realize all the technical details. People are still getting it wrong with IPv4 today, they will repeat their same mistakes in IPv6 as well. - I've observed that if you avoid providers that rely upon tunnels, you can sometimes observe significant performance improvements in IPv6 nitrates. Those that are tunneling are likely to take a software path at one end, whereas native (or native-like/6PE) tends to not see this behavior. Those doing native tend to have more experience debugging it as well as they already committed business resources to it. - Jared

Jeroen Massar

8:31 p.m.

On 2012-06-04 07:31, Jared Mauch wrote:

...

On Jun 4, 2012, at 10:07 AM, Jeroen Massar wrote:

...
On 4 Jun 2012, at 06:36, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:

...
Jeroen Massar wrote:

...
...
...
So IPv6 fixes the fragmentation and MTU issues of IPv4 by how exactly?

Completely wrongly.

Got a better solution? ;)

IPv4 without PMTUD, of course.

We are (afaik) discussing IPv6 in this thread, I assume you typo'd here ;)

He is comparing & contrasting with the behavior of IPv4 v IPv6.

If your PMTU is broken for v4 because people do wholesale blocks of ICMP, there is a chance they will have the same problem with wholesale blocks of ICMPv6 packets.

Yep, people who act stupid will remain stupid...

...

The interesting thing about IPv6 is it's "just close enough" to IPv4 in many ways that people don't realize all the technical details. People are still getting it wrong with IPv4 today, they will repeat their same mistakes in IPv6 as well.

IMHO they should not have to need to know about technical details. But if one is configuring firewalls one should know what one is blocking and things might break. If one does block PtB you should realize that you are breaking connectivity in some cases and that that is your problem to resolve, not that of other peoples. There are various 'secure firewall' examples for people who are unable to think for themselves and figure out what kind of firewalling is appropriate for their environment.

...

I've observed that if you avoid providers that rely upon tunnels, you can sometimes observe significant performance improvements in IPv6 nitrates. Those that are tunneling are likely to take a software path at one end, whereas native (or native-like/6PE) tends to not see this behavior. Those doing native tend to have more experience debugging it as well as they already committed business resources to it.

Tunnels therefor only should exist at the edge where native IPv6 cannot be made possible without significant investments in hardware and or other resources. Of course every tunnel should at one point in time be replaced by native where possible, thus hopefully the folks planning expenses and hardware upgrades have finally realized that they cannot get around it any more and have put this "ipv6" feature on the list for the next round of upgrades. Note that software-based tunnels can be extremely quick nowadays too, especially when given the fact that hardware can be so abundant. During tests for sixxsd v4 I've been able to stuff 10GE through it with ease, but the trick there is primarily also that we do not need to do an "expensive" full ipv6 address lookup as we know how the addresses are structured and thus instead of having to do a 128bit lookup we can restrict that to a 12 bit lookup for those tunnels, which is just a direct jump table, much cheaper than having generic silicon that needs to do it for 128bits, then again that same trick of course would be so much faster in hardware that is specifically built to apply that trick. The trick is much faster than using the software tunnels that you would normally find in eg a Linux or BSD kernel though, also because those tunnels look up tunnels based on the IPv4 address, thus the full 32-bit address space instead of using the knowledge that the 128bit one can be reduced to the 12bits that we use. The advantage of knowing one's field and being less generic ;) Greets, Jeroen

Joe Maimon

9:26 p.m.

Jeroen Massar wrote:

...

Tunnels therefor only should exist at the edge where native IPv6 cannot be made possible without significant investments in hardware and or other resources. Of course every tunnel should at one point in time be replaced by native where possible, thus hopefully the folks planning expenses and hardware upgrades have finally realized that they cannot get around it any more and have put this "ipv6" feature on the list for the next round of upgrades.

IPv4 is pretty mature. Are there more or less tunnels on it? Why do you think a maturing IPv6 means less tunnels as opposed to more? Does IPv6 contain elegant solutions to all the issues one would resort to tunnels with IPv4? Does successful IPv6 deployment require obsoleting tunneling? Fail. Today, most people cant even get IPv6 without tunnels. And tunnels are far from the only cause of MTU lower than what has become the only valid MTU of 1500, thanks in no small part to people who refuse to acknowledge operational reality and are quite satisfied with the state of things once they find a "them" to blame it on. I just want to know if we can expect IPv6 to devolve into 1280 standard mtu and at what gigabit rates. Joe

Templin, Fred L

9:33 p.m.

...

I just want to know if we can expect IPv6 to devolve into 1280 standard mtu and at what gigabit rates.

The vast majority of hosts will still expect 1500, so we need to find a way to get them at least that much. Fred fred.l.templin@boeing.com

Jeroen Massar

9:47 p.m.

On 2012-06-04 14:26, Joe Maimon wrote:

...

Jeroen Massar wrote:

...
Tunnels therefor only should exist at the edge where native IPv6 cannot be made possible without significant investments in hardware and or other resources. Of course every tunnel should at one point in time be replaced by native where possible, thus hopefully the folks planning expenses and hardware upgrades have finally realized that they cannot get around it any more and have put this "ipv6" feature on the list for the next round of upgrades.

IPv4 is pretty mature. Are there more or less tunnels on it?

I would hazard to state that there are more IPv4 tunnels than IPv6 tunnels. This as "tunneling" is what most people simply call VPN and there are large swaths of those.

...

Why do you think a maturing IPv6 means less tunnels as opposed to more?

More native instead of tunneling IPv6 over IPv6. Note that tunneling in this context is used for connecting locations that do not have IPv6 but have IPv4, not for connecting ala VPN networks where you need to gain access to a secured/secluded network. If people want to use a tunnel for the purpose of a VPN, then they will, be that IPv4 or IPv6 or both inside that tunnel.

...

Does IPv6 contain elegant solutions to all the issues one would resort to tunnels with IPv4?

Instead of having a custom VPN protocol one can do IPSEC properly now as there is no NAT that one has to get around. Microsoft's Direct Access does this btw and is an excellent example of doing it correctly.

...

Does successful IPv6 deployment require obsoleting tunneling?

No why should it? But note that "IPv6 tunnels" (not VPNs) are a transition technique from IPv4 to IPv6 and thus should not remain around forever, the transition will end somewhere, sometime, likely far away in the future with the speed that IPv6 is being deployed ;)

...

Fail.

Today, most people cant even get IPv6 without tunnels.

In time that will change, that is simply transitional.

...

And tunnels are far from the only cause of MTU lower than what has become the only valid MTU of 1500, thanks in no small part to people who refuse to acknowledge operational reality and are quite satisfied with the state of things once they find a "them" to blame it on.

I just want to know if we can expect IPv6 to devolve into 1280 standard mtu and at what gigabit rates.

1280 is the minimum IPv6 MTU. If people allow pMTU to work, aka accept and process ICMPv6 Packet-Too-Big messages everything will just work. This whole thread is about people who cannot be bothered to know what they are filtering and that they might just randomly block PtB as they are doing with IPv4 today. Yes, in that case their network breaks if the packets are suddenly larger than a link somewhere else, that is the same as in IPv4 ;) Greets, Jeroen

Templin, Fred L

9:55 p.m.

...

...
I just want to know if we can expect IPv6 to devolve into 1280 standard mtu and at what gigabit rates.

1280 is the minimum IPv6 MTU. If people allow pMTU to work, aka accept and process ICMPv6 Packet-Too-Big messages everything will just work.

This whole thread is about people who cannot be bothered to know what they are filtering and that they might just randomly block PtB as they are doing with IPv4 today. Yes, in that case their network breaks if the packets are suddenly larger than a link somewhere else, that is the same as in IPv4 ;)

But, it is not necessarily the person that filters the PTBs that suffers the breakage. It is the original source that may be many IP hops further down the line, who would have no way of knowing that the filtering is even happening. Thanks - Fred fred.l.templin@boeing.com

...

Greets, Jeroen

Jeroen Massar

10:04 p.m.

On 2012-06-04 14:55, Templin, Fred L wrote:

...

...
...
I just want to know if we can expect IPv6 to devolve into 1280 standard mtu and at what gigabit rates.

1280 is the minimum IPv6 MTU. If people allow pMTU to work, aka accept and process ICMPv6 Packet-Too-Big messages everything will just work.

This whole thread is about people who cannot be bothered to know what they are filtering and that they might just randomly block PtB as they are doing with IPv4 today. Yes, in that case their network breaks if the packets are suddenly larger than a link somewhere else, that is the same as in IPv4 ;)

But, it is not necessarily the person that filters the PTBs that suffers the breakage. It is the original source that may be many IP hops further down the line, who would have no way of knowing that the filtering is even happening.

It is not too tricky to figure that out actually: $ tracepath6 www.nanog.org 1?: [LOCALHOST] 0.078ms pmtu 1500 1: 2620:0:6b0:a::1 0.540ms 1: 2620:0:6b0:a::1 1.124ms 2: ge-4-35.car2.Chicago2.Level3.net 56.557ms asymm 13 3: vl-52.edge4.Chicago2.Level3.net 57.501ms asymm 13 4: 2001:1890:1fff:310:192:205:37:149 61.910ms asymm 10 5: cgcil21crs.ipv6.att.net 92.067ms asymm 12 6: sffca21crs.ipv6.att.net 94.720ms asymm 12 7: cr81.sj2ca.ip.att.net 90.068ms asymm 12 8: sj2ca401me3.ipv6.att.net 90.605ms asymm 11 9: 2001:1890:c00:3a00::11fb:8591 89.888ms asymm 12 10: no reply 11: no reply 12: no reply and you'll at least have a good guess where it happens. Not something for non-techy users, but good enough hopefully for people working in the various NOCs. Now the tricky part is where to complain to get that fixed though ;) Greets, Jeroen (tracepath6 is available on your favourite Linux, eg in the iputils-tracepath package for Debian, for the various *BSD's one can use scamper from: http://www.wand.net.nz/scamper/ )

Joe Maimon

10:27 p.m.

Jeroen Massar wrote:

...

If people want to use a tunnel for the purpose of a VPN, then they will, be that IPv4 or IPv6 or both inside that tunnel.

...

Instead of having a custom VPN protocol one can do IPSEC properly now as there is no NAT that one has to get around. Microsoft's Direct Access does this btw and is an excellent example of doing it correctly.

Microsoft has had this capability since win2k. I didnt see any enterprises use it, even those who used their globally unique and routed ipv4 /16 internally. NAT was not why they did not use it. They did not use it externally, they did not use it internally. In fact, most of them were involved in projects to switch to NAT internally. Enterprises also happen not to be thrilled with the absence of NAT in IPv6. Dont expect huge uptake there.

...

No why should it? But note that "IPv6 tunnels" (not VPNs) are a transition technique from IPv4 to IPv6 and thus should not remain around forever, the transition will end somewhere, sometime, likely far away in the future with the speed that IPv6 is being deployed ;)

So VPN is the _only_ acceptable use of sub 1500 encapsulation?

...

...
Today, most people cant even get IPv6 without tunnels.

In time that will change, that is simply transitional.

If turning it on with a tunnel breaks things, it wont make native transition happen sooner.

...

1280 is the minimum IPv6 MTU. If people allow pMTU to work, aka accept and process ICMPv6 Packet-Too-Big messages everything will just work.

If things break with higher mtu's then 1280 but less then 1500, there really is no reason at all not to use 1280, the efficiency difference is trivial. And on the IPv4 internet, we generally cannot control what most of the rest of the people on it do. Looks like we are not going to be doing any better on the IPv6 internet.

...

This whole thread is about people who cannot be bothered to know what they are filtering and that they might just randomly block PtB as they are doing with IPv4 today. Yes, in that case their network breaks if the packets are suddenly larger than a link somewhere else, that is the same as in IPv4 ;)

Greets, Jeroen

This whole thread is all about how IPv6 has not improved any of the issues that are well known with IPv4 and in many cases makes them worse. This whole thread is all about showcasing how IPv6 makes them worse, simply because it is designed with "this time they will do what we want" mentality. Joe

Jeroen Massar

11:50 p.m.

On 2012-06-04 15:27, Joe Maimon wrote:

...

Jeroen Massar wrote:

...
If people want to use a tunnel for the purpose of a VPN, then they will, be that IPv4 or IPv6 or both inside that tunnel.

...
Instead of having a custom VPN protocol one can do IPSEC properly now as there is no NAT that one has to get around. Microsoft's Direct Access does this btw and is an excellent example of doing it correctly.

Microsoft has had this capability since win2k. I didnt see any enterprises use it, even those who used their globally unique and routed ipv4 /16 internally. NAT was not why they did not use it.

They did not use it externally, they did not use it internally.

In fact, most of them were involved in projects to switch to NAT internally.

Enterprises also happen not to be thrilled with the absence of NAT in IPv6.

What I read that you are saying is that you know a lot of folks who do not understand the concept of end-to-end reachability and think that NAT is a security feature and that ICMP is evil. That indeed matches most of the corporate world quite well. That they are heavily misinformed does not make it the correct answer though.

...

Dont expect huge uptake there.

Every problem has it's own solution depending on the situation. Direct Access is a just another possible solution to a problem. If NATs would not have existed and the IPSEC key infra was better integrated into Operating Systems the uptake for IPSEC based VPNs would have likely been quite a bit higher by now. But all guess work.

...

...
No why should it? But note that "IPv6 tunnels" (not VPNs) are a transition technique from IPv4 to IPv6 and thus should not remain around forever, the transition will end somewhere, sometime, likely far away in the future with the speed that IPv6 is being deployed ;)

So VPN is the _only_ acceptable use of sub 1500 encapsulation?

Why would anything be 'acceptable'? If you have a medium that only can carry X bytes per packet, then that is the way it is, you'll just have to be able to frag IPv6 packets on that medium if you want to support IPv6. And the good thing is that if you can support jumbo frames, just turn it on and let pMTU do it's work. Happy 9000's ;)

...

...
...
Today, most people cant even get IPv6 without tunnels.

In time that will change, that is simply transitional.

If turning it on with a tunnel breaks things, it wont make native transition happen sooner.

Using tunnels does not break things. Filtering PTB's (which can happen anywhere in the network, thus also remotely) can break things though. Or better said: mis-configuring systems break things.

...

This whole thread is all about how IPv6 has not improved any of the issues that are well known with IPv4 and in many cases makes them worse.

You cannot unteach stupid people to do stupid things. Protocol changes will not suddenly make people understand that what they want to do is wrong and breaks said protocol. Greets, Jeroen

Mark Andrews

5 Jun 5 Jun

12:13 a.m.

What we need is World 1280 MTU day where *all* peering links are set to 1280 bytes for IPv4 and IPv6 and are NOT raised for 24 hours regardless of the complaints. This needs to be done annually. -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org

Owen DeLong

12:18 a.m.

I kind of like the idea. I suspect that $DAYJOB would be less enthusiastic. Owen On Jun 4, 2012, at 5:13 PM, Mark Andrews wrote:

...

What we need is World 1280 MTU day where *all* peering links are set to 1280 bytes for IPv4 and IPv6 and are NOT raised for 24 hours regardless of the complaints. This needs to be done annually.

-- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org

Joe Maimon

12:21 a.m.

Jeroen Massar wrote:

...

That indeed matches most of the corporate world quite well. That they are heavily misinformed does not make it the correct answer though.

Either you are correct and they are all wrong, or they have a perspective that you dont or wont see. Either way I dont see them changing their mind anytime soon. So how about we both accept that they exist and start designing the network to welcome rather than ostracize them, unless that is your intent.

...

And the good thing is that if you can support jumbo frames, just turn it on and let pMTU do it's work. Happy 9000's ;)

pMTU has been broken in IPv4 since the early days. It is still broken. It is also broken in IPv6. It will likely still be broken for the forseeable future. This is a) a problem that should not be ignored b) a failure in imagination when designing the protocol c) a missed opportunity to correct a systemic issue with IPv4

...

Or better said: mis-configuring systems break things.

Why do switches auto-mdix these days? Because insisting that things will work properly if you just configure them correctly turns out to be inferior to designing a system that requires less configuration to achieve the same goal. Automate.

...

...
This whole thread is all about how IPv6 has not improved any of the issues that are well known with IPv4 and in many cases makes them worse.

You cannot unteach stupid people to do stupid things.

Protocol changes will not suddenly make people understand that what they want to do is wrong and breaks said protocol.

Greets, Jeroen

You also cannot teach protocol people that there is protocol and then there is reality. Relying on ICMP exception messages was always wrong for normal network operation. Best, Joe

Masataka Ohta

12:33 a.m.

Joe Maimon wrote:

...

pMTU has been broken in IPv4 since the early days.

...

It is still broken. It is also broken in IPv6. It will likely still be broken for the forseeable future. This is

...

Relying on ICMP exception messages was always wrong for normal network operation.

Agreed. The proper solution is to have a field in IPv7 header to measure PMTU. It can be a 8 bit field, if fragment granularity is 256B. Masataka Ohta

Owen DeLong

12:57 a.m.

On Jun 4, 2012, at 5:33 PM, Masataka Ohta wrote:

...

Joe Maimon wrote:

...
pMTU has been broken in IPv4 since the early days.

...
It is still broken. It is also broken in IPv6. It will likely still be broken for the forseeable future. This is

...
Relying on ICMP exception messages was always wrong for normal network operation.

Agreed.

The proper solution is to have a field in IPv7 header to measure PMTU. It can be a 8 bit field, if fragment granularity is 256B.

Masataka Ohta

If you're going to redesign the header, I'd be much more interested in having 32 bits for the destination ASN so that IDR can ignore IP prefixes altogether. Owen

Jeroen Massar

1:11 a.m.

On 2012-06-04 17:57, Owen DeLong wrote: [..]

...

If you're going to redesign the header, I'd be much more interested in having 32 bits for the destination ASN so that IDR can ignore IP prefixes altogether.

One can already do that: route your IPv6 over IPv4.... IPv4 has 32bit destination addresses remember? :) It is also why it is fun if somebody uses a 32-bit ASN to route IPv4, as one is not making the problem smaller that way. ASNs are more used as identifiers to avoid routing loops than as actual routing parameters. Greets, Jeroen

Owen DeLong

6:06 a.m.

On Jun 4, 2012, at 6:11 PM, Jeroen Massar wrote:

...

On 2012-06-04 17:57, Owen DeLong wrote: [..]

...
If you're going to redesign the header, I'd be much more interested in having 32 bits for the destination ASN so that IDR can ignore IP prefixes altogether.

One can already do that: route your IPv6 over IPv4.... IPv4 has 32bit destination addresses remember? :)

It is also why it is fun if somebody uses a 32-bit ASN to route IPv4, as one is not making the problem smaller that way. ASNs are more used as identifiers to avoid routing loops than as actual routing parameters.

Greets, Jeroen

While this is true today (to some extent), it doesn't have to always be true. If we provided a reliable scaleable mechanism to distribute and cache prefix->ASN mappings and could reliably populate a DEST-AS field in the packet header, stub networks would no longer need separate ASNs to multihome and IDR routing could be based solely on best path to the applicable DEST-AS and we wouldn't even need to carry prefixes beyond the local AS border. While I don't think DNS is up to the task of reliable distribution and caching (though something somewhat similar to DNS could do the job rather well), DNS-style resource records could be used. For example, instead of using my own AS1734 as I do today, my multi-homed household could be placed in the database with pointers to my two upstream ASNs as follows: 2620:0:930::/48 AS 10 6939 AS 10 8121 192.124.40.0/23 AS 10 6939 AS 10 8121 192.159.10.0/24 AS 10 6939 AS 10 8121 Or, if I wanted to do some traffic engineering, I could tweak the preferences to be non-equal values. The router doing the DEST-AS insertion into the packet would grab the most preferred AS to which it has a valid feasible successor. I believe that the number of transit autonomous systems on the planet is much smaller than the minimum number of prefixes to represent all multi-homed organizations with independent routing policies. As such, I believe this could produce much more scalable routing with relatively little additional overhead. Owen

Jeroen Massar

2:44 p.m.

New subject: New routing systems (Was: IPv6 day and tunnels)

On 2012-06-04 23:06, Owen DeLong wrote:

...

On Jun 4, 2012, at 6:11 PM, Jeroen Massar wrote:

...
On 2012-06-04 17:57, Owen DeLong wrote: [..]

...
If you're going to redesign the header, I'd be much more interested in having 32 bits for the destination ASN so that IDR can ignore IP prefixes altogether.

One can already do that: route your IPv6 over IPv4.... IPv4 has 32bit destination addresses remember? :)

It is also why it is fun if somebody uses a 32-bit ASN to route IPv4, as one is not making the problem smaller that way. ASNs are more used as identifiers to avoid routing loops than as actual routing parameters.

Greets, Jeroen

While this is true today (to some extent), it doesn't have to always be true.

If we provided a reliable scaleable mechanism to distribute and cache prefix->ASN mappings and could reliably populate a DEST-AS field in the packet header, stub networks would no longer need separate ASNs to multihome and IDR routing could be based solely on best path to the applicable DEST-AS and we wouldn't even need to carry prefixes beyond the local AS border.

The problem here does not lie with the fact that various of these systems (LISP comes to mind amongst others) have been well researched and implemented already, but with the fact that the general operator community will not change to such a new system as it is not what they are used to. Greets, Jeroen

Owen DeLong

6:44 p.m.

New subject: New routing systems (Was: IPv6 day and tunnels)

On Jun 5, 2012, at 7:44 AM, Jeroen Massar wrote:

...

On 2012-06-04 23:06, Owen DeLong wrote:

...
On Jun 4, 2012, at 6:11 PM, Jeroen Massar wrote:

...
On 2012-06-04 17:57, Owen DeLong wrote: [..]

...
If you're going to redesign the header, I'd be much more interested in having 32 bits for the destination ASN so that IDR can ignore IP prefixes altogether.

One can already do that: route your IPv6 over IPv4.... IPv4 has 32bit destination addresses remember? :)

It is also why it is fun if somebody uses a 32-bit ASN to route IPv4, as one is not making the problem smaller that way. ASNs are more used as identifiers to avoid routing loops than as actual routing parameters.

Greets, Jeroen

While this is true today (to some extent), it doesn't have to always be true.

If we provided a reliable scaleable mechanism to distribute and cache prefix->ASN mappings and could reliably populate a DEST-AS field in the packet header, stub networks would no longer need separate ASNs to multihome and IDR routing could be based solely on best path to the applicable DEST-AS and we wouldn't even need to carry prefixes beyond the local AS border.

The problem here does not lie with the fact that various of these systems (LISP comes to mind amongst others) have been well researched and implemented already, but with the fact that the general operator community will not change to such a new system as it is not what they are used to.

Greets, Jeroen

LISP et. al requires a rather complicated deployment and would be even more complex to troubleshoot when it fails. What I am proposing could, literally, be deployed with the existing system still running as it does. The difference would be that for packets containing a dest-as field, we would (initially) have the option of routing to destination based on that field and ignoring the prefix. Later, the distribution of prefixes in BGP could be deprecated, but that would be several years off. What I am proposing is much much simpler to implement and much closer to what operators are used to than the map/encap solutions proposed to date. What I am proposing, however, requires us to add fields to the packet header (at the source), so it will take a long time to get implemented even if it ever did. Deploying it would require code updates to every router and end host and would require a new IP version number. However, the code changes at the host level would be pretty minor. Add some bits to the header and set them all to zero. The first router with "full" routing information and access to a populated cache or the ability to resolve dest-as for the destination prefix would populate the dest-as field. From that point until it arrived at the actual destination AS, the packet would be routed based on the best path to the AS in question. True, this would mean operators would have to revert to using an ASN to represent a common routing policy, but at most that would require some larger operators to deploy a few more ASNs. ASNs would no longer be required for what are now stub AS. It's truly unfortunate that IETF chose to drop the ball on this when IPv6 was being developed. Owen

Jeroen Massar

9:12 p.m.

New subject: New routing systems (Was: IPv6 day and tunnels)

On 2012-06-05 11:44, Owen DeLong wrote: [..]

...

LISP et. al requires a rather complicated deployment and would be even more complex to troubleshoot when it fails.

What I am proposing could, literally, be deployed with the existing system still running as it does. The difference would be that for packets containing a dest-as field, we would (initially) have the option of routing to destination based on that field and ignoring the prefix.

I would love to see a more formal specification ala a IETF draft about it and/or a short preso style thing along with a comparison of existing proposals and how this is different/better.

...

What I am proposing, however, requires us to add fields to the packet header (at the source)

Well, we have IPv6 extension headers and the flow-label is still undefined too ;) Greets, Jeroen

Templin, Fred L

2:49 p.m.

...

The proper solution is to have a field in IPv7 header to measure PMTU. It can be a 8 bit field, if fragment granularity is 256B.

We tried that for IPv4 and it didn't work very well [RFC1063]. You are welcome to try again in IPv7 when we have a green field. Fred fred.l.templin@boeing.com

Masataka Ohta

4:43 p.m.

Templin, Fred L wrote:

...

...
The proper solution is to have a field in IPv7 header to measure PMTU. It can be a 8 bit field, if fragment granularity is 256B.

...

We tried that for IPv4 and it didn't work very well [RFC1063].

IP option is a bad idea, which is partly why IPv6 sucks. Masataka Ohta

Owen DeLong

12:57 a.m.

On Jun 4, 2012, at 5:21 PM, Joe Maimon wrote:

...

Jeroen Massar wrote:

...
That indeed matches most of the corporate world quite well. That they are heavily misinformed does not make it the correct answer though.

Either you are correct and they are all wrong, or they have a perspective that you dont or wont see.

He is correct. I have seen their perspective and it is, in fact, misinformed and based largely on superstition.

...

Either way I dont see them changing their mind anytime soon.

Very likely true, unfortunately. Zealots are rarely persuaded by facts, science, or anything based in reality, choosing instead to maintain their bubble of belief even to the point of historically killing those that could not accept their misguided viewpoint. Nonetheless, over time, even humans eventually figured out that Galileo was right and the world is, indeed round, does, in fact orbit the sun (and not the other way around) and is not, in fact, at the center of the universe. Given that we were able to overcome the catholic church with those facts eventually, I suspect that overcoming corporate IT mythology over time will be somewhat sooner and easier. It might eve take less than 100 years instead of several hundred.

...

So how about we both accept that they exist and start designing the network to welcome rather than ostracize them, unless that is your intent.

I would rather educate them and let them experience the errs of their ways until they learn than damage the network in the pursuit of inclusion in this case. If you reward bad behavior with adaptation to that behavior and accommodation, you get more bad behavior. This was proven with appeasement of hitler in the 40s (hey, someone had to feed Godwin's law, right?) and has been confirmed with the recent corporate bail-outs, bank bail-outs and the mortgage crisis. One could even argue that the existing corporate attitudes about NAT are a reflection of this behavior being rewarded with ALGs and other code constructs aimed at accommodating that bad behavior.

...

...
And the good thing is that if you can support jumbo frames, just turn it on and let pMTU do it's work. Happy 9000's ;)

pMTU has been broken in IPv4 since the early days.

PMTUD is broken in IPv4 since the early days because it didn't exist in the early days. PMTUD is a relatively recent feature for IPv4. PMTUD has been getting progressively less broken in IPv4 since it was introduced.

...

It is still broken. It is also broken in IPv6. It will likely still be broken for the forseeable future. This is

PMTU-D itself is not broken in IPv6, but some networks do break PMTU-D.

...

a) a problem that should not be ignored

True. Ignoring ignorance is no better than accommodating it. The correct answer to ignorance is education.

...

b) a failure in imagination when designing the protocol

Not really. In reality, it is a failure of implementers to follow the published standards. The protocol, as designed, works as expected if deployed in accordance with the specifications.

...

c) a missed opportunity to correct a systemic issue with IPv4

There are many of those (the most glaring being the failure to address scalability of the routing system). However, since, as near as I can tell, PMTU-D was a new feature for IPv6 which was subsequently back-ported to IPv4, I am not sure that statement really applies in this case. Many of the features we take for granted in IPv4 today were actually introduced as part of IPv6 development, including IPSEC, PMTU-D, CIDR notation for prefix length, and more.

...

...
Or better said: mis-configuring systems break things.

Why do switches auto-mdix these days?

Because it makes correct configuration easier. You can turn this off on most switches, in fact, and if you do, you can still misconfigure them. Any device with a non-buggy IPv6 implementation, by default does not block ICMPv6 PTB messages. If you subsequently deliberately misconfigure it to block them, then, you have taken deliberate action to misconfigure your network.

...

Because insisting that things will work properly if you just configure them correctly turns out to be inferior to designing a system that requires less configuration to achieve the same goal.

Breaking PMTU-D in IPv6 requires configuration unless the implementation is buggy. Don't get me started on how bad a buggy Auto MDI/X implementation can make your life. Believe me, it is far worse than PMTU-D blocking.

...

Automate.

Already done. Most PMTU-D blocks in IPv6 are the result of operators taking deliberate configuration action to block packets that should not be blocked. That is equivalent to turning off Auto-MDI/X or Autonegotiation on the port.

...

...
...
This whole thread is all about how IPv6 has not improved any of the issues that are well known with IPv4 and in many cases makes them worse.

You cannot unteach stupid people to do stupid things.

I disagree. People can be educated. It may take more effort than working around them, but, it can be done.

...

...
Protocol changes will not suddenly make people understand that what they want to do is wrong and breaks said protocol.

Nope... This requires education.

...

...
Greets, Jeroen

You also cannot teach protocol people that there is protocol and then there is reality.

Huh? This seems nonsensical to me, so I am unsure what you mean.

...

Relying on ICMP exception messages was always wrong for normal network operation.

Here we must disagree. What else can you rely on? In order to characterize the path to a given destination, you must either get an exception back for packets that are too large, or, you must get confirmation back that your packet arrived. The absent of arrival confirmation does not tell you anything about why the packet did not arrive, so, assuming that it is due to size requires a rather lengthy conversation searching for the largest size of packet that will pass through the path. For example, consider that you are on an ethernet segment with jumbo frames. PMTU-D is relatively efficient. You send a 9000 octet datagram and you get back an ICMP message telling you the largest size datagram that will pass. If there are several points where the PMTU is reduced along the way, you will have 1 round trip of this type for each of those points. Notice there are no waits for timeouts involved here. Probing as you have proposed requires you to essentially do a binary search to arrive at some number n where 1280≤n≤9000, so, you end up doing something like this: Send 5140 octet datagram, wait for reply (how long?) Send 3210 octet datagram, wait for reply (how long?) Send 2245 octet datagram, wait for reply (how long?) Send 1762 octet datagram, wait for reply (how long?) Send 1521 octet datagram, wait for reply (how long?) Send 1400 octet datagram, wait for reply (how long?) Send 1340 octet datagram, wait for reply (how long?) Send 1310 octet datagram, wait for reply (how long?) Send 1296 octet datagram, wait for reply (how long?) Send 1288 octet datagram, wait for reply (how long?) Send 1284 octet datagram, wait for reply (how long?) Send 1282 octet datagram, wait for reply (how long?) Send 1281 octet datagram, wait for reply (how long?) Settle on 1280 MTU... So, you waited for 13 timeouts before you actually passed useful traffic? Or, perhaps you putter along at the lowest possible MTU until you find some higher value you know works so you're sending lots of extra traffic? That's fantastic for modern short-lived flows. You send more traffic for PMTU discovery than you receive in the entire life of the flow in some cases. Owen

Jimmy Hess

2:47 a.m.

...

Probing as you have proposed requires you to essentially do a binary search to arrive at some number n where 1280≤n≤9000, so, you end up doing something like this: [snip] So, you waited for 13 timeouts before you actually passed useful traffic? Or, perhaps you putter along at the lowest possible MTU until you [snip] Instead of waiting for 13 timeouts, start with 4 initial probes in

On 6/4/12, Owen DeLong <owen@delong.com> wrote: [snip] parallel, and react rapidly to the responses you receive; say 9000,2200, 1500, 830. Don't wait until any timeouts until the possible MTUs are narrowed. FindLocalMTU(B,T) Let B := Minimum_MTU Let T := Maximum_MTU Let D := Max(1, Floor( ( (T - 1) - (B+1) ) / 4 )) Let R := T Let Attempted_Probes := [] While ( ( (B + D) < T ) or Attempted_Probes is not Empty ) do If R is not a member of Attempted_Probes or Retries < 1 then AsynchronouslySendProbeOfSize (R) Append (R,Tries) to list of Attempted_Probes if not exists or if (R,Tries) already in list then increment Retries. else T := R - 1 Delete from Attempted_Probes (R) end if ( (B + D) < T ) AsynchronouslySendProbeOfSize (B+ D) if ( (B + 2*D) < T ) AsynchronouslySendProbeOfSize (B+ 2*D) if ( (B + 3*D) < T ) AsynchronouslySendProbeOfSize (B+ 3*D) if ( (B + 4*D) < T ) AsynchronouslySendProbeOfSize (B+ 4*D) Wait_For_Next_Probe_Response_To_Arrive() Wait_For_Additional_Probe_Response_Or_Short_Subsecond_Delay() Add_Probe_Responses_To_Queue(Q) R := Get_Largest_Received_Probe_Size(Q) If ( R > T ) then T := R end If ( R > B ) then B := R D := Max(1, Floor( ( (R - 1) - (B+1) ) / 4 )) end done Result := B # If you receive the response at n=830 first, then wait 1ms and send the next 4 probes 997 1164 1331 1498, and resend the n=1500 probe If 1280 is what the probe needs to detect. You'll receive a response for 1164 , so wait 1ms then retry n=1498 next 4 probes are 1247 1330 1413 1496 if 1280 is what the probe needs to detect, You'll receive a response for 1247, so wait 1ms resend n=1496 next 4 probes are 1267 1307 1327 1347 if 1280 is what you neet to detect, you'll receive response for 1267, so retry n=1347 wait 1ms next 4 probes are: 1276 1285 1294 1303 next 4 probes are: 1277 1278 1279 1280 next 2 parallel probes are: 1281 1282 You hit after 22 probes, but you only needed to wait for n=1281 n=1282 and their retry to time out. -- -JH

Owen DeLong

6:18 a.m.

On Jun 4, 2012, at 7:47 PM, Jimmy Hess wrote:

...

...
Probing as you have proposed requires you to essentially do a binary search to arrive at some number n where 1280≤n≤9000, so, you end up doing something like this: [snip] So, you waited for 13 timeouts before you actually passed useful traffic? Or, perhaps you putter along at the lowest possible MTU until you [snip] Instead of waiting for 13 timeouts, start with 4 initial probes in

On 6/4/12, Owen DeLong <owen@delong.com> wrote: [snip] parallel, and react rapidly to the responses you receive; say 9000,2200, 1500, 830.

What's the point of an 830 probe when the minimum valid MTU is 1280?

...

Don't wait until any timeouts until the possible MTUs are narrowed.

FindLocalMTU(B,T) Let B := Minimum_MTU Let T := Maximum_MTU Let D := Max(1, Floor( ( (T - 1) - (B+1) ) / 4 )) Let R := T Let Attempted_Probes := []

While ( ( (B + D) < T ) or Attempted_Probes is not Empty ) do If R is not a member of Attempted_Probes or Retries < 1 then AsynchronouslySendProbeOfSize (R) Append (R,Tries) to list of Attempted_Probes if not exists or if (R,Tries) already in list then increment Retries.

Did I miss the definition of Tries and/or Retries somewhere? ;-)

...

else T := R - 1 Delete from Attempted_Probes (R) end

...

if ( (B + D) < T ) AsynchronouslySendProbeOfSize (B+ D) if ( (B + 2*D) < T ) AsynchronouslySendProbeOfSize (B+ 2*D) if ( (B + 3*D) < T ) AsynchronouslySendProbeOfSize (B+ 3*D) if ( (B + 4*D) < T ) AsynchronouslySendProbeOfSize (B+ 4*D)

Shouldn't all of those be <= T?

...

Wait_For_Next_Probe_Response_To_Arrive()

...

Wait_For_Additional_Probe_Response_Or_Short_Subsecond_Delay() Add_Probe_Responses_To_Queue(Q)

Not really a Queue, more of a list. In fact, no real need to maintain a list at all, you could simply keep a variable Q and let Q=max(Q,Probe_response)

...

R := Get_Largest_Received_Probe_Size(Q)

Which would allow you to eliminate this line altogether and replace R below with Q.

...

If ( R > T ) then T := R end

If ( R > B ) then B := R D := Max(1, Floor( ( (R - 1) - (B+1) ) / 4 )) end done

Result := B

#

If you receive the response at n=830 first, then wait 1ms and send the next 4 probes 997 1164 1331 1498, and resend the n=1500 probe If 1280 is what the probe needs to detect. You'll receive a response for 1164 , so wait 1ms then retry n=1498 next 4 probes are 1247 1330 1413 1496 if 1280 is what the probe needs to detect, You'll receive a response for 1247, so wait 1ms resend n=1496 next 4 probes are 1267 1307 1327 1347 if 1280 is what you neet to detect, you'll receive response for 1267, so retry n=1347 wait 1ms next 4 probes are: 1276 1285 1294 1303 next 4 probes are: 1277 1278 1279 1280 next 2 parallel probes are: 1281 1282

You hit after 22 probes, but you only needed to wait for n=1281 n=1282 and their retry to time out.

But that's a whole lot more packets than working PMTU-D to get there and you're also waiting for all those round trips, not just the 4 timeouts. The round trips add up if you're dealing with a 100ms+ RTT. 22 RTTs at 100ms is 2.2 seconds. That's a long time to go without first data packet passed, Owen

Joe Maimon

12:21 p.m.

Owen DeLong wrote:

...

But that's a whole lot more packets than working PMTU-D to get there and you're also waiting for all those round trips, not just the 4 timeouts.

The round trips add up if you're dealing with a 100ms+ RTT. 22 RTTs at 100ms is 2.2 seconds. That's a long time to go without first data packet passed,

Owen

Yes, it is quite nice when ICMP helpfully informs you what your MTU is. However, we have known for quite some time that is simply not reliable on the IPv4 internet, for a multitude of reasons, with intentional ICMP blocking just one of them. I have no reason to expect it to work better in IPv6. This is why more reliable methods are a good idea, even if they work slower or add more overhead, because as I see it they are intended to be used concurrently with ICMP. Also, as I understand the probing, getting data through happens much faster then arriving at the optimal mtu size might take. Perhaps short flows should just be sticking to the min-mtu anyways. Joe

Owen DeLong

10:40 p.m.

On Jun 5, 2012, at 5:21 AM, Joe Maimon wrote:

...

Owen DeLong wrote:

...
But that's a whole lot more packets than working PMTU-D to get there and you're also waiting for all those round trips, not just the 4 timeouts.

The round trips add up if you're dealing with a 100ms+ RTT. 22 RTTs at 100ms is 2.2 seconds. That's a long time to go without first data packet passed,

Owen

Yes, it is quite nice when ICMP helpfully informs you what your MTU is.

However, we have known for quite some time that is simply not reliable on the IPv4 internet, for a multitude of reasons, with intentional ICMP blocking just one of them.

You keep saying this, yet you have offered no other explanations.

...

I have no reason to expect it to work better in IPv6.

That's where we differ. In IPv4, since PMTU-D was a new thing, it had to be optional and we had to work around places where it was broken to avoid flat-out breaking the internet. In IPv6, we have the opportunity to push the issue and use education to resolve the problem correctly.

...

This is why more reliable methods are a good idea, even if they work slower or add more overhead, because as I see it they are intended to be used concurrently with ICMP.

ICMP can be a reliable method if we just stop breaking it. Do you have some reason to believe people won't break the other methods, too? I don't. PMTU-D can't be fire-and-forget because paths aren't static.

...

Also, as I understand the probing, getting data through happens much faster then arriving at the optimal mtu size might take.

At the expense of sending a lot of unnecessary additional datagrams.

...

Perhaps short flows should just be sticking to the min-mtu anyways.

So you want to turn a short-flow (say a retrieving a 20k PNG) from being a 3-packet transaction on a path with jumbo frame support at 9000 octets into a 16+ packet exchange? (note I'm only counting the payload packets in one direction, not setup, teardown, additional acks, etc.). Still seems like a bad idea to me. Owen

Templin, Fred L

2:38 p.m.

A quick comment on probes. Making the tunnel ingress probe is tempting but fraught with difficulties; believe me, I have tried. So, having the tunnel ingress fragment when necessary in conjunction with the original source probing is the way forward, and we should advocate both approaches. RFC4821 specifies how the original source can probe with or without tunnels in the path. It does not have any RTT delays, because it starts small and then tries for larger sizes in parallel with getting the valuable data through without loss. Thanks - Fred fred.l.templin@boeing.com

Mark Andrews

2:55 p.m.

In message <E1829B60731D1740BB7A0626B4FAF0A65D374A86CB@XCH-NW-01V.nw.nos.boeing .com>, "Templin, Fred L" writes:

...

A quick comment on probes. Making the tunnel ingress probe is tempting but fraught with difficulties; believe me, I have tried. So, having the tunnel ingress fragment when necessary in conjunction with the original source probing is the way forward, and we should advocate both approaches.

RFC4821 specifies how the original source can probe with or without tunnels in the path. It does not have any RTT delays, because it starts small and then tries for larger sizes in parallel with getting the valuable data through without loss.

It's useful for TCP but it is not a general solution. PTB should not be being blocked and for some applications one should just force minimum mtu use.

...

Thanks - Fred fred.l.templin@boeing.com -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org

Templin, Fred L

3 p.m.

...

-----Original Message----- From: Mark Andrews [mailto:marka@isc.org] Sent: Tuesday, June 05, 2012 7:55 AM To: Templin, Fred L Cc: Owen DeLong; Jimmy Hess; nanog@nanog.org Subject: Re: IPv6 day and tunnels

In message <E1829B60731D1740BB7A0626B4FAF0A65D374A86CB@XCH-NW- 01V.nw.nos.boeing .com>, "Templin, Fred L" writes:

...
A quick comment on probes. Making the tunnel ingress probe is tempting but fraught with difficulties; believe me, I have tried. So, having the tunnel ingress fragment when necessary in conjunction with the original source probing is the way forward, and we should advocate both approaches.

RFC4821 specifies how the original source can probe with or without tunnels in the path. It does not have any RTT delays, because it starts small and then tries for larger sizes in parallel with getting the valuable data through without loss.

It's useful for TCP but it is not a general solution. PTB should not be being blocked and for some applications one should just force minimum mtu use.

Any packetization layer that is capable of getting duplicate ACKs from the peer can do it. Plus, support for just TCP is all that is needed for the vast majority of end systems at the current time. Thanks - Fred fred.l.templin@boeing.com

...

...
Thanks - Fred fred.l.templin@boeing.com -- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org

Jimmy Hess

5:15 p.m.

On 6/5/12, Owen DeLong <owen@delong.com> wrote: [snip]

...

But that's a whole lot more packets than working PMTU-D to get there and you're also waiting for all those round trips, not just the 4 timeouts. The round trips add up if you're dealing with a 100ms+ RTT. 22 RTTs at 100ms is 2.2 seconds. That's a long time to go without first data packet

I'm only suggesting probing to discover the MTU between neighboring endpoints directly connected to the same subnet -- Layer 2 interconnect. PTMU doesn't work, because devices know the MTU of _their link_, but not necessarily the MTU of every intervening bridge or L2 tunnel, Not for IP end-end-to-end MTU discovery. The "too big packet" forwarded is just discarded by the L2 bridge, there's no ICMP packet that can result from that, the L2 bridge might not even have an IP address to source one from, so the PMTU method which relies on ICMP alone cannot possibly work. The router after discovering the local MTU constraints to its neighbors would then be responsible for sending TooBig messages as needed or passing on the MTU constraint. You've got an issue if there are 100ms between two peers on your LAN. You're right, you don't need to probe for possible MTUs below 1280. -- -JH

Owen DeLong

7:18 p.m.

On Jun 5, 2012, at 10:15 AM, Jimmy Hess wrote:

...

On 6/5/12, Owen DeLong <owen@delong.com> wrote: [snip]

...
But that's a whole lot more packets than working PMTU-D to get there and you're also waiting for all those round trips, not just the 4 timeouts. The round trips add up if you're dealing with a 100ms+ RTT. 22 RTTs at 100ms is 2.2 seconds. That's a long time to go without first data packet

I'm only suggesting probing to discover the MTU between neighboring endpoints directly connected to the same subnet -- Layer 2 interconnect. PTMU doesn't work, because devices know the MTU of _their link_, but not necessarily the MTU of every intervening bridge or L2 tunnel, Not for IP end-end-to-end MTU discovery.

This is a horrible misconfiguration of the devices on that link. If your MTU setting on your interface is larger than the smallest MTU of any L2 forwarder on the link, then, you have badly misconfigured your system(s). Adding probing to compensate for this misconfiguration merely serves to perpetuate such errant configurations.

...

The "too big packet" forwarded is just discarded by the L2 bridge, there's no ICMP packet that can result from that, the L2 bridge might not even have an IP address to source one from, so the PMTU method which relies on ICMP alone cannot possibly work.

Sure... PMTU-D isn't designed to compensate for misconfigured links, it's designed to detect the path MTU based on the smallest correctly configured L3 MTU in the path.

...

The router after discovering the local MTU constraints to its neighbors would then be responsible for sending TooBig messages as needed or passing on the MTU constraint.

So you want to add an MTU setting for each L2 destination to the L2 adjacency table? This seems like a really bad idea. The better solution would be to correctly configure your link MTUs in the first place.

...

You've got an issue if there are 100ms between two peers on your LAN. You're right, you don't need to probe for possible MTUs below 1280.

LAN, sure. However, consider that there are intercontinental L2 links. Owen

Jimmy Hess

6 Jun 6 Jun

1:02 a.m.

On 6/5/12, Owen DeLong <owen@delong.com> wrote:

...

This is a horrible misconfiguration of the devices on that link. If your MTU setting on your interface is larger than the smallest MTU of any L2 forwarder on the link, then, you have badly misconfigured

Not really; The network layer and L2 protocols should both be designed to handle this, it is a design error in the protocol that it doesn't. You say it's "misconfiguration", but if IP handled the situation reasonably, it shouldn't be necessary to configure anything in the first place. Whether the neighbors are LAN or cross-tunnel, the issues are similar. It's only a misconfiguration because of flaws in the protocol. Just like you expect to plug devices in a typical LAN and it's not a configuration error to fail to manually find every switch in the LAN and enter MAC addresses into a forwarding table by hand; likewise, you shouldn't expect to key a MTU into every device by hand. IP should be designed so that devices on the link that _can_ handle the large transmission unit, which provides efficiency gains, should be allowed to fully utilize those capabilities, without breakage of connectivity to devices on the same link that have more limited capabilities and can only receive the Minimum required frame size (smaller MTU), and without separating the subnet or installing dividing Proxy ARP servers to send ICMP TooBig packets.

...

Adding probing to compensate for this misconfiguration merely serves to perpetuate such errant configurations.

Just like adding MAC address learning to Ethernet switches to compensate for the misconfiguration of failing to manually enter hardware addreses into your switches, serves to perpetuate such errant configurations, where the state of the forwarding tables are unreliably left in a non-deterministic state.

...

...
You've got an issue if there are 100ms between two peers on your LAN. You're right, you don't need to probe for possible MTUs below 1280. LAN, sure. However, consider that there are intercontinental L2 links.

Intercontinental multi-access L2 links, perhaps, are a horrible misconfiguration.

...

Owen -- -JH

Owen DeLong

4:44 a.m.

On Jun 5, 2012, at 6:02 PM, Jimmy Hess wrote:

...

On 6/5/12, Owen DeLong <owen@delong.com> wrote:

...
This is a horrible misconfiguration of the devices on that link. If your MTU setting on your interface is larger than the smallest MTU of any L2 forwarder on the link, then, you have badly misconfigured

Not really; The network layer and L2 protocols should both be designed to handle this, it is a design error in the protocol that it doesn't. You say it's "misconfiguration", but if IP handled the situation reasonably, it shouldn't be necessary to configure anything in the first place. Whether the neighbors are LAN or cross-tunnel, the issues are similar.

Really, no. The L3 MTU on an interface should be configured to the lowest MTU reachable via that link without crossing a router. It's just that simple. Anything else _IS_ a misconfiguration. First, your idea of handling the situation reasonably is a layering violation. Second, you are correct. All L2 bridges for a given media type should support the largest configurable MTU for that media type, so, it is arguably a design flaw in the bridges. However, in an environment where you have broken L2 devices (design flaw), you have to configure appropriately for that.

...

It's only a misconfiguration because of flaws in the protocol.

No, it's a misconfiguration because of the limitations of the hardware due to its design defects. L3 should not need to test the end-to-end L2 capabilities. It should be able to depend on what the OS tells it.

...

Just like you expect to plug devices in a typical LAN and it's not a configuration error to fail to manually find every switch in the LAN and enter MAC addresses into a forwarding table by hand; likewise, you shouldn't expect to key a MTU into every device by hand.

You don't expect to ever care about the MAC addresses of any of the switches in the LAN let alone enter them into any form of forwarding table at all. You do expect to need to know about the MAC addresses of adjacent systems you are trying to reach, and, you use either ND or ARP to map L3 addresses onto their corresponding L2 addresses as needed. I will note that this depends on sending a packet out to an address that reaches all of the candidate hosts (In the case of ND, this is a multicast to all hosts which have the same last 24 bits in their IP suffix. In the case of ARP, this is a broadcast packet) and expects them (at L3) to answer "That's ME!". Of course you can enter them by hand in situations where ARP or ND don't work for whatever reason. You expect ARP or ND to work and a bridge that didn't forward ARP would be just as broken as a bridge which doesn't support the full interface MTU. I would expect to have to enter MAC adjacencies manually if I had a bridge that didn't pass ARP/ND traffic, just as I expect to have to enter the MTU manually if I have a bridge that doesn't support the correct full MTU of the network.

...

IP should be designed so that devices on the link that _can_ handle the large transmission unit, which provides efficiency gains, should be allowed to fully utilize those capabilities, without breakage of connectivity to devices on the same link that have more limited capabilities and can only receive the Minimum required frame size (smaller MTU), and without separating the subnet or installing dividing Proxy ARP servers to send ICMP TooBig packets.

No, it really shouldn't. Doing this is a serious layering violation for one, and, it can't be achieved efficiently number two. It adds lots of overhead and is very error prone. There's no signaling mechanism for L3 to be informed when the L2 topology changes, for example, which might necessitate a recalculation of the MTU. A given link should have a single MTU period. I don't know of ANY L3 protocol which supports anything else. Not IP, not IPX, not DECNET, not AppleTalk, no Banyan Vines, not XNS, none of them support the idea of MTU per adjacency. If you can only have one MTU per link, then, it must be the lowest common denominator of all participants and forwarders on that link.

...

...
Adding probing to compensate for this misconfiguration merely serves to perpetuate such errant configurations.

Just like adding MAC address learning to Ethernet switches to compensate for the misconfiguration of failing to manually enter hardware addreses into your switches, serves to perpetuate such errant configurations, where the state of the forwarding tables are unreliably left in a non-deterministic state.

Apples and oranges. See above. In fact, MAC address learning on the switches is utterly unrelated to the MAC adjacency table maintained by ARP/ND. One is an L2 forwarding tree never learned by anything at L3 (the MAC forwarding table learned on the switches) and the other is a MAC adjacency table for a given link used by the L2 software on the host to populate the L2 packet header based on the L3 information.

...

...
...
You've got an issue if there are 100ms between two peers on your LAN. You're right, you don't need to probe for possible MTUs below 1280. LAN, sure. However, consider that there are intercontinental L2 links.

Intercontinental multi-access L2 links, perhaps, are a horrible misconfiguration.

No, they are not. They may be a horribly bad idea in many cases, but, there are actually legitimate applications for them and they conform to the existing documented standards. Owen

valdis.kletnieks＠vt.edu

7:22 a.m.

On Tue, 05 Jun 2012 21:44:59 -0700, Owen DeLong said:

...

Second, you are correct. All L2 bridges for a given media type should support the largest configurable MTU for that media type, so, it is arguably a design flaw in the bridges. However, in an environment where you have broken L2 devices (design flaw), you have to configure appropriately for that.

Don't waste your time configuring for that case. Find a shotgun, some ammo, and a bring-your-own-target range... You'll feel better, live longer, and the world will be a better place for it.

Joe Maimon

1:31 p.m.

Owen DeLong wrote:

...

Really, no. The L3 MTU on an interface should be configured to the lowest MTU reachable via that link without crossing a router. It's just that simple. Anything else _IS_ a misconfiguration.

Perhaps this should be thought of as a limitation, rather then a feature. Joe

Templin, Fred L

4 p.m.

A few more words on MTU. What we are after is accommodation of MTU diversity - not any one specific size. Practical limit is (2^32 - 1) for IPv6, but we expect smaller sizes for the near term. Operators know how to configure MTUs appropriate for their links. 1280 is too small, and turns the IPv6 Internet into ATM. In order to support MTU diversity, PMTUD must be made to work. This means working to eliminate all network blockage of ICMPv6 PTBs, while at the same time provisioning hosts and tunnels with mechanisms that work even if no PTBs are delivered. For hosts, that requires RFC4821. For tunnels, that requires fragmentation.

...

From an earlier message:

...

9000B may still be acceptable.

True, but what we need is not any one fixed Internet "cell size" but rather full support for MTU diversity. Fred fred.l.templin@boeing.com

Owen DeLong

4 Jun 4 Jun

10:08 p.m.

On Jun 4, 2012, at 2:26 PM, Joe Maimon wrote:

...

Jeroen Massar wrote:

...
Tunnels therefor only should exist at the edge where native IPv6 cannot be made possible without significant investments in hardware and or other resources. Of course every tunnel should at one point in time be replaced by native where possible, thus hopefully the folks planning expenses and hardware upgrades have finally realized that they cannot get around it any more and have put this "ipv6" feature on the list for the next round of upgrades.

IPv4 is pretty mature. Are there more or less tunnels on it?

There are dramatically fewer IPv4 tunnels than IPv6 tunnels to the best of my knowledge.

...

Why do you think a maturing IPv6 means less tunnels as opposed to more?

Because a maturing IPv6 eliminates many of the present day needs for IPv6 tunnels which is to span IPv4-only areas of the network when connecting IPv6 end points.

...

Does IPv6 contain elegant solutions to all the issues one would resort to tunnels with IPv4?

Many of the issues I would resort to tunnels for involve working around NAT, so, yes, IPv6 provides a much more elegant solution -- End-to-end addressing. However, for some of the other cases, no, tunnels will remain valuable in IPv6. However, as IPv6 end-to-end native connectivity becomes more prevalent, much of the current need for IPv6 over IPv4 tunnels will become deprecated.

...

Does successful IPv6 deployment require obsoleting tunneling?

No, it does not, but, it will naturally obsolete many of the tunnels which exist today.

...

Fail.

What, exactly are you saying is a failure? The single word here even in context is very ambiguous.

...

Today, most people cant even get IPv6 without tunnels.

Anyone can get IPv6 without a tunnel if they are willing to bring a circuit to the right place. As IPv6 gets more ubiquitously deployed, the number of right places will grow and the cost of getting a circuit to one of them will thus decrease.

...

And tunnels are far from the only cause of MTU lower than what has become the only valid MTU of 1500, thanks in no small part to people who refuse to acknowledge operational reality and are quite satisfied with the state of things once they find a "them" to blame it on.

Meh... Sour grapes really don't add anything useful to the discussion. Breaking PMTU-D is bad. People should stop doing so. Blocking PTB messages is bad in IPv4 and worse in IPv6. This has been well known for many years. If you're breaking PMTU-D, then stop that. If not, then you're not part of them. If you have a useful alternative solution to propose, put it forth and let's discuss the merits.

...

I just want to know if we can expect IPv6 to devolve into 1280 standard mtu and at what gigabit rates.

I hope not. I hope that IPv6 will cause people to actually re-evaluate their behavior WRT PMTU-D and correct the actual problem. Working PMTU-D allows not only 1500, but also 1280, and 9000 and >9000 octet datagrams to be possible and segments that support <1500 work almost as well as segments that support jumbo frames. Where jumbo frames offer an end-to-end advantage, that advantage can be realized. Where there is a segment with a 1280 MTU, that can also work with a relatively small performance penalty. Where PMTU-D is broken, nothing works unless the MTU end-to-end happens to coincide with the smallest MTU. For links that carry tunnels and clear traffic, life gets interesting if one of them is the one with the smallest MTU regardless of the MTU value chosen. Owen

...

Joe

Joe Maimon

10:34 p.m.

Owen DeLong wrote:

...

...
Fail.

What, exactly are you saying is a failure? The single word here even in context is very ambiguous.

The failure is that even now, when tunnels are critical to transition, a proper solution that improves on the IPv4 problems does not exist And if tunnels do become less prevalent there will be even less impetus than now to make things work better.

...

...
Today, most people cant even get IPv6 without tunnels.

Anyone can get IPv6 without a tunnel if they are willing to bring a circuit to the right place.

Today most people cant even get IPv6 without tunnels, or without paying excessively more for their internet connection, or without having their pool of vendors shrink dramatically, sometimes to the point of none.

...

Breaking PMTU-D is bad. People should stop doing so.

Blocking PTB messages is bad in IPv4 and worse in IPv6.

It has always been bad and people have not stopped doing it. And intentional blocking is not the sole cause of pmtud breaking.

...

If you have a useful alternative solution to propose, put it forth and let's discuss the merits.

PMTU-d probing, as recently standardizes seems a more likely solution. Having CPE capable of TCP mss adjustment on v6 is another one. Being able to fragment when you want to is another good one as well.

...

I hope not. I hope that IPv6 will cause people to actually re-evaluate their behavior WRT PMTU-D and correct the actual problem. Working PMTU-D allows not only 1500, but also 1280, and 9000 and>9000 octet datagrams to be possible and segments that support<1500 work almost as well as segments that support jumbo frames. Where jumbo frames offer an end-to-end advantage, that advantage can be realized. Where there is a segment with a 1280 MTU, that can also work with a relatively small performance penalty.

Where PMTU-D is broken, nothing works unless the MTU end-to-end happens to coincide with the smallest MTU.

For links that carry tunnels and clear traffic, life gets interesting if one of them is the one with the smallest MTU regardless of the MTU value chosen.

Owen

I dont share your optimism that it will go any better this time around than last. If it goes at all. Joe

Templin, Fred L

10:43 p.m.

...

PMTU-d probing, as recently standardizes seems a more likely solution. Having CPE capable of TCP mss adjustment on v6 is another one. Being able to fragment when you want to is another good one as well.

I'll take a) and c), but don't care so much for b). About fragmenting, any tunnel ingress (VPNs included) can do inner fragmentation today independently of all other ingresses and with no changes necessary on the egress. It's just that they need to take precautions to avoid messing up the final destination's reassembly buffers. Fred fred.l.templin@boeing.com

Owen DeLong

10:59 p.m.

On Jun 4, 2012, at 3:34 PM, Joe Maimon wrote:

...

Owen DeLong wrote:

...
...
Fail.

What, exactly are you saying is a failure? The single word here even in context is very ambiguous.

The failure is that even now, when tunnels are critical to transition, a proper solution that improves on the IPv4 problems does not exist

A proper solution does exist... Stop blocking PTB messages. That's the proper solution. It was the proper solution in IPv4 and it is the proper solution in IPv6.

...

And if tunnels do become less prevalent there will be even less impetus than now to make things work better.

True, perhaps, but, I don't buy that tunnels are the only sub-1500 octet MTU out there, so, I think your premise here is somewhat flawed.

...

...
...
Today, most people cant even get IPv6 without tunnels.

Anyone can get IPv6 without a tunnel if they are willing to bring a circuit to the right place.

Today most people cant even get IPv6 without tunnels, or without paying excessively more for their internet connection, or without having their pool of vendors shrink dramatically, sometimes to the point of none.

It never shrinks to none, but, yes, the cost can go up dramatically. You can, generally, get a circuit to somewhere that HE has presence from almost anywhere in the world if you are willing to pay for it. Any excessive costs would be what the circuit vendor charges. HE sells transit pretty cheap and everywhere we sell, it's dual-stack native. Sure, we wish we could magically have POPs everywhere and serve every customer with a short local loop. Unfortunately, that's not economically viable at this time, so, we build out where we can when there is sufficient demand to cover our costs. Pretty much like any other provider, I would imagine. Difference is, we've been building everything native dual stack for years. IPv6 is what we do. We're also pretty good at IPv4, so we deliver legacy connectivity to those that want it as well.

...

...
Breaking PMTU-D is bad. People should stop doing so.

Blocking PTB messages is bad in IPv4 and worse in IPv6.

It has always been bad and people have not stopped doing it. And intentional blocking is not the sole cause of pmtud breaking.

I guess that depends on how you define the term intentional. I don't care whether it was the administrators intent, or a default intentionally placed there by the firewall vendor or what, it was someone's intent, therefore, yes, it is intentional. If you can cite an actual case of accidental dropping of PTB messages that was not the result of SOMEONE's intent, then, OK. However, at least on IPv6, I believe that intentional blocking (regardless of whose intent) is, in fact, the only source of PMTUD breakage at this point. In IPv4, there is some breakage in older software that didn't do PMTUD right even if it received the correct packets, but, that's not relevant to IPv6.

...

...
If you have a useful alternative solution to propose, put it forth and let's discuss the merits.

PMTU-d probing, as recently standardizes seems a more likely solution. Having CPE capable of TCP mss adjustment on v6 is another one. Being able to fragment when you want to is another good one as well.

Fragments are horrible from a security perspective and worse from a network processing perspective. Having a way to signal path MTU is much better. Probing is fine, but, it's not a complete solution and doesn't completely compensate for the lack of PTB message transparency.

...

...
I hope not. I hope that IPv6 will cause people to actually re-evaluate their behavior WRT PMTU-D and correct the actual problem. Working PMTU-D allows not only 1500, but also 1280, and 9000 and>9000 octet datagrams to be possible and segments that support<1500 work almost as well as segments that support jumbo frames. Where jumbo frames offer an end-to-end advantage, that advantage can be realized. Where there is a segment with a 1280 MTU, that can also work with a relatively small performance penalty.

Where PMTU-D is broken, nothing works unless the MTU end-to-end happens to coincide with the smallest MTU.

For links that carry tunnels and clear traffic, life gets interesting if one of them is the one with the smallest MTU regardless of the MTU value chosen.

Owen

I dont share your optimism that it will go any better this time around than last. If it goes at all.

It is clearly going, so, the if it goes at all question is already answered. We're already seeing a huge ramp in IPv6 traffic leading up to ISOC's big celebration of my birthday (aka World IPv6 Launch) since early last week. I have no reason to expect that that traffic won't remain at the new higher levels after June 6. There are too many ISPs, Mobile operators, Web site operators and others committed at this point for it not to actually go. Also, since there's no viable alternative if it doesn't go, that pretty well insures that it will go one way or another. As to my optimism, please don't mistake my statement of hope for any form of expectation. I _KNOW_ how bad it is. I live behind tunnels for IPv4 and IPv6 and have these issues on a regular basis. Usually I'm able to work around them. Sometimes I'm even able to get people to actually fix their firewalls. The good news is... + If we can get people to stop deploying bad filters + And we keep fixing the existing bad filters Eventually, bad filters disappear. Yes, that's two big ifs, but, it's worth a try. Owen

...

Joe

Templin, Fred L

11:13 p.m.

Hi Owen, I am 100% with you on wanting to see an end to filtering of ICMPv6 PTBs. But, tunnels can take matters into their own hands today to make sure that 1500 and smaller gets through no matter if PTBs are delivered or not. There doesn't really even need to be a spec as long as each tunnel takes the necessary precautions to avoid messing up the final destination. The next thing is to convince the hosts to implement RFC4821... Thanks - Fred fred.l.templin@boeing.com

...

-----Original Message----- From: Owen DeLong [mailto:owen@delong.com] Sent: Monday, June 04, 2012 4:00 PM To: Joe Maimon Cc: nanog@nanog.org Subject: Re: IPv6 day and tunnels

On Jun 4, 2012, at 3:34 PM, Joe Maimon wrote:

...
Owen DeLong wrote:

...
...
Fail.

What, exactly are you saying is a failure? The single word here even in

...
...
very ambiguous.

The failure is that even now, when tunnels are critical to transition, a

context is proper solution that improves on the IPv4 problems does not exist

...
A proper solution does exist... Stop blocking PTB messages. That's the proper solution. It was the proper solution in IPv4 and it is the proper solution in IPv6.

...
And if tunnels do become less prevalent there will be even less impetus than now to make things work better.

True, perhaps, but, I don't buy that tunnels are the only sub-1500 octet MTU out there, so, I think your premise here is somewhat flawed.

...
...
...
Today, most people cant even get IPv6 without tunnels.

Anyone can get IPv6 without a tunnel if they are willing to bring a

circuit to the right place.

...
Today most people cant even get IPv6 without tunnels, or without paying

excessively more for their internet connection, or without having their pool of vendors shrink dramatically, sometimes to the point of none.

It never shrinks to none, but, yes, the cost can go up dramatically. You can, generally, get a circuit to somewhere that HE has presence from almost anywhere in the world if you are willing to pay for it. Any excessive costs would be what the circuit vendor charges. HE sells transit pretty cheap and everywhere we sell, it's dual-stack native. Sure, we wish we could magically have POPs everywhere and serve every customer with a short local loop. Unfortunately, that's not economically viable at this time, so, we build out where we can when there is sufficient demand to cover our costs. Pretty much like any other provider, I would imagine. Difference is, we've been building everything native dual stack for years. IPv6 is what we do. We're also pretty good at IPv4, so we deliver legacy connectivity to those that want it as well.

...
...
Breaking PMTU-D is bad. People should stop doing so.

Blocking PTB messages is bad in IPv4 and worse in IPv6.

It has always been bad and people have not stopped doing it. And intentional blocking is not the sole cause of pmtud breaking.

I guess that depends on how you define the term intentional. I don't care whether it was the administrators intent, or a default intentionally placed there by the firewall vendor or what, it was someone's intent, therefore, yes, it is intentional. If you can cite an actual case of accidental dropping of PTB messages that was not the result of SOMEONE's intent, then, OK. However, at least on IPv6, I believe that intentional blocking (regardless of whose intent) is, in fact, the only source of PMTUD breakage at this point. In IPv4, there is some breakage in older software that didn't do PMTUD right even if it received the correct packets, but, that's not relevant to IPv6.

...
...
If you have a useful alternative solution to propose, put it forth and

let's discuss the merits.

PMTU-d probing, as recently standardizes seems a more likely solution. Having CPE capable of TCP mss adjustment on v6 is another one. Being able to fragment when you want to is another good one as well.

Fragments are horrible from a security perspective and worse from a network processing perspective. Having a way to signal path MTU is much better. Probing is fine, but, it's not a complete solution and doesn't completely compensate for the lack of PTB message transparency.

...
...
I hope not. I hope that IPv6 will cause people to actually re-evaluate their behavior WRT PMTU-D and correct the actual problem. Working PMTU-D allows not only 1500, but also 1280, and 9000 and>9000 octet datagrams to be possible and segments that support<1500 work almost as well as segments that support jumbo frames. Where jumbo frames offer an end-to-end advantage, that advantage can be realized. Where there is a segment with a 1280 MTU, that can also work with a relatively small performance penalty.

Where PMTU-D is broken, nothing works unless the MTU end-to-end happens to coincide with the smallest MTU.

For links that carry tunnels and clear traffic, life gets interesting if one of them is the one with the smallest MTU regardless of the MTU value chosen.

Owen

I dont share your optimism that it will go any better this time around than last. If it goes at all.

It is clearly going, so, the if it goes at all question is already answered. We're already seeing a huge ramp in IPv6 traffic leading up to ISOC's big celebration of my birthday (aka World IPv6 Launch) since early last week. I have no reason to expect that that traffic won't remain at the new higher levels after June 6. There are too many ISPs, Mobile operators, Web site operators and others committed at this point for it not to actually go. Also, since there's no viable alternative if it doesn't go, that pretty well insures that it will go one way or another.

As to my optimism, please don't mistake my statement of hope for any form of expectation. I _KNOW_ how bad it is. I live behind tunnels for IPv4 and IPv6 and have these issues on a regular basis.

Usually I'm able to work around them. Sometimes I'm even able to get people to actually fix their firewalls.

The good news is...

+ If we can get people to stop deploying bad filters + And we keep fixing the existing bad filters

Eventually, bad filters disappear.

Yes, that's two big ifs, but, it's worth a try.

Owen

...
Joe

Masataka Ohta

3:34 p.m.

Jeroen Massar wrote:

...

...
IPv4 without PMTUD, of course.

We are (afaik) discussing IPv6 in this thread,

That's your problem of insisting on very narrow solution space, which is why you can find no solution and are trying to ignore the problem.

...

...
It is a sender of a multicast packet, not you as some ISP, who set max packet size to 1280B or 1500B.

If a customer already miraculously has the rare capability of sending multicast packets in the rare case that a network is multicast enabled

That is the case IPv6 WG insisted on.

...

then they will also have been told to use a max packet size of 1280 to avoid any issues when it is expected that some endpoint might have that max MTU.

Those who insisted on the case won't tell so nor do so.

...

I really cannot see the problem with this

because you insist on IPv6.

...

...
You can do nothing against a sender who consciously (not necessarily maliciously) set it to 1500B.

Of course you can, the first hop into your network can generate a single PtB

I can, but I can't expect others will do so. I, instead, know those who insisted on the case won't.

...

No need, as above, reject and send PtB and all is fine.

As I wrote:

...

...
That you don't enable multicast in your network does not mean you have nothing to do with packet too big against multicast, because you may be on a path of returning ICMPs. That is, you should still block them.

you are wrong.

...

...
If you don't want to inspect packets so deeply (beyond first 64B, for example), packet too big against unicast packets are also blocked.

Routing (forwarding packets) is in no way "expection".

What?

...

Blocking returning ICMPv6 PtB where you are looking at the original packet which is echod inside the data of the ICMPv6 packet would indeed require one to look quite deep, but if one is so determined to firewall them well, then you would have to indeed.

As I already filter packets required by RFC2463, why, do you think, do I have to bother only to reduce performance?

...

I do not see a reason to do so though. Please note that the src/dst of the packet itself is unicast even if the PtB will be for a multicast packet.

How can you ignore the implosion of unicast ICMP?

...

They did not ignore you, they realized that not everybody has the same requirements. With the current spec you can go your way and break pMTU requiring manual 1280 settings, while other networks can use pMTU in their networks.Everbody wins.

What? Their networks? The Internet is interconnected.

...

...
So, you should assume some, if not all, of them still insist on using multicast PMTUD to make multicast packet size larger than 1280B.

As networks become more and more jumbo frame enabled, what exactly is the problem with this?

That makes things worse. It will promote people try to multicast with jumbo frames.

...

...
Because PMTUD is not expected to work,

You assume it does not work, but as long as per the spec people do not filter it, it works.

Such operation leaves network vulnerable and should be corrected.

...

...
you must assume MTU of outer path is 1280B, as is specified "simply restrict itself to sending packets no larger than 1280 octets" in RFC2460.

While for multicast enabled networks that might hit the minimum MTU this might be true-ish, it does not make it universally true.

The Internet is interconnected.

...

...
...
you need to use a tunneling protocol that knows how to frag and reassemble as is acting as a medium with an mtu less than the minimum of 1280

That's my point in my second last slide.

Then you word it wrongly. It is not the problem of IPv6

You should read RFC2473, an example in my slide.

...

Please fix your network instead, kthx.

It is a problem of RFC2463 and networks of people who insist on the current RFC2463 for multicast PMTUD. If you want the problem disappear, change RFC2463. Masataka Ohta

4864

Age (days ago)

4879

Last active (days ago)

List overview

Download

94 comments

14 participants

participants (14)

bmanning＠vacation.karoshi.com
Brett Frankenberger
Cameron Byrne
Jared Mauch
Jeroen Massar
Jimmy Hess
Joe Maimon
Joel Maslak
Mark Andrews
Masataka Ohta
Matthew Huff
Owen DeLong
Templin, Fred L
valdis.kletnieks＠vt.edu