PMTUD for IPv4 Multicast - How?
I recently discovered that my routers weren't generating ICMP Type 3 Code 4 (unreachable, DF-bit) messages in response to too-big IPv4 multicast packets with DF=1. At first, I thought this was a bug, but then learned that RFCs 1112, 1122 and 1812 all specify that ICMP unreachables not be sent in response to multicast packets. RFC1981 (PMTUD for IPv6), on the other hand, is explicit that PMTUD works for multicast flows, that the path MTU for a multicast flow is the smallest MTU available anywhere in the distribution tree, and that a single multicast packet may provoke many ICMP unreachables from routers along the tree. Further complicating matters, the default Linux behavior (ip_no_pmtu_disc = 0) sets the DF bit on all packets unless the application is explicit (setsockopt()) that DF be cleared. This behavior strikes me as a troublesome assumption (that the application will interpret unreachables) in the case of unicast UDP sockets, and downright broken (because traffic will be dropped silently) in the case of multicast UDP sockets. I'm struggling to grok the rationale behind not sending unreachables in response to multicast packets. It seems to me that our networks put IPv4 multicast speakers in a position where it's impossible for them to do the right thing. Does anybody understand why PMTUD for IPv4 multicast flows is disabled in routers? Is there a secret lever to enable it in Cisco IOS? What should a responsible IPv4 multicast application do when receivers are flung far and wide with un-knowable MTUs in the transit path? Thanks, /chris
On Mon, 31 Aug 2015 12:12:16 -0400, Chris Marget said:
At first, I thought this was a bug, but then learned that RFCs 1112, 1122 and 1812 all specify that ICMP unreachables not be sent in response to multicast packets.
I'm struggling to grok the rationale behind not sending unreachables in response to multicast packets. It seems to me that our networks put IPv4 multicast speakers in a position where it's impossible for them to do the right thing.
For the exact same reason that replying to an ICMP Echo Request sent to your broadcast address is generally considered a Bad Idea. The obvious solution is "Doctor, it hurts when I do that" "Don't do that anymore". Don't send multicast packets with DF set.
On Mon, Aug 31, 2015 at 12:37 PM, <Valdis.Kletnieks@vt.edu> wrote:
On Mon, 31 Aug 2015 12:12:16 -0400, Chris Marget said:
At first, I thought this was a bug, but then learned that RFCs 1112, 1122 and 1812 all specify that ICMP unreachables not be sent in response to multicast packets.
I'm struggling to grok the rationale behind not sending unreachables in response to multicast packets. It seems to me that our networks put IPv4 multicast speakers in a position where it's impossible for them to do the right thing.
For the exact same reason that replying to an ICMP Echo Request sent to your broadcast address is generally considered a Bad Idea.
The obvious solution is "Doctor, it hurts when I do that" "Don't do that anymore".
It's not as obvious to me as it is to you. I mean, v6 *requires* exactly this behavior, so it can't be all that bad, can it?
Don't send multicast packets with DF set.
Are you asserting that the default behavior of the Linux kernel (setting DF on multicast packets) is wrong then? I'll probably come around, but I've not yet concluded that "screw it, fragment my traffic, I don't care" is the stance that a conscientious application should be taking. /chris
At first, I thought this was a bug, but then learned that RFCs 1112, 1122 and 1812 all specify that ICMP unreachables not be sent in response to multicast packets.
I'm struggling to grok the rationale behind not sending unreachables in response to multicast packets. It seems to me that our networks put IPv4 multicast speakers in a position where it's impossible for them to do the right thing.
For the exact same reason that replying to an ICMP Echo Request sent to your broadcast address is generally considered a Bad Idea.
The obvious solution is "Doctor, it hurts when I do that" "Don't do that anymore".
It's not as obvious to me as it is to you. I mean, v6 *requires* exactly this behavior, so it can't be all that bad, can it?
ICMP replies to multicast packets can cause ICMP "implosion". This is not a new discussion - see for instance http://mailman.nanog.org/pipermail/nanog/2012-June/048685.html Steinar Haug, Nethelp consulting, sthaug@nethelp.no
On Mon, Aug 31, 2015 at 3:49 PM, <sthaug@nethelp.no> wrote:
ICMP replies to multicast packets can cause ICMP "implosion". This is not a new discussion - see for instance
http://mailman.nanog.org/pipermail/nanog/2012-June/048685.html
It's a shame we handle path MTU as a layer 3 problem that gets an ICMP response from a middlebox. It'd make more sense to truncate the packet, set a flag, and then let layer 4 at the recipient deal with negotiating a new size with the sender. You know, end to end principle and all. That'd eliminate the problems with firewall-blocked protocols and routers using private IP addresses, the usual culprits for pmtud breakage. It'd also let multicast protocols make reasonable choices for that particular protocol without being stuck with the stack's default. -Bill -- William Herrin ................ herrin@dirtside.com bill@herrin.us Owner, Dirtside Systems ......... Web: <http://www.dirtside.com/>
William Herrin wrote:
It'd make more sense to truncate the packet, set a flag, and then let layer 4 at the recipient deal with negotiating a new size with the sender.
For routers, truncating the packet and setting a flag is as burdensome as fragmentation or ICMP generation. Moreover, just with plain fragmentation enabled IPv4 packets, layer 4 can deal similarly.
You know, end to end principle and all.
PMTUD requires "knowledge and help" (quote from the end to end argument) of all the intermediate routers. That is, you apply the end to end argument completely wrongly.
That'd eliminate the problems with firewall-blocked protocols and routers using private IP addresses, the usual culprits for pmtud breakage.
With your approach, you will find firewalls dropping truncated packets. Masataka Ohta
It's not as obvious to me as it is to you. I mean, v6 *requires* exactly this behavior, so it can't be all that bad, can it?
ICMP replies to multicast packets can cause ICMP "implosion". This is not a new discussion - see for instance
http://mailman.nanog.org/pipermail/nanog/2012-June/048685.html
Thanks very much for the pointer to that discussion. "ICMP implosion" has been a helpful search term. The position taken there appears to boil down to: - The IPv6 requirement to generate "too big" messages *really is a problem* - RFC2463 should not have made the exception which allows sending these messages - Multicast PMTUD should not be a thing - Multicast speakers should send un-fragmentable minimum-sized packets I remain fuzzy on exactly the nature of the implosion problem. Is the concern that I might DDoS myself by sending un-fragmentable traffic? It's hard for me to recognize this as a problem, but I'm working on it. It seems to me that as a multicast speaker, the influx of ICMP errors is both desirable (I set DF because I intend to react) and under my control. It certainly beats sending minimum-sized packets, which appears to be the recommendation in the linked discussion. If somebody would be so kind as to detail the disastrous nature of the implosion, that would be helpful.
Chris Marget wrote:
For the exact same reason that replying to an ICMP Echo Request sent to your broadcast address is generally considered a Bad Idea.
The obvious solution is "Doctor, it hurts when I do that" "Don't do that anymore".
And, it implies that some ISPs will filter all the ICMPv6 PTB including those generated against unicast ones, which means PMTUDv6 won't work. Filtering ICMPv6 PTB generated against multicast packets but not unicast ones is not very easy.
It's not as obvious to me as it is to you. I mean, v6 *requires* exactly this behavior, so it can't be all that bad, can it?
Yes, of course. See https://en.wikipedia.org/wiki/Design_by_committee which is why we should avoid IPv6 entirely, especially because NAT, with its 48bit effective address space, is fair enough and, for theoretical purity, NAT can be modified to have full end to end transparency (https://tools.ietf.org/html/draft-ohta-e2e-nat-00), or, UPnP capable NAT already practically have the transparency.
I'll probably come around, but I've not yet concluded that "screw it, fragment my traffic, I don't care" is the stance that a conscientious application should be taking.
Don't you care, for routers, generating ICMP PTB is as burdensome as generating fragments? Masataka Ohta PS Pages 87-101 of ftp://chacha.hpcl.titech.ac.jp/2014/infra5.ppt is my presentation at APNIC32 on the problem.
I'll probably come around, but I've not yet concluded that "screw it, fragment my traffic, I don't care" is the stance that a conscientious application should be taking.
Don't you care, for routers, generating ICMP PTB is as burdensome as generating fragments?
I don't think so. If PMTUD is working (big IF, I know), the ICMP PTB generation is a one-time thing (or once per 10 minutes or whatever) and can be rate limited with little impact. Fragmenting transit traffic, on the other hand, needs to be done for every transit packet.
Chris Marget wrote:
I'll probably come around, but I've not yet concluded that "screw it, fragment my traffic, I don't care" is the stance that a conscientious application should be taking.
Don't you care, for routers, generating ICMP PTB is as burdensome as generating fragments?
I don't think so. If PMTUD is working (big IF, I know),
Yup.
the ICMP PTB generation is a one-time thing (or once per 10 minutes or whatever)
A meaningful interval of retry is not 10 minutes but RTT measured at layer 4 or above.
Is the concern that I might DDoS myself
Or, with spoofed source addresses, someone else. Masataka Ohta
On Mon, Aug 31, 2015 at 8:55 PM, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
Chris Marget wrote:
I'll probably come around, but I've not yet concluded that "screw it, fragment my traffic, I don't care" is the stance that a conscientious application should be taking.
Don't you care, for routers, generating ICMP PTB is as burdensome as generating fragments?
I don't think so. If PMTUD is working (big IF, I know),
Yup.
the ICMP PTB generation is a one-time thing (or once per 10 minutes or whatever)
A meaningful interval of retry is not 10 minutes but RTT measured at layer 4 or above.
I took the 10 minute value from RFC1191's recommendation about when it's appropriate to try larger MTU sizes. One ICMP message should hold the sender off for 10 minutes (or whatever), just like it does with unicast traffic. Would you explain why a router might need to generate ICMP PTB at a rate corresponding to intervals of RTT? I don't see why the error rate would be correlated with path length. Anyway, RTT isn't something that necessarily exists with multicast applications.
Is the concern that I might DDoS myself
Or, with spoofed source addresses, someone else.
The latter concern isn't unique to this case and applies to many (most? all?) types of reflection attacks. Indeed, many other protocols have disabled potentially useful features in order to thwart reflection attacks which rely on spoofed source addresses. At least in the case of those protocols, we have the choice to enable monlist, ip_respond_to_echo_broadcast or whatever as appropriate for the environment. I'm still not sure where this leaves the application which wants to do the right thing. - Send 1500 byte frames and expect fragmentation? - Guess at the least of all likely path MTUs? - Send 576 byte frames? - Build a feedback mechanism into the application?
On Mon, Aug 31, 2015 at 5:17 PM, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
for routers, generating ICMP PTB is as burdensome as generating fragments?
No, it isn't. When a router fragments a packet, it has to fragment the next and the next and the next. Maybe tens or hundreds of thousands of packets before the end of that one user's session. When a router generates a PTB, there is no next. PTB is a soft failure. The origin must correct the error (by reducing packet size) before communication can succeed. There are potentially several orders of magnitude of difference in the burden on the router. Regards, Bill Herrin -- William Herrin ................ herrin@dirtside.com bill@herrin.us Owner, Dirtside Systems ......... Web: <http://www.dirtside.com/>
William Herrin wrote:
for routers, generating ICMP PTB is as burdensome as generating fragments?
No, it isn't.
Yes, it is. Generating an ICMP PTB @aclet is as burdensome as fragmenting a packet.
When a router fragments a packet, it has to fragment the next and the next and the next. Maybe tens or hundreds of thousands of packets before the end of that one user's session.
Not necessarily, because transport layer can react against fragmented packets.
When a router generates a PTB, there is no next. PTB is a soft failure. The origin must correct the error (by reducing packet size)
What if, the origin does not reduce packet size? Masataka Ohta
In message <55E4F62B.6060300@necom830.hpcl.titech.ac.jp>, Masataka Ohta writes:
William Herrin wrote:
for routers, generating ICMP PTB is as burdensome as generating fragments?
No, it isn't.
Yes, it is. Generating an ICMP PTB is as burdensome as fragmenting a packet.
Well it could be done at wire speed. It just requires more complicated hardware. Routers usually punt it to the cpu but there is no real reason that they have to do that. There is no theoretical reason why it has to be more burdensome than forwarding a packet. It's a implementation choice.
When a router fragments a packet, it has to fragment the next and the next and the next. Maybe tens or hundreds of thousands of packets before the end of that one user's session.
Not necessarily, because transport layer can react against fragmented packets.
When a router generates a PTB, there is no next. PTB is a soft failure. The origin must correct the error (by reducing packet size)
What if, the origin does not reduce packet size?
The communiction fails. Additionally routers normally rate limit PTB generation thereby reducing cpu loads to a acceptable level which is the whole point of moving the fragmentation to the originating node.
Masataka Ohta
-- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org
Mark Andrews wrote:
Yes, it is. Generating an ICMP PTB is as burdensome as fragmenting a packet.
Well it could be done at wire speed.
Both of them could be.
There is no theoretical reason why it has to be more burdensome than forwarding a packet.
That's not my point.
The communiction fails.
It depends on layer 4 and above.
Additionally routers normally rate limit PTB generation thereby reducing cpu loads to a acceptable level which is the whole point of moving the fragmentation to the originating node.
Routers can rate limit fragment generation, too. Masataka Ohta
participants (6)
-
Chris Marget
-
Mark Andrews
-
Masataka Ohta
-
sthaug@nethelp.no
-
Valdis.Kletnieks@vt.edu
-
William Herrin