The problem is described pretty clearly at http://www.cisco.com/warp/public/105/56.html. The issue I have experienced is that fragmentation can lead to performance impacts that are unacceptable. I wish we could start a clue campaign informing people why ICMP should not be summarily dumped at the firewall. Chris Proctor EPIK Communications
-----Original Message----- From: Valdis.Kletnieks@vt.edu [mailto:Valdis.Kletnieks@vt.edu] Sent: Wednesday, December 03, 2003 11:39 AM To: jgraun@comcast.net Cc: nanog@merit.edu Subject: Re: MTU path discovery and IPSec
On Wed, 03 Dec 2003 16:05:39 GMT, jgraun@comcast.net said:
1) I assume MTU path discovery has to been in enabled on each router in the path in order for it work correctly?!
Actually, no. All that's required is that:
a) The router handle the case of a too-large packet with the DF bit set by sending back an ICMP 'Dest Unreachable - Frag Needed' packet. I've never actually encountered a router that didn't get this part right. (Has anybody ever seen a router botch this, *other* than a config error covered in (b) below?)
b) said ICMP makes it back to the originating machine. This is where all the operational breakage I've ever seen on PMTU Discovery comes from. And in almost all cases, one of two things is at fault. Either some bonehead firewall admin is "blocking all ICMP for security" (fixable by reconfiguring the firewall to let ICMP Frag Needed error messages through), or some bonehead network provider numbered their point-to-points from 1918 space and the ICMP gets ingress/egress filtered (this one is usually not fixable except with a baseball bat).
Do not just blame random company's firewall's for dumping ICMP. There are some very well known hosting groups that filter ICMP on edge of their network's in their routers. It gets even worse when their server admin's decide to leave PMTU discovery on. Sort of defeats the purpose... Given the nastiness of ICMP DDoS attacks of late, it might be better to hit the server and client admin's with the clue bat about not using PMTU discovery (which also extends to the writers of the App's and OS's). Frag. is in the fast path of just about every current version of brand C code, so giving the tunneling folks the OK to frag the packet might be preferred to forcing them to mess about with alternate options. David On 12/3/03 8:59 AM, "cproctor@epik.net" <cproctor@epik.net> wrote:
The problem is described pretty clearly at http://www.cisco.com/warp/public/105/56.html. The issue I have experienced is that fragmentation can lead to performance impacts that are unacceptable.
I wish we could start a clue campaign informing people why ICMP should not be summarily dumped at the firewall.
Chris Proctor EPIK Communications
-----Original Message----- From: Valdis.Kletnieks@vt.edu [mailto:Valdis.Kletnieks@vt.edu] Sent: Wednesday, December 03, 2003 11:39 AM To: jgraun@comcast.net Cc: nanog@merit.edu Subject: Re: MTU path discovery and IPSec
On Wed, 03 Dec 2003 16:05:39 GMT, jgraun@comcast.net said:
1) I assume MTU path discovery has to been in enabled on each router in the path in order for it work correctly?!
Actually, no. All that's required is that:
a) The router handle the case of a too-large packet with the DF bit set by sending back an ICMP 'Dest Unreachable - Frag Needed' packet. I've never actually encountered a router that didn't get this part right. (Has anybody ever seen a router botch this, *other* than a config error covered in (b) below?)
b) said ICMP makes it back to the originating machine. This is where all the operational breakage I've ever seen on PMTU Discovery comes from. And in almost all cases, one of two things is at fault. Either some bonehead firewall admin is "blocking all ICMP for security" (fixable by reconfiguring the firewall to let ICMP Frag Needed error messages through), or some bonehead network provider numbered their point-to-points from 1918 space and the ICMP gets ingress/egress filtered (this one is usually not fixable except with a baseball bat).
You could drop ICMP packets at your firewall if the firewalls properly implemented stateful inspection of ICMP packets. The problem is few firewalls include ICMP responses in their statefull analysis. So you are left with two bad choices, permit "all" ICMP packets or deny "all" ICMP packets.
Actually, any halfway decent firewall allows you to permit certain ICMP type codes while rejecting others. Not a perfect solution, but, for the most part, there aren't a lot of fragmentation-needed exploits running around. (In fact, I'm hard pressed to imagine how a Frag needed packet for an invalid session could do much of anything). Owen --On Wednesday, December 3, 2003 5:12 PM -0500 Sean Donelan <sean@donelan.com> wrote:
You could drop ICMP packets at your firewall if the firewalls properly implemented stateful inspection of ICMP packets. The problem is few firewalls include ICMP responses in their statefull analysis. So you are left with two bad choices, permit "all" ICMP packets or deny "all" ICMP packets.
-- If it wasn't crypto-signed, it probably didn't come from me.
On Wed, 03 Dec 2003 15:57:37 PST, Owen DeLong <owen@delong.com> said:
around. (In fact, I'm hard pressed to imagine how a Frag needed packet for an invalid session could do much of anything).
You can use a forged 'frag needed' to stomp an existing connection of the victim's down to 64 byte MTU or similar silliness, but other than sheer "it's a packet" DDoS effects, I can't think of a malicious use for one for an invalid session either....
--On Wednesday, December 3, 2003 10:53 PM -0500 Valdis.Kletnieks@vt.edu wrote:
On Wed, 03 Dec 2003 15:57:37 PST, Owen DeLong <owen@delong.com> said:
around. (In fact, I'm hard pressed to imagine how a Frag needed packet for an invalid session could do much of anything).
You can use a forged 'frag needed' to stomp an existing connection of the victim's down to 64 byte MTU or similar silliness, but other than sheer "it's a packet" DDoS effects, I can't think of a malicious use for one for an invalid session either....
Agreed. However, the former pretty much requires knowledge, a lot of packets, or a really lucky set of guesses. Owen -- If it wasn't crypto-signed, it probably didn't come from me.
there are expert modes where you can apply the name source destination protocol time comments. rank state action track for more stabilized dedicated connections I am certain there are more depending on the vender -Henry Sean Donelan <sean@donelan.com> wrote: You could drop ICMP packets at your firewall if the firewalls properly implemented stateful inspection of ICMP packets. The problem is few firewalls include ICMP responses in their statefull analysis. So you are left with two bad choices, permit "all" ICMP packets or deny "all" ICMP packets.
On Wednesday, 2003-12-03 at 09:38 PST, David Sinn <dsinn@dsinn.com> wrote:
Given the nastiness of ICMP DDoS attacks of late, it might be better to hit the server and client admin's with the clue bat about not using PMTU discovery (which also extends to the writers of the App's and OS's).
This idea that some protocol has been used for some form of attack means that we should for now and evermore block that protocol leads clearly to a network with all protocols blocked. No, I don't buy the argument that icmp (at least most forms of it) should be blocked.
Frag. is in the fast path of just about every current version of brand C code, so giving the tunneling folks the OK to frag the packet might be preferred to forcing them to mess about with alternate options.
Fragmentation should be an ok eventuality for some traffic, but there are a couple of points that make it more painful than it might seem: 1. Encapsulated traffic (such as most vpns - GRE, IPSEC, etc.) often results in packets that subsequently need to be fragmented. That typically yields lots of 1500 byte packets followed by 80 byte packets. 2. I really don't know how NAPT routers deal with fragments. These guys depend on the port information in a packet to reliably determine the target of inbound traffic. But there is no port information in anything other than fragment 1. When they receive a frag other than 1 they don't definitely know who to deliver it to. They have to either guess or drop the packet. Ugh, in both cases. (And note that frag 1 often is not the first fragment to arrive at downstream nodes. In my example in (1), frequently frag 2 will reach places before frag 1 does (if any router along the path reorders its transmit queue based on packet size).) Tony Rall
Tony Rall wrote:
On Wednesday, 2003-12-03 at 09:38 PST, David Sinn <dsinn@dsinn.com> wrote:
<snipped>
(And note that frag 1 often is not the first fragment to arrive at downstream nodes. In my example in (1), frequently frag 2 will reach places before frag 1 does (if any router along the path reorders its transmit queue based on packet size).)
I agree with all I have snipped. I was wondering would it not be wiser for fraggers to frag in half instead of just the overflow? For instance, suppose router has to fragment 1500 byte packet to go over 1476 GRE. Instead of having a big packet/little fragment why not just divide in half? This would give them more equal buffer treatment, but an even bigger potential win is to avoid perhaps a second (maybe ipsec?) fragmenting later on down the pipe. Once you are going to do it, do it right. It is not as if your decreasing header overhead by producing small fragment packets. And I am assuming the whole packet is already in buffer when it comes time to fragment it.
Tony Rall
On Thu, 04 Dec 2003 16:40:45 EST, Joe Maimon <jmaimon@ttec.com> said:
I agree with all I have snipped. I was wondering would it not be wiser for fraggers to frag in half instead of just the overflow?
There's 2 cases here: 1) This is the final frag on the path - if PMTUD is in use, we want to frag right at the overflow so the connection can use the max (so if we're fragging from 1500 down to 1410, they end up with 1410 rather than 750). 2) There's an even more restrictive frag further downstream. We frag from 1500 to 1460, and somebody else frags from 1460 down to 1410. If you frag at overflow, you end up with a PMTU of 1410. If you fragged it in half, you avoid the second frag but end up with a PMTU of 750. After several dozen packets, the difference between 750 and 1410 will start to become noticable.....
On Thu, Dec 04, 2003 at 05:54:42PM -0500, Valdis.Kletnieks@vt.edu wrote:
On Thu, 04 Dec 2003 16:40:45 EST, Joe Maimon <jmaimon@ttec.com> said:
I was wondering would it not be wiser for fraggers to frag in half instead of just the overflow?
There's 2 cases here:
1) This is the final frag on the path - if PMTUD is in use, we want to frag right at the overflow so the connection can use the max (so if we're fragging from 1500 down to 1410, they end up with 1410 rather than 750).
2) There's an even more restrictive frag further downstream. We frag from 1500 to 1460, and somebody else frags from 1460 down to 1410. If you frag at overflow, you end up with a PMTU of 1410. If you fragged it in half, you avoid the second frag but end up with a PMTU of 750.
After several dozen packets, the difference between 750 and 1410 will start to become noticable.....
That's not how PMTUD works. If DF is set, you discard the packet and report back with ICMP. If DF is not set, you frag the packet - but that's not PMTUD, because no report ever goes back to the sender. -- Barney Wolff http://www.databus.com/bwresume.pdf I'm available by contract or FT, in the NYC metro area or via the 'Net.
Barney Wolff wrote:
On Thu, Dec 04, 2003 at 05:54:42PM -0500, Valdis.Kletnieks@vt.edu wrote:
On Thu, 04 Dec 2003 16:40:45 EST, Joe Maimon <jmaimon@ttec.com> said:
I was wondering would it not be wiser for fraggers to frag in half instead of just the overflow?
There's 2 cases here:
1) This is the final frag on the path - if PMTUD is in use, we want to frag right at the overflow so the connection can use the max (so if we're fragging from 1500 down to 1410, they end up with 1410 rather than 750).
2) There's an even more restrictive frag further downstream. We frag from 1500 to 1460, and somebody else frags from 1460 down to 1410. If you frag at overflow, you end up with a PMTU of 1410. If you fragged it in half, you avoid the second frag but end up with a PMTU of 750.
After several dozen packets, the difference between 750 and 1410 will start to become noticable.....
That's not how PMTUD works. If DF is set, you discard the packet and report back with ICMP. If DF is not set, you frag the packet - but that's not PMTUD, because no report ever goes back to the sender.
Probaly better to say that in this day and age PMTUD doesnt work and the best interoperability feature of IP, the fragmenting is becoming useless as the internet remains pegged to a 1500 MTU everywhere, with evil hacks everywhere else to keep things working (mss adjustment/clamping, DF bit clearing). Is there any discussion on better alternatives to PMTUD such as leaving off DF and a new ICMP subtype, rate limited, to inform senders that they've been fragged and at what (call it reverse PMTUD?) ? Or how about a new TCP option (Call it MSSr/s maximum segment size sent/received) for the receiver to tell the sender if packet sizes are less than expected/fragged? (again with DF off)? Does IP6 really do away with fragmenting? Is there any current discussion on all this? I see I have to go do some research.
On Thu, 04 Dec 2003 18:03:38 EST, Barney Wolff said:
That's not how PMTUD works. If DF is set, you discard the packet and report back with ICMP. If DF is not set, you frag the packet - but that's not PMTUD, because no report ever goes back to the sender.
Oh, so we compute ONE number if DF is set, saying what number we think they should use - but if DF *isn't* set, we use a different number. Sounds like more complicated code that's just there so it can sink its teeth into the rump of the first banana-eating NOC dweller that has to figure out what's wrong.... Unless of course there's a *reason* we want it different? Though it escapes me what it might be....
Valdis.Kletnieks@vt.edu wrote:
On Thu, 04 Dec 2003 18:03:38 EST, Barney Wolff said:
That's not how PMTUD works. If DF is set, you discard the packet and report back with ICMP. If DF is not set, you frag the packet - but that's not PMTUD, because no report ever goes back to the sender.
Oh, so we compute ONE number if DF is set, saying what number we think they should use - but if DF *isn't* set, we use a different number. Sounds like more complicated code that's just there so it can sink its teeth into the rump of the first banana-eating NOC dweller that has to figure out what's wrong....
Unless of course there's a *reason* we want it different? Though it escapes me what it might be....
As I have said previously, some reasons are that A) Your fragmenting the packet anyways, thus there will be extra header overhead. Splitting that overhead into 1 big and 1 small packet does not seem to be a performance win**. B) Fragmenting into equal sizes may mean that equipment can treat them more equaly and may reduce Out of Order fragments, which is easier on state keeping devices. C) Equal buffer treatment may mean easier handling of switching and reassembly, I havent thought this through. D) And the best part, avoid the insult to injury by lessening the chance that further fragmentation will occur on the packet. Picture a packet coming in from ATM to Ethernet to PPPoE through Ipsec. How many fragments is that? How much overhead? As far as code goes how is that a problem? One assumes the length of the packet is there already. SO all we have to do is divide in half use that number and use it instead of the value of next_hop_mtu. And we use different numbers because when DF is set our only option is telling the sender to lower. Lower to what? Well to what we know is good. How do we know the next hop isnt even lower? Well we should know if its in the same AS, otherwise we just do our best. And besides, PMTUD is a performance orientated feature. One would like to avoid compromising the performance gains. The precise maximum path MTU is exactly what the sender wants to find out. So give it. But IP without DF is best attempt delivery. So do whatever will be best compromise. And we are fragmenting anyway... (GOTO START) **But, one case where this could be undesired is by causing buffer fragmentation.
Joe Maimon wrote:
Tony Rall wrote:
On Wednesday, 2003-12-03 at 09:38 PST, David Sinn <dsinn@dsinn.com> wrote:
<snipped>
(And note that frag 1 often is not the first fragment to arrive at downstream nodes. In my example in (1), frequently frag 2 will reach places before frag 1 does (if any router along the path reorders its transmit queue based on packet size).)
I agree with all I have snipped. I was wondering would it not be wiser for fraggers to frag in half instead of just the overflow?
For instance, suppose router has to fragment 1500 byte packet to go over 1476 GRE. Instead of having a big packet/little fragment why not just divide in half? This would give them more equal buffer treatment, but an even bigger potential win is to avoid perhaps a second (maybe ipsec?) fragmenting later on down the pipe.
Once you are going to do it, do it right. It is not as if your decreasing header overhead by producing small fragment packets. And I am assuming the whole packet is already in buffer when it comes time to fragment it.
Programmers are lazy. Excerise for the reader: Devise an algorthm that will take an arbitrarily sized packet 20-65535 octets and an arbitrarily sized MTU, > 576 octets, and split the packet into the minimum number of "n" fragments where each fragment is (1) less than the MTU, (2) no two fragments differ by more than 8 octets, and the fragments obey the IP fragmentation rules, (3) data payload must end on an 8-octet boundary for all but the last fragment and (4) each fragment has an exact copy of the original header except for differences in the fragmentation fields and checksum. Compare to the algorithm of cutting the data in to "m" (mtu - ip_hl)- chunks and putting the leftovers into the final fragment. -- Crist J. Clark crist.clark@globalstar.com Globalstar Communications (408) 933-4387 The information contained in this e-mail message is confidential, intended only for the use of the individual or entity named above. If the reader of this e-mail is not the intended recipient, or the employee or agent responsible to deliver it to the intended recipient, you are hereby notified that any review, dissemination, distribution or copying of this communication is strictly prohibited. If you have received this e-mail in error, please contact postmaster@globalstar.com
Crist Clark wrote:
Joe Maimon wrote:
Tony Rall wrote:
On Wednesday, 2003-12-03 at 09:38 PST, David Sinn <dsinn@dsinn.com> wrote:
<snipped>
(And note that frag 1 often is not the first fragment to arrive at downstream nodes. In my example in (1), frequently frag 2 will reach places before frag 1 does (if any router along the path reorders its transmit queue based on packet size).)
I agree with all I have snipped. I was wondering would it not be wiser for fraggers to frag in half instead of just the overflow?
For instance, suppose router has to fragment 1500 byte packet to go over 1476 GRE. Instead of having a big packet/little fragment why not just divide in half? This would give them more equal buffer treatment, but an even bigger potential win is to avoid perhaps a second (maybe ipsec?) fragmenting later on down the pipe.
Once you are going to do it, do it right. It is not as if your decreasing header overhead by producing small fragment packets. And I am assuming the whole packet is already in buffer when it comes time to fragment it.
Programmers are lazy.
Excerise for the reader:
Devise an algorthm that will take an arbitrarily sized packet 20-65535 octets and an arbitrarily sized MTU, > 576 octets, and split the packet into the minimum number of "n" fragments where each fragment is (1) less than the MTU, (2) no two fragments differ by more than 8 octets, and the fragments obey the IP fragmentation rules, (3) data payload must end on an 8-octet boundary for all but the last fragment and (4) each fragment has an exact copy of the original header except for differences in the fragmentation fields and checksum.
Compare to the algorithm of cutting the data in to "m" (mtu - ip_hl)- chunks and putting the leftovers into the final fragment.
How about only going to the bother if 'n' would only be 2 in either algorithm? That should keep things nice and simple for all the lazy programmers. And we wonder why there are so many security holes. As for the rest, I do not see the real difference. And now I will shut ip about implementation until/when(if ever) I could write some.
Crist Clark wrote:
Joe Maimon wrote:
Tony Rall wrote:
On Wednesday, 2003-12-03 at 09:38 PST, David Sinn <dsinn@dsinn.com> wrote:
<snipped>
(And note that frag 1 often is not the first fragment to arrive at downstream nodes. In my example in (1), frequently frag 2 will reach places before frag 1 does (if any router along the path reorders its transmit queue based on packet size).)
I agree with all I have snipped. I was wondering would it not be wiser for fraggers to frag in half instead of just the overflow?
For instance, suppose router has to fragment 1500 byte packet to go over 1476 GRE. Instead of having a big packet/little fragment why not just divide in half? This would give them more equal buffer treatment, but an even bigger potential win is to avoid perhaps a second (maybe ipsec?) fragmenting later on down the pipe.
Once you are going to do it, do it right. It is not as if your decreasing header overhead by producing small fragment packets. And I am assuming the whole packet is already in buffer when it comes time to fragment it.
Programmers are lazy.
Excerise for the reader:
Devise an algorthm that will take an arbitrarily sized packet 20-65535 octets and an arbitrarily sized MTU, > 576 octets, and split the packet into the minimum number of "n" fragments where each fragment is (1) less than the MTU, (2) no two fragments differ by more than 8 octets, and the fragments obey the IP fragmentation rules, (3) data payload must end on an 8-octet boundary for all but the last fragment and (4) each fragment has an exact copy of the original header except for differences in the fragmentation fields and checksum.
Compare to the algorithm of cutting the data in to "m" (mtu - ip_hl)- chunks and putting the leftovers into the final fragment.
I've got to jump in and display my considerable ignorance here. Are there not machines in service now that start blatting bits out (when able) before the whole packet has been recieved? Given that to be correct, it would seem to be Really Hard to whack up a packet into equal-sized chunks (given that it is otherwise a Good Thing To Do) on-the-fly. Lazy programmers (which I have long taught are the Best Kind) will blat out bits until the buffer is full, start a new buffer, rinse, lather, repeat until the input buffer is exhausted. Where did I go into the ditch?
Laurence F. Sheldon, Jr. wrote:
Crist Clark wrote:
Joe Maimon wrote:
Tony Rall wrote:
On Wednesday, 2003-12-03 at 09:38 PST, David Sinn <dsinn@dsinn.com> wrote:
<snipped>
(And note that frag 1 often is not the first fragment to arrive at downstream nodes. In my example in (1), frequently frag 2 will reach places before frag 1 does (if any router along the path reorders its transmit queue based on packet size).)
I agree with all I have snipped. I was wondering would it not be wiser for fraggers to frag in half instead of just the overflow?
For instance, suppose router has to fragment 1500 byte packet to go over 1476 GRE. Instead of having a big packet/little fragment why not just divide in half? This would give them more equal buffer treatment, but an even bigger potential win is to avoid perhaps a second (maybe ipsec?) fragmenting later on down the pipe.
Once you are going to do it, do it right. It is not as if your decreasing header overhead by producing small fragment packets. And I am assuming the whole packet is already in buffer when it comes time to fragment it.
Programmers are lazy.
Excerise for the reader:
Devise an algorthm that will take an arbitrarily sized packet 20-65535 octets and an arbitrarily sized MTU, > 576 octets, and split the packet into the minimum number of "n" fragments where each fragment is (1) less than the MTU, (2) no two fragments differ by more than 8 octets, and the fragments obey the IP fragmentation rules, (3) data payload must end on an 8-octet boundary for all but the last fragment and (4) each fragment has an exact copy of the original header except for differences in the fragmentation fields and checksum.
Compare to the algorithm of cutting the data in to "m" (mtu - ip_hl)- chunks and putting the leftovers into the final fragment.
I've got to jump in and display my considerable ignorance here.
Are there not machines in service now that start blatting bits out (when able) before the whole packet has been recieved?
Given that to be correct, it would seem to be Really Hard to whack up a packet into equal-sized chunks (given that it is otherwise a Good Thing To Do) on-the-fly. Lazy programmers (which I have long taught are the Best Kind) will blat out bits until the buffer is full, start a new buffer, rinse, lather, repeat until the input buffer is exhausted.
Where did I go into the ditch?
Maybe because IP tells the length of the packet up front?
On Thu, 04 Dec 2003 17:22:23 PST, Crist Clark said:
Excerise for the reader:
Devise an algorthm that will take an arbitrarily sized packet 20-65535 octets and an arbitrarily sized MTU, > 576 octets, and split the packet into the minimum number of "n" fragments where each fragment is (1) less than the MTU, (2) no two fragments differ by more than 8 octets, and the fragments obey the IP fragmentation rules, (3) data payload must end on an 8-octet boundary for all but the last fragment and (4) each fragment has an exact copy of the original header except for differences in the fragmentation fields and checksum.
Compare to the algorithm of cutting the data in to "m" (mtu - ip_hl)- chunks and putting the leftovers into the final fragment.
Then re-read rfc792: The ICMP messages typically report errors in the processing of datagrams. To avoid the infinite regress of messages about messages etc., no ICMP messages are sent about ICMP messages. Also ICMP messages are only sent about errors in handling fragment zero of fragemented datagrams. (Fragment zero has the fragment offeset equal zero). and think which is more likely to silently fail - fragmenting a 1500 byte packet 750/750, or 1430/70, if the *second* frag is in possible danger of being lost due to buffer congestion or similar?
Joe Maimon wrote:
Tony Rall wrote:
On Wednesday, 2003-12-03 at 09:38 PST, David Sinn <dsinn@dsinn.com> wrote:
<snipped>
<snipped> I was
wondering would it not be wiser for fraggers to frag in half instead of just the overflow? <snip>
I noticed today this URL http://www.cisco.com/en/US/products/sw/iosswrel/ps1839/products_feature_guid... Interesting part down on the page: Uniform Fragmentation Packets are fragmented into equally sized units to prevent further downstream fragmentation.
Tony Rall
On Wed, Dec 10, 2003 at 03:43:59PM -0500, Joe Maimon wrote:
Packets are fragmented into equally sized units to prevent further downstream fragmentation.
For amusement's sake, in response to a challenge from Crist Clark, here's code to do it right. Pretty simple, although I have no idea how to do it in an ASIC. -- Barney Wolff http://www.databus.com/bwresume.pdf I'm available by contract or FT, in the NYC metro area or via the 'Net.
participants (11)
-
Barney Wolff
-
cproctor@epik.net
-
Crist Clark
-
David Sinn
-
Henry Linneweh
-
Joe Maimon
-
Laurence F. Sheldon, Jr.
-
Owen DeLong
-
Sean Donelan
-
Tony Rall
-
Valdis.Kletnieks@vt.edu