On Thu, Mar 20, 2003 at 03:26:35PM -0500, bdragon@gweep.net wrote:
If someone can identify what you are actually seeing, I'll check into it. If you are experiencing drops or slow traces, only through the core, there is an issue with excessive de-prioritization of ICMP control message with a particular router type (vendcor) in the core. End to end data flow has not seemed to be affected but trace and ping core latencies are looking very wierd. I've been asking customers to use trace only for path detail and to use end to end ping for any performance data.=20
Yes, the core is MPLS enabled. Diffserv acted on only at the edges though.=20
Michelle
It could certainly be customers who have broken themselves. I've heard lots of stories about people who do PMTUD but simultaneously filter ICMP Can't Frag messages.
As soon as the Path MTU drops below whatever their local box is (usually 1500) they "break" although due to their own screwed up config.
Since MPLS adds additional overhead, dropping the MTU, I'ld seriously consider this as a possible reason.
Speaking very generally and not about any one specific network, this is likely to not be the issue. MPLS leads to problems on Ethernet, but I've seen no problems in anything other than Eth/FE. GigE and POS haven't had the same issue; for one, default POS MTU is ~4k, which is more than enough to hold packets from hosts that assume 576 or 1500, and PMTU over an MPLS network takes the MPLS label stack size into account when doing discovery. Also, some implementations have framers that can accept a packet that's actually MTU+(N*4), where N is typically no more than 4, and more likely 2. And I think I can say without breaking any confidentially agreements that AT&T's backbone Probably Isn't (nudge nudge wink wink) made up of scads and scads of 10/100Mb links everywhere. :) The biggest problem you can have with MPLS is if you have customers who are connected at 4k or 9k or what have you, and who don't do PMTUD; I've not seen this come up as a real operational issue. .02 eric
The major problems are: 1) identifying broken customers 2) convincing customers that they are broken when they "haven't changed anything" 3) getting them to actually change
Some folks just put off the problem until later by moving to MTUs > 1500. The only benefit to this is that hopefully when the customer next breaks it is as a direct result of them having "changed something" which gets you over the hurdle of convincing some person that their filtering of all ICMP isn't just stupid, but is also broken.