To whom it may concern, here is some email I received in the last months followed by some of my observations which might be related to the problems discussed. I have posted my obervations to comp.sys.dcom.cisco and opened a trouble ticket with cisco's technical assistance center. # Forwarded message: # > From merit.edu!errors-nohumans Fri Jun 5 23:44:49 1998 # > Message-Id: <3.0.3.32.19980605095358.006ebd4c@mailhost.ip-plus.net> # > X-Sender: bridge@mailhost.ip-plus.net # > X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.3 (32) # > Date: Fri, 05 Jun 1998 09:53:58 +0100 # > To: nanog@merit.edu # > From: philip bridge <bridge@ip-plus.net> # > Subject: MTU problems with GRE tunnels # > Mime-Version: 1.0 # > Content-Type: text/plain; charset="us-ascii" # > Sender: owner-nanog@merit.edu # > Content-Length: 1881 # > # > I'm experiencing problems with fragmentation due to Cisco GRE tunnel # > overhead: the way I understand it, the MTU if a GRE tunnel will always be # > less than the MTU of the underlying IP cloud (in our case 1500 bytes) due # > to the IP encapsulation overhead. So 1500 byte packets attempting to # > traverse the tunnel will be fragmented, or dropped if the DF bit is set, in # > which case an ICMP message is send back to the originating host # > # > We're trying to use GRE tunnels extensivly in some fancy added-value # > Internet services, and it seems that there is a small but significant # > amount of application traffic out there that has problems when traversing a # > GRE tunnel with MTU < 1500. We've seen two problems: # > # > - 1500 byte packets with DF set. This is either application traffic, or MTU # > path discovery is broken, because the same packets get sent repeatedly # > - 1500 byte packets get fragmented, but the destination host cannot cope # > with the fragmentation (firewall issues?) # > # > We see this on a variety of platforms (from 2500, 7507) and a variety of # > IOS releases (11.1(18)CC, 11.1(2), 11.2(5). Talking to another provider # > indicates that the same problem exists with other vendors, and is having # > the same severe impact. # > # > Thinking about it, this is a problem is to be expected with IP tunnels of # > all types, but I am surprised at the extent it's influence on our # > customer's applications (such as large emails). I do not want to overstate # > the proportion of traffic we see with this problem - but it does seem to be # > enough to render GRE tunnels very problematic - to say the least. But I # > know lots of people are using GRE for this or similar applications...so # > what am I missing here. # > # > thanks in advance for help/tips # > # > Phil # > # > # > # > ______________________________________________________________ # > Philip Bridge # > ++41 31 688 8262 bridge@ip-plus.net www.ip-plus.ch # > PGP: DE78 06B7 ACDB CB56 CE88 6165 A73F B703 # > # # # -- # Bernhard Kroenung, Bahnhofstr 8, 36157 Ebersburg/Rhoen, Germany +49 6656 910101 # @work : bernhard@kroenung.de Work: +49 661 9011777 # @home : horke@Rhoen.De @school : Bernhard.Kroenung@Informatik.FH-Fulda.De # hello, world\n Here's something very strange I observe with GRE tunnels (the default tunnel mode). It looks like cisco routers send IP datagrams violating RFC 791 [Internet Protocol] over GRE tunnels. In particular, the length field of the IP header is computed incorrectly to *not* include the size of the IP header. RFC 791 says about the length field: <quote> Total Length: 16 bits Total Length is the length of the datagram, measured in octets, including internet header and data. This field allows the length of a datagram to be up to 65,535 octets. ... </quote> I have an application on my workstation that serves as one endpoint of a GRE tunnel. In fact, it's such a tiny perl program that I have appended it at the end of this mail. Here's the tunnel config on my cisco, which is a IOS (tm) 4500 Software (C4500-P-M), Version 11.2(9), RELEASE SOFTWARE (fc1): interface Tunnel2 description GRE Test Tunnel ip address 10.0.0.1 255.255.255.252 tunnel source 193.174.247.254 !another iface of this cisco tunnel destination 193.174.247.193 !my workstation's address tunnel key 42 !optional Let's ping the other end of the tunnel: io#ping 10.0.0.2 Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 10.0.0.2, timeout is 2 seconds: ..... Success rate is 0 percent (0/5) Here's what the perl tunnel endpoint outputs: Length of received packet: 128 <<<<<<<<< Note this version: 4 header len: 5 tos: 0 length: 108 <<<<<<<<< Note this id: 1586 flags: 0 offset: 0 ttl: 255 protocol: 47 chksum: 16895 source: 193.174.247.254 destination: 193.174.247.193 20 00 08 00 00 00 00 2a 45 00 00 64 01 39 00 00 ff 01 a6 5d 0a 00 00 01 0a 00 00 02 08 00 51 68 00 00 23 a5 00 00 00 01 9a 8b 6e b0 ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd Or let's try a telnet session: io#telnet 10.0.0.2 Trying 10.0.0.2 ... Length of received packet: 72 <<<<<<<<< Note this version: 4 header len: 5 tos: 0 length: 52 <<<<<<<<< Note this id: 1591 flags: 0 offset: 0 ttl: 255 protocol: 47 chksum: 16946 source: 193.174.247.254 destination: 193.174.247.193 20 00 08 00 00 00 00 2a 45 00 00 2c 00 00 00 00 ff 06 a7 c9 0a 00 00 01 0a 00 00 02 52 02 00 17 52 c8 26 04 00 00 00 00 60 02 10 c0 a8 9a 00 00 02 04 05 98 We note that the length as reported in the IP header is always 20 octets less than what we receive on the socket. This leads me to the question Do you cisco guys read RFCs? :-) Regards, Jens Schweikhardt -- ## Network Operation Center, DFN-Verein Geschäftsstelle Stuttgart ## ## http://www.noc.dfn.de/ finger trouble@noc.dfn.de wartung@noc.dfn.de ## ## >>>>>> mailto: noc@noc.dfn.de <<<<<< ## Here's my perl script: #!/usr/local/bin/perl5 -w # # GRE Tunnel Endpoint; prints all GRE packets received. # # Author: Jens Schweikhardt <schweikh@noc.dfn.de> # # >>> You probably need root permission to open the raw socket. <<< use Socket qw (SOCK_RAW PF_INET); use strict; my $gre = 47; # Generic Routing Encapsulation my $rbits; # bitmask with read file descriptors for select my $out; # writable copy of rbits for select to clobber my $nready; # return value from select unless (socket (SOCKET, &PF_INET(), &SOCK_RAW(), $gre)) { print STDERR "gre socket: $!\n"; exit 1; } $rbits = ''; vec ($rbits, fileno SOCKET, 1) = 1; for (;;) { $nready = select ($out = $rbits, undef, undef, undef); last unless defined $nready; # Should not happen... &receive_packet () if $nready; # A packet is waiting } close SOCKET; exit 0; sub receive_packet { my $from_msg = ''; my $from_saddr = recv (SOCKET, $from_msg, 1500, 0); unless (defined $from_saddr) { print STDERR "recv: $!\n"; return 0; } print "\nLength of received packet: ", length ($from_msg), "\n"; my ($delivery_ip_version, $delivery_ip_ihl, $delivery_ip_tos, $delivery_ip_length, $delivery_ip_id, $delivery_ip_flags, $delivery_ip_offset, $delivery_ip_ttl, $delivery_ip_proto, $delivery_ip_chksum, $delivery_ip_src, $delivery_ip_dst, $delivery_ip_options, $delivery_ip_data ) = &ip_unpack ($from_msg); print "version: $delivery_ip_version\n"; print "header len: $delivery_ip_ihl\n"; print "tos: $delivery_ip_tos\n"; print "length: $delivery_ip_length\n"; print "id: $delivery_ip_id\n"; print "flags: $delivery_ip_flags\n"; print "offset: $delivery_ip_offset\n"; print "ttl: $delivery_ip_ttl\n"; print "protocol: $delivery_ip_proto\n"; print "chksum: $delivery_ip_chksum\n"; printf "source: %u.%u.%u.%u\n", unpack ('C4', pack ('L', $delivery_ip_src)); printf "destination: %u.%u.%u.%u\n", unpack ('C4', pack ('L', $delivery_ip_dst)); &dump ($delivery_ip_data); } sub dump { my $len = length ($_[0]); if ($len > 0) { my @octet = split //, $_[0]; my $i; for ($i = 1; $i <= $len; ++$i) { printf " %02x", unpack ('C', $octet[$i-1]); print "\n" unless $i % 16; } print "\n" if $i % 16; } else { print " [NO DATA]\n"; } } # Format of an IP packet, RFC 791. # sub ip_unpack { my $packet = shift; if (length ($packet) < 20) { print STDERR "ip packet too short: ", length ($packet), " bytes\n"; exit 1; } my ( $version, $tos, $length, $id, $flags, $ttl, $proto, $chksum, $src, $dst ) = unpack ('CCnnnCCnNN', $packet); my $ihl = $version & 017; $version >>= 4; if ($version != 4) { print STDERR "ip version mismatch, expected 4, got $version\n"; exit 1; } my $offset = $flags & 017777; $flags >>=13; my $options = substr ($packet, 20, $ihl * 4 - 20); my $data = substr ($packet, $ihl * 4); return ( $version, $ihl, $tos, $length, $id, $flags, $offset, $ttl, $proto, $chksum, $src, $dst, $options, $data ); }
It's well known problem... not for Cisco (any connectionless-based tunnelling cause MTU to be decreased) but for those MS-based application which do not know how to deal with the fragmentation AND use big (1500 bytes) packet sizes. The only object have to be treated here is applications, not routers... Through it's possible to imagine some ways to over-fix this by router's software... The application MUST: - do not use DF bit; OR - do not use long (> 1024) packets at any cases, AND understand ICMP packets about MTU size and MTU discovery protocol. Any application do not corresponding to this is niot garanteed to work in the Internet. On Mon, 6 Jul 1998, Jens Schweikhardt wrote:
Date: Mon, 6 Jul 1998 13:39:04 +0200 (MET DST) From: Jens Schweikhardt <schweikh@noc.dfn.de> To: bridge@ip-plus.net, horke@regio.net, nanog@merit.edu Cc: DFN NOC <noc@noc.dfn.de> Subject: MTU problems with GRE tunnels (fwd)
To whom it may concern,
here is some email I received in the last months followed by some of my observations which might be related to the problems discussed. I have posted my obervations to comp.sys.dcom.cisco and opened a trouble ticket with cisco's technical assistance center.
# Forwarded message: # > From merit.edu!errors-nohumans Fri Jun 5 23:44:49 1998 # > Message-Id: <3.0.3.32.19980605095358.006ebd4c@mailhost.ip-plus.net> # > X-Sender: bridge@mailhost.ip-plus.net # > X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.3 (32) # > Date: Fri, 05 Jun 1998 09:53:58 +0100 # > To: nanog@merit.edu # > From: philip bridge <bridge@ip-plus.net> # > Subject: MTU problems with GRE tunnels # > Mime-Version: 1.0 # > Content-Type: text/plain; charset="us-ascii" # > Sender: owner-nanog@merit.edu # > Content-Length: 1881 # > # > I'm experiencing problems with fragmentation due to Cisco GRE tunnel # > overhead: the way I understand it, the MTU if a GRE tunnel will always be # > less than the MTU of the underlying IP cloud (in our case 1500 bytes) due # > to the IP encapsulation overhead. So 1500 byte packets attempting to # > traverse the tunnel will be fragmented, or dropped if the DF bit is set, in # > which case an ICMP message is send back to the originating host # > # > We're trying to use GRE tunnels extensivly in some fancy added-value # > Internet services, and it seems that there is a small but significant # > amount of application traffic out there that has problems when traversing a # > GRE tunnel with MTU < 1500. We've seen two problems: # > # > - 1500 byte packets with DF set. This is either application traffic, or MTU # > path discovery is broken, because the same packets get sent repeatedly # > - 1500 byte packets get fragmented, but the destination host cannot cope # > with the fragmentation (firewall issues?) # > # > We see this on a variety of platforms (from 2500, 7507) and a variety of # > IOS releases (11.1(18)CC, 11.1(2), 11.2(5). Talking to another provider # > indicates that the same problem exists with other vendors, and is having # > the same severe impact. # > # > Thinking about it, this is a problem is to be expected with IP tunnels of # > all types, but I am surprised at the extent it's influence on our # > customer's applications (such as large emails). I do not want to overstate # > the proportion of traffic we see with this problem - but it does seem to be # > enough to render GRE tunnels very problematic - to say the least. But I # > know lots of people are using GRE for this or similar applications...so # > what am I missing here. # > # > thanks in advance for help/tips # > # > Phil # > # > # > # > ______________________________________________________________ # > Philip Bridge # > ++41 31 688 8262 bridge@ip-plus.net www.ip-plus.ch # > PGP: DE78 06B7 ACDB CB56 CE88 6165 A73F B703 # > # # # -- # Bernhard Kroenung, Bahnhofstr 8, 36157 Ebersburg/Rhoen, Germany +49 6656 910101 # @work : bernhard@kroenung.de Work: +49 661 9011777 # @home : horke@Rhoen.De @school : Bernhard.Kroenung@Informatik.FH-Fulda.De #
hello, world\n
Here's something very strange I observe with GRE tunnels (the default tunnel mode). It looks like cisco routers send IP datagrams violating RFC 791 [Internet Protocol] over GRE tunnels. In particular, the length field of the IP header is computed incorrectly to *not* include the size of the IP header. RFC 791 says about the length field:
<quote>
Total Length: 16 bits
Total Length is the length of the datagram, measured in octets, including internet header and data. This field allows the length of a datagram to be up to 65,535 octets. ...
</quote>
I have an application on my workstation that serves as one endpoint of a GRE tunnel. In fact, it's such a tiny perl program that I have appended it at the end of this mail.
Here's the tunnel config on my cisco, which is a IOS (tm) 4500 Software (C4500-P-M), Version 11.2(9), RELEASE SOFTWARE (fc1):
interface Tunnel2 description GRE Test Tunnel ip address 10.0.0.1 255.255.255.252 tunnel source 193.174.247.254 !another iface of this cisco tunnel destination 193.174.247.193 !my workstation's address tunnel key 42 !optional
Let's ping the other end of the tunnel: io#ping 10.0.0.2
Type escape sequence to abort. Sending 5, 100-byte ICMP Echos to 10.0.0.2, timeout is 2 seconds: ..... Success rate is 0 percent (0/5)
Here's what the perl tunnel endpoint outputs: Length of received packet: 128 <<<<<<<<< Note this version: 4 header len: 5 tos: 0 length: 108 <<<<<<<<< Note this id: 1586 flags: 0 offset: 0 ttl: 255 protocol: 47 chksum: 16895 source: 193.174.247.254 destination: 193.174.247.193 20 00 08 00 00 00 00 2a 45 00 00 64 01 39 00 00 ff 01 a6 5d 0a 00 00 01 0a 00 00 02 08 00 51 68 00 00 23 a5 00 00 00 01 9a 8b 6e b0 ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd ab cd
Or let's try a telnet session: io#telnet 10.0.0.2 Trying 10.0.0.2 ...
Length of received packet: 72 <<<<<<<<< Note this version: 4 header len: 5 tos: 0 length: 52 <<<<<<<<< Note this id: 1591 flags: 0 offset: 0 ttl: 255 protocol: 47 chksum: 16946 source: 193.174.247.254 destination: 193.174.247.193 20 00 08 00 00 00 00 2a 45 00 00 2c 00 00 00 00 ff 06 a7 c9 0a 00 00 01 0a 00 00 02 52 02 00 17 52 c8 26 04 00 00 00 00 60 02 10 c0 a8 9a 00 00 02 04 05 98
We note that the length as reported in the IP header is always 20 octets less than what we receive on the socket. This leads me to the question
Do you cisco guys read RFCs? :-)
Regards,
Jens Schweikhardt -- ## Network Operation Center, DFN-Verein GeschДftsstelle Stuttgart ## ## http://www.noc.dfn.de/ finger trouble@noc.dfn.de wartung@noc.dfn.de ## ## >>>>>> mailto: noc@noc.dfn.de <<<<<< ##
Here's my perl script:
#!/usr/local/bin/perl5 -w # # GRE Tunnel Endpoint; prints all GRE packets received. # # Author: Jens Schweikhardt <schweikh@noc.dfn.de> # # >>> You probably need root permission to open the raw socket. <<<
use Socket qw (SOCK_RAW PF_INET); use strict;
my $gre = 47; # Generic Routing Encapsulation my $rbits; # bitmask with read file descriptors for select my $out; # writable copy of rbits for select to clobber my $nready; # return value from select
unless (socket (SOCKET, &PF_INET(), &SOCK_RAW(), $gre)) { print STDERR "gre socket: $!\n"; exit 1; } $rbits = ''; vec ($rbits, fileno SOCKET, 1) = 1; for (;;) { $nready = select ($out = $rbits, undef, undef, undef); last unless defined $nready; # Should not happen... &receive_packet () if $nready; # A packet is waiting } close SOCKET; exit 0;
sub receive_packet { my $from_msg = ''; my $from_saddr = recv (SOCKET, $from_msg, 1500, 0); unless (defined $from_saddr) { print STDERR "recv: $!\n"; return 0; } print "\nLength of received packet: ", length ($from_msg), "\n"; my ($delivery_ip_version, $delivery_ip_ihl, $delivery_ip_tos, $delivery_ip_length, $delivery_ip_id, $delivery_ip_flags, $delivery_ip_offset, $delivery_ip_ttl, $delivery_ip_proto, $delivery_ip_chksum, $delivery_ip_src, $delivery_ip_dst, $delivery_ip_options, $delivery_ip_data ) = &ip_unpack ($from_msg);
print "version: $delivery_ip_version\n"; print "header len: $delivery_ip_ihl\n"; print "tos: $delivery_ip_tos\n"; print "length: $delivery_ip_length\n"; print "id: $delivery_ip_id\n"; print "flags: $delivery_ip_flags\n"; print "offset: $delivery_ip_offset\n"; print "ttl: $delivery_ip_ttl\n"; print "protocol: $delivery_ip_proto\n"; print "chksum: $delivery_ip_chksum\n"; printf "source: %u.%u.%u.%u\n", unpack ('C4', pack ('L', $delivery_ip_src)); printf "destination: %u.%u.%u.%u\n", unpack ('C4', pack ('L', $delivery_ip_dst)); &dump ($delivery_ip_data); }
sub dump { my $len = length ($_[0]); if ($len > 0) { my @octet = split //, $_[0]; my $i; for ($i = 1; $i <= $len; ++$i) { printf " %02x", unpack ('C', $octet[$i-1]); print "\n" unless $i % 16; } print "\n" if $i % 16; } else { print " [NO DATA]\n"; } }
# Format of an IP packet, RFC 791. # sub ip_unpack { my $packet = shift; if (length ($packet) < 20) { print STDERR "ip packet too short: ", length ($packet), " bytes\n"; exit 1; } my ( $version, $tos, $length, $id, $flags, $ttl, $proto, $chksum, $src, $dst ) = unpack ('CCnnnCCnNN', $packet); my $ihl = $version & 017; $version >>= 4; if ($version != 4) { print STDERR "ip version mismatch, expected 4, got $version\n"; exit 1; } my $offset = $flags & 017777; $flags >>=13; my $options = substr ($packet, 20, $ihl * 4 - 20); my $data = substr ($packet, $ihl * 4); return ( $version, $ihl, $tos, $length, $id, $flags, $offset, $ttl, $proto, $chksum, $src, $dst, $options, $data ); }
Aleksei Roudnev, Network Operations Center, Relcom, Moscow (+7 095) 194-19-95 (Network Operations Center Hot Line),(+7 095) 239-10-10, N 13729 (pager) (+7 095) 196-72-12 (Support), (+7 095) 194-33-28 (Fax)
# > I'm experiencing problems with fragmentation due to Cisco GRE tunnel # > overhead: the way I understand it, the MTU if a GRE tunnel will always be # > less than the MTU of the underlying IP cloud (in our case 1500 bytes) due
We can confirm that IP in IP tunneling is broken too... To quote a specific case, Sparc/Solaris networking over a tunnel with IP in IP will break, e.g. the largest ping you can do is "ping -s xyz 1452", so you have to reduce the MTU, we lowered ours to 1400 (arbitrary reduction) and it makes the problem "go away". The router in question was running 11.1, but from what's been said, it looks like it's generic across the range. I think the Perl program posted was a great idea, now if only Cisco would add a Perl interpreter to IOS :-) Paul ---- P Mansfield, Senior SysAdmin PSINet, +44-1223-577577x2611/577611 fax:577600
Here's something very strange I observe with GRE tunnels (the default tunnel mode). It looks like cisco routers send IP datagrams violating RFC 791 [Internet Protocol] over GRE tunnels. In particular, the length field of the IP header is computed incorrectly to *not* include the size of the IP header. RFC 791 says about the length field: [...] I have an application on my workstation that serves as one endpoint of a GRE tunnel. In fact, it's such a tiny perl program that I have appended it at the end of this mail. [...] We note that the length as reported in the IP header is always 20 octets less than what we receive on the socket. This leads me to the question
Do you cisco guys read RFCs? :-)
I can tell you for sure that the Cisco routers do send the packets (GRE or IP protocol 4) with a length which includes the IP header, just like the RFC. If you look I think you'll find that it is your kernel which is subtracting out the IP header length before it hands the packet to you on the raw socket. Dennis Ferguson
Dennis Ferguson
I can tell you for sure that the Cisco routers do send the packets (GRE or IP protocol 4) with a length which includes the IP header, just like the RFC. If you look I think you'll find that it is your kernel which is subtracting out the IP header length before it hands the packet to you on the raw socket.
BSD Unix converts the length to host order and subtracts the IP header length. Linux leaves the length in net order (don't know about subtracting). Windows leaves the length in net order and does not subtract the IP header length. -Dave
Dave & Dennis, thanks a lot for your valuable insights. And of course, apologies for making the cisco guys look like they didn't read the RFC. Thank God this was written as a question with a smiley... # Dennis Ferguson # > I can tell you for sure that the Cisco routers do send the packets (GRE # > or IP protocol 4) with a length which includes the IP header, just like # > the RFC. If you look I think you'll find that it is your kernel which is # > subtracting out the IP header length before it hands the packet to you on # > the raw socket. # # BSD Unix converts the length to host order and subtracts the IP header # length. Ugh! The platform in question is Solaris 2.5.1. So we have one more datapoint. I'd consider this a bug unless it's documented somewhere outside of the kernel sources. What do you think? Should I harass the Solaris developers? # Linux leaves the length in net order (don't know about subtracting). # Windows leaves the length in net order and does not subtract the IP # header length. # # -Dave Regards, -- Jens Schweikhardt http://www.shuttle.de/schweikh/ SIGSIG -- signature too long (core dumped)
participants (5)
-
Alex P. Rudnev
-
Dave Thaler
-
Dennis Ferguson
-
Jens Schweikhardt
-
Paul Mansfield