Fwd: [v6ops] IPv6 MTU Flow-label.... (related to draft-v6ops-pmtud-ecmp-problem-01)

10 Nov 2014

      Forwarding this so that everybody can comment on this nasty proposal ;)

Forcing replies to v6ops@ietf.org where they likely should be taking
place as that is where recently the mentioned draft was accepted as a WG
item.

Greets,
 Jeroen

-------- Forwarded Message --------
Subject: [v6ops] IPv6 MTU Flow-label.... (related to
draft-v6ops-pmtud-ecmp-problem-01)
Date: Mon, 10 Nov 2014 11:31:52 +0100
From: Jeroen Massar <jeroen@massar.ch>
Organization: Massar
To: ipv6@ietf.org, v6ops@ietf.org

Hola folks (and folks in BCC ;),

With the recent Google and Akamai outages (latter still ongoing afaik),
it came to light that the cause is likely the model and problem
described here:

 https://tools.ietf.org/html/draft-v6ops-pmtud-ecmp-problem-01
which previously was:
 https://tools.ietf.org/html/draft-v6ops-jaeggli-pmtud-ecmp-problem-01

Or shortly described: terminating an IP address at different hosts and
having the balancer box not knowing where to deliver the ICMP PTBs that
get send for large packets.

One of the suggestions there is to lower the MSS for every connection by
forcing it (either on the loadbalancer or on the final host) to a value
that "works everywhere": the one for an MTU of 1280.

MSS only applies to TCP, and people like Google are coming out with QUIC
and other schemes.

As we really do not want an Internet at an MTU of 1280, why don't we
indicate in the packet what the MTU is when it is diverting from the norm?

What if we instead let a router that sources a packet from a link or is
going to transmit a packet over a link < 1500 indicate with that packet
that that packet came from/is going to is a link with a MTU < 1500.

We can't use an additional extension header, as adding anything would
mean we might hit the MTU of the packet and we have other issues.

As our least-known-used field is the FlowLabel field, we could abuse
that and have enough bits there to stuff our data.

What if we define that when the first 4 bits are set to 0xF (all one)
that the rest (16bits) defines the MTU of the link (MTU 0 - 65k)?
(We could even use a 'base of 1280' and thus 0xf0000 = 1280 MTU, but
possibly it is better to state "value of < 0xf0500 is invalid")

Thus allowing when the first 4 bits are not set to all-1 that the
flowlabel field is a "normal flowlabel" field ala RFC6437. We could even
state "Only set this MTU option when the FlowLabel field == 0" to avoid
incompatibility (though I do not expect any as I rarely see packets with
the field non-0...)

Thus given a network like:

  [H1]
2001:db8:1500::1/64
   | mtu = 1500
2001:db8:1500::a/64
  [RA]
2001:db8:1501::a/64
   | mtu = 1500
2001:db8:1501::b/64
  [RB]
2001:db8:1480::b/64
   | mtu = 1480
2001:db8:1480::c/64
  [RC]
2001:db8:1280::c/64
   | mtu = 1280
2001:db8:1280::d/64
  [RD]
2001:db8:9000::d/64
   | mtu = 9000
2001:db8:9000::2/64
  [H2]

RA receives packet, src+dst interface are MTU=1500, thus does nothing
RB receives packet, src = 1500, dst = 1480, thus sets FL = 0xf05c8
RC receives packet, src = 1480, dst = 1280, thus sets FL = 0xf0500
RD receives packet, src = 1280, dst = 9000, thus sets FL = 0xf0500
                             (again, just set is quicker than checking)

Now even if H2 is a loadbalancer, if the flow is just forwarded (without
TTL change btw...) the destination receives it correctly.

The disadvantage is of course that you lose the ability to balance based
on the FlowLabel, but if we go with "only change when not 0 then there
was one anyway. Also you got src+dst which is 256bits, which should be
pretty good already and optionally next-header + the contents of the
header if you want that.

Note that as we have no checksum in IPv6, there is little overhead to do
this kind of forwarding, HopLimit already needs updating, this is just
another field to update.

In another model from the above, we could even just let every hop set
the known lowest MTU.

In that case, H1 would set 0xf05dc in the packet, and then it gets
lowered automatically. Which would also mean that a pure 9000 path would
nicely work suddenly as everybody knows that 9000 will fit :)

Greets,
 Jeroen

_______________________________________________
v6ops mailing list
v6ops@ietf.org
https://www.ietf.org/mailman/listinfo/v6ops