Mechanisms for a multi-homed host to pick the best router
(My apologies, in advance, for the fact that this question is very long winded.) I have a server which is multi-homed to N routers as shown below: +---+ R1---| | | | R2---| | ... | S | | | Rn---| | +---+ This server is a host; it is not a router in the sense that it will never forward any packets (but it might run routing protocols as discussed below). Also, for the sake of simplicity in this discussion, let's say this server only receives inbound TCP connections; it never initiates outbound TCP connections. Finally, this server has a loopback address L. All traffic destined to the server uses address L as the destination address. All N routers have a static route to L and inject that route into their routing protocol (possibly as part of an aggregate). Now, imagine the server receives an inbound connection from another host whose address is A. Thus, the TCP SYN packet which S receives has source address A and destination address L. When the server sends TCP traffic for that same connection back to host A, it needs to pick one of the N routers, in other words, it needs to pick an outbound interface from its N interfaces. Traditionally, this is done by doing a best-match lookup for address A in the forwarding table of the server. One could install a ECMP default route which points to all N routers. In this case, the downstream router would essentially be picked at random (for each connection, assuming 5-tuple hashing). The problem is that some routers are "better" than other routers in the sense that they are closer to the final destination address A. (For example, each router could be connected to a different ISP.) One way for the server to pick the "optimal" downstream router, is to run "stub BGP" between the server and each of the routers. By "stub BGP" I mean that the server uses the BGP session only to learn routes. It advertises its own loopback L, but it never advertises any other routes, and it never propagates and routes from one BGP session to another BGP session. The server would have N BGP sessions and learn the full default-free BGP route table over each of those sessions. In other words, the server would end up with approximately N x 250,000 routes in its RIB and 250,000 routes in its FIB. While this approach would certainly allow the server to pick the optimal downstream router in all cases, I would prefer not to run routing protocols on this server for a number of reasons: - I don't want to the spend memory and CPU on such large RIBs and FIBs. - I'm afraid that other routers will attempt to forward traffic through the server (due to accidental misconfigurations) once it starts participating in the routing protocols. - Since there might be many of these servers (many more than the number of routers) I might end up stretching the routers beyond their scaling limits (number of BGP sessions, link state database size, etc.) and destabilizing the network. - I know there are good open source implementation of routing protocols, but still, I'm nervous that any instability or bugs on the servers could end up screwing up the routers (e.g. persistent BGP flaps). - One possible variation is that the server is a client of some route-reflectors which are not in the forwarding path (i.e. next-hop-self is not enabled). In that case, I might end up needing to do BGP next-hop resolution for a very large number of BGP next-hops. This, in turn, implies that the server might need to also run OSPF in a very large flat area 0. For all these reasons, I don't want to run BGP on the server. Someone suggested an idea to me which seems almost to simple to work, but I cannot find any good reason why it would not work. The idea is "the server simply sends all outbound traffic for the TCP connection out over the same interface over which the most recent TCP traffic for that connection was received". So, for example, if the server receives the SYN from router R3, it would send the SYN ACK and all subsequent packets for the TCP connection over that same interface R3. If the inbound packets for that same TCP connection start arriving from a different router (e.g. because of link failure), say R4, then the server also switches the outbound packets to that same router R4. I am aware that routing is not always symmetrical. In other words, I am aware that the best route from A to Z might be A->B->C->Z while the best route from Z to A might be Z->D->A. However, since the IP routing tables form an inverted tree, it seems to me that in realistic scenarios the traffic should still arrive at A (maybe over a non-optimal path in rare cases) if Z sends the reverse traffic to C instead of D. It seems unlikely (impossible?) that this would cause a routing loop. I can think of the following problems with this approach: (Problem 1) It only works for inbound TCP connections and not for outbound TCP connections. For outbound TCP connections we would not know which router to send the first SYN packet to. (Problem 2) If there is a topology change after the TCP connection has been established, the traffic might follow a sub-optimal path. In extreme cases, after a topology change the TCP connection might be lost despite the availability of an alternative feasible path. For example, if server S is using router R1 to reach host A, and router R1 can no longer reach host A, and host A is not sending *any* traffic to server S, then server S is not going to find out about the topology change (since S doesn't receive any traffic from A), and the connection is going to be dropped (because S's retransmission will never reach A). Many application layer protocols have a keep-alive message at the application layer; in that case the problem would not occur because host A would always be sending some traffic to server S which would allow S to discover the topology change by virtue of A's traffic arriving from a different router. My question for the NANOG community are these: (Question 1) Can you think of any additional problems with this approach? Specifically, I am interested in persistent failures in the absence of topology changes. (Question 2) Is there another mechanism for the server (a multi-homed host) to pick a best router, short of running stub BGP? Are there any standards for this? (Question 3) If the answer to question 2 is "no", is there any interest in standardizing a protocol for a multi-homed host to pick a best next-hop router? One could think of this is a host-to-router routing protocol. One might call the existing routing protocols router-to-router protocols (because I think we are abusing them by running them on hosts). This is somewhat analogous to the multicast routing world where we use different protocols for router-to-router multicast (PIM) versus host-to-router (IGMP). -- Cayle
Cayle Spandon wrote:
I have a server which is multi-homed to N routers as shown below:
+---+ R1---| | | | R2---| | ... | S | | | Rn---| | +---+
This server is a host; it is not a router in the sense that it will never forward any packets (but it might run routing protocols as discussed below).
This is going to be the stupid question of the day, but unless you have a route policy (in which case, what was the question again?) why would you not sent the reply out the same spigot you go the request on?
Hi Laurence, RE> why would you not sent the reply out the same spigot you go the request on? Yes, that exactly what I was trying to ask in the e-mail (in a much more verbose way than you :-). The problems I could think of are: - It only works for inbound TCP connections. - The TCP connections might be dropped after a topology change despite the existence of an alternative path. I was wondering if anyone else knows of any additional problems. -- Cayle
cayle.spandon@gmail.com ("Cayle Spandon") writes:
(My apologies, in advance, for the fact that this question is very long winded.)
np.
I have a server which is multi-homed to N routers as shown below:
+---+ R1---| | | | R2---| | ... | S | | | Rn---| | +---+
This server is a host; it is not a router in the sense that it will never forward any packets (but it might run routing protocols as discussed below).
Also, for the sake of simplicity in this discussion, let's say this server only receives inbound TCP connections; it never initiates outbound TCP connections.
Finally, this server has a loopback address L. All traffic destined to the server uses address L as the destination address. All N routers have a static route to L and inject that route into their routing protocol (possibly as part of an aggregate).
Now, imagine the server receives an inbound connection from another host whose address is A. Thus, the TCP SYN packet which S receives has source address A and destination address L. ... For all these reasons, I don't want to run BGP on the server.
"too many moving parts."
Someone suggested an idea to me which seems almost to simple to work, but I cannot find any good reason why it would not work.
The idea is "the server simply sends all outbound traffic for the TCP connection out over the same interface over which the most recent TCP traffic for that connection was received".
So, for example, if the server receives the SYN from router R3, it would send the SYN ACK and all subsequent packets for the TCP connection over that same interface R3. ...
right idea. works great. see the following: http://www.academ.com/nanog/feb1997/multihoming.html http://www.irbs.net/internet/nanog/9706/0232.html http://gatekeeper.dec.com/pub/misc/vixie/ifdefault/
... I can think of the following problems with this approach:
(Problem 1) It only works for inbound TCP connections and not for outbound TCP connections. For outbound TCP connections we would not know which router to send the first SYN packet to.
you said above you only needed inbound. for outbound and udp: round robin.
... My question for the NANOG community are these:
(Question 1) Can you think of any additional problems with this approach? Specifically, I am interested in persistent failures in the absence of topology changes.
topology change frequency is in a different order of magnitude than the usual tcp session startup frequency, so unless you've got long running tcp sessions which won't restart on a connection reset, you've got no problem, and if you do have that kind of tcp session, you've already got problems.
(Question 2) Is there another mechanism for the server (a multi-homed host) to pick a best router, short of running stub BGP? Are there any standards for this?
there are a bazillion patented and/or ubersecret ways to do this. noone has ever demonstrated anything that works better than an undercommitted network with undercommitted connections to other undercommitted first-hop networks.
(Question 3) If the answer to question 2 is "no", is there any interest in standardizing a protocol for a multi-homed host to pick a best next-hop router? One could think of this is a host-to-router routing protocol. One might call the existing routing protocols router-to-router protocols (because I think we are abusing them by running them on hosts). This is somewhat analogous to the multicast routing world where we use different protocols for router-to-router multicast (PIM) versus host-to-router (IGMP).
sadly, this has been tried, but it always runs into least-cost routing issues whereby not only the predicted connection quality but also contract details like whether this is over or under the per-period minima and how many quatloos per kilosegment it will cost all have to get exchanged at high speed with low latency and good accuracy. been there, did that, got no useful t-shirt even. -- Paul Vixie
Hi Paul, Thank you very much for the confirmation that the idea is sane and for the pointers to the additional information. -- Cayle On Wed, Sep 17, 2008 at 11:49 PM, Paul Vixie <vixie@isc.org> wrote:
cayle.spandon@gmail.com ("Cayle Spandon") writes:
(My apologies, in advance, for the fact that this question is very long winded.)
np.
I have a server which is multi-homed to N routers as shown below:
+---+ R1---| | | | R2---| | ... | S | | | Rn---| | +---+
This server is a host; it is not a router in the sense that it will never forward any packets (but it might run routing protocols as discussed below).
Also, for the sake of simplicity in this discussion, let's say this server only receives inbound TCP connections; it never initiates outbound TCP connections.
Finally, this server has a loopback address L. All traffic destined to the server uses address L as the destination address. All N routers have a static route to L and inject that route into their routing protocol (possibly as part of an aggregate).
Now, imagine the server receives an inbound connection from another host whose address is A. Thus, the TCP SYN packet which S receives has source address A and destination address L. ... For all these reasons, I don't want to run BGP on the server.
"too many moving parts."
Someone suggested an idea to me which seems almost to simple to work, but I cannot find any good reason why it would not work.
The idea is "the server simply sends all outbound traffic for the TCP connection out over the same interface over which the most recent TCP traffic for that connection was received".
So, for example, if the server receives the SYN from router R3, it would send the SYN ACK and all subsequent packets for the TCP connection over that same interface R3. ...
right idea. works great. see the following:
http://www.academ.com/nanog/feb1997/multihoming.html http://www.irbs.net/internet/nanog/9706/0232.html http://gatekeeper.dec.com/pub/misc/vixie/ifdefault/
... I can think of the following problems with this approach:
(Problem 1) It only works for inbound TCP connections and not for outbound TCP connections. For outbound TCP connections we would not know which router to send the first SYN packet to.
you said above you only needed inbound. for outbound and udp: round robin.
... My question for the NANOG community are these:
(Question 1) Can you think of any additional problems with this approach? Specifically, I am interested in persistent failures in the absence of topology changes.
topology change frequency is in a different order of magnitude than the usual tcp session startup frequency, so unless you've got long running tcp sessions which won't restart on a connection reset, you've got no problem, and if you do have that kind of tcp session, you've already got problems.
(Question 2) Is there another mechanism for the server (a multi-homed host) to pick a best router, short of running stub BGP? Are there any standards for this?
there are a bazillion patented and/or ubersecret ways to do this. noone has ever demonstrated anything that works better than an undercommitted network with undercommitted connections to other undercommitted first-hop networks.
(Question 3) If the answer to question 2 is "no", is there any interest in standardizing a protocol for a multi-homed host to pick a best next-hop router? One could think of this is a host-to-router routing protocol. One might call the existing routing protocols router-to-router protocols (because I think we are abusing them by running them on hosts). This is somewhat analogous to the multicast routing world where we use different protocols for router-to-router multicast (PIM) versus host-to-router (IGMP).
sadly, this has been tried, but it always runs into least-cost routing issues whereby not only the predicted connection quality but also contract details like whether this is over or under the per-period minima and how many quatloos per kilosegment it will cost all have to get exchanged at high speed with low latency and good accuracy. been there, did that, got no useful t-shirt even. -- Paul Vixie
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
So, for example, if the server receives the SYN from router R3, it would send the SYN ACK and all subsequent packets for the TCP connection over that same interface R3. ...
right idea. works great. see the following:
http://www.academ.com/nanog/feb1997/multihoming.html http://www.irbs.net/internet/nanog/9706/0232.html http://gatekeeper.dec.com/pub/misc/vixie/ifdefault/
This approach is particularly useful for host with multiple IPv6 tunnels. Some tunnel providers implement strict RPF, some don't. Where this is the case, having multiple tunnels (cf multiple address ranges) is problematic. Of course these days perhaps perhaps the IPv4 variant could be done with a stateful NAT. Maybe case could be made for IPv6 NAT (and site-local addresses?) in this scnario... - -w - -- William Waites <ww@styx.org> http://www.irl.styx.org/ +49 30 8894 9942 CD70 0498 8AE4 36EA 1CD7 281C 427A 3F36 2130 E9F5 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) iEYEARECAAYFAkjSlsMACgkQQno/NiEw6fWEhACfcVGZ5qEbvESVCWxQibkm/jLp wKsAn1lQWcMO+fk5ZV5V08narSfoC/gF =tlbx -----END PGP SIGNATURE-----
On Wed, 17 Sep 2008 22:32:29 EDT, Cayle Spandon said:
(Problem 2) If there is a topology change after the TCP connection has been established, the traffic might follow a sub-optimal path.
Another possibility is that the connection was originally established *during* a link outage, so the initial part of the connection was done over a sub-optimal path, and that the topology change puts it back to the normal better path... A possible *biggger* issue is that "toss the reply packet back where the original came from" makes traffic-engineering your outbound packets a lot more challenging - you end up having to play announcement games upstream of your N routers to engineer your *inbound* traffic so your outbound packets do what you want.
participants (5)
-
Cayle Spandon
-
Laurence F. Sheldon, Jr.
-
Paul Vixie
-
Valdis.Kletnieks@vt.edu
-
William Waites