Re: [c-nsp] Devil's Advocate - Segment Routing, Why?
On 19/Jun/20 09:32, Saku Ytti wrote:
We need to decide if we are discussing a specific market situation or fundamentals. Ideally we'd drive the market to what is fundamentally most efficient, so that we pay the least amount of the kit that we use. If we drive SRv6, we drive cost up, if we drive MPLS, we drive cost down.
Even today in many cases you can take a cheap L2 chip, and make it an MPLS switch, due to them supporting VLAN swap! Which has no clue of IPV6 or IPV4.
We need a new toy. MPLS has been around far too long, and if you post web site content still talking about it or show up at conferences still talking about it, you fear that you can't sell more boxes and line cards on the back of "just" broadening carriage pipes. So we need to invent a new toy around which we can wrap a story about "adding value to your network" to "drive new business" and "reduce operating costs", to entice money to leave wallets, when all that's really still happening is the selling of more boxes and line cards, so that we can continue to broaden carriage pipes. There are very few things that have been designed well from scratch, and stand the test of time regardless of how much wind is thrown at them. MPLS is one of those things, IMHO. Nearly 20 years to the day since inception, and I still can't find another packet forwarding technology that remains as relevant and versatile as it is simple. Mark.
On Fri, 19 Jun 2020 at 11:03, Mark Tinka <mark.tinka@seacom.mu> wrote:
MPLS has been around far too long, and if you post web site content still talking about it or show up at conferences still talking about it, you fear that you can't sell more boxes and line cards on the back of "just" broadening carriage pipes.
So we need to invent a new toy around which we can wrap a story about "adding value to your network" to "drive new business" and "reduce operating costs", to entice money to leave wallets, when all that's really still happening is the selling of more boxes and line cards, so that we can continue to broaden carriage pipes.
I need to give a little bit of credit to DC people. If your world is compute and you are looking out to networks. MPLS _is hard_, it's _harder_ to generate MPLS packets in Linux than arbitrarily complex IP stack. Now instead of fixing that on the OS stack, to have a great ecosystem on software to deal with MPLS, which is easy to fix, we are fixing that in silicon, which is hard and expensive to fix. So instead of making it easy for software to generate MPLS packets. We are making it easy for hardware teo generate complex IP packets. Bizarre, but somewhat rational if you start from compute looking out to networks, instead of starting from networks.
There are very few things that have been designed well from scratch, and stand the test of time regardless of how much wind is thrown at them. MPLS is one of those things, IMHO. Nearly 20 years to the day since inception, and I still can't find another packet forwarding technology that remains as relevant and versatile as it is simple.
Mark.
-- ++ytti
On 19/Jun/20 10:18, Saku Ytti wrote:
I need to give a little bit of credit to DC people. If your world is compute and you are looking out to networks. MPLS _is hard_, it's _harder_ to generate MPLS packets in Linux than arbitrarily complex IP stack. Now instead of fixing that on the OS stack, to have a great ecosystem on software to deal with MPLS, which is easy to fix, we are fixing that in silicon, which is hard and expensive to fix.
So instead of making it easy for software to generate MPLS packets. We are making it easy for hardware teo generate complex IP packets. Bizarre, but somewhat rational if you start from compute looking out to networks, instead of starting from networks.
Which I totally appreciate and, fundamentally, have nothing against. My concern is when we, service providers, start to get affected because equipment manufacturers need to follow the data centre money hard, often at our expense. This is not only in the IP world, but also in the Transport world, where service providers are having to buy DWDM gear formatted for DCI. Yes, it does work, but it's not without its eccentricities. Cycling, over the past decade, between TRILL, OTV, SPB, FabricPath, VXLAN, NV-GRE, ACI... and perhaps even EVPN, there is probably a lesson to be learned. Mark.
On Fri, 19 Jun 2020 at 11:35, Mark Tinka <mark.tinka@seacom.mu> wrote:
So instead of making it easy for software to generate MPLS packets. We are making it easy for hardware teo generate complex IP packets. Bizarre, but somewhat rational if you start from compute looking out to networks, instead of starting from networks.
Which I totally appreciate and, fundamentally, have nothing against.
My concern is when we, service providers, start to get affected because equipment manufacturers need to follow the data centre money hard, often at our expense. This is not only in the IP world, but also in the Transport world, where service providers are having to buy DWDM gear formatted for DCI. Yes, it does work, but it's not without its eccentricities.
Cycling, over the past decade, between TRILL, OTV, SPB, FabricPath, VXLAN, NV-GRE, ACI... and perhaps even EVPN, there is probably a lesson to be learned.
Maybe this is fundamental and unavoidable, maybe some systematic bias in human thinking drives us towards simple software and complex hardware. Is there an alternative future, where we went with Itanium? Where we have simple hardware and an increasingly complex compiler and increasingly complex runtime making sure the program runs fast on that simple hardware? Instead we have two guys in tel aviv waking up in night terror every night over confusion why does x86 run any code at all, how come it works. And I'm at home 'hehe new intc make program go fast:)))' Now that we have comparatively simple compilers and often no runtime at all, the hardware has to optimise the shitty program for us, but as we don't get to see how the sausage is made, we think it's probably something that is well done, robust and correct. If we'd do this in software, we'd all have to suffer how fragile the compiler and runtime are and how unapproachable they are. -- ++ytti
On 19/Jun/20 10:40, Saku Ytti wrote:
Maybe this is fundamental and unavoidable, maybe some systematic bias in human thinking drives us towards simple software and complex hardware.
Is there an alternative future, where we went with Itanium? Where we have simple hardware and an increasingly complex compiler and increasingly complex runtime making sure the program runs fast on that simple hardware? Instead we have two guys in tel aviv waking up in night terror every night over confusion why does x86 run any code at all, how come it works. And I'm at home 'hehe new intc make program go fast:)))'
Now that we have comparatively simple compilers and often no runtime at all, the hardware has to optimise the shitty program for us, but as we don't get to see how the sausage is made, we think it's probably something that is well done, robust and correct. If we'd do this in software, we'd all have to suffer how fragile the compiler and runtime are and how unapproachable they are.
So this brings back a discussion you and I had last year about a scenario where the market shifts toward open vendor in-house silicon, sold as a PCI card one can stick in a server. Trio, ExpressPlus, Lightspeed, Silicon One, Cylon, QFP, e.t.c., with open specs. so that folk can code for them and see what happens. At the moment, everyone is coding for x86 as an NPU, and we know that path is not the cheapest or most efficient for packet forwarding. Vendors may feel a little skittish about "giving away" their IP, but I don't think it's an issue because: * The target market is folk currently coding for x86 CPU's to run as NPU's. * No one is about to run 100Gbps backbones on a PCI card. But hey, the world does surprise :-). * Writing code for forwarding traffic as well as control plane protocols is not easy. Buying a finished product from an equipment vendor will be the low-hanging fruit for most of the landscape. It potentially also has the positive side effect of getting Broadcom to raise their game, which would make them a more viable option for operators with significant high-touch requirements. As we used to say in Vladivostok, "It could be a win win" :-). Mark.
Maybe this is fundamental and unavoidable, maybe some systematic bias in human thinking drives us towards simple software and complex hardware.
how has software progressed in the last 50-70 years as compared to hardware? my watch has how many orders of magnitude of compute vs the computer which took a large room when i was but a youth? and we have barely avoided writing stacks in cobol; but we're close, assembler++. randy
Mark Tinka wrote:
MPLS has been around far too long, and if you post web site content still talking about it or show up at conferences still talking about it, you fear that you can't sell more boxes and line cards on the back of "just" broadening carriage pipes.
The problem of MPLS, or label switching in general, is that, though it was advertised to be topology driven to scale better than flow driven, it is actually flow driven with poor scalability. Thus, it is impossible to deploy any technology scalably over MPLS. MPLS was considered to scale, because it supports nested labels corresponding to hierarchical, thus, scalable, routing table. However, to assign nested labels at the source, the source must know hierarchical routing table at the destination, even though the source only knows hierarchical routing table at the source itself. So, the routing table must be flat, which dose not scale, or the source must detect flows to somehow request hierarchical destination routing table on demand, which means MPLS is flow driven. People, including some data center people, avoiding MPLS, know network scalability better than those deploying MPLS. It is true that some performance improvement is possible with label switching by flow driven ways, if flows are manually detected. But, it means extra label-switching-capable equipment and administrative effort to detect flows, neither of which do not scale and cost a lot. It cost a lot less to have more plain IP routers than insisting on having a little fewer MPLS routers. Masataka Ohta
On 19/Jun/20 16:45, Masataka Ohta wrote:
The problem of MPLS, or label switching in general, is that, though it was advertised to be topology driven to scale better than flow driven, it is actually flow driven with poor scalability.
Thus, it is impossible to deploy any technology scalably over MPLS.
MPLS was considered to scale, because it supports nested labels corresponding to hierarchical, thus, scalable, routing table.
However, to assign nested labels at the source, the source must know hierarchical routing table at the destination, even though the source only knows hierarchical routing table at the source itself.
So, the routing table must be flat, which dose not scale, or the source must detect flows to somehow request hierarchical destination routing table on demand, which means MPLS is flow driven.
People, including some data center people, avoiding MPLS, know network scalability better than those deploying MPLS.
It is true that some performance improvement is possible with label switching by flow driven ways, if flows are manually detected. But, it means extra label-switching-capable equipment and administrative effort to detect flows, neither of which do not scale and cost a lot.
It cost a lot less to have more plain IP routers than insisting on having a little fewer MPLS routers.
I wouldn't agree. MPLS is a purely forwarding paradigm, as is hop-by-hop IP. Even with hop-by-hop IP, you need the edge to be routing-aware. I wasn't at the table when the MPLS spec. was being dreamed up, but I'd find it very hard to accept that someone drafting the idea advertised it as being a replacement or alternative for end-to-end IP routing and forwarding. Whether you run MPLS or not, you will always have routing table scaling concerns. So I'm not quite sure how that is MPLS's problem. If you can tell me how NOT running MPLS affords you a "hierarchical, scalable" routing table, I'm all ears. Whether you forward in IP or in MPLS, scaling routing is an ever clear & present concern. Where MPLS can directly mitigate that particular concern is in the core, where you can remove BGP. But you still need routing in the edge, whether you forward in IP or MPLS. Mark.
Hi Mark, As actually someone who was at that table you are referring to - I must say that MPLS was never proposed as replacement for IP. MPLS was since day one proposed as enabler for services originally L3VPNs and RSVP-TE. Then bunch of others jumped on the same encapsulation train. If at that very time GSR would be able to do right GRE encapsulation at line rate in all of its engines MPLS for transport would never take off. As service demux - sure but this is completely separate. But since at that time shipping hardware could not do the right encapsulation and since SPs were looking for more revenue and new way to move ATM and FR customers to IP backbones L3VPN was proposed which really required to hide the service addresses from anyone's core. So some form of encapsulation was a MUST. Hence tag switching then mpls switching was rolled out. So I think Ohta-san's point is about scalability services not flat underlay RIB and FIB sizes. Many years ago we had requests to support 5M L3VPN routes while underlay was just 500K IPv4. Last - when I originally discussed just plain MPLS with customers with single application of hierarchical routing (no BGP in the core) frankly no one was interested. Till L3VPN arrived which was game changer and run for new revenue streams ... Best, R. On Fri, Jun 19, 2020 at 5:00 PM Mark Tinka <mark.tinka@seacom.mu> wrote:
On 19/Jun/20 16:45, Masataka Ohta wrote:
The problem of MPLS, or label switching in general, is that, though it was advertised to be topology driven to scale better than flow driven, it is actually flow driven with poor scalability.
Thus, it is impossible to deploy any technology scalably over MPLS.
MPLS was considered to scale, because it supports nested labels corresponding to hierarchical, thus, scalable, routing table.
However, to assign nested labels at the source, the source must know hierarchical routing table at the destination, even though the source only knows hierarchical routing table at the source itself.
So, the routing table must be flat, which dose not scale, or the source must detect flows to somehow request hierarchical destination routing table on demand, which means MPLS is flow driven.
People, including some data center people, avoiding MPLS, know network scalability better than those deploying MPLS.
It is true that some performance improvement is possible with label switching by flow driven ways, if flows are manually detected. But, it means extra label-switching-capable equipment and administrative effort to detect flows, neither of which do not scale and cost a lot.
It cost a lot less to have more plain IP routers than insisting on having a little fewer MPLS routers.
I wouldn't agree.
MPLS is a purely forwarding paradigm, as is hop-by-hop IP. Even with hop-by-hop IP, you need the edge to be routing-aware.
I wasn't at the table when the MPLS spec. was being dreamed up, but I'd find it very hard to accept that someone drafting the idea advertised it as being a replacement or alternative for end-to-end IP routing and forwarding.
Whether you run MPLS or not, you will always have routing table scaling concerns. So I'm not quite sure how that is MPLS's problem. If you can tell me how NOT running MPLS affords you a "hierarchical, scalable" routing table, I'm all ears.
Whether you forward in IP or in MPLS, scaling routing is an ever clear & present concern. Where MPLS can directly mitigate that particular concern is in the core, where you can remove BGP. But you still need routing in the edge, whether you forward in IP or MPLS.
Mark.
On Jun 19, 2020, at 11:34 AM, Randy Bush <randy@psg.com> wrote:
MPLS was since day one proposed as enabler for services originally L3VPNs and RSVP-TE.
MPLS day one was mike o'dell wanting to move his city/city traffic matrix from ATM to tag switching and open cascade's hold on tags.
And IIRC, Tag switching day one was Cisco overreacting to Ipsilon. -dorian
< ranting of a curmudgeonly old privileged white male >
MPLS was since day one proposed as enabler for services originally L3VPNs and RSVP-TE. MPLS day one was mike o'dell wanting to move his city/city traffic matrix from ATM to tag switching and open cascade's hold on tags. And IIRC, Tag switching day one was Cisco overreacting to Ipsilon.
i had not thought of it as overreacting; more embrace and devour. mo and yakov, aided and abetted by sob and other ietf illuminati, helped cisco take the ball away from Ipsilon, Force10, ... but that is water over the damn, and my head is hurting a bit from thinking on too many levels at once. there is saku's point of distributing labels in IGP TLVs/LSAs. i suspect he is correct, but good luck getting that anywhere in the internet vendor task force. and that tells us a lot about whether we can actually effect useful simplification and change. is a significant part of the perception that there is a forwarding problem the result of the vendors, 25 years later, still not designing for v4/v6 parity? there is the argument that switching MPLS is faster than IP; when the pressure points i see are more at routing (BGP/LDP/RSVP/whatever), recovery, and convergence. did we really learn so little from IP routing that we need to recreate analogous complexity and fragility in the MPLS control plane? ( sound of steam eminating from saku's ears :) and then there is buffering; which seems more serious than simple forwarding rate. get it there faster so it can wait in a queue? my principal impression of the Stanford/Google workshops was the parable of the blind men and the elephant. though maybe Matt had the main point: given scaling 4x, Moore's law can not save us and it will all become paced protocols. will we now have a decade+ of BBR evolution and tuning? if so, how do we engineer our networks for that? and up 10,000m, we watch vendor software engineers hand crafting in an assembler language with if/then/case/for, and running a chain of checking software to look for horrors in their assembler programs. it's the bleeping 21st century. why are the protocol specs not formal and verified, and the code formally generated and verified? and don't give me too slow given that the hardware folk seem to be able to do 10x in the time it takes to run valgrind a few dozen times. we're extracting ore with hammers and chisels, and then hammering it into shiny objects rather than safe and securable network design and construction tools. apologies. i hope you did not read this far. randy
Randy Bush wrote:
MPLS was since day one proposed as enabler for services originally L3VPNs and RSVP-TE. MPLS day one was mike o'dell wanting to move his city/city traffic matrix from ATM to tag switching and open cascade's hold on tags. And IIRC, Tag switching day one was Cisco overreacting to Ipsilon.
i had not thought of it as overreacting; more embrace and devour. mo and yakov, aided and abetted by sob and other ietf illuminati, helped cisco take the ball away from Ipsilon, Force10, ...
Ipsilon was hopeless because, as Yakov correctly pointed out, flow driven approach to automatically detect flows does not scale. The problem of MPLS, however, is that, it must also be flow driven, because detailed route information at the destination is necessary to prepare nested labels at the source, which costs a lot and should be attempted only for detected flows.
there is the argument that switching MPLS is faster than IP; when the pressure points i see are more at routing (BGP/LDP/RSVP/whatever), recovery, and convergence.
Routing table at IPv4 backbone today needs at most 16M entries to be looked up by simple SRAM, which is as fast as MPLS look up, which is one of a reason why we should obsolete IPv6. Though resource reserved flows need their own routing table entries, they should be charged proportional to duration of the reservation, which can scale to afford the cost to have the entries. Masataka Ohta
The problem of MPLS, however, is that, it must also be flow driven, because detailed route information at the destination is necessary to prepare nested labels at the source, which costs a lot and should be attempted only for detected flows.
MPLS is not flow driven. I sent some mail about it but perhaps it bounced. MPLS LDP or L3VPNs was NEVER flow driven. Since day one till today it was and still is purely destination based. Transport is using LSP to egress PE (dst IP). L3VPNs are using either per dst prefix, or per CE or per VRF labels. No implementation does anything upon "flow detection" - to prepare any nested labels. Even in FIBs all information is preprogrammed in hierarchical fashion well before any flow packet arrives. Thx, R.
there is the argument that switching MPLS is faster than IP; when the pressure points i see are more at routing (BGP/LDP/RSVP/whatever), recovery, and convergence.
Routing table at IPv4 backbone today needs at most 16M entries to be looked up by simple SRAM, which is as fast as MPLS look up, which is one of a reason why we should obsolete IPv6.
Though resource reserved flows need their own routing table entries, they should be charged proportional to duration of the reservation, which can scale to afford the cost to have the entries.
Masataka Ohta
On 20/Jun/20 17:12, Robert Raszuk wrote:
MPLS is not flow driven. I sent some mail about it but perhaps it bounced.
MPLS LDP or L3VPNs was NEVER flow driven.
Since day one till today it was and still is purely destination based.
Transport is using LSP to egress PE (dst IP).
L3VPNs are using either per dst prefix, or per CE or per VRF labels. No implementation does anything upon "flow detection" - to prepare any nested labels. Even in FIBs all information is preprogrammed in hierarchical fashion well before any flow packet arrives.
If you really don't like LDP or RSVP-TE, you can statically assign labels and manually configure FEC's across your entire backbone. If trading state for administration is your thing, of course :-). Mark.
Robert Raszuk wrote:
MPLS LDP or L3VPNs was NEVER flow driven.
Since day one till today it was and still is purely destination based.
If information to create labels at or near sources to all the possible destinations is distributed in advance, may be. But it is effectively flat routing, or, in extreme cases, flat host routing. Or, if information to create labels to all the active destinations is supplied on demand, it is flow driven. On day one, Yakov said MPLS had scaled because of nested labels corresponding to routing hierarchy. Masataka Ohta
On 21/Jun/20 13:11, Masataka Ohta wrote:
If information to create labels at or near sources to all the possible destinations is distributed in advance, may be.
But this is what happens today. Whether you do it manually or use a label distribution protocol, FEC's are pre-computed ahead of time. What am I missing?
But it is effectively flat routing, or, in extreme cases, flat host routing.
I still don't get it.
Or, if information to create labels to all the active destinations is supplied on demand, it is flow driven.
What would the benefit of this be? Ingress and egress nodes don't come and go. They are stuck in racks in data centres somewhere, and won't disappear until a human wants them to. So why create labels on-demand if a box to handle the traffic is already in place and actively working, day-in, day-out? Mark.
Mark Tinka wrote:
If information to create labels at or near sources to all the possible destinations is distributed in advance, may be.
But this is what happens today.
That is a tragedy.
Whether you do it manually or use a label distribution protocol, FEC's are pre-computed ahead of time.
What am I missing?
If all the link-wise (or, worse, host-wise) information of possible destinations is distributed in advance to all the possible sources, it is not hierarchical but flat (host) routing, which scales poorly. Right?
But it is effectively flat routing, or, in extreme cases, flat host routing.
I still don't get it.
Why, do you think, flat routing does not but hierarchical routing does scale? It is because detailed information to reach destinations below certain level is advertised not globally but only for small part of the network around the destinations. That is, with hierarchical routing, detailed information around destinations is actively hidden from sources. So, with hierarchical routing, routing protocols can carry only rough information around destinations, from which, source side can not construct detailed (often purposelessly nested) labels required for MPLS.
So why create labels on-demand if a box to handle the traffic is already in place and actively working, day-in, day-out?
According to your theory to ignore routing traffic, we can be happy with global *host* routing table with 4G entries for IPv4 and a lot lot lot more than that for IPv6. CIDR should be unnecessary complication to the Internet With nested labels, you don't need so much labels at certain nesting level, which was the point of Yakov, which does not mean you don't need so much information to create entire nested labels at or near the sources. The problem is that we can't afford traffic (and associated processing by all the related routers or things like those) and storage (at or near source) for routing (or MPLS, SR* or whatever) with such detailed routing at the destinations. Masataka Ohta
Let's clarify a few things ... On Sun, Jun 21, 2020 at 2:39 PM Masataka Ohta < mohta@necom830.hpcl.titech.ac.jp> wrote: If all the link-wise (or, worse, host-wise) information of possible
destinations is distributed in advance to all the possible sources, it is not hierarchical but flat (host) routing, which scales poorly.
Right?
Neither link wise nor host wise information is required to accomplish say L3VPN services. Imagine you have three sites which would like to interconnect each with 1000s of users. So all you are exchanging as part of VPN overlay is three subnets. Moreover if you have 1000 PEs and those three sites are attached only to 6 of them - only those 6 PEs will need to learn those routes (Hint: RTC - RFC4684) It is because detailed information to reach destinations
below certain level is advertised not globally but only for small part of the network around the destinations.
Same thing here.
That is, with hierarchical routing, detailed information around destinations is actively hidden from sources.
Same thing here. That is why as described we use label stack. Top label is responsible to get you to the egress PE. Service label sitting behind top label is responsible to get you through to the customer site (with or without IP lookup at egress PE).
So, with hierarchical routing, routing protocols can carry only rough information around destinations, from which, source side can not construct detailed (often purposelessly nested) labels required for MPLS.
Usually sources have no idea of MPLS. MPLS to the host never took off.
According to your theory to ignore routing traffic, we can be happy with global *host* routing table with 4G entries for IPv4 and a lot lot lot more than that for IPv6. CIDR should be unnecessary complication to the Internet
I do not think any one saying it here.
With nested labels, you don't need so much labels at certain nesting level, which was the point of Yakov, which does not mean you don't need so much information to create entire nested labels at or near the sources.
Label stack is here from day one. Each layer of the stack has a completely different role. That is your hierarchy. Kind regards, R.
Robert Raszuk wrote:
Neither link wise nor host wise information is required to accomplish say L3VPN services. Imagine you have three sites which would like to interconnect each with 1000s of users.
For a single customer of an ISP with 1000s of end users. OK. But, it should be noted that a single class B routing table entry often serves for an organization with 10000s of users, which is at least our case here at titech.ac.jp. It should also be noted that, my concern is scalability in ISP side.
Moreover if you have 1000 PEs and those three sites are attached only to 6 of them - only those 6 PEs will need to learn those routes (Hint: RTC - RFC4684)
If you have 1000 PEs, you should be serving for somewhere around 1000 customers. And, if I understand BGP-MP correctly, all the routing information of all the customers is flooded by BGP-MP in the ISP. Then, it should be a lot better to let customer edges encapsulate L2 or L3 over IP, with which, routing information within customers is exchanged by customer provided VPN without requiring extra overhead of maintaining customer local routing information by the ISP. If a customer want customer-specific SLA, it can be described as SLA between customer edge routers, for which, intra-ISP MPLS may or may not be used. For the ISP, it can be as profitable as PE-based VRF solutions, because customers so relying on ISPs will let the ISP provide and maintain customer edges. The only difference should be on profitability for router makers, which want to make routing system as complex as possible or even a lot more than that to make backbone routers a lot profitable product.
With nested labels, you don't need so much labels at certain nesting level, which was the point of Yakov, which does not mean you don't need so much information to create entire nested labels at or near the sources.
Label stack is here from day one.
Label stack was there, because of, now recognized to be wrong, statement of Yakov on day one and I can see no reason still to keep it. Masataka Ohta
On 22/Jun/20 14:49, Masataka Ohta wrote:
But, it should be noted that a single class B...
CIDR - let's not teach the kids old news :-).
If you have 1000 PEs, you should be serving for somewhere around 1000 customers.
It's not linear. We probably have 1 edge router serving several-thousand customers.
And, if I understand BGP-MP correctly, all the routing information of all the customers is flooded by BGP-MP in the ISP.
Yes, best practice is in iBGP. Some operators may still be using an IGP for this. It would work, but scales poorly.
Then, it should be a lot better to let customer edges encapsulate L2 or L3 over IP, with which, routing information within customers is exchanged by customer provided VPN without requiring extra overhead of maintaining customer local routing information by the ISP.
You mean like IP-in-IP or GRE? That already happens today, without any intervention from the ISP.
If a customer want customer-specific SLA, it can be described as SLA between customer edge routers, for which, intra-ISP MPLS may or may not be used.
l2vpn's and l3vpn's attract a higher SLA because the services are mostly provisioned on-net. If an off-net component exists, it would be via a trusted NNI partner. Regular IP or GRE tunnels don't come with these kinds of SLA's because the ISP isn't involved, and the B-end would very likely be off-net with no SLA guarantees between the A-end customer's ISP and the remote ISP hosting the B-end.
For the ISP, it can be as profitable as PE-based VRF solutions, because customers so relying on ISPs will let the ISP provide and maintain customer edges.
There are few ISP's who would be able to terminate an IP or GRE tunnel on-net, end-to-end. And even then, they might be reluctant to offer any SLA's because those tunnels are built on the CPE, typically outside of their control.
The only difference should be on profitability for router makers, which want to make routing system as complex as possible or even a lot more than that to make backbone routers a lot profitable product.
If ISP's didn't make money from MPLS/VPN's, router vendors would not be as keen on adding the capability in their boxes.
Label stack was there, because of, now recognized to be wrong, statement of Yakov on day one and I can see no reason still to keep it.
Label stacking is fundamental to the "MP" part of MPLS. Whether your payload is IP, ATM, Ethernet, Frame Relay, PPP, HDLC, e.t.c., the ability to stack labels is what makes an MPLS network payload agnostic. There is value in that. Mark.
Mark Tinka wrote:
But, it should be noted that a single class B...
CIDR - let's not teach the kids old news :-).
Saying /16 is ambiguous depends on IP version.
And, if I understand BGP-MP correctly, all the routing information of all the customers is flooded by BGP-MP in the ISP.
Yes, best practice is in iBGP.
Some operators may still be using an IGP for this. It would work, but scales poorly.
The amount of flooded traffic is not so different.
Then, it should be a lot better to let customer edges encapsulate L2 or L3 over IP, with which, routing information within customers is exchanged by customer provided VPN without requiring extra overhead of maintaining customer local routing information by the ISP.
You mean like IP-in-IP or GRE? That already happens today, without any intervention from the ISP.
I know, though I didn't know ISP's are not offering SLA for it.
There are few ISP's who would be able to terminate an IP or GRE tunnel on-net, end-to-end.
And even then, they might be reluctant to offer any SLA's because those tunnels are built on the CPE, typically outside of their control.
The condition to offer SLA beyond a network of an ISP should not "trusted NNI" but policing by the ISP with ISP's own equipment, which prevent too much traffic enter the network.
If ISP's didn't make money from MPLS/VPN's, router vendors would not be as keen on adding the capability in their boxes.
It is like telco was making money by expensive telephone exchangers only to be replaced by ISPs, I'm afraid.
Label stacking is fundamental to the "MP" part of MPLS. Whether your payload is IP, ATM, Ethernet, Frame Relay, PPP, HDLC, e.t.c., the ability to stack labels is what makes an MPLS network payload agnostic. There is value in that.
What? You are saying "payload" not something carrying "payload" is MP. Then, plain Ethernet is MP with EtherType, isn't it? Masataka Ohta
On Jun 23, 2020, at 4:16 AM, Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> wrote:
Mark Tinka wrote:
But, it should be noted that a single class B... CIDR - let's not teach the kids old news :-).
Saying /16 is ambiguous depends on IP version.
Not really… A /16 in IPv6 is a lot more addresses, but it’s still using the first 16 bits to specify the prefix, same as IPv4. Owen
Owen DeLong wrote:
Saying /16 is ambiguous depends on IP version.
Not really… A /16 in IPv6 is a lot more addresses, but its still using the first 16 bits to specify the prefix, same as IPv4.
As I wrote: : But, it should be noted that a single class B routing table entry : often serves for an organization with 10000s of users, which is : at least our case here at titech.ac.jp. the number of remaining bits save the first 16 matters, which depends on IP version. Masataka Ohta
Masataka Ohta wrote on 22/06/2020 13:49:
But, it should be noted that a single class B routing table entry
"a single class B routing table entry"? Did 1993 just call and ask for its addressing back? :-)
But, it should be noted that a single class B routing table entry often serves for an organization with 10000s of users, which is at least our case here at titech.ac.jp.
It should also be noted that, my concern is scalability in ISP side.
This entire conversation is puzzling: we already have "hierarchical routing" to a large degree, to the extent that the public DFZ only sees aggregate routes exported by ASNs. Inside ASNs, there will be internal aggregation of individual routes (e.g. an ISP DHCP pool), and possibly multiple levels of aggregation, depending on how this is configured. Aggregation is usually continued right down to the end-host edge, e.g. a router might have a /26 assigned on an interface, but the hosts will be aggregated within this /26.
If you have 1000 PEs, you should be serving for somewhere around 1000 customers.
And, if I understand BGP-MP correctly, all the routing information of all the customers is flooded by BGP-MP in the ISP.
Well, maybe. Or maybe not. This depend on lots of things.
Then, it should be a lot better to let customer edges encapsulate L2 or L3 over IP, with which, routing information within customers is exchanged by customer provided VPN without requiring extra overhead of maintaining customer local routing information by the ISP.
If you have 1000 or even 10000s of PEs, injecting simplistic non-aggregated routing information is unlikely to be an issue. If you have 1,000,000 PEs, you'll probably need to rethink that position. If your proposition is that the nature of the internet be changed so that route disaggregation is prevented, or that addressing policy be changed so that organisations are exclusively handed out IP address space by their upstream providers, then this is simple matter of misunderstanding of how impractical the proposition is: that horse bolted from the barn 30 years ago; no organisation would accept exclusive connectivity provided by a single upstream; and today's world of dense interconnection would be impossible on the terms you suggest. You may not like that there are lots of entries in the DFZ and many operators view this as a bit of a drag, but on today's technology, this can scale to significantly more than what we foresee in the medium-long term future. Nick
Masataka Ohta Sent: Monday, June 22, 2020 1:49 PM
Robert Raszuk wrote:
Moreover if you have 1000 PEs and those three sites are attached only to 6 of them - only those 6 PEs will need to learn those routes (Hint: RTC - RFC4684)
If you have 1000 PEs, you should be serving for somewhere around 1000 customers.
And, if I understand BGP-MP correctly, all the routing information of all the customers is flooded by BGP-MP in the ISP.
Not quite, The routing information is flooded by default, but the receivers will cherry pick what they need and drop the rest. And even if the default flooding of all and dropping most is a concern -it can be addressed where only the relevant subset of all the routing info is sent to each receiver. The key takeaway however is that no single entity in SP network, be it PE, or RR, or ASBR...., ever needs everything, you can always slice and dice indefinitely. So to sum it up you simply can not run into any scaling ceiling with MP-BGP architecture. adam
On 22/Jun/20 16:30, adamv0025@netconsultings.com wrote:
Not quite, The routing information is flooded by default, but the receivers will cherry pick what they need and drop the rest. And even if the default flooding of all and dropping most is a concern -it can be addressed where only the relevant subset of all the routing info is sent to each receiver. The key takeaway however is that no single entity in SP network, be it PE, or RR, or ASBR...., ever needs everything, you can always slice and dice indefinitely. So to sum it up you simply can not run into any scaling ceiling with MP-BGP architecture.
The only nodes in our network that have ALL the NLRI is our RR's. Depending on the function of the egress/ingress router, the RR sends it only what it needs for its function. This is how we get away using communities in lieu of VRF's :-). And as Adam points out, those RR's will swallow anything and everything, and still remain asleep. Mark.
adamv0025@netconsultings.com wrote:
The key takeaway however is that no single entity in SP network, be it PE, or RR, or ASBR...., ever needs everything, you can always slice and dice indefinitely. So to sum it up you simply can not run into any scaling ceiling with MP-BGP architecture.
Flooding nature of BGP requires all the related entities treat everything, regardless of whether they need it entirely or not. Masataka Ohta
So to sum it up you simply can not run into any scaling ceiling with MP-BGP architecture.
Flooding nature of BGP requires all the related entities treat everything, regardless of whether they need it entirely or not.
That is long gone I am afraid ... Hint RFC 4684. Now applicable to more and more AFI/SAFIs. Also from day one of L3VPNs, PEs even if receiving all routes were dropping on inbound (cheap operation) those routes which contained no locally intersecting RTs. Thx, R.
On 21/Jun/20 14:36, Masataka Ohta wrote:
That is a tragedy.
Well...
If all the link-wise (or, worse, host-wise) information of possible destinations is distributed in advance to all the possible sources, it is not hierarchical but flat (host) routing, which scales poorly.
Right?
Host NLRI is summarized in iBGP within the domain, and eBGP outside the domain. It's no longer novel to distribute end-user NLRI in the IGP. If folk are still doing that, I can't feel sympathy for the pain they may experience.
Why, do you think, flat routing does not but hierarchical routing does scale?
It is because detailed information to reach destinations below certain level is advertised not globally but only for small part of the network around the destinations.
That is, with hierarchical routing, detailed information around destinations is actively hidden from sources.
So, with hierarchical routing, routing protocols can carry only rough information around destinations, from which, source side can not construct detailed (often purposelessly nested) labels required for MPLS.
But hosts often point default to a clever router. That clever router could also either point default to the provider, or carry a full BGP table from the provider. Neither the host nor their first-hop gateway need to be MPLS-aware. There are use-cases where a customer CPE can be MPLS-aware, but I'd say that in nearly 99.999% of all cases, CPE are never MPLS-aware.
According to your theory to ignore routing traffic, we can be happy with global *host* routing table with 4G entries for IPv4 and a lot lot lot more than that for IPv6. CIDR should be unnecessary complication to the Internet
Not sure what Internet you're running, but I, generally, accept aggregate IPv4 and IPv6 BGP routes from other AS's. I don't need to know every /32 or /128 host that sits behind them.
With nested labels, you don't need so much labels at certain nesting level, which was the point of Yakov, which does not mean you don't need so much information to create entire nested labels at or near the sources.
I don't know what Yakov advertised back in the day, but looking at what I and a ton of others are running in practice, in the real world, today, I don't see what you're talking about. Again, if you can identify an actual scenario today, in a live, large scale (or even small scale) network, I'd like to know. I'm talking about what's in practice, not theory.
The problem is that we can't afford traffic (and associated processing by all the related routers or things like those) and storage (at or near source) for routing (or MPLS, SR* or whatever) with such detailed routing at the destinations.
Again, I disagree as I mentioned earlier, because you won't be able to buy a router today that does only IP any cheaper than it does both IP and MPLS. MPLS has become mainstream, that its economies of scale have made the consideration between it and IP a non-starter. Heck, you can even do it in Linux... Mark.
Mark Tinka wrote:
So, with hierarchical routing, routing protocols can carry only rough information around destinations, from which, source side can not construct detailed (often purposelessly nested) labels required for MPLS.
But hosts often point default to a clever router. The requirement from the E2E principle is that routers should be dumb and hosts should be clever or the entire system do not. scale reliably.
In this case, such clever router can ever exist only near the destination unless very detailed routing information is flooded all over the network to all the possible sources. A router can't be clever on something, unless it is provided with very detailed information on all the possible destinations, which needs a lot of routing traffic making entire system not to scale. Masataka Ohta
On 22/Jun/20 15:08, Masataka Ohta wrote:
The requirement from the E2E principle is that routers should be dumb and hosts should be clever or the entire system do not. scale reliably.
And yet in the PTT world, it was the other way around. Clever switching and dumb telephone boxes. How things have since evened out. I can understand the concern about making the network smart. But even a smart network is not as smart as a host. My laptop can do a lot of things more cleverly than any of the routers in my network. It just can't do them at scale, consistently, for a bunch of users. So the responsibility gets to be shared, with the number of users being served diminishing as you enter and exit the edge of the network. It's probably not yet an ideal networking paradigm, but it's the one we have now that is a reasonably fair compromise.
In this case, such clever router can ever exist only near the destination unless very detailed routing information is flooded all over the network to all the possible sources.
I will admit that bloating router code over recent years to become terribly smart (CGN, Acceleration, DoS mitigation, VPN's, SD-WAN, IDS, Video Monitoring, e.t.c.) can become a run away problem. I've often joked that with all the things being thrown into BGP, we may just see it carrying DNS too, hehe. Personally, the level of intelligence we have in routers now beyond being just Layer 1, 2, 3 - and maybe 4 - crunching machines is just as far as I'm willing to go. If, like me, you keep pushing back on vendors trying to make your routers also clean your dishes, they'll take the hint and stop bloating the code. Are MPLS/VPN's overly clever? I think so. But considering the pay-off and how much worse it could get, I'm willing to accept that.
A router can't be clever on something, unless it is provided with very detailed information on all the possible destinations, which needs a lot of routing traffic making entire system not to scale.
Well, if you can propose a better way to locate hosts on a global network not owned by anyone, in a connectionless manner, I'm sure we'd all be interested. Mark.
The requirement from the E2E principle is that routers should be dumb and hosts should be clever or the entire system do not. scale reliably.
And yet in the PTT world, it was the other way around. Clever switching and dumb telephone boxes.
how did that work out for the ptts? :)
On Mon, Jun 22, 2020 at 7:18 PM Randy Bush <randy@psg.com> wrote:
how did that work out for the ptts? :)
Though its release slipped by three years, by 1995 ATM had started to replace IP as the protocol of choice. By 1999, IP was used only by a small number of academic networks. Nah, I don't think there is anywhere in the multiverse where fat pipes and dumb switches doesn't win. -- Fletcher Kittredge GWI 207-602-1134 www.gwi.net
Mark Tinka wrote:
Personally, the level of intelligence we have in routers now beyond being just Layer 1, 2, 3 - and maybe 4 - crunching machines is just as far as I'm willing to go.
Once upon a time in Japan, NTT proudly announced to have developed and actually deployed telephone exchangers to be able to offer complex calculator service including trigonometric/exponential/logarithmic functions, which was impossible by handheld calculators at that time. My favorite example when I explain the E2E principle. Masataka Ohta
Masataka Ohta Sent: Sunday, June 21, 2020 1:37 PM
Whether you do it manually or use a label distribution protocol, FEC's are pre-computed ahead of time.
What am I missing?
If all the link-wise (or, worse, host-wise) information of possible destinations is distributed in advance to all the possible sources, it is not hierarchical but flat (host) routing, which scales poorly.
Right?
On the Internet yes in controlled environments no, as in these environments the set of possible destinations is well scoped. Take an MPLS enabled DC for instance, every VM does need to talk to only a small subset of all the VMs hosted in a DC. Hence each VM gets flow transport labels programmed via centralized end-to-end flow controllers on a need to know bases (not everything to everyone). (E.g. dear vm1 this is how you get your EF/BE flows via load-balancer and FW to backend VMs in your local pod, this is how you get via local pod fw to internet gw, etc..., done) Now that you have these neat "pipes" all over the place connecting VMs it's easy for the switching fabric controller to shuffle elephant and mice flows around in order to avoid any link saturation. And now imagine a bit further doing the same as above but with CPEs on a Service Provider network... yep, no PEs acting as chokepoints for MPLS label switch paths to flow assignment, needing massive FIBs and even bigger, just dumb MPLS switch fabric, all the "hard-work" is offloaded to centralized controllers (and CPEs for label stack imposition) -but only on a need to know bases (not everything to everyone). Now in both cases you're free to choose to what extent should the MPLS switch fabric be involved with the end-to-end flows by imposing hierarchies to the MPLS stack. In light of the above, does it suck to have just 20bits of MPLS label space? Absolutely. Adam
It is destination based flat routing distributed 100% before any data packet within each layer - yes. But layers are decoupled so in a sense this is what defines a hierarchy overall. So transport is using MPLS LSPs most often hosts IGP routes are matched with LDP FECs and flooded everywhere in spite of RFC 5283 at least allowing to aggregate IGP. Then say L2VPNs or L3VPNs with their own choice of routing protocols are in turn distributing reachability for the customer sites. Those are service routes linked to transport by BGP next hop(s). Many thx, R. On Sun, Jun 21, 2020 at 1:11 PM Masataka Ohta < mohta@necom830.hpcl.titech.ac.jp> wrote:
Robert Raszuk wrote:
MPLS LDP or L3VPNs was NEVER flow driven.
Since day one till today it was and still is purely destination based.
If information to create labels at or near sources to all the possible destinations is distributed in advance, may be. But it is effectively flat routing, or, in extreme cases, flat host routing.
Or, if information to create labels to all the active destinations is supplied on demand, it is flow driven.
On day one, Yakov said MPLS had scaled because of nested labels corresponding to routing hierarchy.
Masataka Ohta
But MPLS can be made flow driven (it can be made whatever the policy dictates), for instance DSCP driven… adam From: NANOG <nanog-bounces+adamv0025=netconsultings.com@nanog.org> On Behalf Of Robert Raszuk Sent: Saturday, June 20, 2020 4:13 PM To: Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> Cc: North American Network Operators' Group <nanog@nanog.org> Subject: Re: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?) The problem of MPLS, however, is that, it must also be flow driven, because detailed route information at the destination is necessary to prepare nested labels at the source, which costs a lot and should be attempted only for detected flows. MPLS is not flow driven. I sent some mail about it but perhaps it bounced. MPLS LDP or L3VPNs was NEVER flow driven. Since day one till today it was and still is purely destination based. Transport is using LSP to egress PE (dst IP). L3VPNs are using either per dst prefix, or per CE or per VRF labels. No implementation does anything upon "flow detection" - to prepare any nested labels. Even in FIBs all information is preprogrammed in hierarchical fashion well before any flow packet arrives. Thx, R.
there is the argument that switching MPLS is faster than IP; when the pressure points i see are more at routing (BGP/LDP/RSVP/whatever), recovery, and convergence.
Routing table at IPv4 backbone today needs at most 16M entries to be looked up by simple SRAM, which is as fast as MPLS look up, which is one of a reason why we should obsolete IPv6. Though resource reserved flows need their own routing table entries, they should be charged proportional to duration of the reservation, which can scale to afford the cost to have the entries. Masataka Ohta
adamv0025@netconsultings.com wrote:
But MPLS can be made flow driven (it can be made whatever the policy dictates), for instance DSCP driven…
The point of Yakov on day one was that, flow driven approach of Ipsilon does not scale and is unacceptable. Though I agree with Yakov here, we must also eliminate all the flow driven approaches by MPLS or whatever. Masataka Ohta
From: Masataka Ohta <mohta@necom830.hpcl.titech.ac.jp> Sent: Monday, June 22, 2020 2:17 PM
adamv0025@netconsultings.com wrote:
But MPLS can be made flow driven (it can be made whatever the policy dictates), for instance DSCP driven.
The point of Yakov on day one was that, flow driven approach of Ipsilon does not scale and is unacceptable.
Though I agree with Yakov here, we must also eliminate all the flow driven approaches by MPLS or whatever.
First I'd need a definition of what flow means in this discussion are we considering 5-tuple or 4-tuple or just SRC-IP & DST-IP, is DSCP marking part of it? Second, although I agree that ~1M unique identifiers is not ideal, can you provide examples of MPLS applications where 1M is limiting? What particular aspect? Is it 1M interfaces per MPLS switching fabric box? Or 1M unique flows (or better flow groups) originated by a given VM/Container/CPE? Or 1M destination entities (IPs or apps on those IPs) that any particular VM/Container/CPE needs to talk to? Or 1M customer VPNs or 1M PE-CPE links, if PE acts as a bottleneck? adam
On 22/Jun/20 15:17, Masataka Ohta wrote:
The point of Yakov on day one was that, flow driven approach of Ipsilon does not scale and is unacceptable.
Though I agree with Yakov here, we must also eliminate all the flow driven approaches by MPLS or whatever.
I still don't see them in practice, even though they may have been proposed. Mark.
Masataka Ohta wrote:
The point of Yakov on day one was that, flow driven approach of Ipsilon does not scale and is unacceptable.
Though I agree with Yakov here, we must also eliminate all the flow driven approaches by MPLS or whatever.
I still don't see them in practice, even though they may have been proposed.
I don't know, either, as it's Adam who said:
But MPLS can be made flow driven (it can be made whatever the policy dictates), for instance DSCP driven…
Masataka Ohta
On 20/Jun/20 15:39, Masataka Ohta wrote:
Ipsilon was hopeless because, as Yakov correctly pointed out, flow driven approach to automatically detect flows does not scale.
The problem of MPLS, however, is that, it must also be flow driven, because detailed route information at the destination is necessary to prepare nested labels at the source, which costs a lot and should be attempted only for detected flows.
Again, I think you are talking about what RSVP should have been. RSVP != MPLS.
Routing table at IPv4 backbone today needs at most 16M entries to be looked up by simple SRAM, which is as fast as MPLS look up, which is one of a reason why we should obsolete IPv6.
I'm not sure I should ask this in fear of taking this discussion way off tangent... aaah, what the heck: So if we can't assign hosts IPv4 anymore because it has run out, should we obsolete IPv6 in favour of CGN? I know this works.
Though resource reserved flows need their own routing table entries, they should be charged proportional to duration of the reservation, which can scale to afford the cost to have the entries.
RSVP failed to take off when it was designed. Outside of capturing Netflow data (or tracking firewall state), nobody really cares about handling flows at scale (no, I'm not talking about ECMP). Why would we want to do that in 2020 if we didn't in 2000? Mark.
there is saku's point of distributing labels in IGP TLVs/LSAs. i suspect he is correct, but good luck getting that anywhere in the internet vendor task force.
Perhaps I will surprise a few but this is not only already in RFC formats - it is also shipping already across vendors for some time now. SR-MPLS (as part of its spec) does exactly that. You do not need to use any SR if you do not want, you still can encapsulate your packets with transport label corresponding to your exit at any ingress and forget about LDP for good. But with that let's not forget that aggregation here is still not spec-ed out well and to the best of my knowledge it is also not shipping yet. I recently proposed an idea how to aggregate SRGBs .. one vendor is analyzing it. Best, R. On Sat, Jun 20, 2020 at 1:33 AM Randy Bush <randy@psg.com> wrote:
< ranting of a curmudgeonly old privileged white male >
MPLS was since day one proposed as enabler for services originally L3VPNs and RSVP-TE. MPLS day one was mike o'dell wanting to move his city/city traffic matrix from ATM to tag switching and open cascade's hold on tags. And IIRC, Tag switching day one was Cisco overreacting to Ipsilon.
i had not thought of it as overreacting; more embrace and devour. mo and yakov, aided and abetted by sob and other ietf illuminati, helped cisco take the ball away from Ipsilon, Force10, ...
but that is water over the damn, and my head is hurting a bit from thinking on too many levels at once.
there is saku's point of distributing labels in IGP TLVs/LSAs. i suspect he is correct, but good luck getting that anywhere in the internet vendor task force. and that tells us a lot about whether we can actually effect useful simplification and change.
is a significant part of the perception that there is a forwarding problem the result of the vendors, 25 years later, still not designing for v4/v6 parity?
there is the argument that switching MPLS is faster than IP; when the pressure points i see are more at routing (BGP/LDP/RSVP/whatever), recovery, and convergence.
did we really learn so little from IP routing that we need to recreate analogous complexity and fragility in the MPLS control plane? ( sound of steam eminating from saku's ears :)
and then there is buffering; which seems more serious than simple forwarding rate. get it there faster so it can wait in a queue? my principal impression of the Stanford/Google workshops was the parable of the blind men and the elephant. though maybe Matt had the main point: given scaling 4x, Moore's law can not save us and it will all become paced protocols. will we now have a decade+ of BBR evolution and tuning? if so, how do we engineer our networks for that?
and up 10,000m, we watch vendor software engineers hand crafting in an assembler language with if/then/case/for, and running a chain of checking software to look for horrors in their assembler programs. it's the bleeping 21st century. why are the protocol specs not formal and verified, and the code formally generated and verified? and don't give me too slow given that the hardware folk seem to be able to do 10x in the time it takes to run valgrind a few dozen times.
we're extracting ore with hammers and chisels, and then hammering it into shiny objects rather than safe and securable network design and construction tools.
apologies. i hope you did not read this far.
randy
On 20/Jun/20 17:08, Robert Raszuk wrote:
But with that let's not forget that aggregation here is still not spec-ed out well and to the best of my knowledge it is also not shipping yet. I recently proposed an idea how to aggregate SRGBs .. one vendor is analyzing it.
Hence why I think SR still needs time to grow up. There are some things I can be maverick about. I don't think SR is it, today. Mark.
On 20/Jun/20 01:32, Randy Bush wrote:
there is saku's point of distributing labels in IGP TLVs/LSAs. i suspect he is correct, but good luck getting that anywhere in the internet vendor task force. and that tells us a lot about whether we can actually effect useful simplification and change.
This is shipping today with SR-MPLS. Besides still being brand new and not yet fully field tested by the community, my other concern is unless you are running a Juniper and have the energy to pull a "Vijay Gill" and move your entire backbone to IS-IS, you'll get either no SR-ISISv6 support, no SR-OSPFv3 support, or both, with all the vendors. Which brings me back to the same piss-poor attention LDPv6 is getting, which is, really, poor attention to IPv6. Kind of hard for operators to take IPv6 seriously at this level if the vendors, themselves, aren't.
is a significant part of the perception that there is a forwarding problem the result of the vendors, 25 years later, still not designing for v4/v6 parity?
I think the forwarding is fine, if you're carrying the payload in MPLS. The problem is the control plane. It's not insurmountable; the vendors just want to do less work. The issue is IPv4 is gone, and trying to keep it around will only lead to the creation of more hacks, which will further complicate the control and data plane.
there is the argument that switching MPLS is faster than IP; when the pressure points i see are more at routing (BGP/LDP/RSVP/whatever), recovery, and convergence.
Either way, the MPLS or IP problem already has an existing solution. If you like IP, you can keep it. If you like MPLS, you can keep it. So I'd be spending less time on the forwarding (of course, if there are ways to improve that and someone has the time, why not), and as you say, work on fixing the control plane and the signaling for efficiency and scale.
did we really learn so little from IP routing that we need to recreate analogous complexity and fragility in the MPLS control plane? ( sound of steam eminating from saku's ears :)
The path to SR-MPLS's inherent signaling carried in the IGP is an optimum solution, that even I have been wanting since inception. But, it's still too fresh, global deployment is terrible, and there is still much to be learned about how it behaves outside of the lab. For me, a graceful approach toward SR via LDPv6 makes sense. But, as always, YMMV.
and then there is buffering; which seems more serious than simple forwarding rate. get it there faster so it can wait in a queue? my principal impression of the Stanford/Google workshops was the parable of the blind men and the elephant. though maybe Matt had the main point: given scaling 4x, Moore's law can not save us and it will all become paced protocols. will we now have a decade+ of BBR evolution and tuning? if so, how do we engineer our networks for that?
This deserves a lot more attention than it's receiving. The problem is it doesn't sound sexy enough to compile into a PPT that you can project to suits whom you need to part with cash. It doesn't have that 5G or SRv6 or Controller or IoT ring to it :-). It's been a while since vendors that control a large portion of the market paid real attention to their geeky side. The buffer problem, for me, would fall into that category. Maybe a smaller, more agile, more geeky start-up, can take the lead with this one.
and up 10,000m, we watch vendor software engineers hand crafting in an assembler language with if/then/case/for, and running a chain of checking software to look for horrors in their assembler programs. it's the bleeping 21st century. why are the protocol specs not formal and verified, and the code formally generated and verified? and don't give me too slow given that the hardware folk seem to be able to do 10x in the time it takes to run valgrind a few dozen times.
And for today's episode of Jeopardy: "What used to be the IETF?"
we're extracting ore with hammers and chisels, and then hammering it into shiny objects rather than safe and securable network design and construction tools.
Rush it out the factory, fast, even though it's not ready. Get all their money before they board the ship and sail for Mars. Mark.
Robert Raszuk wrote:
MPLS was since day one proposed as enabler for services originally L3VPNs and RSVP-TE. There seems to be serious confusions between label switching with explicit flows and MPLS, which was believed to scale without detecting/configuring flows.
At the time I proposed label switching, there already was RSVP but RSVP-TE was proposed long after MPLS was proposed. But, today, people are seems to be using, so called, MPLS, with explicitly configured flows, administration of which does not scale and is annoying. Remember that the original point of MPLS was that it should work scalably without a lot of configuration, which is not the reality recognized by people on this thread.
So I think Ohta-san's point is about scalability services not flat underlay RIB and FIB sizes. Many years ago we had requests to support 5M L3VPN routes while underlay was just 500K IPv4.
That is certainly a problem. However, worse problem is to know label values nested deeply in MPLS label chain. Even worse, if route near the destination expected to pop the label chain goes down, how can the source knows that the router goes down and choose alternative router near the destination?
Last - when I originally discussed just plain MPLS with customers with single application of hierarchical routing (no BGP in the core) frankly no one was interested.
MPLS with hierarchical routing just does not scale. Masataka Ohta
But, today, people are seems to be using, so called, MPLS, with
explicitly configured flows, administration of which does not
scale and is annoying.
I am actually not sure what you are talking about here. The only per flow action in any MPLS deployments I have seen was mapping flow groups to specific TE-LSPs. In all other TDP or LDP cases flow == IP destination so it is exact based on the destination reachability. And such mapping is based on the LDP FEC to IGP (or BGP) match. Even worse, if route near the destination expected to pop the label
chain goes down, how can the source knows that the router goes down and choose alternative router near the destination?
In normal MPLS the src does not pick the transit paths. Transit is 100% driven by IGP and if you loose a node local connectivity restoration techniques (FRR or IGP convergence applies). If egress signalled implicit NULL it would signal it to any IGP peer. That is also possible with SR-MPLS too. No change ... no per flow state at all more then per IP destination routing. If you want to control your transit hops you can - but this is an option not w requirement. MPLS with hierarchical routing just does not scale. While I am not defending MPLS here and 100% agree that IP as transit is a much better option today and tomorrow I also would like to make sure we communicate true points. So when you say it does not scale - it could be good to list what exactly does not scale by providing a real network operational example. Many thx, R.
On 19/Jun/20 18:00, Masataka Ohta wrote:
There seems to be serious confusions between label switching with explicit flows and MPLS, which was believed to scale without detecting/configuring flows.
At the time I proposed label switching, there already was RSVP but RSVP-TE was proposed long after MPLS was proposed.
RSVP failed to take off, for whatever reason (I can think of many). I'm not sure any network operator, today, would allow an end-host to make reservation requests in their core. Even in the Transport world, this was the whole point of GMPLS. After they saw how terrible that idea was, it shifted from customers to being an internal fight between the IP teams and the Transport teams. Ultimately, I don't think anybody really cared about routers automatically using GMPLS to reserve and direct the DWDM network. In our Transport network, we use GMPLS/ASON in the Transport network only. When the IP team needs capacity, it's a telephone job :-).
But, today, people are seems to be using, so called, MPLS, with explicitly configured flows, administration of which does not scale and is annoying.
Remember that the original point of MPLS was that it should work scalably without a lot of configuration, which is not the reality recognized by people on this thread.
Well, you get the choice of LDP (low-touch) or RSVP-TE (high-touch). Pick your poison. We don't use RSVP-TE because of the issues you describe above. We use LDP to avoid the issues you describe above. In the end, SR-MPLS is meant to solve this issue for TE requirements. So the signaling state-of-the-art improves with time.
That is certainly a problem. However, worse problem is to know label values nested deeply in MPLS label chain.
Why, how, is that a problem? For load balancing?
Even worse, if route near the destination expected to pop the label chain goes down, how can the source knows that the router goes down and choose alternative router near the destination?
If by source you mean end-host, if the edge router they are connected to only ran IP and they were single-homed, they'd still go down. If the end-host were multi-homed to two edge routers, one of them failing won't cause an outage for the host. Unless I misunderstand.
MPLS with hierarchical routing just does not scale.
With Internet in a VRF, I truly agree. But if you run a simple global BGP table and no VRF's, I don't see an issue. This is what we do, and our scaling concerns are exactly the same whether we run plain IP or IP/MPLS. Mark.
On Sat, Jun 20, 2020 at 11:08 AM Mark Tinka <mark.tinka@seacom.mu> wrote:
MPLS with hierarchical routing just does not scale.
With Internet in a VRF, I truly agree.
But if you run a simple global BGP table and no VRF's, I don't see an issue. This is what we do, and our scaling concerns are exactly the same whether we run plain IP or IP/MPLS.
Mark.
We run the Internet in a VRF to get watertight separation between management and the Internet. I do also have a CGN vrf but that one has very few routes in it (99% being subscriber management created, eg. one route per customer). Why would this create a scaling issue? If you collapse our three routing tables into one, you would have exactly the same number of routes. All we did was separate the routes into namespaces, to establish a firewall that prevents traffic to flow where it shouldn't. Regards, Baldur
On 20/Jun/20 11:27, Baldur Norddahl wrote:
We run the Internet in a VRF to get watertight separation between management and the Internet. I do also have a CGN vrf but that one has very few routes in it (99% being subscriber management created, eg. one route per customer). Why would this create a scaling issue? If you collapse our three routing tables into one, you would have exactly the same number of routes. All we did was separate the routes into namespaces, to establish a firewall that prevents traffic to flow where it shouldn't.
It may be less of an issue in 2020 with the current control planes and how far the code has come, but in the early days of l3vpn's, the number of VRF's you could have was directly proportional to the number of routes you had in each one. More VRF's, less routes for each. More routes per VRF, less VRF's in total. I don't know if that's still an issue today, as we don't run the Internet in a VRF. I'd defer to those with that experience, who knew about the scaling limitations of the past. Mark.
On Sat, Jun 20, 2020 at 12:38 PM Mark Tinka <mark.tinka@seacom.mu> wrote:
On 20/Jun/20 11:27, Baldur Norddahl wrote:
We run the Internet in a VRF to get watertight separation between management and the Internet. I do also have a CGN vrf but that one has very few routes in it (99% being subscriber management created, eg. one route per customer). Why would this create a scaling issue? If you collapse our three routing tables into one, you would have exactly the same number of routes. All we did was separate the routes into namespaces, to establish a firewall that prevents traffic to flow where it shouldn't.
It may be less of an issue in 2020 with the current control planes and how far the code has come, but in the early days of l3vpn's, the number of VRF's you could have was directly proportional to the number of routes you had in each one. More VRF's, less routes for each. More routes per VRF, less VRF's in total.
I don't know if that's still an issue today, as we don't run the Internet in a VRF. I'd defer to those with that experience, who knew about the scaling limitations of the past.
I can't speak for the year 2000 as I was not doing networking at this level at that time. But when I check the specs for the base mx204 it says something like 32 VRFs, 2 million routes in FIB and 6 million routes in RIB. Clearly those numbers are the total of routes across all VRFs otherwise you arrive at silly numbers (64 million FIB if you multiply, 128k FIB if you divide by 32). My conclusion is that scale wise you are ok as long you do not try to have more than one VRF with a complete copy of the DFZ. More worrying is that 2 million routes will soon not be enough to install all routes with a backup route, invalidating BGP FRR. Regards, Baldur
On 20/Jun/20 22:00, Baldur Norddahl wrote:
I can't speak for the year 2000 as I was not doing networking at this level at that time. But when I check the specs for the base mx204 it says something like 32 VRFs, 2 million routes in FIB and 6 million routes in RIB. Clearly those numbers are the total of routes across all VRFs otherwise you arrive at silly numbers (64 million FIB if you multiply, 128k FIB if you divide by 32). My conclusion is that scale wise you are ok as long you do not try to have more than one VRF with a complete copy of the DFZ.
I recall a number of networks holding multiple VRF's, including at least 2x Internet VRF's, for numerous use-cases. I don't know if they still do that today, but one can get creative real quick :-).
More worrying is that 2 million routes will soon not be enough to install all routes with a backup route, invalidating BGP FRR.
I have a niggling feeling this will be solved before we get there. Now, whether we can afford it is a whole other matter. Mark.
On Sun, Jun 21, 2020 at 9:56 AM Mark Tinka <mark.tinka@seacom.mu> wrote:
On 20/Jun/20 22:00, Baldur Norddahl wrote:
I can't speak for the year 2000 as I was not doing networking at this level at that time. But when I check the specs for the base mx204 it says something like 32 VRFs, 2 million routes in FIB and 6 million routes in RIB. Clearly those numbers are the total of routes across all VRFs otherwise you arrive at silly numbers (64 million FIB if you multiply, 128k FIB if you divide by 32). My conclusion is that scale wise you are ok as long you do not try to have more than one VRF with a complete copy of the DFZ.
I recall a number of networks holding multiple VRF's, including at least 2x Internet VRF's, for numerous use-cases. I don't know if they still do that today, but one can get creative real quick :-).
Yes I once made a plan to have one VRF per transit provider plus a peering VRF. That way our BGP customers could have a session with each of those VRFs to allow them full control of the route mix. I would of course also need a Internet VRF for our own needs. But the reality of that would be too many copies of the DFZ in the routing tables. Although not necessary in the FIB as each of the transit VRFs could just have a default route installed. Regards, Baldur
On 21/Jun/20 12:45, Baldur Norddahl wrote:
Yes I once made a plan to have one VRF per transit provider plus a peering VRF. That way our BGP customers could have a session with each of those VRFs to allow them full control of the route mix. I would of course also need a Internet VRF for our own needs.
But the reality of that would be too many copies of the DFZ in the routing tables. Although not necessary in the FIB as each of the transit VRFs could just have a default route installed.
We just opted for BGP communities :-). Mark.
On Sun, Jun 21, 2020 at 1:30 PM Mark Tinka <mark.tinka@seacom.mu> wrote:
On 21/Jun/20 12:45, Baldur Norddahl wrote:
Yes I once made a plan to have one VRF per transit provider plus a peering VRF. That way our BGP customers could have a session with each of those VRFs to allow them full control of the route mix. I would of course also need a Internet VRF for our own needs.
But the reality of that would be too many copies of the DFZ in the routing tables. Although not necessary in the FIB as each of the transit VRFs could just have a default route installed.
We just opted for BGP communities :-).
Not really the same. Lets say the best path is through transit 1 but the customer thinks transit 1 sucks balls and wants his egress traffic to go through your transit 2. Only the VRF approach lets every BGP customer, even single homed ones, make his own choices about upstream traffic. You would be more like a transit broker than a traditional ISP with a routing mix. Your service is to buy one place, but get the exact same product as you would have if you bought from top X transits in your area. Delivered as X distinct BGP sessions to give you total freedom to send traffic via any of the transit providers. This is also the reason you do not actually need any routes in the FIB for each of those transit VRFs. Just a default route because all traffic will unconditionally go to said transit provider. The customer routes would still be there of course. Regards, Baldur
On 21/Jun/20 14:58, Baldur Norddahl wrote:
Not really the same. Lets say the best path is through transit 1 but the customer thinks transit 1 sucks balls and wants his egress traffic to go through your transit 2. Only the VRF approach lets every BGP customer, even single homed ones, make his own choices about upstream traffic.
You would be more like a transit broker than a traditional ISP with a routing mix. Your service is to buy one place, but get the exact same product as you would have if you bought from top X transits in your area. Delivered as X distinct BGP sessions to give you total freedom to send traffic via any of the transit providers.
We received such requests years ago, and calculated the cost of complexity vs. BGP communities. In the end, if the customer wants to use a particular upstream on our side, we'd rather setup an EoMPLS circuit between them and they can have their own contract. Practically, 90% of our traffic is peering. We don't that much with upstreams providers.
This is also the reason you do not actually need any routes in the FIB for each of those transit VRFs. Just a default route because all traffic will unconditionally go to said transit provider. The customer routes would still be there of course.
Glad it works for you. We just found it too complex, not just for the problems it would solve, but also for the parity issues between VRF's and the global table. Mark.
Hi Baldur,
From memory mx204 FIB is 10M (v4/v6) and RIB 30M for each v4 and v6.
And remember the FIB is hierarchical -so it’s the next-hops per prefix you are referring to with BGP FRR. And also going from memory of past scaling testing, if pfx1+NH1 == x, then Pfx1+NH1+NH2 !== 2x, where x is used FIB space. adam From: NANOG <nanog-bounces+adamv0025=netconsultings.com@nanog.org> On Behalf Of Baldur Norddahl Sent: Saturday, June 20, 2020 9:00 PM I can't speak for the year 2000 as I was not doing networking at this level at that time. But when I check the specs for the base mx204 it says something like 32 VRFs, 2 million routes in FIB and 6 million routes in RIB. Clearly those numbers are the total of routes across all VRFs otherwise you arrive at silly numbers (64 million FIB if you multiply, 128k FIB if you divide by 32). My conclusion is that scale wise you are ok as long you do not try to have more than one VRF with a complete copy of the DFZ. More worrying is that 2 million routes will soon not be enough to install all routes with a backup route, invalidating BGP FRR.
Mark Tinka wrote:
At the time I proposed label switching, there already was RSVP but RSVP-TE was proposed long after MPLS was proposed.
RSVP failed to take off, for whatever reason (I can think of many).
There are many. So, our research group tried to improve RSVP. Practically, the most serious problem of RSVP is, like OSPF, using unreliable link multicast to reliably exchange signalling messages between routers, making specification and implementations very complicated. So, we developed SRSVP (Simple RSVP) replacing link multicast by, like BGP, link local TCP mesh (thanks to the CATENET model, unlike BGP, there is no scalability concern). Then, it was not so difficult to remove other problems. However, perhaps, most people think show stopper to RSVP is lack of scalability of weighted fair queueing, though, it is not a problem specific to RSVP and MPLS shares the same problem. Obviously, weighted fair queueing does not scale because it is based on deterministic traffic model of token bucket model and, these days, people just use some ad-hoc ways for BW guarantee implicitly assuming stochastic traffic model. I even developed a little formal theory on scalable queueing with stochastic traffic model. So, we have specification and working implementation of hop-by-hop, scalable, stable unicast/multicast interdomain QoS routing protocol supporting routing hierarchy without clank back. See http://www.isoc.org/inet2000/cdproceedings/1c/1c_1.htm for rough description of design guideline.
I'm not sure any network operator, today, would allow an end-host to make reservation requests in their core.
I didn't attempt to standardize our result in IETF, partly because optical packet switching was a lot more interesting.
Even in the Transport world, this was the whole point of GMPLS. After they saw how terrible that idea was, it shifted from customers to being an internal fight between the IP teams and the Transport teams. Ultimately, I don't think anybody really cared about routers automatically using GMPLS to reserve and direct the DWDM network.
That should be a reasonable way of practical operation, though I'm not very interested in OCS (optical circuit switching) of GMPLS
In our Transport network, we use GMPLS/ASON in the Transport network only. When the IP team needs capacity, it's a telephone job :-).
For IP layer, that should be enough. For ASON, so complicated GMPLS is actually overkill. When I was playing with ATM switches, I established control plain network with VPI/VCI=0/0 and assign control plain IP addresses to ATM switches. To control other VCs, simple UDP packets are sent to switches from controlling hosts. Similar technology should be applicable to ASON. Maintaining integrity between wavelength switches is responsibility of controllers.
Remember that the original point of MPLS was that it should work scalably without a lot of configuration, which is not the reality recognized by people on this thread.
Well, you get the choice of LDP (low-touch) or RSVP-TE (high-touch).
No, I just explained what was advertised to be MPLS by people around Cisco against Ipsilon. According to the advertisements, you should call what you are using LS or GLS, not MPLS or GMPLS.
We don't use RSVP-TE because of the issugaes you describe above.
We use LDP to avoid the issues you describe above.
In the end, SR-MPLS is meant to solve this issue for TE requirements. So the signaling state-of-the-art improves with time. Assuming a central controller (and its collocated or distributed back up controllers), we don't need complicated protocols in
Good. the network to maintain integrity of the entire network.
That is certainly a problem. However, worse problem is to know label values nested deeply in MPLS label chain.
Why, how, is that a problem? For load balancing? What if, an inner label becomes invalidated around the destination, which is hidden, for route scalability, from the equipments around the source?
Even worse, if route near the destination expected to pop the label chain goes down, how can the source knows that the router goes down and choose alternative router near the destination?
If by source you mean end-host, if the edge router they are connected to only ran IP and they were single-homed, they'd still go down.
No, as "the destination expected to pop the label" is located somewhere around the final destination end-host. If, at the destination site, connectivity between a router to pop nested label and the fine destination end-host is lost, we are at a loss, unless source side changes inner label.
MPLS with hierarchical routing just does not scale.
With Internet in a VRF, I truly agree.
But if you run a simple global BGP table and no VRF's, I don't see an issue. This is what we do, and our scaling concerns are exactly the same whether we run plain IP or IP/MPLS.
If you are using intra-domain hierarchical routing for scalability within the domain, you still suffer from lack of scalability of MPLS. And, VRF is, in a sense, a form of intra-domain hierarchical routing with a lot of flexibility, which means a lot of unnecessary complications. Masataka Ohta
On 20/Jun/20 14:41, Masataka Ohta wrote:
There are many. So, our research group tried to improve RSVP.
I'm a lot younger than the Internet, but I read a fair bit about its history. I can't remember ever coming across an implementation of RSVP between a host and the network in a commercial setting. If I missed it, kindly share, as I'd be keen to see how that went.
Practically, the most serious problem of RSVP is, like OSPF, using unreliable link multicast to reliably exchange signalling messages between routers, making specification and implementations very complicated.
So, we developed SRSVP (Simple RSVP) replacing link multicast by, like BGP, link local TCP mesh (thanks to the CATENET model, unlike BGP, there is no scalability concern). Then, it was not so difficult to remove other problems.
Was "S-RSVP" ever implemented, and deployed?
However, perhaps, most people think show stopper to RSVP is lack of scalability of weighted fair queueing, though, it is not a problem specific to RSVP and MPLS shares the same problem.
QoS has nothing to do with MPLS. You can do QoS with or without MPLS. I should probably point out, also, that RSVP (or RSVP-TE) is not MPLS. They collaborate, yes, but we'd be doing the community a disservice by interchanging them for one another.
Obviously, weighted fair queueing does not scale because it is based on deterministic traffic model of token bucket model and, these days, people just use some ad-hoc ways for BW guarantee implicitly assuming stochastic traffic model. I even developed a little formal theory on scalable queueing with stochastic traffic model.
Maybe so, but I still don't see the relation to MPLS. All MPLS can do is convey IPP or DSCP values as an EXP code point in the core. I'm not sure how that creates a scaling problem within MPLS itself. If you didn't have MPLS, you'd be encoding those values in IPP or DSCP. So what's the issue?
So, we have specification and working implementation of hop-by-hop, scalable, stable unicast/multicast interdomain QoS routing protocol supporting routing hierarchy without clank back.
See
http://www.isoc.org/inet2000/cdproceedings/1c/1c_1.htm
for rough description of design guideline.
If I understand this correctly, would this be the IntServ QoS model?
I didn't attempt to standardize our result in IETF, partly because optical packet switching was a lot more interesting.
Still is, even today :-)?
That should be a reasonable way of practical operation, though I'm not very interested in OCS (optical circuit switching) of GMPLS
Design goals are often what they are, and then the real world hits you.
For IP layer, that should be enough. For ASON, so complicated GMPLS is actually overkill.
When I was playing with ATM switches, I established control plain network with VPI/VCI=0/0 and assign control plain IP addresses to ATM switches. To control other VCs, simple UDP packets are sent to switches from controlling hosts.
Similar technology should be applicable to ASON. Maintaining integrity between wavelength switches is responsibility of controllers.
Well, GMPLS and ASON is basically skinny OSPF, IS-IS and RSVP running in a DWDM node's control plane.
No, I just explained what was advertised to be MPLS by people around Cisco against Ipsilon.
According to the advertisements, you should call what you are using LS or GLS, not MPLS or GMPLS.
It takes a while for new technology to be fully understood, which is why I'm not rushing on to the SR bandwagon :-). I can't blame the sales droids or the customers of the day. It probably sounded like dark magic.
Assuming a central controller (and its collocated or distributed back up controllers), we don't need complicated protocols in the network to maintain integrity of the entire network.
Well, that's a point of view, I suppose. I still can't walk into a shop and "buy a controller". I don't know what this controller thing is, 10 years on. IGP's, BGP and label distribution protocols have proven themselves, in the interim.
What if, an inner label becomes invalidated around the destination, which is hidden, for route scalability, from the equipments around the source?
I can't say I've ever come across that scenario running MPLS since 2004. Do you have an example from a production network that you can share with us? I'd really like to understand this better.
No, as "the destination expected to pop the label" is located somewhere around the final destination end-host.
If, at the destination site, connectivity between a router to pop nested label and the fine destination end-host is lost, we are at a loss, unless source side changes inner label.
Maybe a diagram would help, as I still don't get this failure scenario. If a host lost connectivity with the service provider network, getting label switching to work is pretty low on the priority list. Again, unless I misunderstand.
If you are using intra-domain hierarchical routing for scalability within the domain, you still suffer from lack of scalability of MPLS.
And, VRF is, in a sense, a form of intra-domain hierarchical routing with a lot of flexibility, which means a lot of unnecessary complications.
I don't think stuffing your VRF's full of routes is an intrinsic problem of MPLS. MPLS works whether you run l3vpn's or not. That MPLS provides a forwarding paradigm for VRF's does not put it and the potential poor scalability VRF's in the same WhatsApp group. Mark.
Mark Tinka wrote:
There are many. So, our research group tried to improve RSVP.
I'm a lot younger than the Internet, but I read a fair bit about its history. I can't remember ever coming across an implementation of RSVP between a host and the network in a commercial setting.
No, of course, because, as we agreed, RSVP has a lot of problems.
Was "S-RSVP" ever implemented, and deployed?
It was implemented and some technology was used by commercial router from Furukawa (a Japanese vendor selling optical fiber now not selling routers).
However, perhaps, most people think show stopper to RSVP is lack of scalability of weighted fair queueing, though, it is not a problem specific to RSVP and MPLS shares the same problem.
QoS has nothing to do with MPLS. You can do QoS with or without MPLS.
GMPLS, you are using, is the mechanism to guarantee QoS by reserving wavelength resource. It is impossible for GMPLS not to offer QoS. Moreover, as some people says they offer QoS with MPLS, they should be using some prioritized queueing mechanisms, perhaps not poor WFQ.
I should probably point out, also, that RSVP (or RSVP-TE) is not MPLS.
They are different, of course. But, GMPLS is to reserve bandwidth resource. MPLS, in general, is to reserve label values, at least.
All MPLS can do is convey IPP or DSCP values as an EXP code point in the core. I'm not sure how that creates a scaling problem within MPLS itself.
I didn't say scaling problem caused by QoS. But, as you are avoiding to extensively use MPLS, I think you are aware that extensive use of MPLS needs management of a lot of labels, which does not scale. Or, do I misunderstand something?
If I understand this correctly, would this be the IntServ QoS model?
No. IntServ specifies format to carry QoS specification in RSVP packets without assuming any specific model of QoS.
I didn't attempt to standardize our result in IETF, partly because optical packet switching was a lot more interesting.
Still is, even today :-)?
No. As experimental switches are working years ago and making it work >10Tbps is not difficult (switching is easy, generating 10Tbps packets needs a lot of parallel equipment), there is little remaining for research. https://www.osapublishing.org/abstract.cfm?URI=OFC-2010-OWM4
Assuming a central controller (and its collocated or distributed back up controllers), we don't need complicated protocols in the network to maintain integrity of the entire network.
Well, that's a point of view, I suppose.
I still can't walk into a shop and "buy a controller". I don't know what this controller thing is, 10 years on.
SDN, maybe. Though I'm not saying SDN scale, it should be no worse than MPLS.
I can't say I've ever come across that scenario running MPLS since 2004.
I did some retrospective research. https://en.wikipedia.org/wiki/Multiprotocol_Label_Switching History 1994: Toshiba presented Cell Switch Router (CSR) ideas to IETF BOF 1996: Ipsilon, Cisco and IBM announced label switching plans 1997: Formation of the IETF MPLS working group 1999: First MPLS VPN (L3VPN) and TE deployments 2000: MPLS traffic engineering 2001: First MPLS Request for Comments (RFCs) released as I was a co-chair of 1994 BOF and my knowledge on MPLS is mostly on 1997 ID: https://tools.ietf.org/html/draft-ietf-mpls-arch-00 there seems to be a lot of terminology changes. I'm saying that, if some failure occurs and IGP changes, a lot of LSPs must be recomputed, which does not scale if # of LSPs is large, especially in a large network where IGP needs hierarchy (such as OSPF area). Masataka Ohta
On 21/Jun/20 12:10, Masataka Ohta wrote:
It was implemented and some technology was used by commercial router from Furukawa (a Japanese vendor selling optical fiber now not selling routers).
I won't lie, never heard of it.
GMPLS, you are using, is the mechanism to guarantee QoS by reserving wavelength resource. It is impossible for GMPLS not to offer QoS.
That is/was the idea. In practice (at least in our Transport network), deploying capacity as an offline exercise is significantly simpler. In such a case, we wouldn't use GMPLS for capacity reservation, just path re-computation in failure scenarios. Our Transport network isn't overly meshed. It's just stretchy. Perhaps if one was trying to build a DWDM backbone into, out of and through every city in the U.S., capacity reservation in GMPLS may be a use-case. But unless someone is willing to pipe up and confess to implementing it in this way, I've not heard of it.
Moreover, as some people says they offer QoS with MPLS, they should be using some prioritized queueing mechanisms, perhaps not poor WFQ.
It would be a combination - PQ and WFQ depending on the traffic type and how much customers want to pay. But carrying an MPLS EXP code point does not make MPLS unscalable. It's no different to carrying a DSCP or IPP code point in plain IP. Or even an 802.1p code point in Ethernet.
They are different, of course. But, GMPLS is to reserve bandwidth resource.
In theory. What are people doing in practice? I just told you our story.
MPLS, in general, is to reserve label values, at least.
MPLS is the forwarding paradigm. Label reservation/allocation can be done manually or with a label distribution protocol. MPLS doesn't care how labels are generated and learned. It will just push, swap and pop as it needs to.
I didn't say scaling problem caused by QoS.
But, as you are avoiding to extensively use MPLS, I think you are aware that extensive use of MPLS needs management of a lot of labels, which does not scale.
Or, do I misunderstand something?
I'm not avoiding extensive use of MPLS. I want extensive use of MPLS. In IPv4, we forward in MPLS 100%. In IPv6, we forward in MPLS 80%. This is due to vendor nonsense. Trying to fix.
No. IntServ specifies format to carry QoS specification in RSVP packets without assuming any specific model of QoS.
Then I'm failing to understand your point, especially since it doesn't sound like any operator is deploying such a model, or if so, publicly suffering from it.
No. As experimental switches are working years ago and making it work >10Tbps is not difficult (switching is easy, generating 10Tbps packets needs a lot of parallel equipment), there is little remaining for research.
We'll get there. This doesn't worry me so much :-). Either horizontally or vertically. I can see a few models to scale IP/MPLS carriage.
SDN, maybe. Though I'm not saying SDN scale, it should be no worse than MPLS.
I still can't tell you what SDN is :-). I won't suffer it in this decade, thankfully.
I did some retrospective research.
https://en.wikipedia.org/wiki/Multiprotocol_Label_Switching History 1994: Toshiba presented Cell Switch Router (CSR) ideas to IETF BOF 1996: Ipsilon, Cisco and IBM announced label switching plans 1997: Formation of the IETF MPLS working group 1999: First MPLS VPN (L3VPN) and TE deployments 2000: MPLS traffic engineering 2001: First MPLS Request for Comments (RFCs) released
as I was a co-chair of 1994 BOF and my knowledge on MPLS is mostly on 1997 ID:
https://tools.ietf.org/html/draft-ietf-mpls-arch-00
there seems to be a lot of terminology changes.
My comment to that was in reference to your text, below: "What if, an inner label becomes invalidated around the destination, which is hidden, for route scalability, from the equipments around the source?" I've never heard of such an issue in 16 years.
I'm saying that, if some failure occurs and IGP changes, a lot of LSPs must be recomputed, which does not scale if # of LSPs is large, especially in a large network where IGP needs hierarchy (such as OSPF area).
That happens everyday, already. Links fail, IGP re-converges, LDP keeps humming. RSVP-TE too, albeit all that state does need some consideration especially if code is buggy. Particularly, where you have LFA/IP-FRR both in the IGP and LDP, I've not come across any issue where IGP re-convergence caused LSP's to fail. In practice, IGP hierarchy (OSPF Areas or IS-IS Levels) doesn't help much if you are running MPLS. FEC's are forged against /32 and /128 addresses. Yes, as with everything else, it's a trade-off. Mark.
I'm saying that, if some failure occurs and IGP changes, a lot of LSPs must be recomputed, which does not scale if # of LSPs is large, especially in a large network where IGP needs hierarchy (such as OSPF area).
Masataka Ohta
Actually when IGP changes LSPs are not recomputed with LDP or SR-MPLS (when used without TE :). "LSP" term is perhaps what drives your confusion --- in LDP MPLS there is no "Path" - in spite of the acronym (Labeled Switch *Path*). Labels are locally significant and swapped at each LSR - resulting essentially with a bunch of one hop crossconnects. In other words MPLS LDP strictly follows IGP SPT at each LSR hop. Many thx, R.
On 21/Jun/20 15:48, Robert Raszuk wrote:
Actually when IGP changes LSPs are not recomputed with LDP or SR-MPLS (when used without TE :).
"LSP" term is perhaps what drives your confusion --- in LDP MPLS there is no "Path" - in spite of the acronym (Labeled Switch *Path*). Labels are locally significant and swapped at each LSR - resulting essentially with a bunch of one hop crossconnects.
In other words MPLS LDP strictly follows IGP SPT at each LSR hop.
Yep, which is what I tried to explain as well. With LDP, MPLS-enabled hosts simply push, swap and pop. There is not concept of an "end-to-end LSP" as such. We just use the term "LSP" to define an FEC. But really, each node in the FEC's path is making its own push, swap and pop decisions. The LFIB in each node need only be as large as the number of LDP-enabled routers in the network. You can get scenarios where FEC's are also created for infrastructure links, but if you employ filtering to save on FIB slots, you really just need to allocate labels to Loopback addresses only. Mark.
The LFIB in each node need only be as large as the number of LDP-enabled routers in the network.
That is true for P routers ... not so much for PEs. Please observe that label space in each PE router is divided for IGP and BGP as well as other label hungy services ... there are many consumers of local label block. So it is always the case that LFIB table (max 2^20 entries - 1M) on PEs is much larger then LFIB on P nodes. Thx, R. On Sun, Jun 21, 2020 at 6:01 PM Mark Tinka <mark.tinka@seacom.mu> wrote:
On 21/Jun/20 15:48, Robert Raszuk wrote:
Actually when IGP changes LSPs are not recomputed with LDP or SR-MPLS (when used without TE :).
"LSP" term is perhaps what drives your confusion --- in LDP MPLS there is no "Path" - in spite of the acronym (Labeled Switch *Path*). Labels are locally significant and swapped at each LSR - resulting essentially with a bunch of one hop crossconnects.
In other words MPLS LDP strictly follows IGP SPT at each LSR hop.
Yep, which is what I tried to explain as well. With LDP, MPLS-enabled hosts simply push, swap and pop. There is not concept of an "end-to-end LSP" as such. We just use the term "LSP" to define an FEC. But really, each node in the FEC's path is making its own push, swap and pop decisions.
The LFIB in each node need only be as large as the number of LDP-enabled routers in the network. You can get scenarios where FEC's are also created for infrastructure links, but if you employ filtering to save on FIB slots, you really just need to allocate labels to Loopback addresses only.
Mark.
On 21/Jun/20 19:34, Robert Raszuk wrote:
That is true for P routers ... not so much for PEs.
Please observe that label space in each PE router is divided for IGP and BGP as well as other label hungy services ... there are many consumers of local label block.
So it is always the case that LFIB table (max 2^20 entries - 1M) on PEs is much larger then LFIB on P nodes.
I should point out that all of my input here is based on simple MPLS forwarding of IP traffic in the global table. In this scenario, labels are only assigned to BGP next-hops, which is typically an IGP Loopback address. Labels don't get assigned to BGP routes in a global table. There is no use for that. Of course, as this is needed in VRF's and other BGP-based VPN services, the extra premium customers pay for that priviledge may be considered warranted :-). Mark.
I should point out that all of my input here is based on simple MPLS forwarding of IP traffic in the global table. In this scenario, labels are only assigned to BGP next-hops, which is typically an IGP Loopback address.
Well this is true for one company :) Name starts with j .... Other company name starting with c - at least some time back by default allocated labels for all routes in the RIB either connected or static or sourced from IGP. Sure you could always limit that with a knob if desired. The issue with allocating labels only for BGP next hops is that your IP/MPLS LFA breaks (or more directly is not possible) as you do not have a label to PQ node upon failure. Hint: PQ node is not even running BGP :). Sure selective folks still count of "IGP Convergence" to restore connectivity. But I hope those will move to much faster connectivity restoration techniques soon.
Labels don't get assigned to BGP routes in a global table. There is no use for that.
Sure - True. Cheers, R,
On 21/Jun/20 22:21, Robert Raszuk wrote:
Well this is true for one company :) Name starts with j ....
Other company name starting with c - at least some time back by default allocated labels for all routes in the RIB either connected or static or sourced from IGP. Sure you could always limit that with a knob if desired.
Juniper allocates labels to the Loopback only. Cisco allocates labels to all IGP and interface routes. Neither allocate labels to BGP routes for the global table.
The issue with allocating labels only for BGP next hops is that your IP/MPLS LFA breaks (or more directly is not possible) as you do not have a label to PQ node upon failure. Hint: PQ node is not even running BGP :).
Wouldn't T-LDP fix this, since LDP LFA is a targeted session? Need to test.
Sure selective folks still count of "IGP Convergence" to restore connectivity. But I hope those will move to much faster connectivity restoration techniques soon.
We are happy :-). Mark.
Wouldn't T-LDP fix this, since LDP LFA is a targeted session?
Nope. You need to get to PQ node via potentially many hops. So you need to have even ordered or independent label distribution to its loopback in place. Best, R. On Sun, Jun 21, 2020 at 10:58 PM Mark Tinka <mark.tinka@seacom.mu> wrote:
On 21/Jun/20 22:21, Robert Raszuk wrote:
Well this is true for one company :) Name starts with j ....
Other company name starting with c - at least some time back by default allocated labels for all routes in the RIB either connected or static or sourced from IGP. Sure you could always limit that with a knob if desired.
Juniper allocates labels to the Loopback only.
Cisco allocates labels to all IGP and interface routes.
Neither allocate labels to BGP routes for the global table.
The issue with allocating labels only for BGP next hops is that your IP/MPLS LFA breaks (or more directly is not possible) as you do not have a label to PQ node upon failure. Hint: PQ node is not even running BGP :).
Wouldn't T-LDP fix this, since LDP LFA is a targeted session?
Need to test.
Sure selective folks still count of "IGP Convergence" to restore connectivity. But I hope those will move to much faster connectivity restoration techniques soon.
We are happy :-).
Mark.
On 21/Jun/20 23:01, Robert Raszuk wrote:
Nope. You need to get to PQ node via potentially many hops. So you need to have even ordered or independent label distribution to its loopback in place.
I have some testing I want to do with IS-IS only announcing the Loopback from a set of routers to the rest of the backbone, and LDP allocating labels for it accordingly, to solve a particular problem. I'll test this out and see what happens re: LDP LFA. Mark.
From: NANOG <nanog-bounces@nanog.org> On Behalf Of Masataka Ohta Sent: Friday, June 19, 2020 5:01 PM
Robert Raszuk wrote:
So I think Ohta-san's point is about scalability services not flat underlay RIB and FIB sizes. Many years ago we had requests to support 5M L3VPN routes while underlay was just 500K IPv4.
That is certainly a problem. However, worse problem is to know label values nested deeply in MPLS label chain.
Even worse, if route near the destination expected to pop the label chain goes down, how can the source knows that the router goes down and choose alternative router near the destination?
Via IGP or controller, but for sub 50ms convergence there are edge node protection mechanisms, so the point is the source doesn't even need to know about for the restoration to happen. adam
On 19/Jun/20 17:13, Robert Raszuk wrote:
So I think Ohta-san's point is about scalability services not flat underlay RIB and FIB sizes. Many years ago we had requests to support 5M L3VPN routes while underlay was just 500K IPv4.
Ah, if the context, then, was l3vpn scaling, yes, that is a known issue. Apart from the global table vs. VRF parity concerns I've always had (one of which was illustrated earlier this week, on this list, with RPKI in a VRF), the other reason I don't do Internet in a VRF is because it was always a trade-off: - More routes per VRF = fewer VRF's. - More VRF's = fewer routes per VRF. Going forward, I believe the l3vpn pressures (for pure VPN services, not Internet in a VRF) should begin to subside as businesses move on-prem workloads to the cloud, bite into the SD-WAN train, and generally, do more stuff over the public Internet than via inter-branch WAN links formerly driven by l3vpn. Time will tell, but in Africa, bar South Africa, l3vpn's were never a big thing, mostly because Internet connectivity was best served from one or two major cities, where most businesses had a branch that warranted connectivity. But even in South Africa (as the rest of our African market), 98% of our business is plain IP. The other 2% is mostly l2vpn. l3vpn's don't really feature, except for some in-house enterprise VoIP carriage + some high-speed in-band management. Even with the older South African operators that made a killing off l3vpn's, these are falling away as their customers either move to the cloud and/or accept SD-WAN thingies.
Last - when I originally discussed just plain MPLS with customers with single application of hierarchical routing (no BGP in the core) frankly no one was interested. Till L3VPN arrived which was game changer and run for new revenue streams ...
The BGP-free core has always sounded like a dark art. More so in the days when hardware was precious, core routers doubled as inline route reflectors and the size of the IPv4 DFZ wasn't rapidly exploding like it is today, and no one was even talking about the IPv6 DFZ. Might be useful speaking with them again, in 2020 :-). Mark.
From: NANOG <nanog-bounces@nanog.org> On Behalf Of Mark Tinka Sent: Friday, June 19, 2020 7:28 PM
On 19/Jun/20 17:13, Robert Raszuk wrote:
So I think Ohta-san's point is about scalability services not flat underlay RIB and FIB sizes. Many years ago we had requests to support 5M L3VPN routes while underlay was just 500K IPv4.
Ah, if the context, then, was l3vpn scaling, yes, that is a known issue.
I wouldn't say it's known to many as not many folks are actually limited by only up to ~1M customer connections, or next level up, only up to ~1M customer VPNs.
Apart from the global table vs. VRF parity concerns I've always had (one of which was illustrated earlier this week, on this list, with RPKI in a VRF),
Well yeah, things work differently in VRFs, not a big surprise. And what about an example of bad flowspec routes/filters cutting the boxes off net -where having those flowspec routes/filters contained within an Internet VRF would not have such an effect. See, it goes either way. Would be interesting to see a comparison of good vs bad for the Internet routes in VRF vs in Internet routes in global/default routing table.
the other reason I don't do Internet in a VRF is because it was always a trade-off:
- More routes per VRF = fewer VRF's. - More VRF's = fewer routes per VRF.
No, that's just a result of having a finite FIB/RIB size -if you want to cut these resources into virtual pieces you'll naturally get your equations above. But if you actually construct your testing to showcase the delta between how much FIB/RIB space is taken by x prefixes with each in a VRF as opposed to all in a single default VRF (global routing table) the delta is negligible. (Yes negligible even in case of per prefix VPN label allocation method -which I'm assuming no one is using anyways as it inherently doesn't scale and would limit you to ~1M VPN prefixes though per-CE/per-next-hop VPN label allocation method gives one the same functionality as per-prefix one while pushing the limit to ~1M PE-CE links/IFLs which from my experience is sufficient for most folks out there). adam
On 21/Jun/20 21:15, adamv0025@netconsultings.com wrote:
I wouldn't say it's known to many as not many folks are actually limited by only up to ~1M customer connections, or next level up, only up to ~1M customer VPNs.
It's probably less of a problem now than it was 10 years ago. But, yes, I don't have any real-world experience.
Well yeah, things work differently in VRFs, not a big surprise. And what about an example of bad flowspec routes/filters cutting the boxes off net -where having those flowspec routes/filters contained within an Internet VRF would not have such an effect. See, it goes either way. Would be interesting to see a comparison of good vs bad for the Internet routes in VRF vs in Internet routes in global/default routing table.
Well, the global table is the basics, and VRF's is where sexy lives :-).
No, that's just a result of having a finite FIB/RIB size -if you want to cut these resources into virtual pieces you'll naturally get your equations above. But if you actually construct your testing to showcase the delta between how much FIB/RIB space is taken by x prefixes with each in a VRF as opposed to all in a single default VRF (global routing table) the delta is negligible. (Yes negligible even in case of per prefix VPN label allocation method -which I'm assuming no one is using anyways as it inherently doesn't scale and would limit you to ~1M VPN prefixes though per-CE/per-next-hop VPN label allocation method gives one the same functionality as per-prefix one while pushing the limit to ~1M PE-CE links/IFLs which from my experience is sufficient for most folks out there).
Like I said, with today's CPU's and memory, probably not an issue. But it's not an area I play in, so those with more experience - like yourself - would know better. Mark.
Mark Tinka wrote:
I wouldn't agree.
MPLS is a purely forwarding paradigm, as is hop-by-hop IP.
As the first person to have proposed the forwarding paradigm of label switching, I have been fully aware from the beginning that: https://tools.ietf.org/html/draft-ohta-ip-over-atm-01 Conventional Communication over ATM in a Internetwork Layer The conventional communication, that is communication that does not assume connectivity, is no different from that of the existing IP, of course. special, prioritized forwarding should be done only by special request by end users (by properly designed signaling mechanism, for which RSVP failed to be) or administration does not scale.
Even with hop-by-hop IP, you need the edge to be routing-aware.
The edge to be routing-aware around itself does scale. The edge to be routing-aware at the destinations of all the flows over it does not scale, which is the problem of MPLS. Though the lack of equipment scalability was unnoticed by many, thanks to Moore' law, inscalable administration costs a lot. As a result, administration of MPLS has been costing a lot.
I wasn't at the table when the MPLS spec. was being dreamed up,
I was there before poor MPLS was dreamed up.
If you can tell me how NOT running MPLS affords you a "hierarchical, scalable" routing table, I'm all ears.
Are you saying inter-domain routing table is not "hierarchical, scalable" except for the reason of multihoming? As for multihoming problem, see, for example: https://tools.ietf.org/html/draft-ohta-e2e-multihoming-03
Whether you forward in IP or in MPLS, scaling routing is an ever clear & present concern.
Not. Even without MPLS, fine tuning of BGP does not scale. However, just as using plain IP router costs less than using MPLS capable IP routers, BGP-only administration costs less than BGP and MPLS administration. For better networking infrastructure, extra cost should be spent for L1, not MPLS or very complicated technologies around it. Masataka Ohta
On 19/Jun/20 17:40, Masataka Ohta wrote:
As the first person to have proposed the forwarding paradigm of label switching, I have been fully aware from the beginning that:
https://tools.ietf.org/html/draft-ohta-ip-over-atm-01
Conventional Communication over ATM in a Internetwork Layer
The conventional communication, that is communication that does not assume connectivity, is no different from that of the existing IP, of course.
special, prioritized forwarding should be done only by special request by end users (by properly designed signaling mechanism, for which RSVP failed to be) or administration does not scale.
I could be wrong, but I get the feeling that you are speaking about RSVP in its original form, where hosts were meant to make calls (CAC) into the network to reserve resources on their behalf. As we all know, that never took off, even though I saw some ideas about it being proposed for mobile phones as well. I don't think there ever was another attempt to get hosts to reserve resources within the network, since the RSVP failure.
Not. Even without MPLS, fine tuning of BGP does not scale.
We all know this, and like I said, that is a current concern.
However, just as using plain IP router costs less than using MPLS capable IP routers, BGP-only administration costs less than BGP and MPLS administration.
For better networking infrastructure, extra cost should be spent for L1, not MPLS or very complicated technologies around it.
In the early 2000's, I would have agreed with that. Nowadays, there is a very good chance that a box you require a BGP DFZ on inherently supports MPLS, likely without extra licensing. Mark.
participants (11)
-
adamv0025@netconsultings.com
-
Baldur Norddahl
-
Dorian Kim
-
Fletcher Kittredge
-
Mark Tinka
-
Masataka Ohta
-
Nick Hilliard
-
Owen DeLong
-
Randy Bush
-
Robert Raszuk
-
Saku Ytti