Re: Intradomain Traffic Engineering

18 Jan 2006

...
1. In the traces I have, there exist several intervals with a huge, 
sudden increase of traffic on some links. The prediction model I use 
cannot predict those 'big spikes'. Do these 'big spikes' really happen 
in operational networks? Or are they merely measurement errors? If they 
really happen, is there a gradual ramp up of traffic in smaller time 
scale, say, on the order of tens of seconds? Or do these 'big spikes' 
really occur very quickly, say, in a few seconds?
    2. I have the option to make a tradeoff between average case 
performance and worst case performance guarantee, but I don't know which 
one is deemed more important by you. Are ISP networks currently 
optimized for worst case or average case performance? Is the trade-off 
between these two an appealing idea, or may the ISP networks are already 
doing it?
This email covers a lot of issues, perhaps it'll start a discussion.

I think the question depends on how big a core you are talking about. 
Excluding local effects (the operator of the network bounces a link or 
loses a router, etc), I doubt if you have a significantly large network 
you have many effects that shift traffic faster than 10s of seconds 
(upperbound on this statement is ~30 seconds).

For example, if you "lose" a BGP session, it may take more than 30 
seconds for the router to notice it. Once it realizes that its gone, it 
may re-route traffic very rapidly. But it would still take a while (at 
least a few seconds for a local link, more for a backbone link) before 
that traffic really renormalizes). This has more to do with TCP noticing 
packet loss, backing off [only for the traffic that has been effected] 
and starting back up. It takes up to half a second to *establish* a 
single TCP session on an average latency link.

So, the trick would be to discover the traffic has gone or gone wonky 
before the BGP session is dropped. This would allow your algorithm to 
back off until a new /normal/ has been established.

However, the talk of traffic engineering and maximum utilization always 
come into vogue when folks want to squeeze more utilization out of their 
networks without really spending more money. IMO, the best time to use 
TE is when customer-links to your network approach your maximum core 
speed [relative here... there is /core/ in your datacenter/pop and there 
is /core/ that is your network to the point the packets get handed off 
on average]. Often this limit on the operator's core is technology 
imposed (though budgetary concerns get in there too).

I think the technology doesn't really exist at a scalable level to 
operate for the worst-case scenario, despite what some people may say. 
Our traffic measurement/link measurement tools are almost all average... 
and "spot" checks are of only marginal value. I would suggest that this 
is because of the nature of TCP. If the Internet were UDP based, there 
would be a *lot* more flash traffic problems. So, for those who have a 
high amount of UDP traffic (media streamers, DNS hosts, etc) would have 
a very different experience.

I'm not the first person to say it, and I can't remember the first place 
I heard it... but I'd suggest that the core is not where TE has the best 
  benefits. Cores by their nature need to be overengineered. You have 
very little flexibility because the demands on them are wide [they need 
to handle UDP and TCP, low latency and high latency acceptable 
applications with aplomb].

TE belongs to the Customer or non-backbone operating ISP. If one were to 
start an ISP where all residential customer connections were 1Gb/s I 
could conceivably have thousands of customers operating without needing 
200Gb/s of uplink [assuming that were really feasible for a network with 
very little traffic terminating on the network]. By using TE I could 
shape my peak traffic needs (MLU) to approach my average. This would 
make me a much more desirable customer to sell transit to.

TE, MLU, and other concerns while most well understood by 
core-operators, aren't by customers. Core operators may eventually need 
to push these concerns to customers if backbone link speeds do not stay 
far above end-user connection speeds. [on an ICB basis, they are -- 
whenever you want to buy a few OC-48s in a single POP or an OC-192 
customer connection, someone is always going to ask you what your 
traffic looks like and when]. This would be easiest to push over by 
providing differential pricing. Enforcement and Analysis of *what* is a 
desirable traffic pattern and what financial value that provides is 
where we are largely lacking today. Since a customer knows their traffic 
  and their needs better than a core operator, they would be much better 
at enforcing traffic flows/engineering. This is better than a core that 
optimizes for its own link utilization instead one that just tries to 
stay as empty as possible for as long as possible.

This is way early in the day for me, so this may not make any sense.

YMMV,

Deepak Jain
AiNET