Elephant in the room - Akamai - Test

newer
packet loss on AT&T in the L.A....

Elephant in the room - Akamai

older
Spoofer Report for NANOG for Nov...

Kaiser, Erich

5 Dec 2019 5 Dec '19

3:03 a.m.

Lets talk Akamai They have shifted 90% of their traffic off IXs and onto our full route DIA, anyone else seeing this issue or have insight as to what is going on over there? We have been asking for help on resolution for weeks and all we get is we are working on it and now we get no response. We were even sent an LOA and when the DC went to go put in the x-connect their patch panel was full. How do they not know if they have ports open or not? I have even reached out to an engineer who is on this list and he does not even respond. The last two nights the traffic levels to them has skyrocketed as well. Any insight? Erich Kaiser The Fusion Network

Attachments:

attachment.html (text/html — 1022 bytes)

Show replies by date

craig washington

5 Dec 5 Dec

5:35 a.m.

I don't have any insight but can confirm I am seeing the same thing. (Traffic shift back onto transit links) They did tell me they were having some bandwidth issues and are working on it. I am currently awaiting a direct PNI with them but haven't heard from them in some time. ________________________________ From: NANOG <nanog-bounces@nanog.org> on behalf of Kaiser, Erich <erich@gotfusion.net> Sent: Thursday, December 5, 2019 3:03 AM To: NANOG list <nanog@nanog.org> Subject: Elephant in the room - Akamai Lets talk Akamai They have shifted 90% of their traffic off IXs and onto our full route DIA, anyone else seeing this issue or have insight as to what is going on over there? We have been asking for help on resolution for weeks and all we get is we are working on it and now we get no response. We were even sent an LOA and when the DC went to go put in the x-connect their patch panel was full. How do they not know if they have ports open or not? I have even reached out to an engineer who is on this list and he does not even respond. The last two nights the traffic levels to them has skyrocketed as well. Any insight? Erich Kaiser The Fusion Network

Matthew Petach

7:48 a.m.

On Wed, Dec 4, 2019, 19:05 Kaiser, Erich <erich@gotfusion.net> wrote:

...

Lets talk Akamai

[...]

...

The last two nights the traffic levels to them has skyrocketed as well.

Any insight?

Erich Kaiser The Fusion Network

As a CDN, I would usually expect to see traffic *from* Akamai to be the large direction. If you're seeing your traffic *to* them skyrocketing, are you sure you aren't carrying DDoS attack traffic at them? CDNs aren't known for being large traffic sinks. ^_^;; Matt

Bryan Holloway

9:15 a.m.

On 12/5/19 8:48 AM, Matthew Petach wrote:

...

On Wed, Dec 4, 2019, 19:05 Kaiser, Erich <erich@gotfusion.net <mailto:erich@gotfusion.net>> wrote:

Lets talk Akamai

[...]

The last two nights the traffic levels to them has skyrocketed as well.

Any insight?

Erich Kaiser The Fusion Network

As a CDN, I would usually expect to see traffic *from* Akamai to be the large direction.

If you're seeing your traffic *to* them skyrocketing, are you sure you aren't carrying DDoS attack traffic at them?

CDNs aren't known for being large traffic sinks. ^_^;;

Matt

I think he meant inbound (from). We also saw the same thing.

Kaiser, Erich

2:28 p.m.

Yes inbound. The patterns are not typical, we are talking gigs of traffic moved off of the IX side and onto our DIA side. They have reached out to me already. Will see what happens. Will post follow-up. On Thu, Dec 5, 2019 at 3:15 AM Bryan Holloway <bryan@shout.net> wrote:

...

On 12/5/19 8:48 AM, Matthew Petach wrote:

...
On Wed, Dec 4, 2019, 19:05 Kaiser, Erich <erich@gotfusion.net <mailto:erich@gotfusion.net>> wrote:

Lets talk Akamai

[...]

The last two nights the traffic levels to them has skyrocketed as

well.

...
Any insight?

Erich Kaiser The Fusion Network

As a CDN, I would usually expect to see traffic *from* Akamai to be the large direction.

If you're seeing your traffic *to* them skyrocketing, are you sure you aren't carrying DDoS attack traffic at them?

CDNs aren't known for being large traffic sinks. ^_^;;

Matt

I think he meant inbound (from). We also saw the same thing.

Jared Mauch

10:36 a.m.

Good morning! If you are having Akamai issues you can reach out to me and I will help you out. - Jared

...

On Dec 4, 2019, at 10:04 PM, Kaiser, Erich <erich@gotfusion.net> wrote:

Lets talk Akamai

They have shifted 90% of their traffic off IXs and onto our full route DIA, anyone else seeing this issue or have insight as to what is going on over there? We have been asking for help on resolution for weeks and all we get is we are working on it and now we get no response. We were even sent an LOA and when the DC went to go put in the x-connect their patch panel was full. How do they not know if they have ports open or not? I have even reached out to an engineer who is on this list and he does not even respond.

The last two nights the traffic levels to them has skyrocketed as well.

Any insight?

Erich Kaiser The Fusion Network

Aaron Gould

2:39 p.m.

I see my Akamai aanp cache utilization at all-time highs the last 2 nights as well. Curious what it is. Jared, you can reply to my off-list if you wish, or on-list if it would benefit the community. Thanks, Aaron

Tarko Tikan

2:54 p.m.

hey,

...

I see my Akamai aanp cache utilization at all-time highs the last 2 nights as well. Curious what it is.

Halo Reach release. -- tarko

Clayton Zekelman

3:05 p.m.

Our AANP cache seems to have done the same in the past 2 nights. Lots of traffic that has never been there before. It has however not reduced the amount of traffic we're getting from AS20940 directly - still hit a new record level last night. We've got a request in with them for a PNI. If things keep growing at this rate, we might need two! Over the years, I've questioned how much the AANP boxes really did for us, as their in:out ratio seemed almost balanced. If the last two nights are an indication, then they're worth keeping. At 09:39 AM 05/12/2019, Aaron Gould wrote:

...

I see my Akamai aanp cache utilization at all-time highs the last 2 nights as well.Â Curious what it is.

Jared, you can reply to my off-list if you wish, or on-list if it would benefit the community.

Thanks, Aaron

-- Clayton Zekelman Managed Network Systems Inc. (MNSi) 3363 Tecumseh Rd. E Windsor, Ontario N8W 1H4 tel. 519-985-8410 fax. 519-985-8409

Aaron Gould

8:41 p.m.

Tarko. wow, gaming again ! It's not going away. gaming traffic is growing in a big way it seems. Clayton.. My thoughts exactly! I too have wondered how valuable these aanp's were, but lately I'm seeing good efficiency Thanks y'all -Aaron

Valdis Klētnieks

9:44 p.m.

On Thu, 05 Dec 2019 14:41:30 -0600, "Aaron Gould" said:

...

Tarko. wow, gaming again ! It's not going away. gaming traffic is growing in a big way it seems.

And it's only going to get worse. Sony has already announced that the Playstation 5 will have a (probably) 1-2 terabyte SSD. And even with that, the game packaging is set up to support only downloading the single-player or multi-player portions of a game because images are going to be pushing 100 gigabytes RSN (some are already well over 40gig). So even with the download restructuring, we're probably going to be seeing a lot of people downloading lots of gigabytes on Day 1 (or a few days before, for games that support it), and re-downloading smaller (but still large) amounts when they want to re-play the game...

Michael Thomas

10:18 p.m.

On 12/5/19 1:44 PM, Valdis Klētnieks wrote:

...

On Thu, 05 Dec 2019 14:41:30 -0600, "Aaron Gould" said:

...
Tarko. wow, gaming again ! It's not going away. gaming traffic is growing in a big way it seems. And it's only going to get worse. Sony has already announced that the Playstation 5 will have a (probably) 1-2 terabyte SSD. And even with that, the game packaging is set up to support only downloading the single-player or multi-player portions of a game because images are going to be pushing 100 gigabytes RSN (some are already well over 40gig).

So even with the download restructuring, we're probably going to be seeing a lot of people downloading lots of gigabytes on Day 1 (or a few days before, for games that support it), and re-downloading smaller (but still large) amounts when they want to re-play the game...

I suspect that it's going to be even worse on the home side. A while ago a friend was here and unbeknownst to me, he was downloading a big game. The rest of the home network was rendered unusable, and it took me over an hour to figure out what was going on. I knew what to look for -- and even then giving the awful tools that routers support it was hard -- but just about anybody else would have been on the phone to their provider saying that "INTERTOOBS ARE SLOW!". My suspicion is that the root problem was buffer bloat -- i flashed a new router with openwrt and was a little dismayed that the bufferbloat code is a plugin you have to enable. The buffer bloat got a lot better after that, but I forgot to retest the downloading after so I'm not 100% positive. But if it was the problem, we're probably in for a world of hurt as I doubt that many home routers implement it. Mike

Valdis Klētnieks

6 Dec 6 Dec

2:02 a.m.

On Thu, 05 Dec 2019 14:18:07 -0800, Michael Thomas said:

...

My suspicion is that the root problem was buffer bloat -- i flashed a new router with openwrt and was a little dismayed that the bufferbloat code is a plugin you have to enable. The buffer bloat got a lot better

Friends don't let friends run factory firmware. :) Hopefully sometime soon the SQM stuff will be added to the default openwrt configs for most of the supported routers, if it hasn't been already. It's been in my config since before the Luci support for SQM got created.... The big problem is that a lot of eyeball networks have a lot of CPE boxes that were created before the bufferbloat work was done, and often have no real motivation to push software updates to the CPE (if they even have the ability), and a lot of customers have routers that they bought at Best Buy or Walmart that will *never* get a software update. (I also admit having no idea what percentage of the intermediate routers in the ISP's networks have gotten de-bloating code.

Stephen Satchell

5:02 a.m.

On 12/5/19 6:02 PM, Valdis Klētnieks wrote:

...

(I also admit having no idea what percentage of the intermediate routers in the ISP's networks have gotten de-bloating code.

For SP-grade routers, there isn't "code" that needs to be added to combat buffer bloat. All an admin has to do is cut back on the number of packet buffers on each interface -- an interface setting, you see. The reason that comsumer-grade devices can contribute to buffer bloat is because the vendor doesn't expose a knob to adjust buffering. At least in most instances with Best Buy and Office Depot routers.

Fred Baker

8 Dec 8 Dec

11:37 p.m.

Sent from my iPad

...

On Dec 5, 2019, at 9:03 PM, Stephen Satchell <list@satchell.net> wrote:

For SP-grade routers, there isn't "code" that needs to be added to combat buffer bloat. All an admin has to do is cut back on the number of packet buffers on each interface -- an interface setting, you see.

A common misconception, and disagrees with the research on the topic. Let me describe this conceptually. Think of a file transfer (a streaming flow can be thought of in those terms, as can web pages etc) as four groups of packets - those that have been delivered correctly and therefore don’t affect the window or flow rate, those that have been delivered out of order and therefore reduce the window and might get retransmitted even though they need not be resent, those that are sitting in a queue somewhere and therefore add latency, and those that haven’t been transmitted yet. If I have a large number of sessions transiting an interface, each one is likely to have a packet or two near the head of the queue; after that, it tends to thin out, with the sessions with the largest windows having packets deep in the queue, and sessions with smaller windows not so much. If you reduce the queue depth, it does reduce that deep-in-the-queue group - there is no storage deep in the queue to hold it. What it does, however, is increase any given packet’s probability of loss (loss being the extreme case of delay, and when you reduce delay unintelligently is the byproduct), and therefore the second category of packets - the ones that managed to get through after a packet was lost, and therefore arrived out of order and have some probability of being retransmitted and therefore delivered multiple times. What AQM technologies attempt to do (we can argue about the relative degree of success in different technologies; I’m talking about them as a class) is identify sessions in that deep-in-the-queue category and cause them to temporarily reduce their windows - keeping most of their outstanding packets near the head of the queue. Reducing their windows has the effect of moving packets out of the network buffers (bufferbloat) and reordering queues in the receiving host to “hasn’t been sent yet” in the sending host. That also reduces median latency, meaning that the sessions with reduced windows don’t generally “slow down” - they simply keep less of their data streams in the network with reduced median latency.

Michael Thomas

9 Dec 9 Dec

12:25 a.m.

On 12/8/19 3:37 PM, Fred Baker wrote:

...

Sent from my iPad

...
On Dec 5, 2019, at 9:03 PM, Stephen Satchell <list@satchell.net> wrote:

For SP-grade routers, there isn't "code" that needs to be added to combat buffer bloat. All an admin has to do is cut back on the number of packet buffers on each interface -- an interface setting, you see. A common misconception, and disagrees with the research on the topic.

Let me describe this conceptually. Think of a file transfer (a streaming flow can be thought of in those terms, as can web pages etc) as four groups of packets - those that have been delivered correctly and therefore don’t affect the window or flow rate, those that have been delivered out of order and therefore reduce the window and might get retransmitted even though they need not be resent, those that are sitting in a queue somewhere and therefore add latency, and those that haven’t been transmitted yet. If I have a large number of sessions transiting an interface, each one is likely to have a packet or two near the head of the queue; after that, it tends to thin out, with the sessions with the largest windows having packets deep in the queue, and sessions with smaller windows not so much.

If you reduce the queue depth, it does reduce that deep-in-the-queue group - there is no storage deep in the queue to hold it. What it does, however, is increase any given packet’s probability of loss (loss being the extreme case of delay, and when you reduce delay unintelligently is the byproduct), and therefore the second category of packets - the ones that managed to get through after a packet was lost, and therefore arrived out of order and have some probability of being retransmitted and therefore delivered multiple times.

What AQM technologies attempt to do (we can argue about the relative degree of success in different technologies; I’m talking about them as a class) is identify sessions in that deep-in-the-queue category and cause them to temporarily reduce their windows - keeping most of their outstanding packets near the head of the queue. Reducing their windows has the effect of moving packets out of the network buffers (bufferbloat) and reordering queues in the receiving host to “hasn’t been sent yet” in the sending host. That also reduces median latency, meaning that the sessions with reduced windows don’t generally “slow down” - they simply keep less of their data streams in the network with reduced median latency.

So are saying in effect that the receiving host is, essentially, managing the sending host's queue by modulating the receiver's window? That seems really weird to me, and probably means I've got it wrong. How would the receiving host know when and why it should change the window, unless of course loss or other measurable things? If everything is chugging away, the receiver doesn't have any idea that the sender is starving other sessions, right? It just seems to me that this is a sending hosts queuing problem? Mike

Michael Thomas

6 Dec 6 Dec

6:33 p.m.

On 12/5/19 6:02 PM, Valdis Klētnieks wrote:

...

On Thu, 05 Dec 2019 14:18:07 -0800, Michael Thomas said:

...
My suspicion is that the root problem was buffer bloat -- i flashed a new router with openwrt and was a little dismayed that the bufferbloat code is a plugin you have to enable. The buffer bloat got a lot better Friends don't let friends run factory firmware. :)

Hopefully sometime soon the SQM stuff will be added to the default openwrt configs for most of the supported routers, if it hasn't been already. It's been in my config since before the Luci support for SQM got created....

The big problem is that a lot of eyeball networks have a lot of CPE boxes that were created before the bufferbloat work was done, and often have no real motivation to push software updates to the CPE (if they even have the ability), and a lot of customers have routers that they bought at Best Buy or Walmart that will *never* get a software update.

(I also admit having no idea what percentage of the intermediate routers in the ISP's networks have gotten de-bloating code.

So I tested this out again after I sent out my message and it does indeed seem to be just fine: it wasn't an identical test since my friend was over wifi, but that really shouldn't affect things, I'd think. The thing I don't get is that buffer bloat is a creature of the upstream, right? I wouldn't think that the stream of acks sent from downloading the file would put much pressure on the upstream. Which makes me wonder if it's just that the old router itself was saturated and couldn't keep up. Or something. In any case, there are probably zillions of 10 year old routers out in the world, and no matter what exactly caused this for me it will probably happen for zillions of other people too. Hope support desks are ready for the deluge. Mike

Chris Adams

12:19 a.m.

Once upon a time, Valdis Klētnieks <valdis.kletnieks@vt.edu> said:

...

And it's only going to get worse. Sony has already announced that the Playstation 5 will have a (probably) 1-2 terabyte SSD. And even with that, the game packaging is set up to support only downloading the single-player or multi-player portions of a game because images are going to be pushing 100 gigabytes RSN (some are already well over 40gig).

Xbox One X games are already there... I'm a pretty casual gamer, and I have multiple games over 90GB (one is 117GB). -- Chris Adams <cma@cmadams.net>

Fawcett, Nick

2:46 p.m.

We had three onsite Akamai caches a few months ago. They called us up and said they are removing that service and sent us boxes to pack up the hardware and ship back. We’ve had quite the increase in DIA traffic as a result of it. ~Nick From: NANOG <nanog-bounces@nanog.org> On Behalf Of Kaiser, Erich Sent: Wednesday, December 4, 2019 9:03 PM To: NANOG list <nanog@nanog.org> Subject: Elephant in the room - Akamai Lets talk Akamai They have shifted 90% of their traffic off IXs and onto our full route DIA, anyone else seeing this issue or have insight as to what is going on over there? We have been asking for help on resolution for weeks and all we get is we are working on it and now we get no response. We were even sent an LOA and when the DC went to go put in the x-connect their patch panel was full. How do they not know if they have ports open or not? I have even reached out to an engineer who is on this list and he does not even respond. The last two nights the traffic levels to them has skyrocketed as well. Any insight? Erich Kaiser The Fusion Network -- Checked by SOPHOS http://www.sophos.com -- Checked by SOPHOS http://www.sophos.com

Chris Adams

2:59 p.m.

Once upon a time, Fawcett, Nick <nfawcett@corp.mtco.com> said:

...

We had three onsite Akamai caches a few months ago. They called us up and said they are removing that service and sent us boxes to pack up the hardware and ship back. We’ve had quite the increase in DIA traffic as a result of it.

Same here. We'd had Akamai servers for many years, replaced as needed (including one failed servre replaced right before they turned them off). Now about 50% of our Akamai traffic comes across transit links, not peering. This seems like it would be rather inefficient for them too... -- Chris Adams <cma@cmadams.net>

Jared Mauch

4:13 p.m.

...

On Dec 6, 2019, at 9:59 AM, Chris Adams <cma@cmadams.net> wrote:

Once upon a time, Fawcett, Nick <nfawcett@corp.mtco.com> said:

...
We had three onsite Akamai caches a few months ago. They called us up and said they are removing that service and sent us boxes to pack up the hardware and ship back. We’ve had quite the increase in DIA traffic as a result of it.

Same here. We'd had Akamai servers for many years, replaced as needed (including one failed servre replaced right before they turned them off). Now about 50% of our Akamai traffic comes across transit links, not peering. This seems like it would be rather inefficient for them too…

There’s an element of scale when it comes to certain content that makes it not viable if the majority of traffic is VOD with variable bitrates it requires a lot more capital. Things like downloads of software updates (eg: patch Tuesday) lend themselves to different optimizations. The hardware has a cost as well as the bandwidth as well. I’ll say that most places that have a few servers may only see a minor improvement in their in:out. If you’re not peering with us or are and see significant traffic via transit, please do reach out. I’m happy to discuss in private or at any NANOG/IETF meeting people are at. We generally have someone at most of the other NOG meetings as well, including RIPE, APRICOT and even GPF etc. I am personally always looking for better ways to serve the medium (or small) size providers better. - Jared

Keenan Tims

7:29 p.m.

Speaking as a (very) small operator, we've also been seeing less and less of our Akamai traffic coming to us over peering over the last couple years. I've reached out to Akamai NOC as well as Jared directly on a few occasions and while they've been helpful and their changes usually have some short-term impact, the balance has always shifted back some weeks/months later. I've more or less resigned myself to this being how Akamai wants things, and as we so often have to as small fish, just dealing with it. We're currently seeing about 80% of our AS20940 origin traffic coming from transit, and I'm certain there's a significant additional amount which is difficult to identify coming from on-net caches at our upstream providers (though it appears from the thread that may be reducing as well). Only about 20% is coming from peering where we have significantly more capacity and lower costs. Whatever the algorithm is doing, from my perspective it doesn't make a lot of sense and is pretty frustrating, and I'm somewhat concerned about busting commits and possibly running into congestion for the next big event that does hit us, which would not be a problem if it were delivered over peering. Luckily we're business focussed, so we're not getting hit by these gaming events. Keenan Tims Stargate Connections Inc (AS19171) On 2019-12-06 8:13 a.m., Jared Mauch wrote:

...

...
On Dec 6, 2019, at 9:59 AM, Chris Adams <cma@cmadams.net> wrote:

Once upon a time, Fawcett, Nick <nfawcett@corp.mtco.com> said:

...
We had three onsite Akamai caches a few months ago. They called us up and said they are removing that service and sent us boxes to pack up the hardware and ship back. We’ve had quite the increase in DIA traffic as a result of it. Same here. We'd had Akamai servers for many years, replaced as needed (including one failed servre replaced right before they turned them off). Now about 50% of our Akamai traffic comes across transit links, not peering. This seems like it would be rather inefficient for them too… There’s an element of scale when it comes to certain content that makes it not viable if the majority of traffic is VOD with variable bitrates it requires a lot more capital.

Things like downloads of software updates (eg: patch Tuesday) lend themselves to different optimizations. The hardware has a cost as well as the bandwidth as well.

I’ll say that most places that have a few servers may only see a minor improvement in their in:out. If you’re not peering with us or are and see significant traffic via transit, please do reach out.

I’m happy to discuss in private or at any NANOG/IETF meeting people are at. We generally have someone at most of the other NOG meetings as well, including RIPE, APRICOT and even GPF etc.

I am personally always looking for better ways to serve the medium (or small) size providers better.

- Jared

Mark Tinka

7 Dec 7 Dec

10:36 p.m.

On 6/Dec/19 21:29, Keenan Tims wrote:

...

We're currently seeing about 80% of our AS20940 origin traffic coming from transit, and I'm certain there's a significant additional amount which is difficult to identify coming from on-net caches at our upstream providers (though it appears from the thread that may be reducing as well). Only about 20% is coming from peering where we have significantly more capacity and lower costs. Whatever the algorithm is doing, from my perspective it doesn't make a lot of sense and is pretty frustrating, and I'm somewhat concerned about busting commits and possibly running into congestion for the next big event that does hit us, which would not be a problem if it were delivered over peering.

We've had 2 or 3 customers, in the last 3 months, complain about the same thing - where they are seeing Akamai traffic drop over peering but preferred via their transit service with us. We run a number of Akamai AANP caches across our backbone. We are working very closely with Akamai - and the customers - to resolve this, I'll add. Mark.

Seth Mattinen

5:06 p.m.

On 12/6/19 06:46, Fawcett, Nick via NANOG wrote:

...

We had three onsite Akamai caches a few months ago. They called us up and said they are removing that service and sent us boxes to pack up the hardware and ship back. We’ve had quite the increase in DIA traffic as a result of it.

Same here, removed last month, and no more Akamai traffic over peering since.

Jared Mauch

7:05 p.m.

...

On Dec 7, 2019, at 12:06 PM, Seth Mattinen <sethm@rollernet.us> wrote:

On 12/6/19 06:46, Fawcett, Nick via NANOG wrote:

...
We had three onsite Akamai caches a few months ago. They called us up and said they are removing that service and sent us boxes to pack up the hardware and ship back. We’ve had quite the increase in DIA traffic as a result of it.

Same here, removed last month, and no more Akamai traffic over peering since.

This last part doesn’t sound right. Can you send me details in private? Thanks, - Jared

Shawn L

7:20 p.m.

Same -- we had an Akamai cache for 15+ years. Then we were notified that it was done and were sent boxes to pack our stuff up and send it back. -----Original Message----- From: "Jared Mauch" <jared@puck.nether.net> Sent: Saturday, December 7, 2019 2:05pm To: "Seth Mattinen" <sethm@rollernet.us> Cc: nanog@nanog.org Subject: Re: Elephant in the room - Akamai

...

On Dec 7, 2019, at 12:06 PM, Seth Mattinen <sethm@rollernet.us> wrote:

On 12/6/19 06:46, Fawcett, Nick via NANOG wrote:

...
We had three onsite Akamai caches a few months ago. They called us up and said they are removing that service and sent us boxes to pack up the hardware and ship back. We’ve had quite the increase in DIA traffic as a result of it.

Same here, removed last month, and no more Akamai traffic over peering since.

This last part doesn’t sound right. Can you send me details in private? Thanks, - Jared

Rod Beck

10:34 p.m.

Have there been any fundamental change in their network architecture that might explain pulling these caches? ________________________________ From: NANOG <nanog-bounces@nanog.org> on behalf of Shawn L via NANOG <nanog@nanog.org> Sent: Saturday, December 7, 2019 8:20 PM To: Jared Mauch <jared@puck.nether.net> Cc: nanog@nanog.org <nanog@nanog.org> Subject: Re: Elephant in the room - Akamai Same -- we had an Akamai cache for 15+ years. Then we were notified that it was done and were sent boxes to pack our stuff up and send it back. -----Original Message----- From: "Jared Mauch" <jared@puck.nether.net> Sent: Saturday, December 7, 2019 2:05pm To: "Seth Mattinen" <sethm@rollernet.us> Cc: nanog@nanog.org Subject: Re: Elephant in the room - Akamai

...

On Dec 7, 2019, at 12:06 PM, Seth Mattinen <sethm@rollernet.us> wrote:

On 12/6/19 06:46, Fawcett, Nick via NANOG wrote:

...
We had three onsite Akamai caches a few months ago. They called us up and said they are removing that service and sent us boxes to pack up the hardware and ship back. We’ve had quite the increase in DIA traffic as a result of it.

Same here, removed last month, and no more Akamai traffic over peering since.

This last part doesn’t sound right. Can you send me details in private? Thanks, - Jared

Mark Delany

11:09 p.m.

...

Have there been any fundamental change in their network architecture that might explain pulling these caches?

Maybe not network architecture, but what if the cache-to-content ratio is dropping dramatically due to changes in consumer behavior and/or a huge increase in the underlying content (such as adoption of higher and multiple-resolution videos)? There has to be a tipping point at which a proportionally small cache becomes almost worthless from a traffic saving perspective. If you run a cluster one presumes you can see what your in/out ratio looks like and where the trend-line is headed. Another possibility might be security. It may be that they need additional security credentials for newer services which they are reluctant to load into remote cache clusters they don't physically control. Mark.

Jared Mauch

8 Dec 8 Dec

12:19 a.m.

On Dec 7, 2019, at 5:34 PM, Rod Beck <rod.beck@unitedcablecompany.com> wrote:

...

Have there been any fundamental change in their network architecture that might explain pulling these caches?

Please see my email on Friday where I outlined a few of the dynamics at play. Akamai isn’t just one thing, it’s an entire basket of products that all have their own resulting behaviors. This is why even though you may peer with us directly you may not see 100% of the traffic from that interconnection. (Take SSL for example, it’s often not served via the clusters in an ISP due to the security requirements we place on those racks, and this is something we treat very seriously!) This is why I’m encouraging people to ping me off-list, because the dynamics at play for one provider don’t match across the board. I know we have thousands of distinct sites that each have their own attributes and composition at play. I’ve been working hard to provide value to our AANP partners as well. I’ll try to stop responding to the list at this point but don’t hesitate to contact me here or via other means if you’re seeing something weird. I know I resolved a problem a few days ago for someone quickly as there was a misconfiguration left around.. We all make mistakes and can all do better. - jared https://www.peeringdb.com/asn/20940

Mark Tinka

6:07 a.m.

On 8/Dec/19 02:19, Jared Mauch wrote:

...

I’ve been working hard to provide value to our AANP partners as well. I’ll try to stop responding to the list at this point but don’t hesitate to contact me here or via other means if you’re seeing something weird. I know I resolved a problem a few days ago for someone quickly as there was a misconfiguration left around.. We all make mistakes and can all do better.

Problems are part of the gig - otherwise we'd have no reason to get up in the morning. What matters is that there is someone you can find to help you fix them. That's what makes all the difference. So kudos to you, Jared, and the entire team out there at Akamai. Mark.

Ben Cannon

3:15 p.m.

+100, and thanks to Jared. -Ben

...

On Dec 7, 2019, at 10:08 PM, Mark Tinka <mark.tinka@seacom.mu> wrote:

...
On 8/Dec/19 02:19, Jared Mauch wrote:

I’ve been working hard to provide value to our AANP partners as well. I’ll try to stop responding to the list at this point but don’t hesitate to contact me here or via other means if you’re seeing something weird. I know I resolved a problem a few days ago for someone quickly as there was a misconfiguration left around.. We all make mistakes and can all do better.

Problems are part of the gig - otherwise we'd have no reason to get up in the morning.

What matters is that there is someone you can find to help you fix them. That's what makes all the difference.

So kudos to you, Jared, and the entire team out there at Akamai.

Mark.

Rod Beck

2:39 p.m.

Taking boxes out of a network does not sound like 'emergent behavior' or unintended consequences. Sounds like a policy change. Perhaps they are being redeployed for better performance or perhaps shut down to lower costs. Or may be the cost of transit for Akamai at the margin is less than the cost of peering with 50 billion peers. Disclaimer: Not picking a fight. Better things to do. Regards, Roderick. ________________________________ From: Jared Mauch <jared@puck.nether.net> Sent: Sunday, December 8, 2019 1:19 AM To: Rod Beck <rod.beck@unitedcablecompany.com> Cc: Shawn L <shawnl@up.net>; nanog@nanog.org <nanog@nanog.org> Subject: Re: Elephant in the room - Akamai On Dec 7, 2019, at 5:34 PM, Rod Beck <rod.beck@unitedcablecompany.com> wrote:

...

Have there been any fundamental change in their network architecture that might explain pulling these caches?

Owen DeLong

5:07 p.m.

My guess (and it’s just this since I haven’t been inside Akamai for a couple of years now) is that they are culling the less effective AANPs (from Akamai’s perspective) in favor of redeploying the hardware to more effective locations and/or to eliminate the cost of supporting/refreshing said hardware. I would guess that the traffic level required to justify the expense of maintaining an AANP (from Akamai’s perspective) probably depends on a great many factors not all of which would be obvious as viewed from the outside. I would guess that the density of AANPs and ISP interconnection in a given geography would be among the factors that would influence that number. I would also guess that the number would tend to rise over time. Again, just external speculation on my part. Owen

...

On Dec 8, 2019, at 06:39 , Rod Beck <rod.beck@unitedcablecompany.com> wrote:

Taking boxes out of a network does not sound like 'emergent behavior' or unintended consequences. Sounds like a policy change. Perhaps they are being redeployed for better performance or perhaps shut down to lower costs. Or may be the cost of transit for Akamai at the margin is less than the cost of peering with 50 billion peers.

Disclaimer: Not picking a fight. Better things to do.

Regards,

Roderick.

From: Jared Mauch <jared@puck.nether.net> Sent: Sunday, December 8, 2019 1:19 AM To: Rod Beck <rod.beck@unitedcablecompany.com> Cc: Shawn L <shawnl@up.net>; nanog@nanog.org <nanog@nanog.org> Subject: Re: Elephant in the room - Akamai

On Dec 7, 2019, at 5:34 PM, Rod Beck <rod.beck@unitedcablecompany.com> wrote:

...
Have there been any fundamental change in their network architecture that might explain pulling these caches?

Please see my email on Friday where I outlined a few of the dynamics at play. Akamai isn’t just one thing, it’s an entire basket of products that all have their own resulting behaviors. This is why even though you may peer with us directly you may not see 100% of the traffic from that interconnection. (Take SSL for example, it’s often not served via the clusters in an ISP due to the security requirements we place on those racks, and this is something we treat very seriously!)

This is why I’m encouraging people to ping me off-list, because the dynamics at play for one provider don’t match across the board. I know we have thousands of distinct sites that each have their own attributes and composition at play.

I’ve been working hard to provide value to our AANP partners as well. I’ll try to stop responding to the list at this point but don’t hesitate to contact me here or via other means if you’re seeing something weird. I know I resolved a problem a few days ago for someone quickly as there was a misconfiguration left around.. We all make mistakes and can all do better.

- jared

https://www.peeringdb.com/asn/20940 <https://www.peeringdb.com/asn/20940>

Rod Beck

5:15 p.m.

Yep. Real estate must be one of their largest expenses and unlike bandwidth it is not going down to price. 😃 ________________________________ From: Owen DeLong <owen@delong.com> Sent: Sunday, December 8, 2019 6:07 PM To: Rod Beck <rod.beck@unitedcablecompany.com> Cc: Jared Mauch <jared@puck.nether.net>; nanog@nanog.org <nanog@nanog.org> Subject: Re: Elephant in the room - Akamai My guess (and it’s just this since I haven’t been inside Akamai for a couple of years now) is that they are culling the less effective AANPs (from Akamai’s perspective) in favor of redeploying the hardware to more effective locations and/or to eliminate the cost of supporting/refreshing said hardware. I would guess that the traffic level required to justify the expense of maintaining an AANP (from Akamai’s perspective) probably depends on a great many factors not all of which would be obvious as viewed from the outside. I would guess that the density of AANPs and ISP interconnection in a given geography would be among the factors that would influence that number. I would also guess that the number would tend to rise over time. Again, just external speculation on my part. Owen On Dec 8, 2019, at 06:39 , Rod Beck <rod.beck@unitedcablecompany.com<mailto:rod.beck@unitedcablecompany.com>> wrote: Taking boxes out of a network does not sound like 'emergent behavior' or unintended consequences. Sounds like a policy change. Perhaps they are being redeployed for better performance or perhaps shut down to lower costs. Or may be the cost of transit for Akamai at the margin is less than the cost of peering with 50 billion peers. Disclaimer: Not picking a fight. Better things to do. Regards, Roderick. ________________________________ From: Jared Mauch <jared@puck.nether.net<mailto:jared@puck.nether.net>> Sent: Sunday, December 8, 2019 1:19 AM To: Rod Beck <rod.beck@unitedcablecompany.com<mailto:rod.beck@unitedcablecompany.com>> Cc: Shawn L <shawnl@up.net<mailto:shawnl@up.net>>; nanog@nanog.org<mailto:nanog@nanog.org> <nanog@nanog.org<mailto:nanog@nanog.org>> Subject: Re: Elephant in the room - Akamai On Dec 7, 2019, at 5:34 PM, Rod Beck <rod.beck@unitedcablecompany.com<mailto:rod.beck@unitedcablecompany.com>> wrote:

...

Have there been any fundamental change in their network architecture that might explain pulling these caches?

Brandon Martin

3:32 p.m.

On 12/7/19 7:19 PM, Jared Mauch wrote:

...

Please see my email on Friday where I outlined a few of the dynamics at play. Akamai isn’t just one thing, it’s an entire basket of products that all have their own resulting behaviors. This is why even though you may peer with us directly you may not see 100% of the traffic from that interconnection. (Take SSL for example, it’s often not served via the clusters in an ISP due to the security requirements we place on those racks, and this is something we treat very seriously!)

Does this mean that, if you peer with Akamai at some location, only content locally available at that location will come over that peering session with the rest coming via other means? Does Akamai not have private connectivity to their public peering points? -- Brandon Martin

Jared Mauch

3:58 p.m.

Not all content is suitable in all locations based on the physical security or market situation. We have some content that can not be served, an example is locations where there are licensing requirements (eg: ICP for China). You will see a different mix from our 20940 vs 16625 as well. Those have different requirements on the security side. If you treat your PKI data seriously you will appreciate what is done here. In Marquette Michigan there will be different opportunities compared with Amsterdam or Ashburn as well. Our customers and traffic mix makes it challenging to serve from a platform where you do capital planning for several year depreciation cycle. We have thousands of unique sites and that scale is quite different from serving on a few distinct IXPs and transit providers. So yes you will see a difference and there are things we can do to improve it when there is a variance in the behavior. - Jared

...

On Dec 8, 2019, at 10:33 AM, Brandon Martin <lists.nanog@monmotha.net> wrote:

On 12/7/19 7:19 PM, Jared Mauch wrote:

...
Please see my email on Friday where I outlined a few of the dynamics at play. Akamai isn’t just one thing, it’s an entire basket of products that all have their own resulting behaviors. This is why even though you may peer with us directly you may not see 100% of the traffic from that interconnection. (Take SSL for example, it’s often not served via the clusters in an ISP due to the security requirements we place on those racks, and this is something we treat very seriously!)

Does this mean that, if you peer with Akamai at some location, only content locally available at that location will come over that peering session with the rest coming via other means? Does Akamai not have private connectivity to their public peering points? -- Brandon Martin

Mehmet Akcin

4:41 p.m.

Let's take a minute and thank Jared for taking the time and responding. thank you, Jared. On Sun, Dec 8, 2019 at 10:58 AM Jared Mauch <jared@puck.nether.net> wrote:

...

Not all content is suitable in all locations based on the physical security or market situation. We have some content that can not be served, an example is locations where there are licensing requirements (eg: ICP for China).

You will see a different mix from our 20940 vs 16625 as well. Those have different requirements on the security side. If you treat your PKI data seriously you will appreciate what is done here.

In Marquette Michigan there will be different opportunities compared with Amsterdam or Ashburn as well.

Our customers and traffic mix makes it challenging to serve from a platform where you do capital planning for several year depreciation cycle. We have thousands of unique sites and that scale is quite different from serving on a few distinct IXPs and transit providers.

So yes you will see a difference and there are things we can do to improve it when there is a variance in the behavior.

- Jared

...
On Dec 8, 2019, at 10:33 AM, Brandon Martin <lists.nanog@monmotha.net> wrote:

...
Please see my email on Friday where I outlined a few of the dynamics at

On 12/7/19 7:19 PM, Jared Mauch wrote: play. Akamai isn’t just one thing, it’s an entire basket of products that all have their own resulting behaviors. This is why even though you may peer with us directly you may not see 100% of the traffic from that interconnection. (Take SSL for example, it’s often not served via the clusters in an ISP due to the security requirements we place on those racks, and this is something we treat very seriously!)

Does this mean that, if you peer with Akamai at some location, only content locally available at that location will come over that peering session with the rest coming via other means? Does Akamai not have private connectivity to their public peering points? -- Brandon Martin

Brandon Martin

4:48 p.m.

...

Not all content is suitable in all locations based on the physical security or market situation. We have some content that can not be served, an example is locations where there are licensing requirements (eg: ICP for China).

You will see a different mix from our 20940 vs 16625 as well. Those have different requirements on the security side. If you treat your PKI data seriously you will appreciate what is done here.

In Marquette Michigan there will be different opportunities compared with Amsterdam or Ashburn as well.

Our customers and traffic mix makes it challenging to serve from a platform where you do capital planning for several year depreciation cycle. We have thousands of unique sites and that scale is quite different from serving on a few distinct IXPs and transit providers.

So yes you will see a difference and there are things we can do to improve it when there is a variance in the behavior. I guess what I'm getting at is that it sounds like, if you cannot source

On 12/8/19 10:58 AM, Jared Mauch wrote: the content locally to the peering link, there's not likely to be an internal connection to the same site from somewhere else within the Akamai network to deliver that content and, instead, the target network should expect it to come in over the "public Internet" via some other connection. Is that accurate? Thanks for the clarifications. -- Brandon Martin

Jared Mauch

5:10 p.m.

...

On Dec 8, 2019, at 11:48 AM, Brandon Martin <lists.nanog@monmotha.net> wrote:

I guess what I'm getting at is that it sounds like, if you cannot source the content locally to the peering link, there's not likely to be an internal connection to the same site from somewhere else within the Akamai network to deliver that content and, instead, the target network should expect it to come in over the "public Internet" via some other connection. Is that accurate?

Thanks for the clarifications.

<personal hat fully on> I was hired at Akamai to design the network architecture for a global backbone. This is proving to be an interesting challenge taking a diverse set of products with various requirements and interconnecting them in a way that saves costs and improves performance while my employers traffic continue to grow. <personal hat off> <work hat on> Akamai is built to use the paths available to deliver traffic and meet our customers and our business goals. Not all our sites are interconnected and it’s extremely unlikely (read: possibly never, but who knows) you will see all your traffic come over a direct link or cache. With any sufficiently complex system, plus the acquisitions we have made over my short tenure it’s almost impractical to integrate them all quickly or possibly at all. I personally want to make sure that we deliver the traffic in a way that makes sense, and a few people have seen those efforts but there’s also many things in progress that are not yet complete or ready for public consumption. I believe there’s room here to improve and each time we can turn a switch or dial a knob to better serve our customers and the end-users that we are paid to serve, everyone wins. <work hat off> Enterprises vs consumer ISPs have very different traffic profiles, and I think the genesis of this thread was a direct result of a very consumer oriented traffic profile that was unexpected. People have wondered why I would spend so much time watching things like Apple rumor websites in the past, it’s because that would lead to high traffic events. You go to where the data is. The same can be said for other large download events or OTT launches. Everyone knows a live event can be big but generally bound by the target audience size. As software is attacked within minutes or hours after security patches are released, I don’t find it surprising these days that systems automatically download whatever they can the moment it’s released from gaming consoles to IoT and server and OS patches. If the traffic is causing you pain, I encourage you to reach out so we can look at what might be improved. - Jared (I swear I’ll stop responding.. off to make lunch)

Rod Beck

5:17 p.m.

Last time I spoke with an Akamai engineer many years ago the network was purely transit. Is that evolving? ________________________________ From: NANOG <nanog-bounces@nanog.org> on behalf of Jared Mauch <jared@puck.nether.net> Sent: Sunday, December 8, 2019 6:10 PM To: Brandon Martin <lists.nanog@monmotha.net> Cc: nanog@nanog.org <nanog@nanog.org> Subject: Re: Elephant in the room - Akamai

...

On Dec 8, 2019, at 11:48 AM, Brandon Martin <lists.nanog@monmotha.net> wrote:

I guess what I'm getting at is that it sounds like, if you cannot source the content locally to the peering link, there's not likely to be an internal connection to the same site from somewhere else within the Akamai network to deliver that content and, instead, the target network should expect it to come in over the "public Internet" via some other connection. Is that accurate?

Thanks for the clarifications.

Niels Bakker

5:29 p.m.

* rod.beck@unitedcablecompany.com (Rod Beck) [Sun 08 Dec 2019, 18:18 CET]:

...

Last time I spoke with an Akamai engineer many years ago the network was purely transit. Is that evolving?

https://conference.apnic.net/data/41/ix_100-akamai-apricot2016-23feb2016_145... Per those slides, PAIX in 2000, LINX, DE-CIX, AMS-IX in 2001, JPIX in 2002, so you must have spoken with an engineer shortly after the company was founded in 1998. -- Niels.

Mark Tinka

9 Dec 9 Dec

9:57 a.m.

On 8/Dec/19 19:17, Rod Beck wrote:

...

Last time I spoke with an Akamai engineer many years ago the network was purely transit. Is that evolving?

I believe Akamai are building, to a reasonable degree, an on-net backbone. Mark.

Mike Hammett

1:44 p.m.

There's no need for speculation. Jared has already said in this thread that's exactly what he was hired for. https://www.youtube.com/watch?v=KXBKnAbW4hQ ----- Mike Hammett Intelligent Computing Solutions Midwest Internet Exchange The Brothers WISP ----- Original Message ----- From: "Mark Tinka" <mark.tinka@seacom.mu> To: nanog@nanog.org Sent: Monday, December 9, 2019 3:57:55 AM Subject: Re: Elephant in the room - Akamai On 8/Dec/19 19:17, Rod Beck wrote: Last time I spoke with an Akamai engineer many years ago the network was purely transit. Is that evolving? I believe Akamai are building, to a reasonable degree, an on-net backbone. Mark.

Chris Adams

8 Dec 8 Dec

6:20 p.m.

Once upon a time, Brandon Martin <lists.nanog@monmotha.net> said:

...

I guess what I'm getting at is that it sounds like, if you cannot source the content locally to the peering link, there's not likely to be an internal connection to the same site from somewhere else within the Akamai network to deliver that content and, instead, the target network should expect it to come in over the "public Internet" via some other connection. Is that accurate?

I believe this is true of multiple content networks. For example, we peer with Amazon in a couple of locations, but a significant amount of traffic frmo their AS comes across transit rather than peering. In old terms, this is "hot potato" routing - where the source gets the traffic out of their network as soon as possible, rather than spend internal resources to carry it as close to the destination as they can. -- Chris Adams <cma@cmadams.net>

Eric Kuhnke

7 Dec 7 Dec

8:01 p.m.

I think this thread might be a perfect example that when an organization reaches a sufficiently large size, one part of its engineering/operations team may no longer be fully aware of what other work groups are doing. Definitely a structural challenge for ISPs that span very large geographical areas and services/roles. On Sat, Dec 7, 2019 at 11:06 AM Jared Mauch <jared@puck.nether.net> wrote:

...

...
On Dec 7, 2019, at 12:06 PM, Seth Mattinen <sethm@rollernet.us> wrote:

On 12/6/19 06:46, Fawcett, Nick via NANOG wrote:

...
We had three onsite Akamai caches a few months ago. They called us up and said they are removing that service and sent us boxes to pack up the hardware and ship back. We’ve had quite the increase in DIA traffic as a result of it.

Same here, removed last month, and no more Akamai traffic over peering since.

This last part doesn’t sound right.

Can you send me details in private?

Thanks,

- Jared

Jared Mauch

8:10 p.m.

...

On Dec 7, 2019, at 3:01 PM, Eric Kuhnke <eric.kuhnke@gmail.com> wrote:

I think this thread might be a perfect example that when an organization reaches a sufficiently large size, one part of its engineering/operations team may no longer be fully aware of what other work groups are doing. Definitely a structural challenge for ISPs that span very large geographical areas and services/roles.

We are a decent sized (public) company. You can look at the # of employees if you are curious. I am but one person who can try to influence things. I’ll say that if you have a few servers from us, it’s not going to serve the entire content set that our customers have. I do remain open to look at your individual cases and see what can be done to improve though. The answer may be nothing, but an e-mail also costs you little. I’ve got several people I’m corresponding with and will continue to do so at least until the paychecks dry up. Given the number of questions I still field about my prior employer, it may even last beyond that point :-) - Jared

2193

Age (days ago)

2197

Last active (days ago)

List overview

Download

45 comments

27 participants

participants (27)

Aaron Gould
Ben Cannon
Brandon Martin
Bryan Holloway
Chris Adams
Clayton Zekelman
craig washington
Eric Kuhnke
Fawcett, Nick
Fred Baker
Jared Mauch
Kaiser, Erich
Keenan Tims
Mark Delany
Mark Tinka
Matthew Petach
Mehmet Akcin
Michael Thomas
Mike Hammett
Niels Bakker
Owen DeLong
Rod Beck
Seth Mattinen
Shawn L
Stephen Satchell
Tarko Tikan
Valdis Klētnieks