Europe-to-US congestion and packet loss on he.net network, and their NOC@ won't even respond

30 Nov 2013

      Dear NANOG@,

I'm not exactly sure how else I can get he.net's attention, because I've 
been experiencing congestion issues between my dedi and Indiana for a 
couple of months now, all due to he.net's poor transit, as it turns out. 
  The issue was complicated by the fact that the routes are asymmetric, 
and it appears as if the traffic loss is going on somewhere where there 
is none at all.

I will just provide the data, and people can make their own conclusions, 
any insights are welcome.

During all of this, since some late September 2013, all 4 networks 
involved have been contacted -- hetzner, init7, he.net, indiana; all 
except for he.net have responded and did troubleshooting.

After pressing the lack of any kind of response from he.net, all they 
did was ask for a customer number, and that was back in September.  I 
have not heard from their NOC@ ever since, with requests left 
unanswered, sans the "we have received your request" autoreply.

Interestingly enough, only some of their Europe-to-US routes are 
blatantly congested and have very obvious packet loss (often making ssh 
unusable), whereas others appear to be doing just fine (at least, not 
losing packets and not experiencing jitter, and the increased latency). 
  E.g. IPv6 routes don't appear affected, for example.  IPv4 addresses 
in North America that are announced directly from AS6939 (e.g. Linode in 
Fremont) don't appear affected, either.  But the multi-homed indiana.edu 
and wiscnet.net are affected.  The single-homed ntp1.yycix.ca is 
affected, too.  Probably other customers are affected as well.

Where's the end to this?

Or is the ongoing 0.5+% traffic loss, and the 140+ms avg latency on a 
114ms route, with random spikes and jitter in certain hours of the day 
(generally around midnight ET), every day for several weeks or even 
months, an acceptable practice?

 From hetzner.de through he.net:

Cns# date ; mtr --report{,-wide,-cycles=600} --interval 0.1 --order "SRL 
BGAWV" -4 ????c????????.indiana.edu ; date
Fri Nov 29 21:06:17 PST 2013
HOST: Cns???????                                   Snt   Rcv Loss% 
Best Gmean   Avg  Wrst StDev
   1.|-- static.??.???.4.46.clients.your-server.de    600   600  0.0% 
  0.5   1.0   1.3   4.9   1.1
   2.|-- hos-tr1.juniper1.rz13.hetzner.de             600   600  0.0% 
  0.1   0.2   1.9  66.0   7.6
   3.|-- core21.hetzner.de                            600   600  0.0% 
  0.2   0.2   0.2   5.8   0.4
   4.|-- core22.hetzner.de                            600   600  0.0% 
  0.2   0.2   0.2  19.4   1.2
   5.|-- core1.hetzner.de                             600   600  0.0% 
  4.8   4.8   4.8  13.2   0.7
   6.|-- juniper1.ffm.hetzner.de                      600   600  0.0% 
  4.8   4.8   4.8  27.4   1.4
   7.|-- 30gigabitethernet1-3.core1.ams1.he.net       600   600  0.0% 
11.2  14.0  14.6  48.7   4.5
   8.|-- 10gigabitethernet1-4.core1.lon1.he.net       600   600  0.0% 
18.2  19.6  19.9  53.9   4.1
   9.|-- 10gigabitethernet10-4.core1.nyc4.he.net      600   599  0.2% 
87.0 116.1 116.7 145.7  12.4
  10.|-- 100gigabitethernet7-2.core1.chi1.he.net      600   597  0.5% 
106.6 135.4 136.1 192.0  13.3
  11.|-- ???                                          600     0 100.0 
  0.0   0.0   0.0   0.0   0.0
  12.|-- et-11-0-0.945.rtr.ictc.indiana.gigapop.net   600   594  1.0% 
113.3 139.3 139.7 166.1  11.4
  13.|-- xe-0-3-0.11.br2.ictc.net.uits.iu.edu         600   596  0.7% 
113.2 139.8 140.3 177.3  12.0
  14.|-- ae-0.0.br2.bldc.net.uits.iu.edu              600   595  0.8% 
114.2 140.1 140.6 183.2  11.8
  15.|-- ae-10.0.cr3.bldc.net.uits.iu.edu             600   597  0.5% 
114.3 140.3 140.8 165.0  11.5
  16.|-- ????c????????.indiana.edu                    600   597  0.5% 
114.7 140.7 141.1 161.6  11.4
Fri Nov 29 21:08:52 PST 2013

Cns# unbuffer hping --icmp-ts ????c????????.indiana.edu | \
perl -ne 'if (/icmp_seq=(\d+) rtt=(\d+\.\d)/) {($s, $p) = ($1, $2);} \
if (/ate=(\d+) Receive=(\d+) Transmit=(\d+)/) {($o, $r, $t) = ($1, $2, 
$3);} \
if (/tsrtt=(\d+)/) { \
print $s, "\t", $p, "\t", $1, " = ", $r - $o, " + ", $o + $1 - $t, "\n"; }'
0       143.5   144 = 87 + 57
1       125.5   126 = 69 + 57
2       143.6   144 = 87 + 57
3       157.9   158 = 102 + 56
4       122.0   122 = 66 + 56
5       141.6   142 = 85 + 57
6       132.2   133 = 76 + 57
7       146.2   146 = 89 + 57
8       145.1   145 = 88 + 57
9       119.9   119 = 63 + 56
10      132.7   132 = 75 + 57
11      140.1   140 = 83 + 57
12      151.0   151 = 94 + 57
13      152.6   152 = 96 + 56
14      129.1   129 = 72 + 57
15      128.5   128 = 71 + 57
^C

Single-homed at he.net:

Cns# date ; mtr --report{,-cycles=600} --interval 0.1 --order "SRL 
BGAWV" -4 ntp1.yycix.ca ; date
Fri Nov 29 21:16:14 PST 2013
HOST: Cns???????                    Snt   Rcv Loss%   Best Gmean   Avg 
Wrst StDev
   1.|-- static.??.???.4.46.client   600   600  0.0%    0.5   1.0   1.3 
  10.2   1.2
   2.|-- hos-tr4.juniper2.rz13.het   600   600  0.0%    0.1   0.2   2.0 
153.9   9.8
   3.|-- core22.hetzner.de           600   600  0.0%    0.2   0.2   0.2 
  10.6   0.6
   4.|-- core1.hetzner.de            600   600  0.0%    4.8   4.8   4.8 
  16.4   0.9
   5.|-- juniper1.ffm.hetzner.de     600   600  0.0%    4.8   4.8   4.8 
  36.4   1.5
   6.|-- 30gigabitethernet1-3.core   600   600  0.0%   11.2  13.5  14.0 
  36.6   4.3
   7.|-- 10gigabitethernet1-4.core   600   600  0.0%   18.0  21.5  21.8 
  43.1   4.0
   8.|-- 10gigabitethernet10-4.cor   600   597  0.5%   93.2 128.0 128.3 
157.5   8.9
   9.|-- 10gigabitethernet1-2.core   600   596  0.7%  103.1 139.4 139.6 
157.5   8.2
  10.|-- 10gigabitethernet3-1.core   600   597  0.5%  128.2 164.9 165.1 
181.9   8.2
  11.|-- 10gigabitethernet1-1.core   600   593  1.2%  138.7 175.9 176.1 
192.6   7.8
  12.|-- sebo-systems-inc.gigabite   600   597  0.5%  139.0 176.4 176.5 
187.5   6.9
  13.|-- ???                         600     0 100.0    0.0   0.0   0.0 
   0.0   0.0
  14.|-- ntp1.yycix.ca               600   597  0.5%  141.0 176.9 177.0 
186.9   6.9
Fri Nov 29 21:18:32 PST 2013
Cns# traceroute -A ntp1.yycix.ca
traceroute to ntp1.yycix.ca (192.75.191.6), 64 hops max, 40 byte packets
  1  static.??.???.4.46.clients.your-server.de (46.4.???.??) [AS24940] 
0.664 ms  0.648 ms  0.453 ms
  2  hos-tr1.juniper1.rz13.hetzner.de (213.239.224.1) [AS24940]  23.985 
ms hos-tr2.juniper1.rz13.hetzner.de (213.239.224.33) [AS24940]  0.234 ms 
hos-tr3.juniper2.rz13.hetzner.de (213.239.224.65) [AS24940]  0.238 ms
  3  core22.hetzner.de (213.239.245.121) [AS24940]  0.238 ms 
core21.hetzner.de (213.239.245.81) [AS24940]  0.234 ms  0.236 ms
  4  core1.hetzner.de (213.239.245.177) [AS24940]  4.811 ms  4.809 ms 
core22.hetzner.de (213.239.245.162) [AS24940]  0.248 ms
  5  core1.hetzner.de (213.239.245.177) [AS24940]  4.831 ms 
juniper1.ffm.hetzner.de (213.239.245.5) [AS24940]  4.842 ms  4.826 ms
  6  juniper1.ffm.hetzner.de (213.239.245.5) [AS24940]  4.857 ms  4.864 
ms 30gigabitethernet1-3.core1.ams1.he.net (195.69.145.150) [AS1200] 
11.233 ms
  7  10gigabitethernet1-4.core1.lon1.he.net (72.52.92.81) [AS6939, 
AS6939]  19.869 ms 30gigabitethernet1-3.core1.ams1.he.net 
(195.69.145.150) [AS1200]  18.420 ms  11.255 ms
  8  10gigabitethernet10-4.core1.nyc4.he.net (72.52.92.241) [AS6939, 
AS6939]  115.845 ms  101.875 ms 10gigabitethernet1-4.core1.lon1.he.net 
(72.52.92.81) [AS6939, AS6939]  17.249 ms
  9  10gigabitethernet10-4.core1.nyc4.he.net (72.52.92.241) [AS6939, 
AS6939]  138.302 ms 10gigabitethernet1-2.core1.tor1.he.net 
(184.105.222.18) [AS6939]  120.449 ms  139.730 ms
10  10gigabitethernet1-2.core1.tor1.he.net (184.105.222.18) [AS6939] 
134.755 ms  104.661 ms 10gigabitethernet3-1.core1.ywg1.he.net 
(184.105.223.221) [AS6939]  167.282 ms
11  10gigabitethernet1-1.core1.yyc1.he.net (184.105.223.214) [AS6939] 
139.310 ms 10gigabitethernet3-1.core1.ywg1.he.net (184.105.223.221) 
[AS6939]  155.983 ms  155.910 ms
12  sebo-systems-inc.gigabitethernet2-23.core1.yyc1.he.net 
(216.218.214.250) [AS6939]  138.703 ms  178.530 ms 
10gigabitethernet1-1.core1.yyc1.he.net (184.105.223.214) [AS6939] 
172.423 ms
13  sebo-systems-inc.gigabitethernet2-23.core1.yyc1.he.net 
(216.218.214.250) [AS6939]  158 ms * *
14  * * ntp1.yycix.ca (192.75.191.6) [AS53339]  181.433 ms
Cns#
Cns#
Cns#
Cns# unbuffer hping --icmp-ts ntp1.yycix.ca | perl -ne \
'if (/icmp_seq=(\d+) rtt=(\d+\.\d)/) {($s, $p) = ($1, $2);} \
if (/ate=(\d+) Receive=(\d+) Transmit=(\d+)/) {($o, $r, $t) = ($1, $2, 
$3);} \
if (/tsrtt=(\d+)/) { \
print $s, "\t", $p, "\t", $1, " = ", $r - $o, " + ", $o + $1 - $t, "\n"; }'
0       165.0   165 = 95 + 70
1       156.2   156 = 86 + 70
2       178.9   179 = 109 + 70
3       181.0   181 = 111 + 70
4       178.3   179 = 108 + 71
5       163.8   164 = 94 + 70
6       175.7   176 = 106 + 70
7       173.9   174 = 104 + 70
8       172.6   173 = 103 + 70
9       163.5   164 = 94 + 70
10      181.8   182 = 112 + 70
11      161.9   162 = 92 + 70
12      183.1   184 = 113 + 71
13      174.5   174 = 104 + 70
14      181.8   181 = 111 + 70
15      181.7   181 = 111 + 70
^C
Cns#

 From indiana.edu to hetzner.de; notice that the mtr by itself gives a 
false impression of a traffic loss at init7, whereas in reality, it's 
the reverse path through he.net that's causing the loss, as hping confirms:

m: {5134} date ; sudo mtr --report{,-cycles=600} --interval 0.1 --order 
"SRL BGAWV" -4 ?????? ; date
Sat Nov 30 00:36:27 EST 2013
HOST: ????c????????.indiana.edu     Snt   Rcv Loss%   Best Gmean   Avg 
Wrst StDev
   1.|-- 129.79.???.?                600   600  0.0%    0.4   0.7   0.9 
  24.7   1.5
   2.|-- ae-13.0.br2.bldc.net.uits   600   600  0.0%    0.5   0.7   0.9 
  22.6   1.8
   3.|-- ae-0.0.br2.ictc.net.uits.   600   600  0.0%    1.4   1.7   1.8 
  20.2   1.6
   4.|-- xe-0-1-0.11.rtr.ictc.indi   600   600  0.0%    1.4   2.1   3.8 
  66.5   8.1
   5.|-- 64.57.21.13                 600   600  0.0%    6.0   7.2   8.4 
  72.9   8.0
   6.|-- xe-2-2-0.0.ny0.tr-cps.int   600   600  0.0%   32.3  33.9  34.4 
  81.0   6.9
   7.|-- paix-nyc.init7.net          600   600  0.0%   32.5  35.3  35.5 
  44.7   3.8
   8.|-- r1lon1.core.init7.net       600   599  0.2%  100.1 104.7 104.9 
146.5   7.5
   9.|-- r1nue1.core.init7.net       600   599  0.2%  114.6 115.7 115.7 
125.4   2.2
  10.|-- gw-hetzner.init7.net        600   594  1.0%  112.4 141.3 142.4 
241.9  18.2
  11.|-- core12.hetzner.de           600   468 22.0%  112.2 142.7 144.0 
203.4  20.3
  12.|-- core21.hetzner.de           600   202 66.3%  114.4 143.7 145.0 
204.1  20.1
  13.|-- juniper1.rz13.hetzner.de    600   594  1.0%  114.7 141.4 142.1 
212.2  14.3
  14.|-- hos-tr2.ex3k11.rz13.hetzn   600   599  0.2%  113.8 123.9 125.5 
218.2  21.8
  15.|-- static.88-198-??-??.clien   599   592  1.2%  114.6 137.2 137.9 
167.6  13.2
0.244u 1.766s 1:05.52 3.0%      0+0k 0+1io 0pf+0w
Sat Nov 30 00:37:32 EST 2013

m: {5137} sudo script -q /dev/null hping3 --icmp-ts 88.198.??.?? | perl 
-ne 'if (/icmp_seq=(\d+) rtt=(\d+\.\d)/) {($s, $p) = ($1, $2);} \
if (/ate=(\d+) Receive=(\d+) Transmit=(\d+)/) {($o, $r, $t) = ($1, $2, 
$3);} \
if (/tsrtt=(\d+)/) { \
print $s, "\t", $p, "\t", $1, " = ", $r - $o, " + ", $o + $1 - $t, 
"\r\n"; }'
0       131.3   131 = 57 + 74
1       122.4   122 = 56 + 66
2       122.6   123 = 56 + 67
3       127.6   128 = 57 + 71
4       146.5   147 = 57 + 90
5       139.8   140 = 56 + 84
6       131.0   131 = 57 + 74
7       134.6   135 = 57 + 78
8       137.7   138 = 57 + 81
9       148.1   148 = 57 + 91
10      141.2   142 = 57 + 85
11      146.4   146 = 56 + 90
12      153.6   154 = 57 + 97
13      149.4   150 = 57 + 93
14      120.2   121 = 57 + 64
15      120.6   120 = 56 + 64
16      130.7   131 = 57 + 74
17      126.4   126 = 56 + 70
18      117.9   118 = 57 + 61
19      116.9   117 = 57 + 60
20      119.8   119 = 56 + 63
21      132.0   132 = 56 + 76
22      134.2   134 = 56 + 78
23      138.8   139 = 57 + 82

Note the ICMP timestamp data from hping above.  From this ICMP 
timestamping data, it is obvious that the congestion is only happening 
on one path -- the one over he.net, and init7 is in the clear.

Any further insights are welcome.  But finding out about the ICMP 
timestamp feature has so far been the most useful thing in 
troubleshooting this issue; I'm surprised it's a rather unknown method 
to get to the bottom of these problems.

However, even after finding out about the cause and the party 
responsible, the problem is yet to be exhausted.  Any help appreciated.

Best regards,
Constantine.

Constantine A. Murenin

Fredy Kuenzler

Matthew Petach

Constantine A. Murenin

sthaug＠nethelp.no

Randy Bush

Daniel Suchy

Constantine A. Murenin

Rob Seastrom

tags

participants (7)