On Tuesday 07 April 2009 22:10:24 Charles Wyble wrote:
Been troubleshooting a very strange problem for a couple of weeks now.
I have a few hundred systems deployed throughout the United States utilizing EVDO connectivity with Verizon as a carrier. They are stationary.
Over the past few weeks clusters of them in SF and Lewisville TX and a few other areas have been failing intermittently. They are offline for several days, then online for a few days then go offline again. They are running Linux and PPPD.
Do they maintain a continuous data link in normal operation (like, say, connectivity for a LAN, or backhaul for a camera or some such), or do they request the data link when they need to send [whatever] (like a discrete SCADA system)? My (user only) experience is that cellular data service doesn't handle long sessions well.
On 8/04/2009, at 10:27 PM, Alexander Harrowell wrote:
Do they maintain a continuous data link in normal operation (like, say, connectivity for a LAN, or backhaul for a camera or some such), or do they request the data link when they need to send [whatever] (like a discrete SCADA system)? My (user only) experience is that cellular data service doesn't handle long sessions well.
I've had great success with it. We have done live audio streaming over IP through a cellular service before. 64kbps ogg encoding. About 7 or so hours in one session. We used to do a cheap live broadcast from an outdoor event for a radio station. -- Nathan Ward
Any good clueful network Engineers from Equinix on-list? If so, please contact me off-line as I noticed some oddball network behavior at some of your peering points. Regards, Stefan Fouant: NeuStar, Inc. Principal Network Engineer 46000 Center Oak Plaza Sterling, VA 20166 [ T ] +1 571 434 5656 [ M ] +1 202 210 2075 [ E ] stefan.fouant@neustar.biz [ W ] www.neustar.biz
* Stefan.Fouant@neustar.biz (Fouant, Stefan) [Wed 08 Apr 2009, 17:04 CEST]:
Any good clueful network Engineers from Equinix on-list? If so, please contact me off-line as I noticed some oddball network behavior at some of your peering points.
You do realise that the people who run an Internet exchange only manage the Ethernet switch and have no influence on participants' routing, right? If you're seeing odd things on your router directly connected to the IX switch you should have a better way of contacting your vendor than through the nanog mailing list. -- Niels.
Niels - this was an issue with the internet exchange netblock being leaked out to upstream providers and causing peering adjacencies to be established through indirect paths. It wasn't an issue with the router and it wasn't an issue with a peer. Thanks for your concern though... I think we got it handled now :) Stefan Fouant: NeuStar, Inc. Principal Network Engineer 46000 Center Oak Plaza Sterling, VA 20166 [ T ] +1 571 434 5656 [ M ] +1 202 210 2075 [ E ] stefan.fouant@neustar.biz [ W ] www.neustar.biz
-----Original Message----- From: Niels Bakker [mailto:niels=nanog@bakker.net] Sent: Wednesday, April 08, 2009 12:17 PM To: nanog@nanog.org Subject: Re: Equinix contact
Any good clueful network Engineers from Equinix on-list? If so,
* Stefan.Fouant@neustar.biz (Fouant, Stefan) [Wed 08 Apr 2009, 17:04 CEST]: please
contact me off-line as I noticed some oddball network behavior at some of your peering points.
You do realise that the people who run an Internet exchange only manage the Ethernet switch and have no influence on participants' routing, right?
If you're seeing odd things on your router directly connected to the IX switch you should have a better way of contacting your vendor than through the nanog mailing list.
-- Niels.
Alexander Harrowell wrote:
On Tuesday 07 April 2009 22:10:24 Charles Wyble wrote:
Been troubleshooting a very strange problem for a couple of weeks now.
I have a few hundred systems deployed throughout the United States utilizing EVDO connectivity with Verizon as a carrier. They are stationary.
Over the past few weeks clusters of them in SF and Lewisville TX and a few other areas have been failing intermittently. They are offline for several days, then online for a few days then go offline again. They are running Linux and PPPD.
Do they maintain a continuous data link in normal operation (like, say, connectivity for a LAN, or backhaul for a camera or some such), or do they request the data link when they need to send [whatever] (like a discrete SCADA system)? My (user only) experience is that cellular data service doesn't handle long sessions well.
I have a few Sprint EVDO cards. They go into standby when nothing is actively going on and fire up within seconds when there is something to do. I regularly use everything from SSH to streaming video without any issues. I only notice the delay with SSH when I don't type anything for a few minutes and it has to come active again, but I can leave it idle for hours and it never drops. As far as the OP goes, let them replace the cards if they think that's the problem. You and I may suspect something else is up, but if that's on their checklist, it is what it is. ~Seth
Seth Mattinen <sethm@rollernet.us> writes:
I have a few Sprint EVDO cards. They go into standby when nothing is actively going on and fire up within seconds when there is something to do. I regularly use everything from SSH to streaming video without any issues. I only notice the delay with SSH when I don't type anything for a few minutes and it has to come active again, but I can leave it idle for hours and it never drops.
Interesting. When I got my Sprint EVDO card (u727) a year and a half ago, they were pretty nasty about gunning down (bidirectional spoofed RST coming out of the middle of the network somewhere) any TCP sessions that were idle for ten minutes or more. Quite repeatable and verified on the downlow by People With Insight that this was in fact expected behavior from boxes that were in the middle of the network due to "politics" (unlike Verizon, Sprint appears to put no restrictions on inbound connections to the evdo-host). Putting this: ServerAliveInterval 60 in ~/.ssh/config was an effective work-around. I have not revisited the issue to see if Sprint has corrected this behavior. Perhaps budget constraints or customer complaints have caused Sprint to revisit the necessity of having extraneous hardware in their network. -r
On Apr 9, 2009, at 7:15 AM, Robert E. Seastrom wrote:
Seth Mattinen <sethm@rollernet.us> writes:
I have a few Sprint EVDO cards. They go into standby when nothing is actively going on and fire up within seconds when there is something to do. I regularly use everything from SSH to streaming video without any issues. I only notice the delay with SSH when I don't type anything for a few minutes and it has to come active again, but I can leave it idle for hours and it never drops.
Interesting. When I got my Sprint EVDO card (u727) a year and a half ago, they were pretty nasty about gunning down (bidirectional spoofed RST coming out of the middle of the network somewhere) any TCP sessions that were idle for ten minutes or more.
We observe this same kind of behavior with firewalls in the path watching for dead sessions they can clean up. Appears they send RSTs to both end points when they decide a session has gone away, as that'll let end hosts figure it out sooner. Same workaround of turning on keep=alives once a minute solves this too. The behavior in the case of firewalls makes sense, as state tables have to be cleaned up eventually.
Daniel Senie <dts@senie.com> writes:
We observe this same kind of behavior with firewalls in the path watching for dead sessions they can clean up. Appears they send RSTs to both end points when they decide a session has gone away, as that'll let end hosts figure it out sooner. Same workaround of turning on keep=alives once a minute solves this too. The behavior in the case of firewalls makes sense, as state tables have to be cleaned up eventually.
While I agree with you that the behavior makes perfect sense, I submit that the controls are often set improperly (by default or due to configuration by underskilled technicians) - that is to say, without taking into account the likely behavior of TCP when the connection is in fact still open. Consider the default keepalive interval on a selection of operating systems: FreeBSD - 7200 seconds: root@clack [17] # sysctl -a | grep keepidle net.inet.tcp.keepidle: 7200000 root@clack [18] # MacOSX - 7200 seconds: [Superfly:~] root# sysctl -a | grep keepidle net.inet.tcp.keepidle: 7200000 [Superfly:~] root# Windows XP - 7200 seconds: http://support.microsoft.com/kb/314053 (notice a pattern here?) Seems to me that a well-engineered firewall will have enough memory in it that (in the application for which it is specified, with anticipated traffic levels) it doesn't have to be over-aggressive and try cleaning up flows that haven't seen any traffic in less than, say, two hours and ten minutes. -r
On Thu, Apr 09, 2009 at 11:45:08AM -0400, Robert E. Seastrom wrote:
Daniel Senie <dts@senie.com> writes:
We observe this same kind of behavior with firewalls in the path watching for dead sessions they can clean up. Appears they send RSTs to both end points when they decide a session has gone away, as that'll let end hosts figure it out sooner. Same workaround of turning on keep=alives once a minute solves this too. The behavior in the case of firewalls makes sense, as state tables have to be cleaned up eventually.
Ish. 3360 argues against extraneous RSTs in general, in addition to some specific cases (response to malformed or unknown TCP options, etc).
While I agree with you that the behavior makes perfect sense, I submit that the controls are often set improperly (by default or due to configuration by underskilled technicians) - that is to say, without taking into account the likely behavior of TCP when the connection is in fact still open. Consider the default keepalive interval on a selection of operating systems:
FreeBSD - 7200 seconds: root@clack [17] # sysctl -a | grep keepidle net.inet.tcp.keepidle: 7200000 root@clack [18] #
MacOSX - 7200 seconds: [Superfly:~] root# sysctl -a | grep keepidle net.inet.tcp.keepidle: 7200000 [Superfly:~] root#
Windows XP - 7200 seconds: http://support.microsoft.com/kb/314053
(notice a pattern here?)
You mean adherance to the minimum per Host Requirements (1122)?
Seems to me that a well-engineered firewall will have enough memory in it that (in the application for which it is specified, with anticipated traffic levels) it doesn't have to be over-aggressive and try cleaning up flows that haven't seen any traffic in less than, say, two hours and ten minutes.
TCP vs application keepalives have been a religious topic for ages. It would seem that generous host idle windows in the modern Internet (increased speed, throughput, mobility, avilability and hostility since 1122 was written) are a bit odd. Joe, who thinks purposefully long-lived TCP applications should certainly have their own keep-alives -- RSUC / GweepNet / Spunk / FnB / Usenix / SAGE
On Thu, 09 Apr 2009 07:15:44 -0400 "Robert E. Seastrom" <rs@seastrom.com> wrote:
Seth Mattinen <sethm@rollernet.us> writes:
I have a few Sprint EVDO cards. They go into standby when nothing is actively going on and fire up within seconds when there is something to do. I regularly use everything from SSH to streaming video without any issues. I only notice the delay with SSH when I don't type anything for a few minutes and it has to come active again, but I can leave it idle for hours and it never drops.
Interesting. When I got my Sprint EVDO card (u727) a year and a half ago, they were pretty nasty about gunning down (bidirectional spoofed RST coming out of the middle of the network somewhere) any TCP sessions that were idle for ten minutes or more. Quite repeatable and verified on the downlow by People With Insight that this was in fact expected behavior from boxes that were in the middle of the network due to "politics" (unlike Verizon, Sprint appears to put no restrictions on inbound connections to the evdo-host). Putting this:
ServerAliveInterval 60
in ~/.ssh/config was an effective work-around. I have not revisited the issue to see if Sprint has corrected this behavior. Perhaps budget constraints or customer complaints have caused Sprint to revisit the necessity of having extraneous hardware in their network.
I use a Verizon Wireless u727; before that, I used a PCMCIA card. I've never had problems with drops on idle. *However* -- if there was a packet from the wrong IP address, the older card would drop the connection -- apparently, that behavior was required by the spec. (I haven't checked if the newer one will do that.) So, if the EVDO connection dropped while I had, say, an IMAP or ssh session open, and I dialed back in, the next TCP packet would cause EVDO to drop again... I finally "fixed" it by creating ipfilter rules in my ppp-up script to block all "bad" packets from going out. --Steve Bellovin, http://www.cs.columbia.edu/~smb
"Steven M. Bellovin" <smb@cs.columbia.edu> writes:
On Thu, 09 Apr 2009 07:15:44 -0400 "Robert E. Seastrom" <rs@seastrom.com> wrote:
Seth Mattinen <sethm@rollernet.us> writes:
I have a few Sprint EVDO cards. They go into standby when nothing is actively going on and fire up within seconds when there is something to do. I regularly use everything from SSH to streaming video without any issues. I only notice the delay with SSH when I don't type anything for a few minutes and it has to come active again, but I can leave it idle for hours and it never drops.
Interesting. When I got my Sprint EVDO card (u727) a year and a half ago, they were pretty nasty about gunning down (bidirectional spoofed RST coming out of the middle of the network somewhere) any TCP sessions that were idle for ten minutes or more. Quite repeatable and verified on the downlow by People With Insight that this was in fact expected behavior from boxes that were in the middle of the network due to "politics" (unlike Verizon, Sprint appears to put no restrictions on inbound connections to the evdo-host). Putting this:
ServerAliveInterval 60
in ~/.ssh/config was an effective work-around. I have not revisited the issue to see if Sprint has corrected this behavior. Perhaps budget constraints or customer complaints have caused Sprint to revisit the necessity of having extraneous hardware in their network.
I use a Verizon Wireless u727; before that, I used a PCMCIA card. I've never had problems with drops on idle. *However* -- if there was a packet from the wrong IP address, the older card would drop the connection -- apparently, that behavior was required by the spec. (I haven't checked if the newer one will do that.) So, if the EVDO connection dropped while I had, say, an IMAP or ssh session open, and I dialed back in, the next TCP packet would cause EVDO to drop again... I finally "fixed" it by creating ipfilter rules in my ppp-up script to block all "bad" packets from going out.
Interesting. I never had that behavior exhibited on my old PCMCIA card on Verizon or on my u727 on Sprint. What OS platform were you on lappie-wise? I've thought on a couple of occasions that a "geek bake-off" between EVDO and 3G providers looking for technical jack moves on the providers' part would make for a nice NANOG lightning talk. Sadly, I haven't the time to devote to such an endeavor. -r
On Thu, 09 Apr 2009 11:12:57 -0400 "Robert E. Seastrom" <rs@seastrom.com> wrote:
I use a Verizon Wireless u727; before that, I used a PCMCIA card. I've never had problems with drops on idle. *However* -- if there was a packet from the wrong IP address, the older card would drop the connection -- apparently, that behavior was required by the spec. (I haven't checked if the newer one will do that.) So, if the EVDO connection dropped while I had, say, an IMAP or ssh session open, and I dialed back in, the next TCP packet would cause EVDO to drop again... I finally "fixed" it by creating ipfilter rules in my ppp-up script to block all "bad" packets from going out.
Interesting. I never had that behavior exhibited on my old PCMCIA card on Verizon or on my u727 on Sprint. What OS platform were you on lappie-wise?
I run NetBSD but I know that the problem also showed up on Linux -- a friend who worked for an equipment vendor also saw it, and he checked the actual EVDO specs. We suspect the problem doesn't show up for Windows users because Windows appears to terminate all connections with extreme prejudice when the link goes away, so there won't be any TCP transmissions to induce the failure.
I've thought on a couple of occasions that a "geek bake-off" between EVDO and 3G providers looking for technical jack moves on the providers' part would make for a nice NANOG lightning talk. Sadly, I haven't the time to devote to such an endeavor.
-r
--Steve Bellovin, http://www.cs.columbia.edu/~smb
"Steven M. Bellovin" <smb@cs.columbia.edu> writes:
Interesting. I never had that behavior exhibited on my old PCMCIA card on Verizon or on my u727 on Sprint. What OS platform were you on lappie-wise?
I run NetBSD but I know that the problem also showed up on Linux -- a friend who worked for an equipment vendor also saw it, and he checked the actual EVDO specs.
We suspect the problem doesn't show up for Windows users because Windows appears to terminate all connections with extreme prejudice when the link goes away, so there won't be any TCP transmissions to induce the failure.
Didn't have the problem manifest itself on MacOSX (10.4, 10.5) either... -r
Do they maintain a continuous data link in normal operation (like, say, connectivity for a LAN, or backhaul for a camera or some such), or do they request the data link when they need to send [whatever] (like a discrete SCADA system)? My (user only) experience is that cellular data service doesn't handle long sessions well.
Continuous operation. They have been working fine for some time. We have about 20 locations that aren't working, and over 200 that are working just fine.
participants (10)
-
Alexander Harrowell
-
Charles Wyble
-
Daniel Senie
-
Fouant, Stefan
-
Joe Provo
-
Nathan Ward
-
Niels Bakker
-
Robert E. Seastrom
-
Seth Mattinen
-
Steven M. Bellovin