Re: outages, quality monitoring, trouble tickets, etc
From: Scott Huddle <huddle@mci.net> I consider this list a place for ISPs to discuss general policy and planning issues that effect all of us. It is a very inappropriate place to discuss problems with a specific provider.
As a general policy I wish all network providers would implement at a minimum a network notification list. Bonus points for a network status WWW page, NetNews, FAX, and PR Newswire distribution.
From: bsimpson@morningstar.COM (William Allen Simpson) I firmly DISAGREE! None of us are particularly interested in hearing every jot and tittle about every network flap, but _BIG_ ones and their resolution are important to bring to this list! How else to get a handle on what the real problems are? How else to help each other avoid repeating the problem in the future?
This is always a tough call. When is news no longer news. When it happens all the time. The slow intermittent problems are the hardest problems to fix. I end up tracking problems from Europe to Australia for customers, so I'm interested in network outages all over the world. On the other hand, I don't (usually) care when Bill's PC is turned off at night. Maybe take a cue from the (failed) NetNews Distribution: or MBONE-sublist mechanism. What to do about backbone telco providers which hate announcing their network problems? Maybe when the NSFNET backbone fractured into multiple backbones, the NSF should have taken a lesson from the breakup of AT&T and established a Network Reliability Council. Any telephone network outage effecting more than 50,000 (now lowered to 15,000 I believe) lines have to be reported to the FCC's NRC. Who cares if you connect to three (plus one) NAPs. In the post-NSFNET era, Internet-wide reliability requires Internet-wide information. Reporting network reliability problems is more important than how many NAPs your network connects.
Has anybody else noticed how hard it is to get trouble tickets these days? Once upon a time, I just called the NSF NOC, and got a report to them in real time, so the problem could be fixed quickly. Nowadays, NOCs seem to want you to send email with 24 or 48 hour turnaround, or go through 2 layers of service representatives. Pretty hard to send email to them when their link is down, or go through "regular" support in the middle of the night!
Welcome to the new and improved Internet. More clueless people cal NOCs these days (is it plugged in?) so more caller screening is done. Likewise there more clueless people working in NOCs so more levels before reaching someone who even understands what the problem is. NOC-to-NOC communication has been a long standing Internet problem. But now there are more NOC's, more different ways to contact them, with no common conventions. Even though its out of date, I still keep my Internet Manager's Phonebook published by BBN in 1990. ANS, MCI, and Sprint issued press releases a few months ago about their joint agreement. I don't know how how well their joint agreement is working, but I suspect there will be more such agreements between network operators in the future. As the sheer number of people involved grows, everyone is going to filter their calls, e-mail, etc. If you aren't on the list, you get dumped in the "take care of after hell freezes" pile. In the meantime, keep a stack of business cards and a special rolodex, with the magic names and telephone numbers that get you directly to someone who can understand (and maybe even fix) the problem. Interesting enough, the people usually don't change; but the employers do. -- Sean Donelan, Data Research Associates, Inc, St. Louis, MO Affiliation given for identification not representation
......... Sean Donelan is rumored to have said: ] ] >From: Scott Huddle <huddle@mci.net> ] > I consider this list a place for ISPs to discuss general policy and ] > planning issues that effect all of us. It is a very inappropriate ] > place to discuss problems with a specific provider. Scott, if a certain provider *cough you_know_who* is causing our connectivity to go to hell through their lame non aggregation policies (now fixed) then where else would the issue be discussed? ] As a general policy I wish all network providers would implement ] at a minimum a network notification list. Bonus points for a network ] status WWW page, NetNews, FAX, and PR Newswire distribution. Hmm, I wonder if the Trib' would be interested in knowing when the DS3 from Pensaulen is down..... ] Who cares if you connect to three (plus one) NAPs. In the post-NSFNET ] era, Internet-wide reliability requires Internet-wide information. ] Reporting network reliability problems is more important than how ] many NAPs your network connects. At a precursing glance I would agree with you. However, let us delve into this a bit deeper. Donning my idiot hat may I point out that the _most_ important thing is network reliability -Period-. While I agree w/ you that accountability is important I can't agree that a simple outage list is very terribly useful. With all due respect to the Sprint folx, their lists are often vague and noninformative. As of late the MCI tickets have been more and more coming and less and less useful. Back to accountability, (with kindest respect) Sean, you haven't a lambs foot to stand when Barrnet isn't looking into a problem. In my opinion this is an issue that needs to be brought to fruition. Either develop some world policy of Internet Connectivity or perhaps we should all realize that the only person we can hold accountable is Mr. Upstream. There are a few ways I can think of this working, in (my opinion of the) order of their potential for success: o Sprint, MCI, ANS jointly fund an Inet trouble tracking NOC o The Federal Government's FCC encompasses USA's Inet Traffic as a medium o NSPs voluntarily subscribe to a policy of notification of problems to a global mail list. Your page at DRA is quite good, however the concensus among upper management (not just at our site) is "Why should other people know when we're broke?". And the sad thing is, I am tempted to agree with them. If you call our NOC and you ask about a connectivity issue, you will get a straight answer. Perhaps not from the first person you get, maybe not the second, but my people will escalate it until you do. The fact that we don't advertise this is not deterrent to the quality of the information, only the convenience. ] >Has anybody else noticed how hard it is to get trouble tickets these ] >days? Once upon a time, I just called the NSF NOC, and got a report to ] >them in real time, so the problem could be fixed quickly. Nowadays, ] >NOCs seem to want you to send email with 24 or 48 hour turnaround, or go ] >through 2 layers of service representatives. Pretty hard to send email ] >to them when their link is down, or go through "regular" support in the ] >middle of the night! I don't know how all the other NSPs work, but if there is ever an issue wrt connectivity or systems we HAVE a trouble ticket and we WILL provide it on request. With kindest respect, I understand your desire to get it "on demand" but with a bit more work you can get it from our NOC. ] Welcome to the new and improved Internet. More clueless people cal ] NOCs these days (is it plugged in?) You can't imagine how humerous this is.... :) I truly feel sorry for the poor chaps at INSC.... ] so more caller screening is done. To a point, but if the person on our end of the phone doesn't know the answer, they aren't allowed to say as such, they escalate the issue until it's resolved. "I don't know" has to be followed by a promise here. Is this not common? ] NOC-to-NOC communication has been a long standing Internet problem. Hmm, I'm not sure I would terribly agree. When MCI or Sprint has a problem, we have not had any latency issue getting to them. Likewise w/ wacky issues causing us to get with Sura, Barr, Cerf, Westnet, etc... ] no common conventions. Even though its out of date, I still keep my ] Internet Manager's Phonebook published by BBN in 1990. Sounds like a good market... :) ] In the meantime, keep a stack of business cards and a special rolodex, ] with the magic names and telephone numbers that get you directly to ] someone who can understand (and maybe even fix) the problem. Interesting ] enough, the people usually don't change; but the employers do. ^ You have openings? ;-) Enter the "Backbone Cabal". I can call you when I need to know what's up w/ DRA. If apropo, you shoot me to the less clueful person. You call us, ditto. I've got the same folx at ANS, Sprint, MCI, etc.. That's why we're important, we know who can do what and occasionally how to find them. Do you really want outage and downtime on public record, or do you want easier access to clueful folx? -alan
Do you really want outage and downtime on public record, or do you want easier access to clueful folx?
A clueful expert system running on a data base collecting information from all over the place, and answering questions automatically, would be a good start. Like an automatic responder at something like this FCC NRC someone mentioned earlier. No need to send me random outage email, until I perceive a problem. I get enough email even without that. I care about fixing problems. The problem is that there is no working procedure if a problem is being perceived, that results in information to a user in near real-time. The underlying issue is no overall Internet management, or at least coordination, at this time. In a computer network we *do* have the technology to do such things, you know. It requires willingness to do it more than it needs technology at this point of time.
On Thu, 23 Nov 1995, Hans-Werner Braun wrote:
In a computer network we *do* have the technology to do such things, you know. It requires willingness to do it more than it needs technology at this point of time.
I think the response being generated (if nowhere else) on this list shows that there IS a willingness to implement such a system. The "best" design for a global trouble database is unclear at best; a lot of issues re: data collection, information distribution, etc. will have to be resolved. I, for one, am willing to devote whatever time and computing resources are required of me to support a project like this, and I think most other providers share my position. Being better informed is an advantage to everyone. // Matt Zimmerman Chief of System Management NetRail, Inc. // Work..........mdz@netrail.net | Play...gemini@alcor.netrail.net // (703) 524-4800 [voice] (703) 524-4802 [data] (703) 534-5033 [fax]
With all due respect, Alan wrote: ...
While I agree w/ you that accountability is important I can't agree that a simple outage list is very terribly useful. With all due respect to the Sprint folx, their lists are often vague and noninformative. As of late the MCI tickets have been more and
Sorry, SprintLink has 255/255 in my book. For instance, long before Gordone Cook sent his flamebait nanog-way, SPRINT had already identified a power failure at sl-stk-9 and its one hour downtime. On at least three occasions that I can recall, Sean has sent out innage (non-outage ;) notes indicating _in_detail_ what he was going to install on what routers, where it had been tried, and which functionality he was expecting to get out of this. Some of the details are the stuff that Cisco is yet to implement in production release. On at least four other occasions, all in the last month, Sean, Elliot Alby, Peter Lothberg, and one other SPRINT individuals [or consultants] posted details as to an outage, ETR, details, up-to-the-minute stuff, and a resolution. While they had their problems before, and will doubtless have them again, right now SprintLink's backbone has a good trouble-reporting mechanism.
Back to accountability, (with kindest respect) Sean, you haven't a lambs foot to stand when Barrnet isn't looking into a problem. In
I have a ticket open with barrnet since 07/95 as to simple loss in the Santa Cruz area. When Barrnet [oops, sorry, "call us BBN Planet"] closes that 4-month-old ticket, talk.
Do you really want outage and downtime on public record, or do you want easier access to clueful folx?
I'll take both, thanks very much. Frankly I'll give up on having immediate access to clueful folx. We're all f'busy. I'll take access to folks-clueful-enough-to-fix-it without my having to educate them. So far, the folks at the SL INSC (Diana, Muhammed, Pat, etc.) have done fine by me. Ehud gavron@Hearts.ACES.COM p.s. You might be inclined to think "My, how easy it is to impress him. SL must have really bought him lunch." Well let me tell you -- it is easy to impress me -- with professionalism, quality, competence, and attention to detail. SL has those.
......... Ehud Gavron is rumored to have said: ] > due respect to the Sprint folx, their lists are often vague and ] > noninformative. ] ] Sorry, SprintLink has 255/255 in my book. ] ] On at least three occasions that I can recall, Sean has sent ] out innage (non-outage ;) notes indicating _in_detail_ what he was Erm, Sean != Sprint. (That's arguable, you know ;) I'm not here to downgrade anyone, Jove knows my folks have been less than 100% at times. However, I gain little of substance from the notes. So we're different. Sure, they're nice, and a bit informative, but they don't help me fix connectivity problems. They don't really even help me explain things to my customers. (Don't take me off SL-outage! :) ] While they had their problems before, and will doubtless have them ] again, right now SprintLink's backbone has a good trouble-reporting ] mechanism. Agreed, but I direct back to the central issue, that being HOW does this trouble reporting mechanism improve the quality of the overall Internet connectivity? ] > Back to accountability, (with kindest respect) Sean, you haven't a ] > lambs foot to stand when Barrnet isn't looking into a problem. In ] ] I have a ticket open with barrnet since 07/95 as to simple loss in ] the Santa Cruz area. When Barrnet [oops, sorry, "call us BBN Planet"] ] closes that 4-month-old ticket, talk. Why should they talk to you? Do you pay them a service fee? That's my base issue, there is a hierarchy, and you can't skip rope to the other guy. It just doesn't work, there's nothing in the system to encourage it. [ access to information or do you....] ] > want easier access to clueful folx? ] ] I'll take both, thanks very much. I'm not sure where my responsibility to you lies. You are another person on this wacky Internet we've created. Why should I allocate 60 hours of my staff time to design an integrated web reporting mechanism for you? Sean Donelan has a terribly good point, he's my customer, and his words mean alot, but I can't agree w/ him that he should/could demand the same thing from another ex-NSFnet regional, or from Sprint. I certainly see no reason why I should do this work for you. MFS and the RA folx do it becuase their customers demand it. Along the way they provide the information to y'all. So we end up in a socialist system where maybe I'm demanded to provide this for my customers, and maybe along the way I'll provide it to the Inet community, but there's no motivation for me to do it globally. Someone ought convince me that knowing where someone else's problem exists make it easier for me to fix problems of mine own. There seems to be this large obsession with linking information to action. If you get an update you think something's happening. Perhaps it's needed, but stuff will happen whether your hand is held or not. Brash in my Thanksgiving Vegetarianism, -alan
Alan wrote: [I wrote:]
] I have a ticket open with barrnet since 07/95 as to simple loss in ] the Santa Cruz area. When Barrnet [oops, sorry, "call us BBN Planet"] ] closes that 4-month-old ticket, talk.
Why should they talk to you? Do you pay them a service fee?
Their customer has instructed Barrnet NOC and staff to treat me as a consultant/employee of the customer authorized to speak for them. (4 month open ticket. Problem duplicated at will. Large packet loss. inexcusable.) ...
I certainly see no reason why I should do this work for you.
Fair enough... Do no work for other than your customers. This isn't Atlas Shrugged. If I need something from you I'll get one of your customers to sign off on it, or I'm just another leech.
Brash in my Thanksgiving Vegetarianism,
-alan
Ehud -- Ehud Gavron (EG76) gavron@Hearts.ACES.COM
participants (5)
-
Alan Hannan
-
Ehud Gavron
-
hwb@upeksa.sdsc.edu
-
Matt Zimmerman
-
Sean Donelan