Re: outages, quality monitoring, trouble tickets, etc
Sorry, another long one. Relevance to NANOG, well, NOCs are customers too. And "remote" NOCs often report problems that effect your paying customers too. Customer service should be of interest to operations folks, at least to the extent the problems are getting reported to the right people to fix. I doubt I can change anyone's mind that providing explanations to customers and non-customers when the network has problems is good for business. In the future I will simply recommend to customers to buy services from NSPs which do provide explanations when their networks fail. Since I haven't found a perfect network yet, I suspect it includes everyone on this list.
Why should they talk to you? Do you pay them a service fee? That's my base issue, there is a hierarchy, and you can't skip rope to the other guy. It just doesn't work, there's nothing in the system to encourage it.
The hierarchy is dead. None of the old NSFnet regional have a monopoly on service in their regions any more. Outside of the US, there are still a few monopoly providers, but they are a rare breed. If you aren't providing the level of service I need, I'll go to someone who can. If XYZ's NOC gives me better service than ABC's NOC, I'll recommend XYZ to my customers.
Sean Donelan has a terribly good point, he's my customer, and his words mean alot, but I can't agree w/ him that he should/could demand the same thing from another ex-NSFnet regional, or from Sprint. I certainly see no reason why I should do this work for you.
Because it is in their self-interest? You are correct I can't make anyone run their network how I would like it run, not even MIDNET (GI). But I can point out long-term problems and code of silence is costing such providers money, and has already cost them customers. For example, I really wish my direct providers would stop munging BGP announcements, or explain why they are doing it. If I have made a mistake, I would like to fix it. Otherwise I will come to the conclusion those provider's NOCs are not up to the job and find a different provider that can do the job. When someone (anyone) reports a problem effecting connectivity with your network, more than likely the reverse is also true for your paying customers. DRA has a bunch of customers connected through just about every major NSP in North America and a couple of other continents. The only time "I" call another NSP is when the process has become totally FUBARed. When I call another NSP, it is usually that NSP's last chance to keep a paying customer on their network. I might call BARRNET because the University of California-Davis has reported problems reaching DRA to DRA's help desk, and the problem hasn't been resolved. No, BARNET doesn't *have* to talk to me. And I will report the same back to the customer. However, I suspect it is in BARRNET's self-interest to work with me in resolving the problem to ensure UC-Davis has end-to-end reliability. I track network reliability by dollars (not packet loss, not latency). I measure network providers, good and bad, by how many of our customers have used their own dollars to buy private lines to St. Louis because they couldn't get the reliability they needed from the network provider. It is not a pretty picture. <http://dranet.dra.com/dranet.html> has a picture where our private line customers are located. If you are an NSP, every one of those green boxes (some boxes represent many paying customers) is an arrow through the heart of your (former) customers view of your network reliability. If you are an NSP in one of those areas, not dealing with these problems or providing coherent explanations has cost you cold-hard cash. Money is something I expect most upper managers to understand. DRA makes its profits elsewhere. DRAnet is simply a vertical market VPN used to sell access to other things. I'm happy to use the Internet and other NSP's to provide that VPN, when the quality exists. On the other hand, if I have to manage a not-so virtual VPN with private lines to achieve the required level of quality, I do. Maybe Adam Smith's invisible hand will correct this eventually.
There seems to be this large obsession with linking information to action. If you get an update you think something's happening. Perhaps it's needed, but stuff will happen whether your hand is held or not.
As I said before: Ideally I want a reliable network. If you can't provide a perfectly reliability network I want an explanation when I can't get through. And I want the problem fixed. The better the explanation, the longer I'm willing to give you to fix the problem. If I get no explanation, I expect the problem to already be fixed. The current situation is the customer gets neither the explanation nor action solving the problem. My proof is the DRAnet map. DRA's customers take a very, very long time to budget money. Those green boxes represent customers whose problems went unanswered, and unsolved for a long time before they gave up on their NSP and expended their own dollars for a private line to St. Louis. Since the technicians seem to be having a very difficult time fixing the network, I thought upper management could meet my other goal. Give the customer an explanation. I'm not pointing fingers at any particular NSP, because frankly I don't have enough fingers to point. Everyone had problems. Yes, even DRA's NOC has fallen down a few times. I'm not asking for perfection, but an explanation when things don't work, while you fix the problem. The Internet is a global cooperative network. If people don't cooperate, the global nature of the network fails. Since your customers may in fact want to use the Internet to communicate globally, problems effect customers globally. When I go to the US Post Office, sometimes there is a sign on the wall that postal service to Timbukto may be delayed because Timbukto's main post office was blown up. I have no idea how many postal customers in Olivette, Missouri send mail to Timbukto. Even though the US Post Service has no control over rebuilding Timbukto's main post office, the US Post Service has discovered it is good customer service to inform their customers why their mail to Timbukto may be delayed. Can't NSPs provide their customers an explanation at least as well as the US Post Office? -- Sean Donelan, Data Research Associates, Inc, St. Louis, MO Affiliation given for identification not representation
......... Sean Donelan is rumored to have said: ] Customer service should be of interest to operations folks, at least ] to the extent the problems are getting reported to the right people to fix. It certainly is here. ] I doubt I can change anyone's mind that providing explanations to ] customers and non-customers when the network has problems is good for ] business. I agree with you that it is important. ] In the future I will simply recommend to customers to buy ] services from NSPs which do provide explanations when their networks ] fail. Since I haven't found a perfect network yet, I suspect it ] includes everyone on this list. alan> rope to the other guy. It just doesn't work, there's nothing in alan> the system to encourage it. What I mean by saying this is NOT that I don't think a per-NSP trouble reporting mechanism is a good idea. What I'm saying is that within our Internet arrangement today, I don't see that it's terribly capitalistically useful for NSP-A to adverise internal problems to NSP-B. There is no doubt in my mind that it IS terribly useful for NSP-A to advertise internal problems to NSP-A's customers, as well as to NSP-B if they inquire on behalf of NSP-B's customers wrt an outage internal to NSP-A. You're right the migration of customers is a good metric, but it's hard to quantify that migration wrt trouble reporting to management. A friend at MFS brings up a good point, that being that the COREN agreement stipulated for a trouble reporting list. Perhaps we could work to develop a scalable model of such for world wide Internet use, or adapt that to this. Any other suggestions? ] If you aren't providing the level of service I need, I'll go to someone who ] can. If XYZ's NOC gives me better service than ABC's NOC, I'll ] recommend XYZ to my customers. Adam Smith's rules _will_ follow us into the Internet. Agreed. ] > Sprint. I certainly see no reason why I should do this work for ] > you. ] ] Because it is in their self-interest? You are correct I can't make ] anyone run their network how I would like it run, not even MIDNET (GI). ] ] But I can point out long-term problems and code of silence is costing such ] providers money, and has already cost them customers. It's not a code of silence. That's my point, that being that historically when we are asked about problems we give darn good answers. That we don't directly advertise problem attention or resolution is not correlative to our response to requests. Should we provide darned good answers? - YES Should we provide automated Darned Good Answers to our customers? - YES, it would be nice but not a NEED, rather a nifty service (IMHO) Should we provide automated Darned Good Answers to other NSPs? - YES, it would be nice but not a NEED, rather a nifty service and lower priority than #2. ] I might call BARRNET because the University of California-Davis has ] reported problems reaching DRA to DRA's help desk, and the problem hasn't ] been resolved. No, BARNET doesn't *have* to talk to me. And I will ] report the same back to the customer. However, I suspect it is in ] BARRNET's self-interest to work with me in resolving the problem ] to ensure UC-Davis has end-to-end reliability. I agree it is too. However, when I hear people complaining about bad NOCs, I think it is important to point out that there is no mechanism in place to hold those other NSPs accountable as the person complaining is rarely the customer of the NSP. Yes it's in our long term interest, but that doesn't mean there's something in place to encourage it other than honest intention. ] I track network reliability by dollars (not packet loss, not latency). ] I measure network providers, good and bad, by how many of our customers ] have used their own dollars to buy private lines to St. Louis because ] they couldn't get the reliability they needed from the network provider. Ouch. ] As I said before: Ideally I want a reliable network. If you can't ] provide a perfectly reliability network I want an explanation when I ] can't get through. And I want the problem fixed. The better the ] explanation, the longer I'm willing to give you to fix the problem. If ] I get no explanation, I expect the problem to already be fixed. This is a good point, and I have been more convinced that it is important. Because of this discussion I am going to work to develop an automated WWW status page. ] The current situation is the customer gets neither the explanation nor ] action solving the problem. I appreciate that NSP response is not always ideal. However, I would encourage all people who get a less than exceptional response from a NOC technician to escalate the question so as to improve the NOC quality. No, this isn't something you should have to do, and it's not something that makes anyone terribly proud but it does tend to improve the service by natural tech selection. ] Since the technicians seem to be having a very difficult time fixing ] the network, I thought upper management could meet my other goal. Give ] the customer an explanation. This is done when they ask, and due to your and others concern, I am going to work to develop an automated web page showing down time problems. ] The Internet is a global cooperative network. If people don't cooperate, ] the global nature of the network fails. Agreed. ] Can't NSPs provide their customers an explanation at least as well as ] the US Post Office? Yes, it's possible, and due to this discussion, I am going to work to build one as nice as FedEx's.... Anyone want to volunteer joint development? :) -alan
On Sat, 25 Nov 1995, Alan Hannan wrote:
Should we provide automated Darned Good Answers to our customers? - YES, it would be nice but not a NEED, rather a nifty service (IMHO)
Automated answers would be great...but what about implementation? "Press 1 for an automated status report...<click>" Keeping customer service staff well-informed (perhaps via an internal automated system) might be a better solution.
Should we provide automated Darned Good Answers to other NSPs? - YES, it would be nice but not a NEED, rather a nifty service and lower priority than #2.
I'm afraid I have to disagree...in a network of the level of complexity of today's Internet (in fact, in any system where communication between two points is dependent on more than just an "upstream" entity), connectivity issues are MORE likely to be caused by interaction with other NSP's. Dissemination of problem information between providers helps everyone diagnose difficulties and keep their customers better informed with respect to current status and predictions for the near future (solutions). A mailing list for this purpose seems like overkill...if dozens of NSP's were to be informed every time JoeNet has a problem, even if their service were not to be affected, the noise overload would reduce the informative value of the list, as well as provider attention to it. But how to determine when a problem is important enough to be distributed? A more interactive shared system (ticket-based?) makes more sense, but may prove far more difficult to design. Problem classification, impact, severity, and location are all issues here, as well as the problem of associating such a record of a problem with its effects. That is, when a provider "discovers" a problem, how are they to know if it has already been "registered", and if so, how to reference the information associated with it? [need for explanations]
This is a good point, and I have been more convinced that it is important.
Because of this discussion I am going to work to develop an automated WWW status page.
Good response, but how sound is the choice of implementation? If there is a problem with your network, there is no small chance that those most interested in acquiring this information would not be able to reach your server to do so.
] The current situation is the customer gets neither the explanation nor ] action solving the problem. I appreciate that NSP response is not always ideal. However, I would encourage all people who get a less than exceptional response from a NOC technician to escalate the question so as to improve the NOC quality. No, this isn't something you should have to do, and it's not something that makes anyone terribly proud but it does tend to improve the service by natural tech selection.
I hate to say it, but what may be needed here is standardization. NOC operating procedre varies greatly between providers, and the proper escalation, etc. of a problem may not be clear. // Matt Zimmerman Chief of System Management NetRail, Inc. // Work..........mdz@netrail.net | Play...gemini@alcor.netrail.net // (703) 524-4800 [voice] (703) 524-4802 [data] (703) 534-5033 [fax]
On Sat, 25 Nov 1995, Matt Zimmerman wrote:
connectivity issues are MORE likely to be caused by interaction with other NSP's. Dissemination of problem information between providers helps everyone diagnose difficulties and keep their customers better informed with respect to current status and predictions for the near future (solutions).
Agreed, but it has to be done in an "easy" manner. I'm sure that several of the NSPs have concerns as to what this information will be used for. Everyone likes to portray the image of having a 99.98% uptime whenever possible, even though most folks realize that it just plain isn't possible, at least today. This sort of leads into the question of the various NOCs integration with whatever central repository of information we are shooting to provide. When provider X opens a ticket, will it automatically be reflected in the 'central' database? I doubt folks will go for that based on security alone. Or how about provider X's NOC staff fire off an Email to incident-report@outages.com? How will they be trained or reimbursed for their time spent on this service? [..facts about how useless mailing lists are removed..]
A more interactive shared system (ticket-based?) makes more sense, but may prove far more difficult to design. Problem classification, impact, severity, and location are all issues here, as well as the problem of associating such a record of a problem with its effects. That is, when a provider "discovers" a problem, how are they to know if it has already been "registered", and if so, how to reference the information associated with it?
Such an idea is already being discussed in several smoke filled rooms. :) Remedy/ARS has the ability to accept input for incident reports and queries to its database via an Email form. One could write a Web page containing the necessary parameters in a form, and then transpose that to an Email sent to the AR system. Implementing such a system is really based around cost issues, as the coding is relatively trivial. (CGIs come to mind) (I used the above example because it's something we've done in the past and I know works, there are probably others) On the issue of connectivity -- agreed; some lonely site should not be allowed to be the only host. However -- if connectivity between certain NSPs also falls apart, you're equally screwed. Some sort of distribution of the "centralized" source of information would be needed. I forsee the most difficult part of the process being, convincing all of the associated Operations groups into sharing their outage information. Providing a simple mechanism for either the customer service, or operations staff to disseminate outage information to the "server," would be equally challenging. If step (a) were to be overcome, I would assume that writing a procedure to fit (b). -jh-
"Jonathan" == Jonathan Heiliger <loco@mfst.com> writes:
Jonathan> Everyone likes to portray Jonathan> the image of having a 99.98% uptime whenever Jonathan> possible, even though most folks realize Jonathan> that it just plain isn't possible Well, more importantly, what on earth does a number like that mean? Sean.
participants (5)
-
Alan Hannan
-
Jonathan Heiliger
-
Matt Zimmerman
-
Sean Donelan
-
Sean Doran