Re: DACS blamed for MCI / CSX train problems last week
[apologies in advanced for the pitiful layout that Outlook Express will no doubt inflict on you after I click send] -----Original Message----- From: Sean Donelan <sean@donelan.com> To: nanog@merit.edu <nanog@merit.edu> Date: Saturday, May 06, 2000 3:04 PM Subject: DACS blamed for MCI / CSX train problems last week
This isn't the first DACS problem nor is MCI the only carrier affected by DACS problems. Bell Atlantic had a multi-day DACS problems, I've experienced 18+ hour MCI DACS problems in the past. What is it about DACS systems which seem to lead to such catastrophic problems?
I don't believe there is anything unherently unreliable in a DACS (we call them Digital Cross-Connects, or DCCs, down here). In fact, most DCCs I have seen are extrordinarily redundant and would probably quite happily continue to function at 100% after a good seeing-to with an axe. The problem lies elsewhere. "Modern" telecommunications equipment seems to come equipped with two configuration paradigms: [1] will usually be a vendor-supplied Unix solution, probably running on Sun or HP hardware, which provides a full database back-end and a simple point-and-click interface for end-to-end service management. This is what the engineering planning people see when the system is demonstrated to them, and this obvious money-saving wonder is what will justify the inflated budget required to purchase it. [2] will probably be 9600 bps terminal access to a protracted series of cryptic menus, about sixteen levels deep, through which the only practical way to navigate is to type ahead protracted strings such as "1,4,9,13-17,3,3,1" and then walk away for ten minutes to drink coffee. [1] will usually stop working properly about four weeks after the vendors disappear from site. Various technicians who have either a vague idea of the network technology, or a self-taught-at-home expertise in Linux, will attempt to fix the problem. The IT department will be called in, and will limp through a SunOS 4.1.3 install from QIC tape, and may even get as far as completing a successful reinstallation of the vendor's management system (at which point it will become clear that the back-end is an embedded version of Informix, for which the licence keys are either missing or have mysteriously transformed themselves into one-week demo licences for some completely unrelated product). Hence [1] will quickly cease to be a useful tool for provisioning services. Of course, the business doesn't stop just because one management system is broken; we always have [2]. [2] is extremely vulnerable to radical accidental reprogramming of the network due to caffeine-induced finger-shake, but quickly becomes a preferred tool for programming the network for a certain subset of the network provisioning technicians. Those who discover that they can script routine tasks using shareware terminal packages will also become able to perform radical accidental reprogramming of the network whilst standing on the keyboard, reaching back to tape laser-printed Dilbert cartoons or semi-pornographic calendars to cubicle walls. Any change using [2] will naturally never be reflected in the database of [1]. This fact will be recognised whenever someone manages to get [1] temporarily back on its feet, and will trigger a seemingly endless series of network audit projects, which will never succeed due to the underground popularity of [2] and the fact that anybody competent enough to perform a network audit does their very best to escape from the project at the earliest opportunity, leaving no indication whatsoever of what they have found out so far. The unreliability of any network records means that any troubleshooting is likely to involve far more random guesswork than is healthy, and the probability of a cascade of outages following routine maintenance on a minor issue is much higher than you would expect. On the other hand, it could be simply that marketing departments have observed that customers are happy with the phrase "DACS outage" and therefore use it as a generic term to describe any incident which might otherwise cause a customer to complain :)
participants (1)
-
Joe Abley