Re: External Events (was Re: www.etrade.com has no DNS A record !)

26 Jan 2000

...
On Tue, 25 January 2000, John Hawkinson wrote:
...
Is your goal to get the word out to network providers of people
who use E*TRADE? Do you really expect that many of them will
forward this announcement or make good use of it? Should
a message be sent to NANOG every time CNN, Netscape, or Yahoo
go down?
Am I missing something here? [Like a sense of humor?]
On Tue, Jan 25, 2000 at 08:40:46PM -0800, Sean Donelan wrote:
...
External events have an affect on network service and network operators.
Why do most NOC's have one or more monitors tuned to CNN and the Weather
channel all day and all night?  Ok, I know the real reason, but what is
the reason the sales people tell prospective clients?
The question is really one of editorial policy and how significant is
any individual event.  I don't think there is really one answer which
can cover everything.
This is true. That is part of why I asked the question they way I did:
...
...
Is your goal to get the word out to network providers of people
who use E*TRADE? Do you really expect that many of them will
forward this announcement or make good use of it? Should
a message be sent to NANOG every time CNN, Netscape, or Yahoo
go down?
While most people interpreted it rhetorically, it was actually
asked with a significant literal component. When asking the
list a question like this, though it's hard to know how to
contend with the potential silent majority versus the exuberant
minority (I've heard from some of people who agreed with the
position I espoused).

It appears that there is a significant population among
the NANOG readership who benefit from this sort of notification.
Personally, I believe that the notification is useful and valuable,
however my opinion is mostly that NANOG is not the right place for it.

This is an opinion I have held for a long time, and it was solidified
back when a mailing list called nsr@merit.edu existed. I believe it 
stood for "Network Status Reporting". It's awful hard to find archives
of it any more (hey, merit!), but google.com has one message cached
which demonstrates the flavor:

| To: nsr@merit.edu 
| Subject: 07/22/94 NSFNET Backbone Unreachable 10:00 - 11:00 UTC 
| From: ANS Network Operations Center <noc@noc.ans.net> 
| Date: Fri, 22 Jul 1994 11:19:20 GMT 
| 
| 07/22/94 NSFNET Backbone Unreachable 10:00 - 11:00 GMT.
| 
| At 10:00 UTC gated  exited on all core routers and ENSS's.
| All networks announced by NSFNET sites were unreachable or 
| experienced varying degrees of instability during this window
| while gated was restarted across the NSFNET backbone.  The
| cause of this outage is currently being pursued by our engineers.
| 
| Stephen Powell
| ANS Network Operations Center

Well, though in many cases notifications were sent to nsr about
circuit outages and individual ENSS outages. I believe the 
charter of the list said that it was appropriate for all sorts
of outage reporting, not simply NFSnet backbone reports, however
I seem to rarely remember that ever happening, even then.

Similarly, the Internet Monthly Report from Anne Cooper at ISI
would summarize notable events and regionals (and anybody else,
it seemed) would submit monthly reports of significant events.

You didn't see discussion of high-level issues on the NSR list,
and that was the right thing; issue-discussion was seperate from
operational notification. I find that seperation to be
incredibly useful. Perhaps it is because at this point I deal
less with day-to-day operational issues (company scaling), but
I think even in the heyday I would have felt the same.

Bill Simpson points out:

/ In the case of a small rural ISP with less than 4,000 customers, an
/ amazing number of folks called about our "problem", and the NANOG list
/ is just about the first place I look for a heads up or explanation.

And of course, NANOG doesn't information about most of these outages,
and while I think it should not, that doesn't mean I do not think that
those outages should go unreported.

I would propose that we consider creating a mechanism for that sort
of outage reporting. It seems to me that there are two broad categories:

a) Official outage reporting from the organization experiencing the
   outage
b) Unofficial outage reporting from someone affected by the outage.

Both are valuable and occur in different ways, and unfortunately it is
the case that in today's business climate, the latter is likely to be
more accurate and detailed.

The obvious implementations that occur to me are i) A mailing list
like NSR; just bring it back, potentially moderate it to ensure that
the usage is consistent with the charter, and redirect postings from
NANOG to such a list. ii) A web-based format where people can note
outages, and comment on them usefully (perhaps ala slashdot?).

I think both of those ideas could work, though both have bene tried
and not worked very well for various reasons [what ever happened to
outage@dal.net?].

I would ask, however, that someone *not* take this message as the impetus
to go out and set up such a thing, but instead try to listen to
reasoned discussion and coordinate it with the community.

Back to Sean:
...
The Internet (RTM) worm affected only VAX and Sun computers, an estimated
10% of the Internet of the day.  If you didn't use Sun or VAXen, it would
have been an irrelevent event for you.
Not only that, it affected *hosts* (unless of course, you were using
Suns or VAXen as gateways, as I'm sure many people were). Surely hosts
are outside the scope of nanog? ;-)
Seriously, though, I think it is terribly unfair to compare something like
an Internet-wide worm to a simple DNS misconfiguration. The latter is one
person's problem and can be fixed with a quick phone call to the right person
(Assuming you can find that person, 20 phone calls later), whereas the former
is a huge management problem that cannot be easily dealt with.
...
When AOL forgot to put a GUARDIAN password on its domains, and there
where changed to a tiny ISP, if you didn't use AOL it may have been
irrelevent to you.
For the most part, yes, though I believe that this caused real operational
effects for large volumes of mail queued on mail servers of network providers
in North America, and so was operationally relevent. Failed DNS queries
to E*TRADE just don't have the same level of visibility. They may affect
customers equally, but they affect providers not-at-all.
...
When Cisco, Bay and GATED BGP implementations had a disagreement on
whether ASNs could be repeated in an as-path, it may have been
irrelevent to you if you used a different BGP implementation or
router.
You're being really off-the-wall here. It's quite clear that a statistically
significant fraction of North American network operators use those implementations,
so discussion is meritted. Especially because there is *something* to discuss,
not merely "Oh, look, it's broken. We can now wait until they fix it."
...
Whether a particular NSI problem, an E*Trade problem, or an Ebay problem,
or a Cisco CCO problem is really significant enough to talk about semi-
publically is tough.  It would be nice if each company was willing to
make timely disclosures about problems.
E*TRADE's annual report for 1999 makes some disclosures about infrastructure
failures, by the way.
...
But as we've seen time and time again, companies would prefer to
never to acknowledge they had any problem until it becomes
impossible to ignore (e.g. Worldcom's 10 days of hell last summer).
Indeed. Just because they should be reported doesn't mean they should
be reported to NANOG.

I think outage notification and operational issue discussion are
different things and should go to different places.

That worked well for the NSFnet with nsr@merit.edu split from
regional-techs@merit.edu, and the Internet has only grown since then,
and the scaling benefits would be much more sizable.

Opinions?

--jhawk