interesting discussion. at least we're talking about networking now. :-) wrt sean's comment, the only thing i can think he means by 'partition' is that the networks may have power may be in some routing table but just not the routing table of any of renesys's (or routeviews or ripe) peers. in that case, i guess i would agree. our use of 'outage' is a special case of 'partition' where the whole internet is on one side and it's possible that the networks in question are on the other. they may route somewhere. just not to the internet. quick question below...
There are some inconsistent terms used in computer dependability research, but I prefer and use two key definitions: failure (something is offline) and outage (customer sees the service offline).
not sure i understand these definitions. i'm happy to use any well-defined terms (vocabulary never being worth fighting over). again, when i use 'outage' i mean: previously in global internet tables of a consensus of a large peerset and now removed from those tables. which is that in your terms?
Looking at the routing tables you see failures.
not necessarily, if i'm understanding your definitions (which i guess i'm not).
If a prefix goes away completely and utterly, and is truly unreachable, then anyone trying to see it is going to see an outage. But you can have a lot of intermediate cases where routes are mostly down but not completely, or where parts of the net can see it but other parts can't due to the vagarities of route propogation and partial failures.
yes. we cover all of these by having a large peerset and integrating our data across them. the outages that we report are not from a particular point on the net. they are from a consensus of a large, selected peerset.
And there are situations where the route is down but the service is still up.
unless you use words differently, this is not true. by 'service' i mean 'IP service'. if the route is down, no one can reach anything associated with that route, obviously. do you mean 'service' as local loop service?
There are other network monitoring groups that do end to end connectivity tests from geographically distributed clients out to sample systems around the net. Some for research and some for hire for network monitoring.
I think what they do is much closer to identifying true outages than your method.
yes, that may be. those are good ways of identifying certain kinds of outages. the problem is that they only measure what they measure. frequently these systems measure well-connected sites monitoring well-connected sites. this creates a bias in the data, tending to suggest that no big event ever really impacts the internet. this is obviously a false conclusion. for reference compare the analysis of the 2003 US blackouts from keynote: http://www.keynote.com/news_events/releases_2003/03august14.html (summary: nothing to see here, move along) with those from renesys: http://www.renesys.com//resource_library/blackout_results.html (summary: >4K prefixes disappeared from the global table impacting connectivity to hospitals, schools, government and lots of businesses). i would agree that our method of routing table analysis has significant limitations and needs to be combined with other data. but it's a fantastic way of showing a lower bound on what was affected: prefixes without entries in the global table almost certainly have no service. t. -- _____________________________________________________________________ todd underwood director of operations & security renesys - interdomain intelligence todd@renesys.com www.renesys.com