Re: Ph.D. student looking for data on network failure causes
Hello network operators! I'm a Ph.D. student at UC Berkeley working for Dave Patterson on the ROC project, which is investgiating techniques for improving the availability and manageability of large-scale Internet services and systems. I'm currently conducting a study of the root causes (hardware, softare, human, etc.) and durations of failures in such systems. To do this, I have been examining the operations trouble ticket databases from several large-scale Internet services (of the Hotmail, eBay, Yahoo!, etc. type).
In doing this research, it has become apparent that for many services (especially geographically distributed ones, e.g. those that use multiple colocation facilities), a major cause of problems is failures of various types in the Internet. Thus I've become interested in finding out the types and root causes of problems in wide-area networks, e.g. within the kinds of large-scale ASes that are administered by the folks on this list. I'm not sure how your services track failures and problems; the problem tracking databases at the services I've examined have been a great source of data about problem scale, symptoms, root causes, durations, steps (and missteps!) taken in diagnosing and fixing problems, etc.
I'm writing to the list because I'm very interested in working with network operators to study the causes of failures in large networks. I realize
type of data is very sensitive to your organizations. I would be happy to talk offline with anyone who is interested in the possibility of sharing data, about how I've overcome the multitude of objections that have been raised by folks I have solicited for data (protecting their customers' privacy, securing datasets when they are not examined on the premises of
services, anonymizing and aggregating data in reporting, etc. etc.). I'm interested in the relative causes of failures, *not* overall availability numbers. As a result of the precautions we've taken, several household-name Internet services have allowed me to examine and report on the problems their servcies have experienced.
If you're interested in discussing the possibility of sharing access to
By the way, just a clarification of my original message: the results of this study will (eventually) be published in some academic forum. I'll post to the NANOG mailing list a pointer to any results when they are available, so there's no need to email me separately to indicate your interest in the results. (Interest in providing data, on the other hand, would be most welcome!) Thanks, David this the this
kind of data about your service, please contact me. I'm willing to examine data on the premises of your service, to anonymize it fully, to submit any results I want to publish to your organization prior to publication, to sign any necessary NDAs, etc. In return, I'm happy to share with you any insights I have about the problems your service experiences, and you'll contribute to the world's knowledge of why bad things happen to good networks. :-)
If you're not the right person in your organization to contact with this request, but you think your organization might be interested in participating in this study, perhaps you could forward this email to the appropriate person or let me know who the right person to contact in your organization would be.
participants (1)
-
David L. Oppenheimer