Hello, Is there anyone with clue from UCSF on-list? Or if someone knows how to put me in contact with them, that would be great. We are not able to query their DNS servers from our network. We've got users not able to access anything UCSF due to this. Thus far, their response has been to manually put DNS entries into our users hosts file, not actually fix the real issue. Thanks, -Robert
On 08/01/12 10:51 , Robert Glover wrote:
We are not able to query their DNS servers from our network. We've got users not able to access anything UCSF due to this.
I am querying them OK. I am in US AZ. I am also able to reach manana.garlic.com. [hyperion]/usr/local# dig www.ucsf.edu @ucsfns2.ucsf.edu www.ucsf.edu. 3600 IN A 64.54.132.50 ucsf.edu. 3600 IN NS ucsfns1.ucsf.edu. ucsf.edu. 3600 IN NS adns2.Berkeley.edu. ucsf.edu. 3600 IN NS adns1.Berkeley.edu. ucsf.edu. 3600 IN NS ucsfns2.ucsf.edu. adns1.Berkeley.edu. 172800 IN A 128.32.136.3 adns1.Berkeley.edu. 3600 IN AAAA 2607:f140:ffff:fffe::3 adns2.Berkeley.edu. 172800 IN A 128.32.136.14 adns2.Berkeley.edu. 3600 IN AAAA 2607:f140:ffff:fffe::e ucsfns1.ucsf.edu. 3600 IN A 128.218.254.10 ucsfns2.ucsf.edu. 3600 IN A 128.218.254.40 ;; Query time: 41 msec ;; SERVER: 128.218.254.40#53(128.218.254.40) ;; WHEN: Wed Aug 1 11:02:46 2012 ;; MSG SIZE rcvd: 270
Ditto on that from TWTC in Milwaukee, WI. # dig www.ucsf.edu @ucsfns2.ucsf.edu ; <<>> DiG 9.8.1-P1 <<>> www.ucsf.edu @ucsfns2.ucsf.edu ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49793 ;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 4, ADDITIONAL: 6 ;; WARNING: recursion requested but not available ;; QUESTION SECTION: ;www.ucsf.edu. IN A ;; ANSWER SECTION: www.ucsf.edu. 3600 IN A 64.54.132.50 ;; AUTHORITY SECTION: ucsf.edu. 3600 IN NS adns1.Berkeley.edu. ucsf.edu. 3600 IN NS ucsfns2.ucsf.edu. ucsf.edu. 3600 IN NS adns2.Berkeley.edu. ucsf.edu. 3600 IN NS ucsfns1.ucsf.edu. ;; ADDITIONAL SECTION: adns1.Berkeley.edu. 172800 IN A 128.32.136.3 adns1.Berkeley.edu. 3600 IN AAAA 2607:f140:ffff:fffe::3 adns2.Berkeley.edu. 172800 IN A 128.32.136.14 adns2.Berkeley.edu. 3600 IN AAAA 2607:f140:ffff:fffe::e ucsfns1.ucsf.edu. 3600 IN A 128.218.254.10 ucsfns2.ucsf.edu. 3600 IN A 128.218.254.40 ;; Query time: 63 msec ;; SERVER: 128.218.254.40#53(128.218.254.40) ;; WHEN: Wed Aug 1 13:48:51 2012 ;; MSG SIZE rcvd: 259 On Wed, Aug 1, 2012 at 1:07 PM, Henry Stryker <henry@hup.org> wrote:
On 08/01/12 10:51 , Robert Glover wrote:
We are not able to query their DNS servers from our network. We've got users not able to access anything UCSF due to this.
I am querying them OK. I am in US AZ. I am also able to reach manana.garlic.com.
[hyperion]/usr/local# dig www.ucsf.edu @ucsfns2.ucsf.edu www.ucsf.edu. 3600 IN A 64.54.132.50 ucsf.edu. 3600 IN NS ucsfns1.ucsf.edu. ucsf.edu. 3600 IN NS adns2.Berkeley.edu. ucsf.edu. 3600 IN NS adns1.Berkeley.edu. ucsf.edu. 3600 IN NS ucsfns2.ucsf.edu. adns1.Berkeley.edu. 172800 IN A 128.32.136.3 adns1.Berkeley.edu. 3600 IN AAAA 2607:f140:ffff:fffe::3 adns2.Berkeley.edu. 172800 IN A 128.32.136.14 adns2.Berkeley.edu. 3600 IN AAAA 2607:f140:ffff:fffe::e ucsfns1.ucsf.edu. 3600 IN A 128.218.254.10 ucsfns2.ucsf.edu. 3600 IN A 128.218.254.40 ;; Query time: 41 msec ;; SERVER: 128.218.254.40#53(128.218.254.40) ;; WHEN: Wed Aug 1 11:02:46 2012 ;; MSG SIZE rcvd: 270
In message <50196C8E.60100@garlic.com>, Robert Glover writes:
Hello,
Is there anyone with clue from UCSF on-list? Or if someone knows how to put me in contact with them, that would be great.
We are not able to query their DNS servers from our network. We've got users not able to access anything UCSF due to this.
Thus far, their response has been to manually put DNS entries into our users hosts file, not actually fix the real issue.
Please, please don't misuse "DNS entries". Host files DO NOT and NEVER HAVE taken "DNS entries". The contain hostname/address mappings but they are not and never bave been DNS entries.
Thanks, -Robert
-- Mark Andrews, ISC 1 Seymour St., Dundas Valley, NSW 2117, Australia PHONE: +61 2 9871 4742 INTERNET: marka@isc.org
On 08/01/2012 10:51 AM, Robert Glover wrote:
Hello,
Is there anyone with clue from UCSF on-list? Or if someone knows how to put me in contact with them, that would be great.
We are not able to query their DNS servers from our network. We've got users not able to access anything UCSF due to this.
Thus far, their response has been to manually put DNS entries into our users hosts file, not actually fix the real issue.
Thanks, -Robert
I should have been a little more forthcoming with information. We are having issues with getting responses from these servers: NSMEDCTR1.UCSFMEDICALCENTER.ORG NSMEDCTR2.UCSFMEDICALCENTER.ORG Which are authoritative for "ucsfmedctr.org" and "ucsfmedicalcenter.org". We ARE able to resolve ucsf.edu and things associated with that entity, just NOT the medical center. Thanks, -Robert
Hi all, I am looking for literature on the (monetary) costs of misconfigurations in an operational ISP network. Are there any such studies I can benefit from? In a larger context, are there any thorough studies exploring the cost of building and running a large ISP network? Best, -Murat ======================================== Murat Yuksel Associate Professor Graduate Director Department of Computer Science and Engineering University of Nevada, Reno 1664 N. Virginia Street, MS 171, Reno, NV 89557. Phone: +1 (775) 327 2246, Fax: +1 (775) 784 1877 E-mail: yuksem@cse.unr.edu Web: http://www.cse.unr.edu/~yuksem ========================================
Hi Murat, I never saw any literature about this topic. But I think it is not too difficult to calculate (or estimate). A misconfiguration will, at least, impact on two points: network outage and re-work. For the network outage, you have to use the SLAs to calculate the cost (how much you lost from the customers' revenue) due to that outage. On the other hand, there is the time efforts spent to fix the misconfiguration. Under the fix, it could be removing the misconfig and applying a new one correct. Or just fixing the misconfig targeting the correct config. This re-work will translate in time, and time can be translated in money spent. Regards On 8/2/12, Murat Yuksel <yuksem@cse.unr.edu> wrote:
Hi all,
I am looking for literature on the (monetary) costs of misconfigurations in an operational ISP network. Are there any such studies I can benefit from?
In a larger context, are there any thorough studies exploring the cost of building and running a large ISP network?
Best,
-Murat ======================================== Murat Yuksel Associate Professor Graduate Director Department of Computer Science and Engineering University of Nevada, Reno 1664 N. Virginia Street, MS 171, Reno, NV 89557. Phone: +1 (775) 327 2246, Fax: +1 (775) 784 1877 E-mail: yuksem@cse.unr.edu Web: http://www.cse.unr.edu/~yuksem ========================================
-- Sent from my mobile device ./diogo -montagner JNCIE-SP 0x41A
On Wed, Aug 1, 2012 at 8:08 PM, Diogo Montagner <diogo.montagner@gmail.com> wrote:
A misconfiguration will, at least, impact on two points: network outage and re-work. For the network outage, you have to use the SLAs to calculate the cost (how much you lost from the customers' revenue) due to that outage. On the other hand, there is the time efforts spent to fix the misconfiguration. Under the fix, it could be removing the misconfig and applying a new one correct. Or just fixing the misconfig targeting the correct config. This re-work will translate in time, and time can be translated in money spent.
Isn't the largest cost omitted (or at least glossed over) here? Namely, lost customers due to the outage. That's why people have SLAs and rework the network at all -- to avoid that cost. -- Darius Jahandarie
Hi Darius, You are right. The lost of a customer due to those things. However, I would classify this as an unknown situation (in terms of risk analisys) because the others I mentioned are possible to calculate and estimate (they are known). But it is very hard to estimate if a customer will cancel the contract because 1 or n network outages. In theory, if the customer SLA is not being met consecutively, there is a potential probability he will cancel the contract. Regards On 8/2/12, Darius Jahandarie <djahandarie@gmail.com> wrote:
On Wed, Aug 1, 2012 at 8:08 PM, Diogo Montagner <diogo.montagner@gmail.com> wrote:
A misconfiguration will, at least, impact on two points: network outage and re-work. For the network outage, you have to use the SLAs to calculate the cost (how much you lost from the customers' revenue) due to that outage. On the other hand, there is the time efforts spent to fix the misconfiguration. Under the fix, it could be removing the misconfig and applying a new one correct. Or just fixing the misconfig targeting the correct config. This re-work will translate in time, and time can be translated in money spent.
Isn't the largest cost omitted (or at least glossed over) here? Namely, lost customers due to the outage. That's why people have SLAs and rework the network at all -- to avoid that cost.
-- Darius Jahandarie
-- Sent from my mobile device ./diogo -montagner JNCIE-SP 0x41A
On Wed, Aug 1, 2012 at 5:32 PM, Diogo Montagner <diogo.montagner@gmail.com> wrote:
Hi Darius,
You are right. The lost of a customer due to those things. However, I would classify this as an unknown situation (in terms of risk analisys) because the others I mentioned are possible to calculate and estimate (they are known). But it is very hard to estimate if a customer will cancel the contract because 1 or n network outages. In theory, if the customer SLA is not being met consecutively, there is a potential probability he will cancel the contract.
Regards
On the end customer side, I've done a bunch of reliability / risk cost assessments for various customers over the years. It's never easy. For an ISP... customers are fairly locked in, but for big networks and customers, especially multihoming customers, business goes where they want it. SLA costs are easy. Predicting the final financial impact is hard. -- -george william herbert george.herbert@gmail.com
Quantifying the business costs would be very complex. Here are some reports and research papers that may be a starting point: [1] Juniper Networks, Inc., “What's Behind Network Downtime?,” pp. 1–12, May 2008. [2] R. Mahajan, D. Wetherall, and T. Anderson, “Understanding BGP misconfiguration,” Proceedings of the 2002 conference on Applications, 2002. [3] A. Medem, R. Teixeira, N. Feamster, and M. Meulle, “Joint analysis of network incidents and intradomain routing changes,” Network and Service Management (CNSM), 2010 International Conference on, pp. 198–205, 2010. [4] D. Turner, K. Levchenko, A. C. Snoeren, and S. Savage, “California fault lines: understanding the causes and impact of network failures,” presented at the SIGCOMM '10: Proceedings of the ACM SIGCOMM 2010 conference on SIGCOMM, 2010. [5] Z. Yin, X. Ma, J. Zheng, Y. Zhou, L. N. Bairavasundaram, and S. Pasupathy, “An empirical study on configuration errors in commercial and open source systems,” presented at the SOSP '11: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, 2011. [6] Z. Kerravala, “As the Value of Enterprise Networks Escalates, So Does the Need for Configuration Management ,” cs.princeton.edu, 01-Jan.-2004. [Online]. Available: https://www.cs.princeton.edu/courses/archive/fall10/cos561/papers/Yankee04.p.... [Accessed: 09-May-2012]. [7] W. Enck, P. McDaniel, S. Sen, and P. Sebos, “Configuration management at massive scale: System design and experience,” USENIX '07, Jun. 2007. [8] R. D. Doverspike, K. K. Ramakrishnan, and C. Chase, “Structural overview of ISP networks,” Guide to Reliable Internet Services and Applications, pp. 19–93, 2010. On 2 August 2012 10:46, George Herbert <george.herbert@gmail.com> wrote:
On Wed, Aug 1, 2012 at 5:32 PM, Diogo Montagner <diogo.montagner@gmail.com> wrote:
Hi Darius,
You are right. The lost of a customer due to those things. However, I would classify this as an unknown situation (in terms of risk analisys) because the others I mentioned are possible to calculate and estimate (they are known). But it is very hard to estimate if a customer will cancel the contract because 1 or n network outages. In theory, if the customer SLA is not being met consecutively, there is a potential probability he will cancel the contract.
Regards
On the end customer side, I've done a bunch of reliability / risk cost assessments for various customers over the years. It's never easy.
For an ISP... customers are fairly locked in, but for big networks and customers, especially multihoming customers, business goes where they want it.
SLA costs are easy. Predicting the final financial impact is hard.
-- -george william herbert george.herbert@gmail.com
The misconfiguration cost is usually not calculable in itself. But I think the more important issue is, "How do we prevent it?" I would spend more time on prevention than assessing the cost. I can think of several minor provisioning issues that cost us more in customer relations than everything else put together and a couple significant ones that seemed like nothing happened. And I am not sure I could have predicted the outcome the day before the event if someone had handed me the scenario to assess it. Reason, when it happens the CURRENT situation is as much a driver of the impact as is the actual event. It even goes back to the emotional state of the customer and maybe if his toast was burned this morning, if he/she had a fight with the spouse, who flipped him the bird during his drive in and a lot of other things that dictate mental state. I would be very lax to use a vendor who is taking an approach that all they are concerned about is what an error costs them. I want them to be more concerned about what that costs their customer (me) and what they can do to prevent it. Proper Prior Preparation Prevents Piss Poor Performance. Training, sound processes, good management practices, good maintenance, good personnel selection go a long way. To quote Chief Gassaway (fire chief with good stuff on the web for any business) "Luck validates bad practices." The REB translation, "We did it this way for years and nothing bad happened." In Chief Gassaway's business, bad practices cause Line of Duty Deaths. In ours it causes outages, lost revenue and possibly bankruptcy. Remember, if your company goes belly up, you are out of a job... http://www.samatters.com/2012/07/31/positive-reinforcement-of-undesirabl e-behavior/ Ralph Brandt -----Original Message----- From: George Herbert [mailto:george.herbert@gmail.com] Sent: Wednesday, August 01, 2012 9:17 PM To: Diogo Montagner Cc: nanog@nanog.org Subject: Re: cost of misconfigurations On Wed, Aug 1, 2012 at 5:32 PM, Diogo Montagner <diogo.montagner@gmail.com> wrote:
Hi Darius,
You are right. The lost of a customer due to those things. However, I would classify this as an unknown situation (in terms of risk analisys) because the others I mentioned are possible to calculate and estimate (they are known). But it is very hard to estimate if a customer will cancel the contract because 1 or n network outages. In theory, if the customer SLA is not being met consecutively, there is a potential probability he will cancel the contract.
Regards
On the end customer side, I've done a bunch of reliability / risk cost assessments for various customers over the years. It's never easy. For an ISP... customers are fairly locked in, but for big networks and customers, especially multihoming customers, business goes where they want it. SLA costs are easy. Predicting the final financial impact is hard. -- -george william herbert george.herbert@gmail.com
On Aug 2, 2012, at 10:31 AM, Brandt, Ralph wrote:
The misconfiguration cost is usually not calculable in itself. But I think the more important issue is, "How do we prevent it?" I would spend more time on prevention than assessing the cost.
Lots of people have developed best practices on these topics. The problem is pushing against the business side and keeping these in place, and not letting the bar be low at your upstream and peers. There is a secondary issue that is yet still unaddressed. Some vendors still send all routes they receive out to all external peers in the absence of a policy. This is something I want to see corrected as it will require a bit more intelligence when it comes to BGP policy to provide the expected behavior. - Jared
I do not think occasional outages cause significant loss of customers. Customers get angry easily, but once an issue is fixed, they get happy quickly. Customers have very short memories and the cost and hassle of changing services is often significant. Outages are never good, but it is better to concentrate on fixing the issue than panic about customers canceling their service. Many times the cause of an outage is totally out of your control. For example, most of our outages are caused by Verizon's aging and neglected copper cable plant. I often wish some company had the balls to file a class action lawsuit over Verizon's neglect of their copper plant, but NOBODY wants to piss off their ILEC, including us. -----Original Message----- From: Diogo Montagner [mailto:diogo.montagner@gmail.com] Sent: Wednesday, August 01, 2012 8:32 PM To: Darius Jahandarie; Murat Yuksel; nanog@nanog.org Subject: Re: cost of misconfigurations Hi Darius, You are right. The lost of a customer due to those things. However, I would classify this as an unknown situation (in terms of risk analisys) because the others I mentioned are possible to calculate and estimate (they are known). But it is very hard to estimate if a customer will cancel the contract because 1 or n network outages. In theory, if the customer SLA is not being met consecutively, there is a potential probability he will cancel the contract. Regards On 8/2/12, Darius Jahandarie <djahandarie@gmail.com> wrote:
On Wed, Aug 1, 2012 at 8:08 PM, Diogo Montagner <diogo.montagner@gmail.com> wrote:
A misconfiguration will, at least, impact on two points: network outage and re-work. For the network outage, you have to use the SLAs to calculate the cost (how much you lost from the customers' revenue) due to that outage. On the other hand, there is the time efforts spent to fix the misconfiguration. Under the fix, it could be removing the misconfig and applying a new one correct. Or just fixing the misconfig targeting the correct config. This re-work will translate in time, and time can be translated in money spent.
Isn't the largest cost omitted (or at least glossed over) here? Namely, lost customers due to the outage. That's why people have SLAs and rework the network at all -- to avoid that cost.
-- Darius Jahandarie
-- Sent from my mobile device ./diogo -montagner JNCIE-SP 0x41A
On 8/1/12, Diogo Montagner <diogo.montagner@gmail.com> wrote: I think it's more complicated than that, the cost of misconfiguration is almost inseparable in some cases from the cost of configuration in general.; not all misconfigs are equal, so you might want to concentrate on a specific kind of misconfiguration, or a specific misconfig impact "E.g. an erroneous filter is applied, causing routes to be accepted from an EGP peer without restriction". Esp. with misconfigurations that might not have an immediately discovered impact, business impact beyond cost to discover and resolve may not be apparent, which depend on details of the misconfig, such as how trivial or 'obvious' the error should be, how consistent the problems it causes. At least if you concetrate on a certain specific type of misconfig and specific impact, you can have a basis for comparison and approximation, for just that type though. The "fix" to some types of misconfigs might sometimes be to update the design documentation, so the "misconfig" is no longer a misconfiguration; so then you can start asking about how you define "misconfig" in the first place, and the costs of having erroneous or missing documentation. Which is hard, because the "costs" of updating documentation and finding errors, less than best/optimal practices, or improvements possible in configurations, are effected by long term "costs" or loss of efficiencies resulting from failing to correct documentation, and failing to review and improve arguably suboptimal configurations. Some misconfigs or suboptimal configs are discovered by review or other measures before there is any operational impact. Some misconfigs are "safe" or "harmless" by coincidence, but can cause issues later when the network is expanded farther according to design that does not anticipate the misconfig, so the cost there is increased risk. Not all possible misconfigurations of a network cause an outage, some misconfigurations are actually design errors, not operator errors; not all network issues are outages, some configuration errors are just things like "Some entries in an access-list that are dead-weight, e.g. can never be reached, or is not necessary"; and the impact of this error is wasted memory resources, or increased complexity / more unnecessary stuff for humans to look at. (The entry might not have been dead-weight when originally added.) Correcting the deadweight ACL entry situation then is an improvement in efficiency. Not all misconfigurations are detected, either, possibly, sometimes even misconfigs that caused issues. An example of a misconfiguration that would occur frequently in some kinds of environments and might not break an uptime SLA, would be suboptimal performance, less cost-effectiveness (E.g. early upgrade required due to an unrecognized misconfiguration). Or configuration deadweight utilizing so much memory, that hardware upgrades become needed. On some networks, there might not be a formal SLA, and the end user might not notice or take issue with it. Loss of fault resilience (E.g. failover path won't work); no SLA is violated if the fault tolerance wasn't required by the SLA, and the configuration error might go undetected for years if there was not regular failover testing performed. It might be corrected before there is an issue... then the cost of "Increased risk" during the period, in which the misconfig wasn't service-effecting could be quite nebulous.
I never saw any literature about this topic. But I think it is not too difficult to calculate (or estimate). [snip] A misconfiguration will, at least, impact on two points: network outage and re-work. For the network outage, you have to use the SLAs to calculate the cost (how much you lost from the customers' revenue) due to that outage. On the other hand, there is the time efforts spent to fix the misconfiguration. Under the fix, it could be removing the [snip]
-- -JH
On 08/01/12 16:22 , Robert Glover wrote:
We are having issues with getting responses from these servers:
NSMEDCTR1.UCSFMEDICALCENTER.ORG NSMEDCTR2.UCSFMEDICALCENTER.ORG
Which are authoritative for "ucsfmedctr.org" and "ucsfmedicalcenter.org".
Those servers respond to my queries from here in AZ: # dig www.ucsfmedicalcenter.org @nsmedctr2.ucsfmedicalcenter.org www.ucsfmedicalcenter.org. 86400 IN CNAME webmcb06.ucsfmedicalcenter.org. webmcb06.ucsfmedicalcenter.org. 86400 IN A 64.54.46.99 ;; Query time: 41 msec ;; SERVER: 64.54.50.50#53(64.54.50.50) ;; WHEN: Wed Aug 1 17:36:36 2012 ;; MSG SIZE rcvd: 93 # dig www.ucsfmedicalcenter.org @nsmedctr1.ucsfmedicalcenter.org www.ucsfmedicalcenter.org. 86400 IN CNAME webmcb06.ucsfmedicalcenter.org. webmcb06.ucsfmedicalcenter.org. 86400 IN A 64.54.46.99 ;; Query time: 54 msec ;; SERVER: 64.54.42.50#53(64.54.42.50) ;; WHEN: Wed Aug 1 17:37:41 2012 ;; MSG SIZE rcvd: 93
also responds here in Ohio on TW On Wed, Aug 1, 2012 at 8:44 PM, Henry Stryker <henry@hup.org> wrote:
On 08/01/12 16:22 , Robert Glover wrote:
We are having issues with getting responses from these servers:
NSMEDCTR1.UCSFMEDICALCENTER.ORG NSMEDCTR2.UCSFMEDICALCENTER.ORG
Which are authoritative for "ucsfmedctr.org" and "ucsfmedicalcenter.org ".
Those servers respond to my queries from here in AZ:
# dig www.ucsfmedicalcenter.org @nsmedctr2.ucsfmedicalcenter.org www.ucsfmedicalcenter.org. 86400 IN CNAME webmcb06.ucsfmedicalcenter.org. webmcb06.ucsfmedicalcenter.org. 86400 IN A 64.54.46.99 ;; Query time: 41 msec ;; SERVER: 64.54.50.50#53(64.54.50.50) ;; WHEN: Wed Aug 1 17:36:36 2012 ;; MSG SIZE rcvd: 93
# dig www.ucsfmedicalcenter.org @nsmedctr1.ucsfmedicalcenter.org www.ucsfmedicalcenter.org. 86400 IN CNAME webmcb06.ucsfmedicalcenter.org. webmcb06.ucsfmedicalcenter.org. 86400 IN A 64.54.46.99 ;; Query time: 54 msec ;; SERVER: 64.54.42.50#53(64.54.42.50) ;; WHEN: Wed Aug 1 17:37:41 2012 ;; MSG SIZE rcvd: 93
participants (15)
-
Brandt, Ralph
-
Brian Henson
-
Darius Jahandarie
-
Diogo Montagner
-
Eric Wieling
-
George Herbert
-
Grant Ridder
-
Henry Stryker
-
Jared Mauch
-
Jimmy Hess
-
Mark Andrews
-
Murat Yuksel
-
Randy Bush
-
Robert Glover
-
Simon Knight