On 8/1/12, Diogo Montagner <diogo.montagner@gmail.com> wrote: I think it's more complicated than that, the cost of misconfiguration is almost inseparable in some cases from the cost of configuration in general.; not all misconfigs are equal, so you might want to concentrate on a specific kind of misconfiguration, or a specific misconfig impact "E.g. an erroneous filter is applied, causing routes to be accepted from an EGP peer without restriction". Esp. with misconfigurations that might not have an immediately discovered impact, business impact beyond cost to discover and resolve may not be apparent, which depend on details of the misconfig, such as how trivial or 'obvious' the error should be, how consistent the problems it causes. At least if you concetrate on a certain specific type of misconfig and specific impact, you can have a basis for comparison and approximation, for just that type though. The "fix" to some types of misconfigs might sometimes be to update the design documentation, so the "misconfig" is no longer a misconfiguration; so then you can start asking about how you define "misconfig" in the first place, and the costs of having erroneous or missing documentation. Which is hard, because the "costs" of updating documentation and finding errors, less than best/optimal practices, or improvements possible in configurations, are effected by long term "costs" or loss of efficiencies resulting from failing to correct documentation, and failing to review and improve arguably suboptimal configurations. Some misconfigs or suboptimal configs are discovered by review or other measures before there is any operational impact. Some misconfigs are "safe" or "harmless" by coincidence, but can cause issues later when the network is expanded farther according to design that does not anticipate the misconfig, so the cost there is increased risk. Not all possible misconfigurations of a network cause an outage, some misconfigurations are actually design errors, not operator errors; not all network issues are outages, some configuration errors are just things like "Some entries in an access-list that are dead-weight, e.g. can never be reached, or is not necessary"; and the impact of this error is wasted memory resources, or increased complexity / more unnecessary stuff for humans to look at. (The entry might not have been dead-weight when originally added.) Correcting the deadweight ACL entry situation then is an improvement in efficiency. Not all misconfigurations are detected, either, possibly, sometimes even misconfigs that caused issues. An example of a misconfiguration that would occur frequently in some kinds of environments and might not break an uptime SLA, would be suboptimal performance, less cost-effectiveness (E.g. early upgrade required due to an unrecognized misconfiguration). Or configuration deadweight utilizing so much memory, that hardware upgrades become needed. On some networks, there might not be a formal SLA, and the end user might not notice or take issue with it. Loss of fault resilience (E.g. failover path won't work); no SLA is violated if the fault tolerance wasn't required by the SLA, and the configuration error might go undetected for years if there was not regular failover testing performed. It might be corrected before there is an issue... then the cost of "Increased risk" during the period, in which the misconfig wasn't service-effecting could be quite nebulous.
I never saw any literature about this topic. But I think it is not too difficult to calculate (or estimate). [snip] A misconfiguration will, at least, impact on two points: network outage and re-work. For the network outage, you have to use the SLAs to calculate the cost (how much you lost from the customers' revenue) due to that outage. On the other hand, there is the time efforts spent to fix the misconfiguration. Under the fix, it could be removing the [snip]
-- -JH