On Tue, 17 Nov 2020 at 03:40, Sabri Berisha <sabri@cluecentral.net> wrote: Hey Sabri,
Also, in the case that I described it wasn't a Junos device. Makes me wonder how bugs like that get introduced. One would expect that after 20+ years of writing BGP code, handling a withdrawl would be easy-peasy.
I don't think this is related to skill, that there was some hard programming problem that DE couldn't solve. These are honest mistakes. I've not experienced in my tenure the frequency of these bugs change at all, NOS are as common now as they were in the 90s. I put most of the blame on the market, we've modelled commercial router market so that poor quality NOS is good for business and good quality NOS is bad for business, I don't think this is in anyone's formal business plan or that companies even realise they are not even trying to make good NOS. I think it's emergent behaviour due to the market and people follow that market demand unknowingly. If we suddenly had one commercial NOS which is 100% bug free, many of their customers would stop buying support, would rely on spare HW and Internet forums for configuration help. Lot of us only need contracts to deal with novel bugs all of us find on a regular basis, so good NOS would immediately reduce revenue. For some reason Windows, macOS or Linux almost never have novel bugs that the end user finds and when those are found, it's big news. While we don't go a month without hitting a novel bug in one of our NOS, and no one cares about it, it's business as usual. I also put a lot of blame on C, it was a terrific language when compiling had to be fast. Basically macro assembler. Now the utility of being 'close to HW' is gone, as the CPU does so much C compiler has no control over, it's not really even executing the same code as-written anymore. MSFT estimated >70% of their bugs are related to memory safety. We could accomplish significant improvements in software quality if we'd ditch C and allow the computer to do more formal correctness checks at compile time and design languages which lend towards this. We constantly misattribute problems (like in this post) to config or HW, while most common reasons for outages are pilot error and SW defect, and very little engineering time is spent on those. And often the time spent improving the two first increases the risk of the two latter, reducing mean availability over time. -- ++ytti