Re: Telia Not Withdrawing v6 Routes

17 Nov 2020

      On Tue, 17 Nov 2020 at 03:40, Sabri Berisha <sabri@cluecentral.net> wrote:

Hey Sabri,
...
Also, in the case that I described it wasn't a Junos device. Makes me wonder how bugs
like that get introduced. One would expect that after 20+ years of writing BGP code,
handling a withdrawl would be easy-peasy.
I don't think this is related to skill, that there was some hard
programming problem that DE couldn't solve. These are honest mistakes.
I've not experienced in my tenure the frequency of these bugs change
at all, NOS are as common now as they were in the 90s.

I put most of the blame on the market, we've modelled commercial
router market so that poor quality NOS is good for business and good
quality NOS is bad for business, I don't think this is in anyone's
formal business plan or that companies even realise they are not even
trying to make good NOS. I think it's emergent behaviour due to the
market and people follow that market demand unknowingly.
If we suddenly had one commercial NOS which is 100% bug free, many of
their customers would stop buying support, would rely on spare HW and
Internet forums for configuration help. Lot of us only need contracts
to deal with novel bugs all of us find on a regular basis, so good NOS
would immediately reduce revenue. For some reason Windows, macOS or
Linux almost never have novel bugs that the end user finds and when
those are found, it's big news. While we don't go a month without
hitting a novel bug in one of our NOS, and no one cares about it, it's
business as usual.

I also put a lot of blame on C, it was a terrific language when
compiling had to be fast. Basically macro assembler. Now the utility
of being 'close to HW' is gone, as the CPU does so much C compiler has
no control over, it's not really even executing the same code
as-written anymore. MSFT estimated >70% of their bugs are related to
memory safety. We could accomplish significant improvements in
software quality if we'd ditch C and allow the computer to do more
formal correctness checks at compile time and design languages which
lend towards this.

We constantly misattribute problems (like in this post) to config or
HW, while most common reasons for outages are pilot error and SW
defect, and very little engineering time is spent on those. And often
the time spent improving the two first increases the risk of the two
latter, reducing mean availability over time.

-- 
  ++ytti

Re: Telia Not Withdrawing v6 Routes

Saku Ytti