Telia Not Withdrawing v6 Routes

Matt Corallo

16 Nov 2020 16 Nov '20

1:58 a.m.

Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least? eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go anywhere, traces die immediately after a hop or with a !N. Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of Telia's tables until it was replaced with something else of higher pref. Matt

Show replies by date

Ryan Hamel

16 Nov 16 Nov

3:37 a.m.

This same issue happened in Los Angeles a number of years ago, but for IPv4 and v6. They need to setup sane BGP timers, and/or advocate the use of BFD for BGP sessions both customer facing and internal. Ryan On Nov 15 2020, at 5:58 pm, Matt Corallo <nanog@as397444.net> wrote:

...

Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?

eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go anywhere, traces die immediately after a hop or with a !N.

Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of Telia's tables until it was replaced with something else of higher pref.

Matt

Olivier Benghozi

3:40 a.m.

Probably a ghost route. Such thing happens :( https://labs.ripe.net/Members/romain_fontugne/bgp-zombies Their (nice) LG shows that it's still advertised from a router of theirs in Frankfurt (iBGP next hop ::ffff:2.255.251.224 – so by the way they use 6PE). Your best option would probably be to re-advertise the exact same prefix, then re-withdraw it, then yell at Telia's NOC if it fails... Some years ago we experienced something similar (it was a router of TI Sparkle still advertising a prefix of us in Asia to their clients, that they were previously receiving from our former transit GTT – we were advertising it in Europe...).

...

Le 16 nov. 2020 à 02:58, Matt Corallo <nanog@as397444.net> a écrit :

Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?

eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go anywhere, traces die immediately after a hop or with a !N.

Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of Telia's tables until it was replaced with something else of higher pref.

Matt

Matt Corallo

3:44 a.m.

Yea, I did try that on that test prefix but it just stuck around anyway.. I don't care too much, its just some stale test prefix. Sadly, I now see it again with 2620:6e:a002::/48, which, somewhat more impressively, is now generating a routing loop Ashburn <-> NYC, and has always been announced from other places has was dropped/re-announced as wel. Must just be something with my particular prefixes, oh well. Matt On 11/15/20 10:40 PM, Olivier Benghozi wrote:

...

Probably a ghost route. Such thing happens :(

https://labs.ripe.net/Members/romain_fontugne/bgp-zombies

Their (nice) LG shows that it's still advertised from a router of theirs in Frankfurt (iBGP next hop ::ffff:2.255.251.224 – so by the way they use 6PE).

Your best option would probably be to re-advertise the exact same prefix, then re-withdraw it, then yell at Telia's NOC if it fails...

Some years ago we experienced something similar (it was a router of TI Sparkle still advertising a prefix of us in Asia to their clients, that they were previously receiving from our former transit GTT – we were advertising it in Europe...).

...
Le 16 nov. 2020 à 02:58, Matt Corallo <nanog@as397444.net> a écrit :

Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?

eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go anywhere, traces die immediately after a hop or with a !N.

Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of Telia's tables until it was replaced with something else of higher pref.

Matt

Olivier Benghozi

4:07 a.m.

One of the routing gears on the path don't like the large community inside those routes maybe ? :) By the way we currently see 2620:6e:a002::/48 at LINX LON1 from Choopa and HE...

...

Le 16 nov. 2020 à 04:44, Matt Corallo <nanog@as397444.net> a écrit :

Yea, I did try that on that test prefix but it just stuck around anyway.. I don't care too much, its just some stale test prefix.

Sadly, I now see it again with 2620:6e:a002::/48, which, somewhat more impressively, is now generating a routing loop Ashburn <-> NYC, and has always been announced from other places has was dropped/re-announced as wel.

Must just be something with my particular prefixes, oh well.

Matt

On 11/15/20 10:40 PM, Olivier Benghozi wrote:

...
Probably a ghost route. Such thing happens :( https://labs.ripe.net/Members/romain_fontugne/bgp-zombies Their (nice) LG shows that it's still advertised from a router of theirs in Frankfurt (iBGP next hop ::ffff:2.255.251.224 – so by the way they use 6PE). Your best option would probably be to re-advertise the exact same prefix, then re-withdraw it, then yell at Telia's NOC if it fails... Some years ago we experienced something similar (it was a router of TI Sparkle still advertising a prefix of us in Asia to their clients, that they were previously receiving from our former transit GTT – we were advertising it in Europe...).

...
Le 16 nov. 2020 à 02:58, Matt Corallo <nanog@as397444.net> a écrit :

Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?

eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go anywhere, traces die immediately after a hop or with a !N.

Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of Telia's tables until it was replaced with something else of higher pref.

Matt

Matt Corallo

4:12 a.m.

Maybe? Never been an issue before. In this case the route does have a depref community on Telia hence why one wouldn’t expect it via the same path, but the other ghost route in question never had anything similar. Matt

...

On Nov 15, 2020, at 23:07, Olivier Benghozi <olivier.benghozi@wifirst.fr> wrote:

One of the routing gears on the path don't like the large community inside those routes maybe ? :) By the way we currently see 2620:6e:a002::/48 at LINX LON1 from Choopa and HE...

...
Le 16 nov. 2020 à 04:44, Matt Corallo <nanog@as397444.net> a écrit :

Yea, I did try that on that test prefix but it just stuck around anyway.. I don't care too much, its just some stale test prefix.

Sadly, I now see it again with 2620:6e:a002::/48, which, somewhat more impressively, is now generating a routing loop Ashburn <-> NYC, and has always been announced from other places has was dropped/re-announced as wel.

Must just be something with my particular prefixes, oh well.

Matt

...
On 11/15/20 10:40 PM, Olivier Benghozi wrote: Probably a ghost route. Such thing happens :( https://labs.ripe.net/Members/romain_fontugne/bgp-zombies Their (nice) LG shows that it's still advertised from a router of theirs in Frankfurt (iBGP next hop ::ffff:2.255.251.224 – so by the way they use 6PE). Your best option would probably be to re-advertise the exact same prefix, then re-withdraw it, then yell at Telia's NOC if it fails... Some years ago we experienced something similar (it was a router of TI Sparkle still advertising a prefix of us in Asia to their clients, that they were previously receiving from our former transit GTT – we were advertising it in Europe...).

...
Le 16 nov. 2020 à 02:58, Matt Corallo <nanog@as397444.net> a écrit :

Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?

eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go anywhere, traces die immediately after a hop or with a !N.

Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of Telia's tables until it was replaced with something else of higher pref.

Matt

Matt Corallo

2:17 p.m.

For those curious, Johan indicated on Twitter this was a JunOS bug. https://twitter.com/gustawsson/status/1328298914785730561 Matt

...

On Nov 15, 2020, at 23:13, Matt Corallo <nanog@as397444.net> wrote:

Maybe? Never been an issue before. In this case the route does have a depref community on Telia hence why one wouldn’t expect it via the same path, but the other ghost route in question never had anything similar.

Matt

...
On Nov 15, 2020, at 23:07, Olivier Benghozi <olivier.benghozi@wifirst.fr> wrote:

One of the routing gears on the path don't like the large community inside those routes maybe ? :) By the way we currently see 2620:6e:a002::/48 at LINX LON1 from Choopa and HE...

...
...
Le 16 nov. 2020 à 04:44, Matt Corallo <nanog@as397444.net> a écrit :

Yea, I did try that on that test prefix but it just stuck around anyway.. I don't care too much, its just some stale test prefix.

Sadly, I now see it again with 2620:6e:a002::/48, which, somewhat more impressively, is now generating a routing loop Ashburn <-> NYC, and has always been announced from other places has was dropped/re-announced as wel.

Must just be something with my particular prefixes, oh well.

Matt

...
On 11/15/20 10:40 PM, Olivier Benghozi wrote: Probably a ghost route. Such thing happens :( https://labs.ripe.net/Members/romain_fontugne/bgp-zombies Their (nice) LG shows that it's still advertised from a router of theirs in Frankfurt (iBGP next hop ::ffff:2.255.251.224 – so by the way they use 6PE). Your best option would probably be to re-advertise the exact same prefix, then re-withdraw it, then yell at Telia's NOC if it fails... Some years ago we experienced something similar (it was a router of TI Sparkle still advertising a prefix of us in Asia to their clients, that they were previously receiving from our former transit GTT – we were advertising it in Europe...).

...
Le 16 nov. 2020 à 02:58, Matt Corallo <nanog@as397444.net> a écrit :

Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?

eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go anywhere, traces die immediately after a hop or with a !N.

Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of Telia's tables until it was replaced with something else of higher pref.

Matt

Sabri Berisha

7:13 p.m.

----- On Nov 15, 2020, at 5:58 PM, Matt Corallo nanog@as397444.net wrote:

...

Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?

I have seen issues like this in a network that I operated. In that particular case, it was an internal ipv4 10/8 route which was withdrawn, along with a few hundred other routes. The withdrawl was configured on a DC exit router, in a Clos network with leaf, spine, and superspine. On the spine layer, I observed that BGP withdrawls, although being received, were not processed by the control plane. Further investigation and working with the TAC of the vendor, revealed that on that particular platform, the BGP process would stop process withdrawls in a very nasty race condition that was very difficult to reproduce. This was the first (and so far only) time in my 20+ years of working with BGP that I've observed such a weird bug. Since I operated the entire network, it was fairly easy to find the culprit. The why, took some more time. If I were in your shoes, I'd ping Telia's NOC to see what's going on. I would not be surprised if they'd be hitting a similar issue. Thanks, Sabri

Matt Corallo

7:45 p.m.

See my latest response from this morning. Telia's "Head of Network Engineering & Architecture" confirmed on Twitter this was due to a (now-worked-around) bug in JunOS. https://twitter.com/gustawsson/status/1328298914785730561 Matt On 11/16/20 2:13 PM, Sabri Berisha wrote:

...

----- On Nov 15, 2020, at 5:58 PM, Matt Corallo nanog@as397444.net wrote:

...
Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?

I have seen issues like this in a network that I operated. In that particular case, it was an internal ipv4 10/8 route which was withdrawn, along with a few hundred other routes. The withdrawl was configured on a DC exit router, in a Clos network with leaf, spine, and superspine. On the spine layer, I observed that BGP withdrawls, although being received, were not processed by the control plane.

Further investigation and working with the TAC of the vendor, revealed that on that particular platform, the BGP process would stop process withdrawls in a very nasty race condition that was very difficult to reproduce.

This was the first (and so far only) time in my 20+ years of working with BGP that I've observed such a weird bug. Since I operated the entire network, it was fairly easy to find the culprit. The why, took some more time.

If I were in your shoes, I'd ping Telia's NOC to see what's going on. I would not be surprised if they'd be hitting a similar issue.

Thanks,

Sabri

Sabri Berisha

17 Nov 17 Nov

1:36 a.m.

----- On Nov 16, 2020, at 11:45 AM, Matt Corallo nanog@as397444.net wrote: Hi,

...

See my latest response from this morning. Telia's "Head of Network Engineering & Architecture" confirmed on Twitter this was due to a (now-worked-around) bug in JunOS.

https://twitter.com/gustawsson/status/1328298914785730561

Interesting. A long time ago, in a galaxy far far away, where I was a JTAC engineer, policy was that once a PR was hit in the field, it would be marked public. Also, in the case that I described it wasn't a Junos device. Makes me wonder how bugs like that get introduced. One would expect that after 20+ years of writing BGP code, handling a withdrawl would be easy-peasy. Thanks, Sabri

Valdis Klētnieks

2:52 a.m.

On Mon, 16 Nov 2020 17:36:58 -0800, Sabri Berisha said:

...

Also, in the case that I described it wasn't a Junos device. Makes me wonder how bugs like that get introduced. One would expect that after 20+ years of writing BGP code, handling a withdrawl would be easy-peasy.

Handling a withdrawal is easy. Handling one correctly without race conditions when you're seeing withdrawals and additions from multiple bgp sessions concurrently, while also maintaining RIB and FIB consistency and keep forwarding customer packets is a little bit harder.

Neil Hanlon

3:23 a.m.

Surely they can just put them in an array. ;) On Mon, Nov 16, 2020, 21:54 Valdis Klētnieks <valdis.kletnieks@vt.edu> wrote:

...

On Mon, 16 Nov 2020 17:36:58 -0800, Sabri Berisha said:

...
Also, in the case that I described it wasn't a Junos device. Makes me wonder how bugs like that get introduced. One would expect that after 20+ years of writing BGP code, handling a withdrawl would be easy-peasy.

Handling a withdrawal is easy.

Handling one correctly without race conditions when you're seeing withdrawals and additions from multiple bgp sessions concurrently, while also maintaining RIB and FIB consistency and keep forwarding customer packets is a little bit harder.

Markus Weber (FvD)

6:01 a.m.

On 17.11.2020 around 02:36 Sabri Berisha wrote:

...

Interesting. A long time ago, in a galaxy far far away, where I was a JTAC engineer, policy was that once a PR was hit in the field, it would be marked public.

Also, in the case that I described it wasn't a Junos device. Makes me wonder how bugs like that get introduced. One would expect that after 20+ years of writing BGP code, handling a withdrawl would be easy-peasy.

New code, new features, new problems. E.g. public PR1323306 describes a BGP stuck situation. (And the fixed code should address as well a - hidden - PR, which causes down/stale sessions, leading to stuck routes even without a both-side GRES event). All very, very special cases ... but some of us will find / get hit by them (unfortunately). Markus

Saku Ytti

6:54 a.m.

On Tue, 17 Nov 2020 at 03:40, Sabri Berisha <sabri@cluecentral.net> wrote: Hey Sabri,

...

Also, in the case that I described it wasn't a Junos device. Makes me wonder how bugs like that get introduced. One would expect that after 20+ years of writing BGP code, handling a withdrawl would be easy-peasy.

I don't think this is related to skill, that there was some hard programming problem that DE couldn't solve. These are honest mistakes. I've not experienced in my tenure the frequency of these bugs change at all, NOS are as common now as they were in the 90s. I put most of the blame on the market, we've modelled commercial router market so that poor quality NOS is good for business and good quality NOS is bad for business, I don't think this is in anyone's formal business plan or that companies even realise they are not even trying to make good NOS. I think it's emergent behaviour due to the market and people follow that market demand unknowingly. If we suddenly had one commercial NOS which is 100% bug free, many of their customers would stop buying support, would rely on spare HW and Internet forums for configuration help. Lot of us only need contracts to deal with novel bugs all of us find on a regular basis, so good NOS would immediately reduce revenue. For some reason Windows, macOS or Linux almost never have novel bugs that the end user finds and when those are found, it's big news. While we don't go a month without hitting a novel bug in one of our NOS, and no one cares about it, it's business as usual. I also put a lot of blame on C, it was a terrific language when compiling had to be fast. Basically macro assembler. Now the utility of being 'close to HW' is gone, as the CPU does so much C compiler has no control over, it's not really even executing the same code as-written anymore. MSFT estimated >70% of their bugs are related to memory safety. We could accomplish significant improvements in software quality if we'd ditch C and allow the computer to do more formal correctness checks at compile time and design languages which lend towards this. We constantly misattribute problems (like in this post) to config or HW, while most common reasons for outages are pilot error and SW defect, and very little engineering time is spent on those. And often the time spent improving the two first increases the risk of the two latter, reducing mean availability over time. -- ++ytti

Mark Tinka

4:32 p.m.

On 11/17/20 08:54, Saku Ytti wrote:

...

I put most of the blame on the market, we've modelled commercial router market so that poor quality NOS is good for business and good quality NOS is bad for business, I don't think this is in anyone's formal business plan or that companies even realise they are not even trying to make good NOS. I think it's emergent behaviour due to the market and people follow that market demand unknowingly. If we suddenly had one commercial NOS which is 100% bug free, many of their customers would stop buying support, would rely on spare HW and Internet forums for configuration help.

Not to mention that many of us would not need to be around to babysit all this dodgy software. Definitely bad for business :-). Mark.

adamv0025＠netconsultings.com

18 Nov 18 Nov

12:45 p.m.

...

On Behalf Of Mark Tinka Sent: Tuesday, November 17, 2020 4:32 PM

On 11/17/20 08:54, Saku Ytti wrote:

...
I put most of the blame on the market, we've modelled commercial router market so that poor quality NOS is good for business and good quality NOS is bad for business, I don't think this is in anyone's formal business plan or that companies even realise they are not even trying to make good NOS. I think it's emergent behaviour due to the market and people follow that market demand unknowingly. If we suddenly had one commercial NOS which is 100% bug free, many of their customers would stop buying support, would rely on spare HW and Internet forums for configuration help.

Not to mention that many of us would not need to be around to babysit all this dodgy software.

Definitely bad for business :-).

Being obsoleted already by "self-driving networks", there's no limit to what one can automate... But then one needs someone to babysit all the automation systems. adam

adamv0025＠netconsultings.com

12:58 p.m.

...

Saku Ytti Sent: Tuesday, November 17, 2020 6:55 AM

On Tue, 17 Nov 2020 at 03:40, Sabri Berisha <sabri@cluecentral.net> wrote:

Hey Sabri,

...
Also, in the case that I described it wasn't a Junos device. Makes me wonder how bugs like that get introduced. One would expect that after 20+ years of writing BGP code, handling a withdrawl would be easy-peasy.

I don't think this is related to skill, that there was some hard programming problem that DE couldn't solve. These are honest mistakes. I've not experienced in my tenure the frequency of these bugs change at all, NOS are as common now as they were in the 90s.

I put most of the blame on the market, we've modelled commercial router market so that poor quality NOS is good for business and good quality NOS is bad for business, I don't think this is in anyone's formal business plan or that companies even realise they are not even trying to make good NOS. I think it's emergent behaviour due to the market and people follow that market demand unknowingly. If we suddenly had one commercial NOS which is 100% bug free, many of their customers would stop buying support, would rely on spare HW and Internet forums for configuration help. Lot of us only need contracts to deal with novel bugs all of us find on a regular basis, so good NOS would immediately reduce revenue. For some reason Windows, macOS or Linux almost never have novel bugs that the end user finds and when those are found, it's big news. While we don't go a month without hitting a novel bug in one of our NOS, and no one cares about it, it's business as usual.

I also put a lot of blame on C, it was a terrific language when compiling had to be fast. Basically macro assembler. Now the utility of being 'close to HW' is gone, as the CPU does so much C compiler has no control over, it's not really even executing the same code as-written anymore. MSFT estimated >70% of their bugs are related to memory safety. We could accomplish significant improvements in software quality if we'd ditch C and allow the computer to do more formal correctness checks at compile time and design languages which lend towards this.

We constantly misattribute problems (like in this post) to config or HW, while most common reasons for outages are pilot error and SW defect, and very little engineering time is spent on those. And often the time spent improving the two first increases the risk of the two latter, reducing mean availability over time.

I agree with everything but the last statement.

...

From my experience, most of the SPs spend a considerable time testing for SW defects on features (and combinations of features) that will be used and at scale intended, that's how you identify most of the bugs. What you're left with afterwards are special packets of death or some slow memory leaks (basically the more exotic stuff).

adam

Mark Tinka

19 Nov 19 Nov

3:58 a.m.

On 11/18/20 14:58, adamv0025@netconsultings.com wrote:

...

From my experience, most of the SPs spend a considerable time testing for SW defects on features (and combinations of features) that will be used and at scale intended,

I'm not so sure about that, actually. I'd say there are some ISP's that spend some (or a considerable) amount of time testing for software defects. My anecdotal experience is that most ISP's have neither the time, tools nor resources to do significant testing of software. More like, "is the version anything after R1, has it been around long enough, has it been recommended by TAC, are the -nsp lists raving on about it, is it a maintenance release, is the caveat list too long, does my vendor SE approve", type-thing.

...

that's how you identify most of the bugs. What you're left with afterwards are special packets of death or some slow memory leaks (basically the more exotic stuff).

Which the majority of ISP's likely will never test for. Mark.

Tom Beecher

18 Nov 18 Nov

3:18 p.m.

...

I also put a lot of blame on C, it was a terrific language when compiling had to be fast. Basically macro assembler. Now the utility of being 'close to HW' is gone, as the CPU does so much C compiler has no control over, it's not really even executing the same code as-written anymore. MSFT estimated >70% of their bugs are related to memory safety. We could accomplish significant improvements in software quality if we'd ditch C and allow the computer to do more formal correctness checks at compile time and design languages which lend towards this.

Agree 1000%. I think this is greatly compounded by current generations of programmers who come out of school without having had much experience with low level memory management, having mostly worked in more modern languages that handle such things in a much better way. Moving from college Python to mature C code with a hellscape of pointers must be a pretty jarring transition. :) On Tue, Nov 17, 2020 at 1:56 AM Saku Ytti <saku@ytti.fi> wrote:

...

On Tue, 17 Nov 2020 at 03:40, Sabri Berisha <sabri@cluecentral.net> wrote:

Hey Sabri,

...
Also, in the case that I described it wasn't a Junos device. Makes me wonder how bugs like that get introduced. One would expect that after 20+ years of writing BGP code, handling a withdrawl would be easy-peasy.

I don't think this is related to skill, that there was some hard programming problem that DE couldn't solve. These are honest mistakes. I've not experienced in my tenure the frequency of these bugs change at all, NOS are as common now as they were in the 90s.

I put most of the blame on the market, we've modelled commercial router market so that poor quality NOS is good for business and good quality NOS is bad for business, I don't think this is in anyone's formal business plan or that companies even realise they are not even trying to make good NOS. I think it's emergent behaviour due to the market and people follow that market demand unknowingly. If we suddenly had one commercial NOS which is 100% bug free, many of their customers would stop buying support, would rely on spare HW and Internet forums for configuration help. Lot of us only need contracts to deal with novel bugs all of us find on a regular basis, so good NOS would immediately reduce revenue. For some reason Windows, macOS or Linux almost never have novel bugs that the end user finds and when those are found, it's big news. While we don't go a month without hitting a novel bug in one of our NOS, and no one cares about it, it's business as usual.

I also put a lot of blame on C, it was a terrific language when compiling had to be fast. Basically macro assembler. Now the utility of being 'close to HW' is gone, as the CPU does so much C compiler has no control over, it's not really even executing the same code as-written anymore. MSFT estimated >70% of their bugs are related to memory safety. We could accomplish significant improvements in software quality if we'd ditch C and allow the computer to do more formal correctness checks at compile time and design languages which lend towards this.

We constantly misattribute problems (like in this post) to config or HW, while most common reasons for outages are pilot error and SW defect, and very little engineering time is spent on those. And often the time spent improving the two first increases the risk of the two latter, reducing mean availability over time.

-- ++ytti

1907

Age (days ago)

1910

Last active (days ago)

List overview

Download

18 comments

11 participants

participants (11)

adamv0025＠netconsultings.com
Mark Tinka
Markus Weber (FvD)
Matt Corallo
Neil Hanlon
Olivier Benghozi
Ryan Hamel
Sabri Berisha
Saku Ytti
Tom Beecher
Valdis Klētnieks