Hello I noticed that we regressed and started failing the test at https://isbgpsafeyet.com/. Investigating I found that we apparently had some routes in the validation state "unknown" that should have been either invalid or valid. Including the test prefix which was received via NL-IX (and Cogent on IPv6). We do however have plenty of prefixes that are validated and received from the same sources. This is a Juniper MX204 router running 20.1R1.11. I tried a few things including "clear bgp neighbor xxx soft-inbound" (supposed to rerun the import policy where RPKI marking and check happens) which did not fix it. Doing a "clear bgp neighbor xxx", which disconnects the peer and reconnects after a slight delay, did however fix the issue. But I have to do that for every peer we received the prefix from and potentially we could have trouble with every peer we have :-( This router was software upgraded and rebooted two days ago. I suspect a race condition. What if the router started BGP sessions before it was able to communicate with the RPKI validation server or before the RPKI database was synchronized? I find it a bit disappointing that we this easily ended up with a bad validation state and apparently there is little I can do about it, except for walking through all our peers and BGP reset them. Which frankly is an unacceptable disruption of traffic flow. Regards, Baldur
Any default route to a non-ROV enabled upstream ? Do you receive the test prefix from more than one upstream and the previous test success could be a function of upstream ROV ? Rubens On Tue, Jun 16, 2020 at 8:35 PM Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
Hello
I noticed that we regressed and started failing the test at https://isbgpsafeyet.com/. Investigating I found that we apparently had some routes in the validation state "unknown" that should have been either invalid or valid. Including the test prefix which was received via NL-IX (and Cogent on IPv6).
We do however have plenty of prefixes that are validated and received from the same sources.
This is a Juniper MX204 router running 20.1R1.11. I tried a few things including "clear bgp neighbor xxx soft-inbound" (supposed to rerun the import policy where RPKI marking and check happens) which did not fix it. Doing a "clear bgp neighbor xxx", which disconnects the peer and reconnects after a slight delay, did however fix the issue. But I have to do that for every peer we received the prefix from and potentially we could have trouble with every peer we have :-(
This router was software upgraded and rebooted two days ago. I suspect a race condition. What if the router started BGP sessions before it was able to communicate with the RPKI validation server or before the RPKI database was synchronized?
I find it a bit disappointing that we this easily ended up with a bad validation state and apparently there is little I can do about it, except for walking through all our peers and BGP reset them. Which frankly is an unacceptable disruption of traffic flow.
Regards,
Baldur
On Wed, Jun 17, 2020 at 1:43 AM Rubens Kuhl <rubensk@gmail.com> wrote:
Any default route to a non-ROV enabled upstream ? Do you receive the test prefix from more than one upstream and the previous test success could be a function of upstream ROV ?
No this is how it looks: admin@gc-edge1> show route 2606:4700:7000::6715:f40f internet.inet6.0: 92472 destinations, 288208 routes (90838 active, 0 holddown, 6565 hidden) + = Active Route, - = Last Active, * = Both 2606:4700:7000::/48*[BGP/170] 1d 21:46:42, MED 100, localpref 100, from 2001:7f8:13::a503:4307:1 AS path: 13335 I, validation-state: unknown > to 2001:7f8:13::a501:3335:1 via nl-ix [BGP/170] 1d 21:46:39, MED 100, localpref 100, from 2001:7f8:13::a503:4307:2 AS path: 13335 I, validation-state: unknown > to 2001:7f8:13::a501:3335:1 via nl-ix [BGP/170] 1d 21:46:50, MED 290, localpref 100 AS path: 174 37100 13335 I, validation-state: unknown > to 2001:978:2:d::25:1 via cogent admin@gc-edge1> show route 103.21.244.14 internet.inet.0: 818706 destinations, 2528384 routes (816242 active, 4 holddown, 32715 hidden) + = Active Route, - = Last Active, * = Both 103.21.244.0/24 *[BGP/170] 1d 21:35:34, MED 100, localpref 100, from 193.239.117.0 AS path: 13335 I, validation-state: unknown > to 193.239.117.114 via nl-ix [BGP/170] 1d 21:35:29, MED 100, localpref 100, from 193.239.116.255 AS path: 13335 I, validation-state: unknown > to 193.239.117.114 via nl-ix Plenty of prefixes in valid state: admin@gc-edge1> show route table internet.inet.0 validation-state valid internet.inet.0: 811569 destinations, 2519989 routes (809383 active, 1 holddown, 28989 hidden) + = Active Route, - = Last Active, * = Both 1.9.0.0/16 *[BGP/170] 08:05:51, MED 100, localpref 100, from 193.239.117.0 AS path: 6939 4788 I, validation-state: valid > to 193.239.116.14 via nl-ix [BGP/170] 08:04:24, MED 100, localpref 100, from 193.239.116.255 AS path: 6939 4788 I, validation-state: valid > to 193.239.116.14 via nl-ix 1.9.250.0/24 *[BGP/170] 08:05:48, MED 100, localpref 100, from 193.239.117.0 AS path: 6939 4788 I, validation-state: valid > to 193.239.116.14 via nl-ix [BGP/170] 08:04:21, MED 100, localpref 100, from 193.239.116.255 AS path: 6939 4788 I, validation-state: valid > to 193.239.116.14 via nl-ix 1.32.218.0/24 [BGP/170] 2d 05:48:33, MED 210, localpref 100 AS path: 1299 2914 64050 4842 I, validation-state: valid > to 62.115.180.72 via telia [BGP/170] 2d 05:47:35, MED 290, localpref 100 AS path: 174 2914 64050 4842 I, validation-state: valid > to 149.6.137.49 via cogent etc After clearing the relevant BGP sessions the Cloudflare invalid prefixes are gone from our routing table and we pass the test again. Regards, Baldur
participants (3)
-
Baldur Norddahl
-
Mark Tinka
-
Rubens Kuhl