mitigations for the BGP vortex attack
Hi. I recently read an interesting research paper that I don’t think has been shared here yet: https://www.usenix.org/system/files/usenixsecurity25-stoeger.pdf I would throw out two additional thoughts related to "Partial Mitigations"(6.1): 1. Usage of modern, multi-threaded routing daemons which are able to take advantage of multi-core control plane CPUs. 2. Control plane CPU usage monitoring. BGP per-neighbor sent and received UPDATE messages rate monitoring. For example, Junos supports this via SNMP, telemetry or RPC. In the "Safe BGP Communities"(6.2) paragraph the paper advises networks to cease the support of "Lower Local Pref Below Peer" community. As the BGP vortex forms only when both the "Lower Local Pref Below Peer" and "Selective NOPEER" communities are present, then perhaps alternative approach would be to cease the support of those two communities together. In other words, both communities would still be allowed separately, but not together. I did a brief analysis on RIPE RIS data for NTT, GTT and Sparkle "Lower Local Pref Below Peer" and "Selective NOPEER" type communities and the combination where both types are attached to the prefix are rare. Finally, I wonder if or how much the attack is amplified if the adversary ensures(for example, using communities) that each announced prefix has unique path attributes and thus there is an UPDATE message per prefix. Martin
On a quick first read, this seems like very much one of those things that is theoretically possible, but highly implausible in the real word. 1. This would be a lot of money for an attacker to spend, connecting to 3 specific ASNs, just to slow down convergence. 2. p3619 : "Then each new prefix will be propagated in parallel." Not really. Even if you assume the AS A sent a single UPDATE with 1 NLRI for each prefix, ASes B C D are going to aggregate multiple NLRI changes in a single UPDATE message to each other. This isn't going to cause the amplification claimed. 3. p3620, 5.1 Experiment Infrastructure Their virtualized test setup is many orders of magnitude less powerful than the actual hardware run by the ASNs that would theoretically be susceptible to this. The software run on this hardware is also WAY more optimized than FRR and BIRD are , especially at massive BGP scale that they run. 4. p3622, 5.3 BGP Vortices Delay Network Convergence, Methodology This methodology is bad. "I wanted X seconds to see" is meaningless. In a controlled environment, you can set things up to see exactly how long convergence takes. You don't need to handwave it. The real DFZ sees almost constant update splashing and oscillations similar to this 24/7/365, none of it malicious. And it has for years. On Mon, Oct 6, 2025 at 6:32 AM Martin Tonusoo via NANOG < nanog@lists.nanog.org> wrote:
Hi.
I recently read an interesting research paper that I don’t think has been shared here yet: https://www.usenix.org/system/files/usenixsecurity25-stoeger.pdf
I would throw out two additional thoughts related to "Partial Mitigations"(6.1):
1. Usage of modern, multi-threaded routing daemons which are able to take advantage of multi-core control plane CPUs.
2. Control plane CPU usage monitoring. BGP per-neighbor sent and received UPDATE messages rate monitoring. For example, Junos supports this via SNMP, telemetry or RPC.
In the "Safe BGP Communities"(6.2) paragraph the paper advises networks to cease the support of "Lower Local Pref Below Peer" community. As the BGP vortex forms only when both the "Lower Local Pref Below Peer" and "Selective NOPEER" communities are present, then perhaps alternative approach would be to cease the support of those two communities together. In other words, both communities would still be allowed separately, but not together. I did a brief analysis on RIPE RIS data for NTT, GTT and Sparkle "Lower Local Pref Below Peer" and "Selective NOPEER" type communities and the combination where both types are attached to the prefix are rare.
Finally, I wonder if or how much the attack is amplified if the adversary ensures(for example, using communities) that each announced prefix has unique path attributes and thus there is an UPDATE message per prefix.
Martin _______________________________________________ NANOG mailing list
https://lists.nanog.org/archives/list/nanog@lists.nanog.org/message/SU76M47A...
On Mon, Oct 6, 2025 at 10:14 AM Tom Beecher via NANOG <nanog@lists.nanog.org> wrote:
On a quick first read, this seems like very much one of those things that is theoretically possible, but highly implausible in the real word.
1. This would be a lot of money for an attacker to spend, connecting to 3 specific ASNs, just to slow down convergence.
To be fair, in Appendix A, the authors point out that the same effect can be had through downstream connections, so long as the upstream network isn't filtering BGP communities. So, you can get the same effect by buying a single BGP connection to a 4th, tier 2 network, so long as the upstream you've chosen a) doesn't strip BGP communities inbound from customers, b) doesn't strip BGP communities before propagating routes upstream, and c) connects to a trio of ASNs that are mutual peers of each other. So, I could trigger this via a simple downstream BGP adjacency through cogent, for example, for relatively little money.
3. p3620, 5.1 Experiment Infrastructure
Their virtualized test setup is many orders of magnitude less powerful than the actual hardware run by the ASNs that would theoretically be susceptible to this. The software run on this hardware is also WAY more optimized than FRR and BIRD are , especially at massive BGP scale that they run.
4. p3622, 5.3 BGP Vortices Delay Network Convergence, Methodology
This methodology is bad. "I wanted X seconds to see" is meaningless. In a controlled environment, you can set things up to see exactly how long convergence takes. You don't need to handwave it.
The real DFZ sees almost constant update splashing and oscillations similar to this 24/7/365, none of it malicious. And it has for years.
I had to chuckle at this part: p 3620 Discussion. To put the results above in perspective, a recent report [28] shows that, in 2024, the APNIC R&D Center AS (AS 131072) received around 200000 BGP updates per day, or 2.3 per second.7 Thus, the fact that a single BGP Vortex attack, based only on 21 ASes, can induce tens of thousands of updates per period highlights the potential impact a BGP Vortex attack can have on the global routing system. Clearly, then, the practical impact of the abstract results described above depends on many factors, but most importantly: Yes, on a typical boring day on the Internet, that's about right. However, taking that rate as though it's indicative of what core routers can *handle* is laughable. Flap a transit adjacency, and your router is going to be processing 1M+ BGP update messages hopefully in a small number of minutes. If my core routers can't deal with at least 200,000 BGP updates a minute, I'm going to be in a world of hurt every time an upstream neighbor session drops and re-establishes. Likewise, on page 3625, the paper says: p 3625 Rexford et al. [43] and Labovitz et al. [31] showed that while routes to popular destinations tend to be stable over time, network changes can trigger convergence delays lasting tens of minutes. The two studies cited were performed in 2000, and 2002, a quarter of a century ago. I will confess, I'm still using network hardware from that era...in my home network. Any network connecting to the BGP core of the internet that's running hardware from that era...may ${diety} have mercy on your CPU cores. ^_^; While this is an interesting demonstration of something we've all had a gut-level understanding probably takes place all the time due to inconsistent policies and unintentional overlooking of implementation details between peers, there are simpler ways to attack the DFZ core with more devastating impact. The amount of sleep I'd be losing worrying about this is negligible. Of course, that needs to be understood in the context of just how little sleep I tend to get in general. ^_^; Thanks! Matt
Hi.
2. p3619 : "Then each new prefix will be propagated in parallel."
Not really. Even if you assume the AS A sent a single UPDATE with 1 NLRI for each prefix, ASes B C D are going to aggregate multiple NLRI changes in a single UPDATE message to each other. This isn't going to cause the amplification claimed.
Perhaps the authors meant that each UPDATE message sent by AS A has unique path attributes and thus ensuring that ASes B, C and D can not aggregate multiple NLRIs into a single UPDATE message. I tried to replicate the "BGP Vortices Delay Network Convergence" test demonstrated in paragraph 5.3. Setup(drawing: https://gist.github.com/tonusoo/1cced39aa6ae53143d12623a05f02331) is very similar to figure 4b on the page 3621, but all my routers are running BIRD 3(single thread mode). Router "rY"(ingress) injects real BGP feed into the lab setup, router "rX"(upstream) periodically advertises and withdraws 50 routes and router "rK" injects 5k prefixes for the BGP vortex. Running the packet capture on Linux bridge connecting, for example, the "rN" and "rM" routers confirms that the BGP vortex is ongoing and I'm seeing well over 10k UPDATE messages per second. However, I might be doing something wrong, but I don't see the delays shown on figure 5a on page 3622. That is, 50 routes advertised or withdrawn by "rX" are propagated to "rZ" within few hundred milliseconds and not delayed for 10+ seconds. Martin
Matthew Petach via NANOG wrote on 06/10/2025 22:23:
While this is an interesting demonstration of something we've all had a gut-level understanding probably takes place all the time due to inconsistent policies and unintentional overlooking of implementation details between peers, there are simpler ways to attack the DFZ core with more devastating impact.
more to the point, the moment you implement both filtering and propagation of subnets in a routing protocol which allows next-hop resolution, you can no longer deterministically defend against this entire class of threats. This is well known, e.g. connecting up a GRE VPN and inserting the NH of the endpoint into the tunnel. Or being careless with prefix redistribution between routing protocols and finding out that it's very easy to shoot yourself in the foot. Hopefully most network engineers have done this accidentally in their lives so that they learn to be aware of it as something that can happen. Obviously the singe marks on my fingers are from ... other things and definitely none of the above (looks at floor awkwardly). Anyway, the principle of all these things is the same: oscillatory invalidation of the next-hop ip address.
The amount of sleep I'd be losing worrying about this is negligible.
Indeed. On a separate issue, it's frustrating when people see the need to brand their discoveries with breathless names and often (not in this case) cutesy logos. What makes a vulnerability relevant is the product of its exploitability and its impact. I'd rate this one as being worth having router high-CPU triggers on the NMS and possibly churn counters, but not much more. Nick
On Tue, Oct 7, 2025 at 9:55 AM Martin Tonusoo via NANOG < nanog@lists.nanog.org> wrote:
Hi.
2. p3619 : "Then each new prefix will be propagated in parallel."
Not really. Even if you assume the AS A sent a single UPDATE with 1 NLRI for each prefix, ASes B C D are going to aggregate multiple NLRI changes in a single UPDATE message to each other. This isn't going to cause the amplification claimed.
Perhaps the authors meant that each UPDATE message sent by AS A has unique path attributes and thus ensuring that ASes B, C and D can not aggregate multiple NLRIs into a single UPDATE message.
I tried to replicate the "BGP Vortices Delay Network Convergence" test demonstrated in paragraph 5.3. Setup(drawing: https://gist.github.com/tonusoo/1cced39aa6ae53143d12623a05f02331) is very similar to figure 4b on the page 3621, but all my routers are running BIRD 3(single thread mode). Router "rY"(ingress) injects real BGP feed into the lab setup, router "rX"(upstream) periodically advertises and withdraws 50 routes and router "rK" injects 5k prefixes for the BGP vortex. Running the packet capture on Linux bridge connecting, for example, the "rN" and "rM" routers confirms that the BGP vortex is ongoing and I'm seeing well over 10k UPDATE messages per second. However, I might be doing something wrong, but I don't see the delays shown on figure 5a on page 3622. That is, 50 routes advertised or withdrawn by "rX" are propagated to "rZ" within few hundred milliseconds and not delayed for 10+ seconds.
Looking at figure 6, it appears that the larger component appears to be the time between when the BGP update message arrived at the bystander-AS and when FRR finished logging the update message in its logs. As the methodology claims: By subtracting the time a route advertisement arrived at the bystander-AS from when it was logged in the FRR’s BGP log, we computed the processing time on the bystander-AS. As someone who has dealt with logging of debugging output from programs that need to be as real-time as possible, the logging functions are generally written to be asynchronous and separate from the main processing path, so that delays in the logging subsystem don't hold up the real work the program is doing. Using the appearance of a log message as an indicator of precise timing of when a RIB update happened is handwavy at best, and flat-out wrong at worst. The timestamp at which the zlog subsystem of FRR got the BGP update log message is unlikely to be the same timestamp at which the RIB itself was updated. Indeed, when researching FRR logging timestamps, it says - Performance impact: Debug-level logging can significantly increase the load on the system and may not capture precise, real-time updates without impacting performance, especially for frequent RIB updates. So, you end up with a double-whammy; turning on debug logging to see the logs for the routing updates significantly increases the load on the box running FRR, which in turns slows down the rate at which it can process update messages coming in. I think we've all known for years the perils of turning on extensive debug messages on routers. How many of us have had the awkward moment of a partner shaking us awake in bed saying "what happened? You were shouting "undebug all! undebug all!" in your sleep. Were you having a nightmare?" I suspect if you turn on verbose debugging logging on "rZ", you might find that suddenly route updates to the RIB slow down noticeably. This has less to do with the actions of a route vortex, and much more to do with hitting the CPU of your router over the head repeatedly with the blunt hammer of sprintf. ^_^;; Matt
Hi.
Using the appearance of a log message as an indicator of precise timing of when a RIB update happened is handwavy at best, and flat-out wrong at worst.
Exactly. Once I replaced the BIRD with FRR on router "rZ"(bystander; https://gist.github.com/tonusoo/1cced39aa6ae53143d12623a05f02331), then I indeed observed 10+ seconds long propagation delays. Input queues of the bgpd were constantly full, related TCP receive queues were extremely high and this caused "rZ" to frequently send TCP messages, with receive window set to zero, to its BGP neighbors "rM" and "rN". Setting the scheduling priority of bgpd and zebra(daemon responsible for updating the kernel routing table) to highest possible value did lower the input queue of bgpd, but it was still very high. Dropping the 5k oscillating routes in ingress route-map of the FRR allowed bgpd to consume all the CPU resources of the virtual machine as it no longer had to compete with zebra, but it was still not enough to keep the bgpd input queue low. The research paper includes scripts for generating the routers configurations: https://zenodo.org/records/16739858 These configurations include few elements that are uncommon in production routers and which may, more or less, affect the performance of the routing daemon: * FRR in "bystander" dumps the UPDATE messages to MRT file * logging processed BGP route advertisements. This was already explained by Matthew Petach. Martin
On Thu, Oct 16, 2025 at 01:59:05PM +0300, Martin Tonusoo via NANOG wrote:
Once I replaced the BIRD with FRR on router "rZ"(bystander; https://gist.github.com/tonusoo/1cced39aa6ae53143d12623a05f02331), then I indeed observed 10+ seconds long propagation delays.
The industry standard concepts of Adj-RIB-Out & MRAI seem extremely relevant to some of the claims made in this *checks notes* paper-that-also-promotes SCION (??!!). The paper's authors could not entirely avoid commenting on MRAI, but in a brief paragraph they stated: """MRAI could theoretically mitigate the BGP Vortex attack by slowing down route oscillations""". In my opinion the lede is being buried a bit here: if it is argued there is a potential problem, then in MRAI there is a potential and deployable solution, that'll work on quite a few platforms! Why not just set MRAI to something very small, like 1 or 2 seconds, on systems that support it, and call it a day? Kind regards, Job
Hi.
Why not just set MRAI to something very small, like 1 or 2 seconds, on systems that support it, and call it a day?
Yes, that would definitely help too. It's also suggested by the authors in paragraph 6.1("Partial Mitigations"). To put some numbers behind it, I replaced the router "rM" running BIRD in a virtual machine with a physical router (1) running Junos which has a similar (2) rate-limiting functionality to MRAI called "out-delay". When the "out-delay" was not configured, that is, it was 0 seconds, then the average UPDATE/sec rate in BGP vortex was around 5900 (3). From the FRR router "rZ" point of view the situation looked similar to what I described in my previous e-mail: input queues of bgpd were filled up and "rZ" frequently sent TCP packets with a zero receive window to its BGP neighbors. Once the "out-delay" was set to 1 second in "rM" Juniper router, then the UPDATE messages rate in BGP vortex dropped to 1700 messages per second and as much as I tested, the route propagation delay from "rX" to "rZ" ranged between 1.2 - 1.8 seconds. (1) Juniper MX960; RE-S-1800X4-16G-S routing engine; 4-core Intel Xeon 5500 family CPU C5518 at 1.73 GHz; Junos version 23.4R2.13; two shard threads(junos-bgpshard0, junos-bgpshard1) and two update-IO threads(bgp-updio-0, bgp-updio-1) able to run on different cores (2) Difference between MRAI timer(https://datatracker.ietf.org/doc/html/rfc4271#section-9.2.1.1) and Juniper's "out-delay" is very well explained, for example, in "On Update Rate-Limiting in BGP"(https://web-backend.simula.no/sites/default/files/publications/Simula.simula...) research paper in paragraph "II. RATE-LIMITING IMPLEMENTATIONS" (3) https://gist.github.com/tonusoo/1cced39aa6ae53143d12623a05f02331?permalink_c... Martin
participants (5)
-
Job Snijders -
Martin Tonusoo -
Matthew Petach -
Nick Hilliard -
Tom Beecher