Analysing traffic in context of rejecting RPKI invalids using pmacct
Dear all, Whether to deploy RPKI Origin Validation with an "invalid == reject" policy really is a business decision. One has to weigh the pros and cons: what are the direct and indirect costs of accepting misconfigurations or hijacks for my company? what is the cost of deploying RPKI? What is the cost of honoring misconfigured RPKI ROAs? There are a few thousand misconfigured ROAs, what does this mean for me? To answer these questions, Paolo Lucente and myself worked to extend pmacct traffic analysis engine (http://pmacct.net/) in such a way that it can do perform the RFC 6811 Origin Validation procedure and present the outcome as a property in the flow aggregation process. Pmacct has the ability to ingest BGP feeds and correlate the BGP data to the sflow/netflow/ipfix data. This allows for fantastic business intelligence, you can see exactly how much traffic is flowing from what customers to what endpoints for what reason! Pmacct implemented Origin Validation in a cute way: it separates out RPKI invalid BGP announcements into two categories: a) "invalid with no overlapping or alternative route" (aka will be blackholed if 'invalid == reject') b) "invalid but an overlapping unknown/valid announcement also exists" (end-to-end connectivity can still work). Because pmacct separates out the various types kinds of (invalid) BGP announcements, operators don't have to do deploy *anything* in their network to get a good grasp on how their connectivity to the rest of the Internet would look like after deploying a "invalid == reject" policy. No changes to your network configurations are required to make use of this feature, you don't need to tag routes with communities or do other tricks. All the analysis happens inside pmacct. Of course we tested this first in the NTT global backbone AS 2914! At the moment of writing, we're seeing less than a handful of gigabits per second being send towards BGP announcements that are RPKI Invalid and for which no alternative route exists. In context of NTT's backbone that amount of traffic is just statistical noise. This is a very encouraging sign, it may help us move towards the goal of deploying RPKI Origin Validation in AS 2914. Nusenu wrote a great blog post on where these RPKI ROA misconfigurations are located, i recommend reading their posts to develop a better understanding of the problem space: https://medium.com/@nusenu/where-are-rpki-unreachable-networks-located-65c7a... Even if you don't intend to deploy RPKI Origin Validation (or are single-homed), pmacct's RPKI capabilities can be useful in forensic investigations. It'll be easier to analyse how much and what kind of traffic for what period of time was sent to a possible hijack. This will help you when writing RFOs! If you want to testdrive this feature, fetch pmacct version 1.7.3-rc1 from https://github.com/pmacct/pmacct/releases/tag/1.7.3-rc1 Documentation on how to configure the feature: https://github.com/pmacct/pmacct/blob/master/QUICKSTART#L1783-#L1833 https://github.com/pmacct/pmacct/blob/master/CONFIG-KEYS#L2626-#L2647 Let us know what you think! Or if you'd like to chat telemetry with Paolo or me about analysing the effects of BGP hijacks and RPKI, we'll both be at the San Francisco NANOG meeting next week! Kind regards, Job ps. Dear Kentik & Deepfield, please copy+paste this feature! We'll happily share development notes with you, you can even look at pmacct's source code for inspiration. :-)
On Tue, Feb 12, 2019 at 1:15 PM Job Snijders <job@ntt.net> wrote:
ps. Dear Kentik & Deepfield, please copy+paste this feature! We'll happily share development notes with you, you can even look at pmacct's source code for inspiration. :-)
Thanks Job, I just wanted to reach back out to you and the NANOG community that we've implemented this feature. Currently Kentik can match flow data with the following validation state: - VALID = Prefix fits in ROA, and ROA ASN and Prefix Origin Match - UNKNOWN = we haven't found any matching ROA - INVALID - ASN mismatch = BGP prefix fits in the ROA prefix's length BUT the ROA ASN differs from the Prefix Origin ASN - INVALID - Prefix length out of bounds = the BGP prefix doesn't have an ROA with large enough Max-Length to refer to - INVALID - ASN 0 specified = there is a matching ROA w/ the right max-length but the ASN associated w/ it is 0 (explicit invalid) If anyone would like more information please hit me up offline. -Steve
On 12-March-2019, Steve Meuse writes:
ps. Dear Kentik & Deepfield, please copy+paste this feature! We'll happily share development notes with you, you can even look at pmacct's source code for inspiration. :-)
Thanks Job, I just wanted to reach back out to you and the NANOG community that we've implemented this feature. Currently Kentik can match flow data with the following validation state:
- VALID = Prefix fits in ROA, and ROA ASN and Prefix Origin Match - UNKNOWN = we haven't found any matching ROA - INVALID - ASN mismatch = BGP prefix fits in the ROA prefix's length BUT the ROA ASN differs from the Prefix Origin ASN - INVALID - Prefix length out of bounds = the BGP prefix doesn't have an ROA with large enough Max-Length to refer to - INVALID - ASN 0 specified = there is a matching ROA w/ the right max-length but the ASN associated w/ it is 0 (explicit invalid)
Hi Steve, Thanks for the update, but based on that description I'm not certain that you implemented the same thing that pmacct built, which IMO is what is needed by those considering deploying a drop-invalids policy. (Perhaps you omitted mentioning that ability in your description but included it in your implementation.) Citing from Job's description:
Pmacct implemented Origin Validation in a cute way: it separates out RPKI invalid BGP announcements into two categories:
a) "invalid with no overlapping or alternative route" (aka will be blackholed if 'invalid == reject')
b) "invalid but an overlapping unknown/valid announcement also exists" (end-to-end connectivity can still work).
Networks contemplating Origin Validation need to be able to predict how their traffic with the rest of the Internet would change after deploying a drop-invalid-routes policy. When we (as7018) were preparing to begin dropping invalid routes received from peers earlier this year, that is exactly the kind of analysis we did. In our case we rolled our own with a two-pass process: we first found all the traffic to/from invalid routes by a bgp community we gave them, then outside of our flow analysis tool we further filtered the traffic for invalid routes which were covered by less-specific not-invalid routes. What remained was the traffic we would lose once invalid routes were dropped. Had the pmacct capability existed at that time, we would have used it. Regarding the ability to further partition invalid traffic into the three sub-categories you mentioned: that would not have been of interest to us at the time we did our analysis, and it's not clear to me how it would be useful to a network as it contemplates adopting a drop-invalids policy. In this context, the reason a route is invalid is not important; what is important is whether it is covered by a non-invalid route or not. Thanks. Jay B.
On Tue, Mar 12, 2019 at 9:26 AM Jay Borkenhagen <jayb@att.com> wrote:
Thanks for the update, but based on that description I'm not certain that you implemented the same thing that pmacct built, which IMO is what is needed by those considering deploying a drop-invalids policy. (Perhaps you omitted mentioning that ability in your description but included it in your implementation.)
Thanks Jay, you are correct. As we were talking through the logic we realized we missed that bit. Internally, we're working though the logic to understand if there is a covering route, is that route valid, and if not, will we recurse and look for another covering route that is valid? Either way, we'll be updating our software with that functionality shortly. -Steve
Thanks for the update, but based on that description I'm not certain that you implemented the same thing that pmacct built, which IMO is what is needed by those considering deploying a drop-invalids policy. (Perhaps you omitted mentioning that ability in your description but included it in your implementation.)
Thanks Jay, you are correct. As we were talking through the logic we realized we missed that bit. Internally, we're working though the logic to understand if there is a covering route, is that route valid, and if not, will we recurse and look for another covering route that is valid?
daniele's pam paper and ripe preso, layed it out pretty well Daniele Iamartino, Cristel Pelsser, Randy Bush. "Measuring BGP Route Origin Registration and Validation," PAM 2015. https://archive.psg.com//141223.route-origin-pam2015.pdf https://ripe69.ripe.net/presentations/103-route-origin-validation.pdf
participants (4)
-
Jay Borkenhagen
-
Job Snijders
-
Randy Bush
-
Steve Meuse