CGNAT growing pains

8 Oct 2024

      We started rolling out CGNAT about 6 months ago.  It was smooth sailing 
for the first few months, but we eventually did run into a number of 
issues.

Our customer base is primarily FTTH with "dynamic" IP assignment via DHCP. 
Since connections are always-on, customer ONTs/routers get an IP assigned, 
and then when the lease is renewed, they request a new lease for the 
existing IP, and, in general, that request is granted.  This gives 
customers the mistaken impression they have a static IP.  So, my 
impression, from working with some customers who've needed to be moved 
from CGNAT back to public IP is that customers who are doing 
port-forwarding don't even bother with dynamic DNS.  They just know they 
can connect to their IP as they've never seen it change.  We do offer/sell 
static IP, but pre-CGNAT, it was strictly for business customers.  i.e. 
A residential customer could only get static IP service by converting 
their account to a business account. That may change in the near future.

One issue we didn't foresee has been IP Geo issues.  i.e.  We all knew 
that streaming services like Netflix use IP Geo to determine what content 
should be made available, but that's, AFAIK, limited by country or region. 
What we didn't anticipate is services like Hulu Live TV doing IP Geo down 
to the city level to determine which local channels are a subscriber's 
local channels.  We're using Juniper MX gear and SPC3 cards for our CGNAT 
routers, each one having a single large external pool.  Since we serve 
most of FL, one external pool can't IP Geo correctly for customers as far 
apart as Miami and Jacksonville hitting the same CGNAT router.  We don't 
currently have an acceptable solution to this other than moving impacted 
customers off CGNAT.

One of the great unknowns (at least for us) with CGNAT was what our PBA 
settings should be.  i.e.  How large each port-block should be, and how 
many port-blocks to allow per customer.  We started with 256x4.  It seemed 
to work.  We eventually noticed that we were logging port-block exceeded 
errors.  This is one aspect where Juniper's CGNAT support is lacking. 
There's a counter for these errors, and it's available via SNMP, but 
there's no way to attribute the errors to subscriber IPs.  We're polling 
the mib and graphing it, so we know it's a continuing issue and can see 
when it's incrementing faster/slower, but Junos provides no means for 
determining if "PBEs" are all being caused by a single customer, a handful 
of customers, etc.  We have a JTAC case open on this.  As a quick & 
hopeful fix, we both increased the port-block size and block limit.  That 
helped, but didn't stop the errors.  It also cut our CGNAT ratio by more 
than half (64:1 -> 28:1), if we stay at this ratio, we'll need much larger 
external pools than originally anticipated.  Tuning these settings is kind 
of painful as JTAC strongly recommends bouncing the CGNAT service anytime 
CGNAT related config changes are made.  This means briefly breaking 
Internet access for all CGNAT'd customers.  For the PBEs, JTAC's 
suggestions so far have been to shorten some of the timeouts in the config 
and to keep doing what we're doing, which is a cron job that essentially 
does a "show services nat source port-block", parses the output looking 
for subscriber IPs that have used up the ports in several of their 
port-blocks, then does a "show services sessions source-prefix ..." and 
logs all of this.  This at least gives us snapshots of "who's a heavy user 
right now" and lets us look at how they were using all their ports.  i.e. 
was it bittorent, are they compromised and scanning the internet for more 
systems to compromise, is it legit looking traffic - just lots of it, 
etc.?

The latest CGNAT issue is a customer with a Palo Alto Networks firewall 
connected to our network and several of their employees are our FTTH 
customers.  On their PANW firewall, they're doing IP Geo based filtering, 
limiting access to internal servers to "US IPs".  Since we only CGNAT 
traffic to the external Internet, their on-net employees hit the firewall 
from their 100.64/10 IPs and get blocked.  I suggested they whitelist 
100.64/10, saying we block traffic from 100.64/10 from entering our 
network via peering and transit, so they can be assured anything from 
100.64/10 came from inside our network / our customers.  They say the 
firewall won't let them whitelist 100.64.0.0/10, giving an error that it's 
invalid IP space.

I know we're not the first to implement CGNAT, so I'm curious if others 
have run into these sorts of issues, or others we haven't run into yet, 
and if so, how you solved them.

----------------------------------------------------------------------
  Jon Lewis, MCP :)              |  I route
  Blue Stream Fiber, Sr. Neteng  |  therefore you are
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________