Great point.  We don't need geo-diversity for websites with the IP address issue, so we could design for that case specially on a one-off basis.

For throughput it shouldn't be an issue where we're located, but we often find websites serving different content based on the source IP of the traffic.  So, having a presence closer to the user is useful.  But then again, this is a different concern that's orthogonal to the original question, because geo-ip doesn't make much sense with an anycast IP.  For those websites that need a stable IP for NACLs *and* serve different content based on source IP, we have to use the predictable 3-5 IPs per site suggestion of yours.



On Wed, Jul 28, 2021 at 11:27 AM Glenn McGurrin via NANOG <nanog@nanog.org> wrote:
I'd had a similar thought/question, though keeping the geo diversity,
you manage the crawlers, and are making contact individually with these
sites from what you have stated (and so don't need a one size fit's all
list for public posting), so why not have a restricted subset of the
crawlers handle sites with these issues (which subset may be unique per
site, which makes maintaining even load balancing not overly complex
/limiting, especially as you are using nat anyway, so multiple servers
can be behind each ip and that number can vary).  That let's you have
geo diversity (or even multi cloud diversity) for every site, but each
site that needs this IP whitelisting only needs 3-5 IP's at any site,
but yet you can distribute load over a much larger overall set of
machines and nat gateways.

As I understand it even CDN's that anycast TCP (externally or internally
[load balancing via routers and multi path]) do similar by spreading
load over multiple IP's at the DNS layer first.

As the transition to IPv6 happens you may have it easier as getting a
large enough allocation to allow for splitting it out into multiple
subnets advertised from different locations without providers dropping
the route as too long a prefix is much easier on the v6 side, so you
could give one /36 or /40 or even /44 out to whitelist but have /48's at
each location.  For sites with ipv6 support that may help now, but it
won't help all sites for quite some time, though the number that support
v6 is slowly getting better.  For the foreseeable future you still need
to handle the v4 side one way or another though.

On 7/28/2021 10:21 AM, William Herrin wrote:
> On Wed, Jul 28, 2021 at 6:04 AM Vimal <j.vimal@gmail.com> wrote:
>> My intention is to run a web-crawling service on a public cloud. This service
>> is geographically distributed, and therefore will run in multiple regions
>> around the world inside AWS... this means there will be multiple AWS VPCs,
>> each with their own NAT gateway, and traffic destined to websites
>> that we crawl will appear to come from this NAT gateway's IP address.
>
> Hello,
>
> AWS does not provide the ability to attach anycasted IP addresses to a
> NAT gateway, regardless of whether it would work, so that's the end of
> your quest.
>
>> The reason I want a predictable IP is to communicate this IP to website
>> owners so they can allow access from these IPs into their networks.
>> I chose IP as an example; it can also be a subnet, but what I don't want to
>> provide is a list of 100 different IP addresses without any predictability.
>
> If you bring your own IP addresses, you can attach a separate /24s of
> them to your VPCs in each region, providing you with a single
> predictable range of source addresses. You will find it difficult and
> expensive to acquire that many IP addresses from the regional
> registries for the purpose you describe.
>
>
> Silly question but: for a web crawler, why do you care whether it has
> the limited geographically distribution that a cloud service provides?
> It's a parallel batch task. It doesn't exactly matter whether you have
> minimum latency.
>
> Regards,
> Bill Herrin
>
>
>


--
Vimal