Timeout behind GCP Nat

Question

I seem to have a problem similar to this question in that I get the following timeout when building a docker image using a gitlab runner on GCP

Put https://registry.gitlab.com/v2/[redacted-repo]: dial tcp 35.227.35.254:443: i/o timeout

At this moment my google cloud NAT gives the following log output:

{
   "insertId": "rh7b2jfleq0wx",
   "jsonPayload": {
     "allocation_status": "DROPPED",
     "endpoint": {
       "project_id": "gitlab-autoscale-runners",
       "vm_name": "runner-5dblbjek-auto-scale-runner-1589446683-0b220f90",
       "region": "europe-west4",
       "zone": "europe-west4-b"
     },
     "connection": {
       "protocol": 6,
       "src_port": 42446,
       "src_ip": "some-ip",
       "dest_ip": "some-ip",
       "dest_port": 443
     },
     "vpc": {
       "vpc_name": "default",
       "subnetwork_name": "default",
       "project_id": "gitlab-autoscale-runners"
     },
     "gateway_identifiers": {
       "gateway_name": "gitlab-runner-gateway",
       "router_name": "gitlab-runner-router",
       "region": "europe-west4"
     }
   },
   "resource": {
     "type": "nat_gateway",
     "labels": {
       "region": "europe-west4",
       "router_id": "7964886332834186727",
       "gateway_name": "gitlab-runner-gateway",
       "project_id": "gitlab-autoscale-runners"
     }
   },
   "timestamp": "2020-05-14T10:17:55.195614735Z",
   "labels": {
     "nat.googleapis.com/nat_ip": "",
     "nat.googleapis.com/instance_name": "runner-5dblbjek-auto-scale-runner-1589446683-0b220f90",
     "nat.googleapis.com/network_name": "default",
     "nat.googleapis.com/subnetwork_name": "default",
     "nat.googleapis.com/router_name": "gitlab-runner-router",
     "nat.googleapis.com/instance_zone": "europe-west4-b"
   },
   "logName": "projects/gitlab-autoscale-runners/logs/compute.googleapis.com%2Fnat_flows",
   "receiveTimestamp": "2020-05-14T10:18:00.422135520Z"
 }

The aforementioned questions seems to indicate a problem with overutilized NAT ports. I have confirmed this not to be a problem in our case by utilising the google cloud CLI, see below.

$ gcloud compute routers get-nat-mapping-info gitlab-runner-router
---
instanceName: runner-5dblbjek-auto-scale-runner-1589446683-0b220f90
interfaceNatMappings:
- natIpPortRanges:
  - some-id:1024-1055
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 32
  sourceAliasIpRange: ''
  sourceVirtualIp: some-ip
- natIpPortRanges:
  - some-ip:32768-32799
  numTotalDrainNatPorts: 0
  numTotalNatPorts: 32
  sourceAliasIpRange: ''
  sourceVirtualIp: some-ip

I seem to only be using some 64 ports.

The google cloud router advertises the following status:

kind: compute#routerStatusResponse
result:
  natStatus:
  - autoAllocatedNatIps:
    - some-ip
    minExtraNatIpsNeeded: 0
    name: gitlab-runner-gateway
    numVmEndpointsWithNatMappings: 3
  network: https://www.googleapis.com/compute/v1/projects/gitlab-autoscale-runners/global/networks/default

The same docker-images does build successfully when run locally or in a shared gitlab runner i.e. when not behind the NAT.

How do I prevent the timeout when building this docker image behind the google cloud nat?

Anurag Sharma · Accepted Answer

Looking at the Cloud NAT output, it shows the allocation dropped status. The recommended action is to increase Minimum ports per VM instance to enough range (4096 ports) and let it run for a few days. I am suggesting this number to achieve where we will stop getting drops and if that helps in reducing the drop, we may have to increase more with factor of 2 till no drops received. If you didn’t receive any DROPPED status at 4k ports you can decrease it until you find a median where you are no longer receiving DROPPED status nor have an abundance of NAT ports open. You are reaching 64 connections. port usage represents number of connection to single unique destination (destination IP:port ,protocol) for a VM

Looking at your configuration, currently there are 64 ports allocated per VM (as you mentioned in the description). That means, each VM instance within that vpc gets 64 NAT IP:PORT combination to connect externally. That means, you can connect to 64 unique unique destination (destination IP address, destination port, and protocol). When you were testing it looks like you are reaching that limit.

Each NAT IP assigned to cloud nat gateway have 64512 ports, so with default configuration of 64 ports per VM, NAT Gateway would assign block of 64 NAT IP:PORT to each VM within the specified subnetwork which are selected in NAT gateway configuration. So that means with this configuration you can run 1008 VM (64512 divided by 64). but each vm can connect to 64 unique destination at the same time. Now depending on your application/use case , you would need to increase minimum ports per VM if you need more simultaneous connections.

For example, with 1 NAT IP and 1024 min ports per VM, You can run 63 VM. and you can connect to 1024 unique destination per vm. If you need to run more VM you, need to allocate more NAT IP. By adding a second IP, you can double the NAT capacity. Since you have auto-allocation of NAT IP chosen,NAT IP will be automatically created and assigned as you create more VM in the subnetwork. In that case, you only need to tweak, min ports per vm configuration which meets your traffic demand.

Please note that, once a connection is terminated, NAT gateway have 2 minute timer, before that NAT IP:PORT can be used. [1] , so you keep ports configuration little bit higher than your peak traffic.

More details on port calculations are here [2]

[1] https://cloud.google.com/nat/docs/overview#specs-timeouts

[2] https://cloud.google.com/nat/docs/ports-and-addresses#ports

Timeout behind GCP Nat

Tags:

google-cloud-platform

nat

google-cloud-networking

milo526

1 Answers

Anurag Sharma

Recent Activity

Donate For Us

Timeout behind GCP Nat

Tags:

google-cloud-platform

nat

google-cloud-networking

milo526

1 Answers

Anurag Sharma

Related questions

Recent Activity

Donate For Us