Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ECS Fargate container is not using VPC Endpoints to pull from ECR

Been stuck on this for a week.

So I have a fargate container with a service in a private subnet, i want to limit to containers access to the private network alone, but im not able to pull an image from my private ecr repo over the private network

  • Created VPC Endpoints for: ecr-api, ecr-dkr, s3(gateway, also setup the route tables to the private subnets), logs. Enabled private dns for them where possible, and also opened their SG's for 0.0.0.0/0 (for testing purposes).
  • For the fargate SG i opened up ingress/egress to the entire vpc CIDR, and also to the security groups that are attached to the VPC Endpoints.
  • Verified IAM permissions on both sides: ecr repo has policy that allows all users to perform actions on the repo, and the fargate task role also containes all relevant iam permissions

When the container is launched, i get the following error: CannotPullContainerError: ref pull has been retried 5 time(s): failed to copy: httpReadSeeker: failed open: failed to do request: Get 956469741060.dkr.ecr.us-east-1.amazonaws.com/my-ecr-repo:latest: dial tcp 52.216.78.32:443: i/o timeout

So the container is still trying to pull the ECR image over a public ip (my vpc cidr is 10.0.0.0/16). Needless to say that the fargate container is able to pull the ecr image once i open 0.0.0.0/0 for my fargate egress, but i want to avoid that and only allow ingress/egress to the private subnets.

I confirmed the VPC endpoints configuration by launching an ec2 instance in a private subnet, and ran nslookup on all VPC Endpoints that mentioned above, and all of them are returning private ips, so this tells me the endpoints are actually configured correctly

because of the ec2 nslookup test, i will assume that the issue is within my fargate configuration, this is how the terraform setup looks:

resource "aws_ecs_cluster" "test_sdk" {
  name = "test-sdk-${var.stage}"
}

resource "aws_ecs_task_definition" "test_task_def" {
  family                   = "test-sdk-${var.stage}"
  network_mode             = "awsvpc"
  task_role_arn            = aws_iam_role.ecs_task_execution_role.arn
  execution_role_arn       = aws_iam_role.ecs_task_execution_role.arn
  requires_compatibilities = ["FARGATE"]
  cpu                      = 4096
  memory                   = 8192
  container_definitions = jsonencode(
    [
    {
      "name": "test-container",
      "image": "${data.aws_caller_identity.self.account_id}.dkr.ecr.${var.region}.amazonaws.com/test-sdk-${var.stage}:latest",
      "essential": true,
      "portMappings": [
        {
          "containerPort": var.container_port,
          "hostPort": var.container_port

        }
      ],

      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "ecs-test-${var.stage}",
          "awslogs-region": "${var.region}",
          "awslogs-stream-prefix": "streaming"
        }
      }
      }
]
  )
}

resource "aws_ecs_service" "test_service" {
  name            = "test-service"
  cluster         = aws_ecs_cluster.test_sdk.id
  task_definition = aws_ecs_task_definition.test_task_def.arn
  launch_type     = "FARGATE" 
  desired_count   = 1

  network_configuration {
    subnets = [data.aws_subnet.private-1.id, data.aws_subnet.private-2.id]
    security_groups = [aws_security_group.test-sg.id]
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.test-tg.arn
    container_name   = "test-container"
    container_port   = var.container_port
  }
}

# Create a security group allowing traffic on container port
resource "aws_security_group" "test-sg" {
  name   = "test-sg-${var.stage}"
  vpc_id = data.aws_vpc.vpc.id

  ingress {
    from_port   = var.container_port
    to_port     = var.container_port
    protocol    = "tcp"
    cidr_blocks = [
       data.aws_subnet.private-1.cidr_block,
       data.aws_subnet.private-2.cidr_block
      ]
  }

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = [
       data.aws_subnet.private-1.cidr_block,
       data.aws_subnet.private-2.cidr_block
      ] # Allow traffic from private subnet
  }
  


  egress {
    from_port   = var.container_port
    to_port     = var.container_port
    protocol    = "tcp"
    cidr_blocks = [
       data.aws_subnet.private-1.cidr_block,
       data.aws_subnet.private-2.cidr_block
      ] # Allow traffic from private subnet
  }
  egress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = [
       data.aws_subnet.private-1.cidr_block,
       data.aws_subnet.private-2.cidr_block
        ] # Allow traffic from private subnet
  }

}

# Create Application Load Balancer
resource "aws_lb" "test" {
  name               = "test-lb-${var.stage}"
  internal           = true
  load_balancer_type = "application"
  security_groups    = [aws_security_group.test-sg.id]
  subnets            = [data.aws_subnet.private-1.id, data.aws_subnet.private-2.id]
}

# Create Target Group
resource "aws_lb_target_group" "test-tg" {
  name     = "test-tg-${var.stage}"
  port     = var.container_port
  protocol = "HTTP"
  target_type = "ip"
  vpc_id   = data.aws_vpc.vpc.id
  health_check {
    enabled             = true
    healthy_threshold   = 2
    interval            = 90
    path                = "/"
    matcher             = "200-399"
    port                = var.container_port
    protocol            = "HTTP"
    timeout             = 40
    unhealthy_threshold = 2
}

}

# Create listener
resource "aws_lb_listener" "test-listener" {
  load_balancer_arn = aws_lb.test.arn
  port              = var.container_port
  protocol          = "HTTP"

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.test-tg.arn
  }
}

# IAM
resource "aws_iam_role" "ecs_task_execution_role" {
  name = "tf-${var.project}-${var.stage}-ecs-task-execution-role"
  assume_role_policy = data.aws_iam_policy_document.ecs_assume_role_policy.json
  inline_policy {
    name = "test-sdk-ecr-repo-policy"
    policy = jsonencode({
      "Version" : "2012-10-17",
      "Statement" : [
        {
          "Effect" : "Allow",
          "Action" : [
            "ecr:GetAuthorizationToken",
            "ecr:BatchCheckLayerAvailability",
            "ecr:GetDownloadUrlForLayer",
            "ecr:BatchGetImage",
            "logs:CreateLogGroup",
            "logs:DescribeLogGroups",
            "logs:DescribeLogStreams",
            "logs:CreateLogStream",
            "logs:PutLogEvents",
            "secretsmanager:GetSecretValue",
            "events:PutEvents"
          ],
          "Resource" : "*"
    }
}

data "aws_iam_policy_document" "ecs_assume_role_policy" {
  statement {
    actions = [
      "sts:AssumeRole"
    ]
    effect = "Allow"
    principals {
      type        = "Service"
      identifiers = ["ecs-tasks.amazonaws.com"]
    }
  }
}
like image 894
Niv Shitrit Avatar asked Dec 06 '25 16:12

Niv Shitrit


1 Answers

If you are using an S3 gateway endpoint, it does not create a network interface on the VPC. So even though your security group allows traffic to your VPC, it won't work for fetching the image (which ECR stores in S3 under the covers) without some modification.

So as you mentioned in your comment, the solution is to add the prefix list ID that was created for S3 to the security group. Essentially, this is adding the S3 IP addresses as an allowlist for outbound communication.

This documentation outlines the details:

enter image description here

Also, this guide outlines the differences and pros/cons of using an S3 gateway endpoint versus an interface endpoint.

like image 124
Shawn Avatar answered Dec 10 '25 07:12

Shawn



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!